Rule F019

Avoid inferSchema=True and mergeSchema=True in Spark read options

Severity

🟡 MEDIUM — Moderate performance impact.

PySpark version

Compatible with PySpark 2.0 and later.

Information

Both options instruct Spark to determine the schema at runtime by scanning the source data, rather than reading a schema you have declared explicitly. This causes:

An extra full scan of the dataset before the actual job starts — at scale this means reading gigabytes or terabytes of data just to guess column types
Non-deterministic schemas that silently change when source data changes: new files with extra columns, type drift between partitions, or different null ratios can produce a different schema on every run
Hard-to-debug production failures when a schema inferred in development no longer matches what arrives in production

mergeSchema=True compounds the problem by unioning all partition schemas, which can introduce unexpected nullable columns or widen types silently.

Best practices

Define schemas explicitly using StructType / StructField so the schema is a visible contract in your code and in code review:

from pyspark.sql.types import StructType, StructField, StringType, LongType

schema = StructType([
    StructField("id",      LongType(),   nullable=False),
    StructField("country", StringType(), nullable=True),
])

df = spark.read.schema(schema).csv("s3://bucket/data/")

Rule of thumb: Schema inference is for exploration in a notebook — never in a pipeline.

Example

Bad:

df = spark.read.option("inferSchema", "true").csv("s3://bucket/data/")
df = spark.read.option("mergeSchema", "true").parquet("s3://bucket/data/")
df = spark.read.csv("s3://bucket/data/", inferSchema=True)

Good:

schema = StructType([StructField("id", LongType()), StructField("name", StringType())])
df = spark.read.schema(schema).csv("s3://bucket/data/")