Rule F019
Avoid inferSchema=True and mergeSchema=True in Spark read options
Severity
🟡 MEDIUM — Moderate performance impact.
PySpark version
Compatible with PySpark 2.0 and later.
Information
Both options instruct Spark to determine the schema at runtime by scanning the source data, rather than reading a schema you have declared explicitly. This causes:
- An extra full scan of the dataset before the actual job starts — at scale this means reading gigabytes or terabytes of data just to guess column types
- Non-deterministic schemas that silently change when source data changes: new files with extra columns, type drift between partitions, or different null ratios can produce a different schema on every run
- Hard-to-debug production failures when a schema inferred in development no longer matches what arrives in production
mergeSchema=True compounds the problem by unioning all partition schemas,
which can introduce unexpected nullable columns or widen types silently.
Best practices
Define schemas explicitly using StructType / StructField so the schema is
a visible contract in your code and in code review:
from pyspark.sql.types import StructType, StructField, StringType, LongType
schema = StructType([
StructField("id", LongType(), nullable=False),
StructField("country", StringType(), nullable=True),
])
df = spark.read.schema(schema).csv("s3://bucket/data/")
Rule of thumb: Schema inference is for exploration in a notebook — never in a pipeline.
Example
Bad:
df = spark.read.option("inferSchema", "true").csv("s3://bucket/data/")
df = spark.read.option("mergeSchema", "true").parquet("s3://bucket/data/")
df = spark.read.csv("s3://bucket/data/", inferSchema=True)
Good: