Rule S007

Avoid using repartition(1) or coalesce(1)

Severity

🔴 HIGH — Major performance impact.

Compatible with PySpark 1.4 and later.

Using repartition(1) or coalesce(1) forces Spark to write all data to a single partition, which can lead to:

Performance bottlenecks: All data is shuffled to a single executor, causing serialization overhead and slower processing.
Reduced parallelism: Spark’s ability to process partitions in parallel is lost, defeating distributed computation.
Memory pressure: Large datasets may exceed the memory of a single executor, causing failures.
Scalability issues: Workloads that run fine on small data may fail or slow dramatically on larger datasets.

While reducing file counts may be requested for downstream systems, the performance and scalability trade-offs are usually not justified.

Allow Spark to determine partitioning automatically for optimal parallelism
If you must reduce output files, use repartition(n) or coalesce(n) with a reasonable number of partitions
Consider merging files after writing using external tools instead of forcing a single partition in Spark

Rule of thumb: Never force a single output partition with repartition(1) or coalesce(1); prioritize distributed processing and scalability.

Bad:

df.repartition(1)
df.coalesce(1)

Good:

df.write.mode("overwrite").parquet("output/")
# let the writer handle single-file output if needed
df.coalesce(4).write.parquet("output/")