Skip to content

Rule S007

Avoid using repartition(1) or coalesce(1)

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.4 and later.

Information

Using repartition(1) or coalesce(1) forces Spark to write all data to a single partition, which can lead to:

  • Performance bottlenecks: All data is shuffled to a single executor, causing serialization overhead and slower processing.
  • Reduced parallelism: Spark’s ability to process partitions in parallel is lost, defeating distributed computation.
  • Memory pressure: Large datasets may exceed the memory of a single executor, causing failures.
  • Scalability issues: Workloads that run fine on small data may fail or slow dramatically on larger datasets.

While reducing file counts may be requested for downstream systems, the performance and scalability trade-offs are usually not justified.

Best practices

  • Allow Spark to determine partitioning automatically for optimal parallelism
  • If you must reduce output files, use repartition(n) or coalesce(n) with a reasonable number of partitions
  • Consider merging files after writing using external tools instead of forcing a single partition in Spark

Rule of thumb: Never force a single output partition with repartition(1) or coalesce(1); prioritize distributed processing and scalability.

Example

Bad:

df.repartition(1)
df.coalesce(1)

Good:

df.write.mode("overwrite").parquet("output/")
# let the writer handle single-file output if needed
df.coalesce(4).write.parquet("output/")