Rule S007
Avoid using repartition(1) or coalesce(1)
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 1.4 and later.
Information
Using repartition(1) or coalesce(1) forces Spark to write all data to a single partition, which can lead to:
- Performance bottlenecks: All data is shuffled to a single executor, causing serialization overhead and slower processing.
- Reduced parallelism: Spark’s ability to process partitions in parallel is lost, defeating distributed computation.
- Memory pressure: Large datasets may exceed the memory of a single executor, causing failures.
- Scalability issues: Workloads that run fine on small data may fail or slow dramatically on larger datasets.
While reducing file counts may be requested for downstream systems, the performance and scalability trade-offs are usually not justified.
Best practices
- Allow Spark to determine partitioning automatically for optimal parallelism
- If you must reduce output files, use
repartition(n)orcoalesce(n)with a reasonable number of partitions - Consider merging files after writing using external tools instead of forcing a single partition in Spark
Rule of thumb: Never force a single output partition with repartition(1) or coalesce(1); prioritize distributed processing and scalability.
Example
Bad:
Good: