Rule S005
.repartition() with fewer partitions than default
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 1.3 and later.
Information
The default Spark shuffle partitions is set by spark.sql.shuffle.partitions (commonly 200). Using .repartition(n) with n lower than this can cause:
- Skewed workloads
- Reduced parallelism
- Potential performance degradation on large datasets
Best practices
- Avoid reducing partitions below the default unless justified by data size
- Use
.coalesce(n)instead of.repartition(n)to reduce partitions without full shuffle - Tune
nbased on cluster size and dataset volume
Rule of thumb: Maintain or carefully adjust partition counts to balance parallelism and overhead.
Example
Bad:
Good: