Rule S005

.repartition() with fewer partitions than default

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

The default Spark shuffle partitions is set by spark.sql.shuffle.partitions (commonly 200). Using .repartition(n) with n lower than this can cause:

Skewed workloads
Reduced parallelism
Potential performance degradation on large datasets

Best practices

Avoid reducing partitions below the default unless justified by data size
Use .coalesce(n) instead of .repartition(n) to reduce partitions without full shuffle
Tune n based on cluster size and dataset volume

Rule of thumb: Maintain or carefully adjust partition counts to balance parallelism and overhead.

Example

Bad:

df.repartition(2)  # fewer than Spark default 200

Good:

df.repartition(200)
# or use coalesce() to reduce partitions without a full shuffle
df.coalesce(10)