Skip to content

Rule S005

.repartition() with fewer partitions than default

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

The default Spark shuffle partitions is set by spark.sql.shuffle.partitions (commonly 200). Using .repartition(n) with n lower than this can cause:

  • Skewed workloads
  • Reduced parallelism
  • Potential performance degradation on large datasets

Best practices

  • Avoid reducing partitions below the default unless justified by data size
  • Use .coalesce(n) instead of .repartition(n) to reduce partitions without full shuffle
  • Tune n based on cluster size and dataset volume

Rule of thumb: Maintain or carefully adjust partition counts to balance parallelism and overhead.

Example

Bad:

df.repartition(2)  # fewer than Spark default 200

Good:

df.repartition(200)
# or use coalesce() to reduce partitions without a full shuffle
df.coalesce(10)