Rule S006

.repartition() with more partitions than default

Severity

🔴 HIGH — Major performance impact.

Compatible with PySpark 1.3 and later.

Using .repartition(n) with n higher than the default spark.sql.shuffle.partitions (commonly 200) can lead to:

Only increase partitions when the dataset is very large and benefits from higher parallelism
Monitor Spark UI to ensure tasks are not too small or excessive
Consider using .coalesce(n) after large transformations if reducing partitions is sufficient

Rule of thumb: Avoid unnecessary increases in partitions; tune n based on data size and cluster capacity.

Bad:

df.repartition(500)  # more than Spark default 200

Good:

df.repartition(200)