Rule S006
.repartition() with more partitions than default
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 1.3 and later.
Information
Using .repartition(n) with n higher than the default spark.sql.shuffle.partitions (commonly 200) can lead to:
- Many small tasks, increasing task scheduling overhead
- Higher shuffle and network costs
- Potential memory pressure on the driver and executors
Best practices
- Only increase partitions when the dataset is very large and benefits from higher parallelism
- Monitor Spark UI to ensure tasks are not too small or excessive
- Consider using
.coalesce(n)after large transformations if reducing partitions is sufficient
Rule of thumb: Avoid unnecessary increases in partitions; tune n based on data size and cluster capacity.
Example
Bad:
Good: