Skip to content

Rule S006

.repartition() with more partitions than default

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

Using .repartition(n) with n higher than the default spark.sql.shuffle.partitions (commonly 200) can lead to:

  • Many small tasks, increasing task scheduling overhead
  • Higher shuffle and network costs
  • Potential memory pressure on the driver and executors

Best practices

  • Only increase partitions when the dataset is very large and benefits from higher parallelism
  • Monitor Spark UI to ensure tasks are not too small or excessive
  • Consider using .coalesce(n) after large transformations if reducing partitions is sufficient

Rule of thumb: Avoid unnecessary increases in partitions; tune n based on data size and cluster capacity.

Example

Bad:

df.repartition(500)  # more than Spark default 200

Good:

df.repartition(200)