Skip to content

Rule S001

Missing .coalesce(numPartitions) after .union() / .unionByName()

Severity

🟡 MEDIUM — Moderate performance impact.

PySpark version

Compatible with PySpark 1.4 and later.

Information

Using .union() or .unionByName() in PySpark can increase the number of partitions, especially when combining multiple DataFrames. If not controlled, this may lead to:

  • Too many small partitions
  • Increased task scheduling overhead
  • Degraded performance

Best practices

  • Use .coalesce(n) after .union() or .unionByName() to reduce partitions when appropriate
  • Choose n based on data size and cluster resources
  • Use .repartition(n) instead if you need to rebalance data evenly

Rule of thumb: After .union() or .unionByName(), ensure partition count is controlled to avoid unnecessary overhead.

Example

Bad:

df_all = df1.union(df2)

Good:

df_all = df1.union(df2).coalesce(4)