Rule S001

Missing .coalesce(numPartitions) after .union() / .unionByName()

Severity

🟡 MEDIUM — Moderate performance impact.

PySpark version

Compatible with PySpark 1.4 and later.

Information

Using .union() or .unionByName() in PySpark can increase the number of partitions, especially when combining multiple DataFrames. If not controlled, this may lead to:

Too many small partitions
Increased task scheduling overhead
Degraded performance

Best practices

Use .coalesce(n) after .union() or .unionByName() to reduce partitions when appropriate
Choose n based on data size and cluster resources
Use .repartition(n) instead if you need to rebalance data evenly

Rule of thumb: After .union() or .unionByName(), ensure partition count is controlled to avoid unnecessary overhead.

Example

Bad:

df_all = df1.union(df2)

Good:

df_all = df1.union(df2).coalesce(4)