Rule S001
Missing .coalesce(numPartitions) after .union() / .unionByName()
Severity
🟡 MEDIUM — Moderate performance impact.
PySpark version
Compatible with PySpark 1.4 and later.
Information
Using .union() or .unionByName() in PySpark can increase the number of partitions, especially when combining multiple DataFrames. If not controlled, this may lead to:
- Too many small partitions
- Increased task scheduling overhead
- Degraded performance
Best practices
- Use
.coalesce(n)after.union()or.unionByName()to reduce partitions when appropriate - Choose
nbased on data size and cluster resources - Use
.repartition(n)instead if you need to rebalance data evenly
Rule of thumb: After .union() or .unionByName(), ensure partition count is controlled to avoid unnecessary overhead.
Example
Bad:
Good: