Rule S008
Avoid overusing explode() or explode_outer()
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 2.3 and later.
Information
Using explode() or explode_outer() frequently or unnecessarily can lead to:
- Performance degradation: Each call can dramatically increase the number of rows, causing large shuffles and slow processing.
- Memory pressure: Exploded datasets can become very large, increasing executor memory usage and risk of OOM errors.
- Complex execution plans: Multiple explosions make query plans harder to understand and maintain.
- Data skew: Exploding uneven arrays or maps can create partitions with very different sizes, reducing parallelism efficiency.
Best practices
- Only use
explode()orexplode_outer()when strictly necessary - Consider alternative approaches such as
inline(),posexplode(), or transforming the data withselectand higher-order functions (transform(),filter()) - If multiple explosions are needed, combine transformations carefully to minimize row multiplication
Rule of thumb: Minimize the use of explode() and explode_outer(); prefer higher-order functions or carefully planned transformations to maintain performance and scalability.
Example
Bad:
df.withColumn("a", explode(col("arr1"))) \
.withColumn("b", explode(col("arr2"))) \
.withColumn("c", explode(col("arr3"))) \
.withColumn("d", explode(col("arr4"))) # exceeds threshold
Good: