Rule S008

Avoid overusing explode() or explode_outer()

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 2.3 and later.

Information

Using explode() or explode_outer() frequently or unnecessarily can lead to:

Performance degradation: Each call can dramatically increase the number of rows, causing large shuffles and slow processing.
Memory pressure: Exploded datasets can become very large, increasing executor memory usage and risk of OOM errors.
Complex execution plans: Multiple explosions make query plans harder to understand and maintain.
Data skew: Exploding uneven arrays or maps can create partitions with very different sizes, reducing parallelism efficiency.

Best practices

Only use explode() or explode_outer() when strictly necessary
Consider alternative approaches such as inline(), posexplode(), or transforming the data with select and higher-order functions (transform(), filter())
If multiple explosions are needed, combine transformations carefully to minimize row multiplication

Rule of thumb: Minimize the use of explode() and explode_outer(); prefer higher-order functions or carefully planned transformations to maintain performance and scalability.

Example

Bad:

df.withColumn("a", explode(col("arr1"))) \
  .withColumn("b", explode(col("arr2"))) \
  .withColumn("c", explode(col("arr3"))) \
  .withColumn("d", explode(col("arr4")))  # exceeds threshold

Good:

# Explode only what you need, or use inline/posexplode
df.select("id", posexplode(col("arr1")).alias("pos", "val"))