Skip to content

Rule F014

Avoid explode_outer() — handle nulls explicitly with higher-order functions

Severity

🟡 MEDIUM — Moderate performance impact.

PySpark version

Compatible with PySpark 2.3 and later.

Information

explode_outer() is often used as a workaround to preserve rows where the array column is null or empty. While it works, it signals that null handling has been pushed into the explosion step rather than being addressed explicitly upstream.

  • Higher-order functions (transform, filter, aggregate, forall, exists) let you handle nulls and empty arrays in-place without materializing extra rows
  • explode_outer() produces one null row per null/empty array, which must then be filtered out again — adding an extra shuffle stage
  • Keeping null handling close to the data transformation improves clarity and reduces unnecessary data movement

Best practices

  • Use transform() or filter() (higher-order) to clean or process array columns before exploding
  • If a null row must be preserved, use coalesce(col("arr"), array()) before a regular explode() to make the intent explicit
  • Reserve explode_outer() only when the null row itself is meaningful business data

Example

Bad:

df.withColumn("item", explode_outer(col("items")))

Good:

# drop nulls explicitly, then explode
df.withColumn("items", coalesce(col("items"), array())) \
  .withColumn("item", explode(col("items")))

# or filter inside the array with a higher-order function
df.withColumn("items", filter(col("items"), lambda x: x.isNotNull())) \
  .withColumn("item", explode(col("items")))