Rule F014
Avoid explode_outer() — handle nulls explicitly with higher-order functions
Severity
🟡 MEDIUM — Moderate performance impact.
PySpark version
Compatible with PySpark 2.3 and later.
Information
explode_outer() is often used as a workaround to preserve rows where the array column is null or empty. While it works, it signals that null handling has been pushed into the explosion step rather than being addressed explicitly upstream.
- Higher-order functions (
transform,filter,aggregate,forall,exists) let you handle nulls and empty arrays in-place without materializing extra rows explode_outer()produces onenullrow per null/empty array, which must then be filtered out again — adding an extra shuffle stage- Keeping null handling close to the data transformation improves clarity and reduces unnecessary data movement
Best practices
- Use
transform()orfilter()(higher-order) to clean or process array columns before exploding - If a null row must be preserved, use
coalesce(col("arr"), array())before a regularexplode()to make the intent explicit - Reserve
explode_outer()only when the null row itself is meaningful business data
Example
Bad:
Good:
# drop nulls explicitly, then explode
df.withColumn("items", coalesce(col("items"), array())) \
.withColumn("item", explode(col("items")))
# or filter inside the array with a higher-order function
df.withColumn("items", filter(col("items"), lambda x: x.isNotNull())) \
.withColumn("item", explode(col("items")))