Rule U006
Avoid all() inside a UDF body — use pyspark.sql.functions.forall instead
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 3.1 and later.
Information
Python's built-in all() used inside a UDF iterates over a collection entirely
in Python, row by row, with no Spark optimisation:
- The array column must be deserialised from the JVM to Python
all()runs in the Python interpreter with no vectorisation- The boolean result is re-serialised back to the JVM
- Catalyst cannot see into the predicate or push it down
pyspark.sql.functions.forall(col, predicate) evaluates the predicate over
every array element using Spark's native execution engine — no UDF boundary,
no serialisation round-trip, and the predicate is visible to the optimizer.
Reference: pyspark.sql.functions.forall
Best practices
Replace all(...) in a UDF return with forall(col, lambda x: ...) applied
directly on the DataFrame column:
Rule of thumb: If your UDF body's purpose is to check a condition across all
elements of an array, forall does the same thing without leaving the JVM.
Example
Bad:
@udf(returnType=BooleanType())
def all_positive(items):
return all(x > 0 for x in items)
@udf(returnType=BooleanType())
def all_non_empty(items):
return all(items)
Good: