Skip to content

Rule U006

Avoid all() inside a UDF body — use pyspark.sql.functions.forall instead

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 3.1 and later.

Information

Python's built-in all() used inside a UDF iterates over a collection entirely in Python, row by row, with no Spark optimisation:

  • The array column must be deserialised from the JVM to Python
  • all() runs in the Python interpreter with no vectorisation
  • The boolean result is re-serialised back to the JVM
  • Catalyst cannot see into the predicate or push it down

pyspark.sql.functions.forall(col, predicate) evaluates the predicate over every array element using Spark's native execution engine — no UDF boundary, no serialisation round-trip, and the predicate is visible to the optimizer.

Reference: pyspark.sql.functions.forall

Best practices

Replace all(...) in a UDF return with forall(col, lambda x: ...) applied directly on the DataFrame column:

Rule of thumb: If your UDF body's purpose is to check a condition across all elements of an array, forall does the same thing without leaving the JVM.

Example

Bad:

@udf(returnType=BooleanType())
def all_positive(items):
    return all(x > 0 for x in items)

@udf(returnType=BooleanType())
def all_non_empty(items):
    return all(items)

Good:

from pyspark.sql.functions import forall, col

df.withColumn("all_positive", forall(col("items"), lambda x: x > 0))