Rule U007
Avoid any() inside a UDF body — use pyspark.sql.functions.exists instead
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 3.1 and later.
Information
Python's built-in any() used inside a UDF iterates over a collection entirely
in Python, row by row, with no Spark optimisation:
- The array column must be deserialised from the JVM to Python
any()runs in the Python interpreter with no vectorisation- The boolean result is re-serialised back to the JVM
- Catalyst cannot see into the predicate or push it down
pyspark.sql.functions.exists(col, predicate) evaluates the predicate over
every array element using Spark's native execution engine — no UDF boundary,
no serialisation round-trip, and the predicate is visible to the optimizer.
Reference: pyspark.sql.functions.exists
Best practices
Replace any(...) in a UDF with exists(col, lambda x: ...) applied directly
on the DataFrame column.
Rule of thumb: If your UDF body's purpose is to check whether any element of
an array satisfies a condition, exists does the same thing without leaving the JVM.
Example
Bad:
@udf(returnType=BooleanType())
def has_negative(items):
return any(x < 0 for x in items)
@udf(returnType=BooleanType())
def has_any(items):
return any(items)
Good: