Skip to content

Rule U007

Avoid any() inside a UDF body — use pyspark.sql.functions.exists instead

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 3.1 and later.

Information

Python's built-in any() used inside a UDF iterates over a collection entirely in Python, row by row, with no Spark optimisation:

  • The array column must be deserialised from the JVM to Python
  • any() runs in the Python interpreter with no vectorisation
  • The boolean result is re-serialised back to the JVM
  • Catalyst cannot see into the predicate or push it down

pyspark.sql.functions.exists(col, predicate) evaluates the predicate over every array element using Spark's native execution engine — no UDF boundary, no serialisation round-trip, and the predicate is visible to the optimizer.

Reference: pyspark.sql.functions.exists

Best practices

Replace any(...) in a UDF with exists(col, lambda x: ...) applied directly on the DataFrame column.

Rule of thumb: If your UDF body's purpose is to check whether any element of an array satisfies a condition, exists does the same thing without leaving the JVM.

Example

Bad:

@udf(returnType=BooleanType())
def has_negative(items):
    return any(x < 0 for x in items)

@udf(returnType=BooleanType())
def has_any(items):
    return any(items)

Good:

from pyspark.sql.functions import exists, col

df.withColumn("has_negative", exists(col("items"), lambda x: x < 0))