Skip to content

pyspark-antipattern

U003

skanderboudawara/pyspark-antipattern

Rule U003

Avoid using UDFs

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

Using UDFs in general can:

Bypass Spark’s Catalyst optimizer
Increase execution time due to serialization between JVM and Python
Make code harder to maintain and debug

Best practices

Always check for built-in DataFrame functions before creating a UDF
Use pyspark.sql.functions for transformations
Reserve UDFs only for complex logic not achievable with built-in functions

Rule of thumb: Minimize or eliminate UDFs to maintain performance and scalability.

Example

Bad:

@udf(DoubleType())
def score(x):
    return x * 1.5
df.withColumn("score", score(col("value")))

Good:

df.withColumn("score", col("value") * 1.5)