Skip to content

Rule U003

Avoid using UDFs

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

Using UDFs in general can:

  • Bypass Spark’s Catalyst optimizer
  • Increase execution time due to serialization between JVM and Python
  • Make code harder to maintain and debug

Best practices

  • Always check for built-in DataFrame functions before creating a UDF
  • Use pyspark.sql.functions for transformations
  • Reserve UDFs only for complex logic not achievable with built-in functions

Rule of thumb: Minimize or eliminate UDFs to maintain performance and scalability.

Example

Bad:

@udf(DoubleType())
def score(x):
    return x * 1.5
df.withColumn("score", score(col("value")))

Good:

df.withColumn("score", col("value") * 1.5)