Rule U001

Avoid using UDFs on string columns

Severity

🔴 HIGH — Major performance impact.

Compatible with PySpark 1.3 and later.

Using UDFs on string columns bypasses Spark’s built-in optimizations and can lead to:

Rule of thumb: Prefer built-in functions over UDFs for string processing.

Bad:

@udf(StringType())
def clean(s):
    return s.strip().lower()
df.withColumn("clean", clean(col("name")))

Good:

from pyspark.sql.functions import trim, lower
df.withColumn("clean", trim(lower(col("name"))))