Skip to content

Rule U001

Avoid using UDFs on string columns

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

Using UDFs on string columns bypasses Spark’s built-in optimizations and can lead to:

  • Slower performance
  • Increased serialization/deserialization overhead
  • Loss of Catalyst optimizations

Best practices

  • Use Spark built-in string functions instead: PySpark String Functions
  • Only use UDFs when no built-in function can achieve the required transformation

Rule of thumb: Prefer built-in functions over UDFs for string processing.

Example

Bad:

@udf(StringType())
def clean(s):
    return s.strip().lower()
df.withColumn("clean", clean(col("name")))

Good:

from pyspark.sql.functions import trim, lower
df.withColumn("clean", trim(lower(col("name"))))