Rule U001
Avoid using UDFs on string columns
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 1.3 and later.
Information
Using UDFs on string columns bypasses Spark’s built-in optimizations and can lead to:
- Slower performance
- Increased serialization/deserialization overhead
- Loss of Catalyst optimizations
Best practices
- Use Spark built-in string functions instead: PySpark String Functions
- Only use UDFs when no built-in function can achieve the required transformation
Rule of thumb: Prefer built-in functions over UDFs for string processing.
Example
Bad:
@udf(StringType())
def clean(s):
return s.strip().lower()
df.withColumn("clean", clean(col("name")))
Good: