Rule U003
Avoid using UDFs
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 1.3 and later.
Information
Using UDFs in general can:
- Bypass Spark’s Catalyst optimizer
- Increase execution time due to serialization between JVM and Python
- Make code harder to maintain and debug
Best practices
- Always check for built-in DataFrame functions before creating a UDF
- Use
pyspark.sql.functionsfor transformations - Reserve UDFs only for complex logic not achievable with built-in functions
Rule of thumb: Minimize or eliminate UDFs to maintain performance and scalability.
Example
Bad:
Good: