Rule U002
Avoid using UDFs on array columns
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 1.3 and later.
Information
Applying UDFs on array columns can:
- Reduce performance due to JVM-Python serialization
- Prevent Spark from optimizing query plans
- Increase memory and CPU usage
Best practices
- Use Spark built-in array functions: PySpark Array Functions
- Only use UDFs when a transformation cannot be done with built-in functions
Rule of thumb: Leverage built-in array functions instead of UDFs whenever possible.
Example
Bad:
Good: