Rule U004
Avoid nested UDF calls
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 1.3 and later.
Information
Calling one UDF from inside another UDF body compounds the performance penalty:
- Each UDF boundary is opaque to Spark's Catalyst optimizer — it cannot optimise across it
- Every UDF incurs Python serialisation/deserialisation overhead for each row
- Nesting UDFs means the data crosses the JVM↔Python boundary twice per row, doubling the serialisation cost
- Neither UDF can be fused, pipelined, or code-generated by Catalyst
Best practices
- Merge the logic of both UDFs into a single UDF to cross the serialisation boundary only once
- Prefer extracting shared helper logic into plain Python functions (not UDFs) and calling those helpers inside one UDF
- Consider whether both transformations can be replaced by built-in Spark functions entirely
Rule of thumb: One serialisation boundary is unavoidable when UDFs are necessary — don't add a second one by calling UDFs from within UDFs.
Example
Bad:
@udf(returnType=StringType())
def normalize(x):
return x.strip().lower()
@udf(returnType=StringType())
def process(x):
return normalize(x) + "_processed" # nested UDF call
Good: