Rule U004

Avoid nested UDF calls

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

Calling one UDF from inside another UDF body compounds the performance penalty:

Each UDF boundary is opaque to Spark's Catalyst optimizer — it cannot optimise across it
Every UDF incurs Python serialisation/deserialisation overhead for each row
Nesting UDFs means the data crosses the JVM↔Python boundary twice per row, doubling the serialisation cost
Neither UDF can be fused, pipelined, or code-generated by Catalyst

Best practices

Merge the logic of both UDFs into a single UDF to cross the serialisation boundary only once
Prefer extracting shared helper logic into plain Python functions (not UDFs) and calling those helpers inside one UDF
Consider whether both transformations can be replaced by built-in Spark functions entirely

Rule of thumb: One serialisation boundary is unavoidable when UDFs are necessary — don't add a second one by calling UDFs from within UDFs.

Example

Bad:

@udf(returnType=StringType())
def normalize(x):
    return x.strip().lower()

@udf(returnType=StringType())
def process(x):
    return normalize(x) + "_processed"  # nested UDF call

Good:

def _normalize(x):          # plain Python helper, no UDF decorator
    return x.strip().lower()

@udf(returnType=StringType())
def process(x):
    return _normalize(x) + "_processed"  # single UDF boundary