Skip to content

Rule U004

Avoid nested UDF calls

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

Calling one UDF from inside another UDF body compounds the performance penalty:

  • Each UDF boundary is opaque to Spark's Catalyst optimizer — it cannot optimise across it
  • Every UDF incurs Python serialisation/deserialisation overhead for each row
  • Nesting UDFs means the data crosses the JVM↔Python boundary twice per row, doubling the serialisation cost
  • Neither UDF can be fused, pipelined, or code-generated by Catalyst

Best practices

  • Merge the logic of both UDFs into a single UDF to cross the serialisation boundary only once
  • Prefer extracting shared helper logic into plain Python functions (not UDFs) and calling those helpers inside one UDF
  • Consider whether both transformations can be replaced by built-in Spark functions entirely

Rule of thumb: One serialisation boundary is unavoidable when UDFs are necessary — don't add a second one by calling UDFs from within UDFs.

Example

Bad:

@udf(returnType=StringType())
def normalize(x):
    return x.strip().lower()

@udf(returnType=StringType())
def process(x):
    return normalize(x) + "_processed"  # nested UDF call

Good:

def _normalize(x):          # plain Python helper, no UDF decorator
    return x.strip().lower()

@udf(returnType=StringType())
def process(x):
    return _normalize(x) + "_processed"  # single UDF boundary