Skip to content

Rule U002

Avoid using UDFs on array columns

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

Applying UDFs on array columns can:

  • Reduce performance due to JVM-Python serialization
  • Prevent Spark from optimizing query plans
  • Increase memory and CPU usage

Best practices

  • Use Spark built-in array functions: PySpark Array Functions
  • Only use UDFs when a transformation cannot be done with built-in functions

Rule of thumb: Leverage built-in array functions instead of UDFs whenever possible.

Example

Bad:

@udf(ArrayType(StringType()))
def split_words(s):
    return s.split(" ")

Good:

from pyspark.sql.functions import split
df.withColumn("words", split(col("text"), " "))