Rule U002

Avoid using UDFs on array columns

Severity

🔴 HIGH — Major performance impact.

Compatible with PySpark 1.3 and later.

Applying UDFs on array columns can:

Rule of thumb: Leverage built-in array functions instead of UDFs whenever possible.

Bad:

@udf(ArrayType(StringType()))
def split_words(s):
    return s.split(" ")

Good:

from pyspark.sql.functions import split
df.withColumn("words", split(col("text"), " "))