Rule S013

Avoid reduceByKey() — use the DataFrame API instead

Severity

🔴 HIGH — Major performance impact.

Compatible with PySpark 1.6 and later.

reduceByKey() is an RDD operation. Falling back to RDD operations loses all the optimizations that the DataFrame/Dataset API provides.

The Catalyst optimizer cannot inspect or optimize RDD transformations
Tungsten's memory management and code generation do not apply to RDD operations
reduceByKey() requires explicit serialization and deserialization of every record, which is significantly slower than columnar processing
The code becomes harder to read and maintain compared to a groupBy().agg() equivalent

Replace rdd.reduceByKey(lambda a, b: a + b) with df.groupBy("key").agg(sum("value"))
Use groupBy().agg() with the appropriate built-in aggregate function (sum, count, max, min, collect_list, …)
Only drop down to RDD if there is no DataFrame equivalent — and document why

Bad:

result = df.rdd.reduceByKey(lambda a, b: a + b)

Good:

from pyspark.sql.functions import sum as spark_sum
result = df.groupBy("key").agg(spark_sum("value"))