Skip to content

Rule S013

Avoid reduceByKey() — use the DataFrame API instead

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.6 and later.

Information

reduceByKey() is an RDD operation. Falling back to RDD operations loses all the optimizations that the DataFrame/Dataset API provides.

  • The Catalyst optimizer cannot inspect or optimize RDD transformations
  • Tungsten's memory management and code generation do not apply to RDD operations
  • reduceByKey() requires explicit serialization and deserialization of every record, which is significantly slower than columnar processing
  • The code becomes harder to read and maintain compared to a groupBy().agg() equivalent

Best practices

  • Replace rdd.reduceByKey(lambda a, b: a + b) with df.groupBy("key").agg(sum("value"))
  • Use groupBy().agg() with the appropriate built-in aggregate function (sum, count, max, min, collect_list, …)
  • Only drop down to RDD if there is no DataFrame equivalent — and document why

Example

Bad:

result = df.rdd.reduceByKey(lambda a, b: a + b)

Good:

from pyspark.sql.functions import sum as spark_sum
result = df.groupBy("key").agg(spark_sum("value"))