Rule S013
Avoid reduceByKey() — use the DataFrame API instead
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 1.6 and later.
Information
reduceByKey() is an RDD operation. Falling back to RDD operations loses all the optimizations that the DataFrame/Dataset API provides.
- The Catalyst optimizer cannot inspect or optimize RDD transformations
- Tungsten's memory management and code generation do not apply to RDD operations
reduceByKey()requires explicit serialization and deserialization of every record, which is significantly slower than columnar processing- The code becomes harder to read and maintain compared to a
groupBy().agg()equivalent
Best practices
- Replace
rdd.reduceByKey(lambda a, b: a + b)withdf.groupBy("key").agg(sum("value")) - Use
groupBy().agg()with the appropriate built-in aggregate function (sum,count,max,min,collect_list, …) - Only drop down to RDD if there is no DataFrame equivalent — and document why
Example
Bad:
Good: