Rule S014
.distinct() or .dropDuplicates() called before .groupBy() — redundant shuffle
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 1.4 and later.
Information
Calling .distinct() or .dropDuplicates() immediately before .groupBy() causes two expensive shuffle operations where one would suffice. groupBy already deduplicates rows across the grouping keys during aggregation, so the preceding deduplication produces no correctness benefit and adds:
- An extra full-dataset shuffle
- Increased memory pressure on executors
- Longer job runtime
Best practices
- Remove
.distinct()or.dropDuplicates()when it is immediately followed by.groupBy()
Rule of thumb: Never pay for two shuffles when one will do — groupBy subsumes the deduplication that .distinct() or .dropDuplicates() was trying to achieve.
Example
Bad:
df.distinct().groupBy("country").agg(count("*"))
df.dropDuplicates(["country"]).groupBy("country").agg(count("*"))
Good: