Rule S014

.distinct() or .dropDuplicates() called before .groupBy() — redundant shuffle

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.4 and later.

Information

Calling .distinct() or .dropDuplicates() immediately before .groupBy() causes two expensive shuffle operations where one would suffice. groupBy already deduplicates rows across the grouping keys during aggregation, so the preceding deduplication produces no correctness benefit and adds:

An extra full-dataset shuffle
Increased memory pressure on executors
Longer job runtime

Best practices

Remove .distinct() or .dropDuplicates() when it is immediately followed by .groupBy()

Rule of thumb: Never pay for two shuffles when one will do — groupBy subsumes the deduplication that .distinct() or .dropDuplicates() was trying to achieve.

Example

Bad:

df.distinct().groupBy("country").agg(count("*"))
df.dropDuplicates(["country"]).groupBy("country").agg(count("*"))

Good:

df.groupBy("country").agg(count("*"))