Rule S003
.groupBy() followed by .distinct() or .dropDuplicates()
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 1.4 and later.
Information
Using .groupBy() followed by .distinct() or .dropDuplicates() is redundant because aggregation already produces one row per group key. The extra deduplication step triggers a second unnecessary shuffle, leading to:
- Unnecessary extra computation
- Increased shuffle overhead
- Slower performance
Best practices
- Remove
.distinct()or.dropDuplicates()when.groupBy()already ensures uniqueness per group key
Rule of thumb: Avoid chaining .groupBy() with .distinct() or .dropDuplicates() — the aggregation result is already deduplicated by the group keys.
Example
Bad:
df.groupBy("country").agg(count("*")).distinct()
df.groupBy("country").agg(count("*")).dropDuplicates()
Good: