Rule S003

.groupBy() followed by .distinct() or .dropDuplicates()

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.4 and later.

Information

Using .groupBy() followed by .distinct() or .dropDuplicates() is redundant because aggregation already produces one row per group key. The extra deduplication step triggers a second unnecessary shuffle, leading to:

Unnecessary extra computation
Increased shuffle overhead
Slower performance

Best practices

Remove .distinct() or .dropDuplicates() when .groupBy() already ensures uniqueness per group key

Rule of thumb: Avoid chaining .groupBy() with .distinct() or .dropDuplicates() — the aggregation result is already deduplicated by the group keys.

Example

Bad:

df.groupBy("country").agg(count("*")).distinct()
df.groupBy("country").agg(count("*")).dropDuplicates()

Good:

df.groupBy("country").agg(count("*"))