Skip to content

Rule S003

.groupBy() followed by .distinct() or .dropDuplicates()

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.4 and later.

Information

Using .groupBy() followed by .distinct() or .dropDuplicates() is redundant because aggregation already produces one row per group key. The extra deduplication step triggers a second unnecessary shuffle, leading to:

  • Unnecessary extra computation
  • Increased shuffle overhead
  • Slower performance

Best practices

  • Remove .distinct() or .dropDuplicates() when .groupBy() already ensures uniqueness per group key

Rule of thumb: Avoid chaining .groupBy() with .distinct() or .dropDuplicates() — the aggregation result is already deduplicated by the group keys.

Example

Bad:

df.groupBy("country").agg(count("*")).distinct()
df.groupBy("country").agg(count("*")).dropDuplicates()

Good:

df.groupBy("country").agg(count("*"))