Skip to content

Rule S004

Too many .distinct() operations

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

Excessive use of .distinct() in PySpark can be very costly because each .distinct() triggers a full shuffle of the data across the cluster. This can lead to:

  • Significant performance degradation
  • High network I/O and memory usage
  • Longer job execution times

Best practices

  • Minimize .distinct() calls by combining them with other transformations when possible
  • Consider aggregations (.groupBy()) to achieve uniqueness more efficiently

Rule of thumb: Use .distinct() sparingly and only when necessary to reduce shuffle overhead.

Example

Bad:

df1 = df.distinct()
df2 = df.select("a").distinct()
df3 = df.select("b").distinct()
df4 = df.select("c").distinct()
df5 = df.select("d").distinct()
df6 = df.select("e").distinct()  # exceeds threshold

Good:

# Consolidate deduplication into fewer steps
df_deduped = df.dropDuplicates(["a", "b", "c", "d", "e"])