Rule S004
Too many .distinct() operations
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 1.3 and later.
Information
Excessive use of .distinct() in PySpark can be very costly because each .distinct() triggers a full shuffle of the data across the cluster. This can lead to:
- Significant performance degradation
- High network I/O and memory usage
- Longer job execution times
Best practices
- Minimize
.distinct()calls by combining them with other transformations when possible - Consider aggregations (
.groupBy()) to achieve uniqueness more efficiently
Rule of thumb: Use .distinct() sparingly and only when necessary to reduce shuffle overhead.
Example
Bad:
df1 = df.distinct()
df2 = df.select("a").distinct()
df3 = df.select("b").distinct()
df4 = df.select("c").distinct()
df5 = df.select("d").distinct()
df6 = df.select("e").distinct() # exceeds threshold
Good: