Rule S004

Too many .distinct() operations

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

Excessive use of .distinct() in PySpark can be very costly because each .distinct() triggers a full shuffle of the data across the cluster. This can lead to:

Significant performance degradation
High network I/O and memory usage
Longer job execution times

Best practices

Minimize .distinct() calls by combining them with other transformations when possible
Consider aggregations (.groupBy()) to achieve uniqueness more efficiently

Rule of thumb: Use .distinct() sparingly and only when necessary to reduce shuffle overhead.

Example

Bad:

df1 = df.distinct()
df2 = df.select("a").distinct()
df3 = df.select("b").distinct()
df4 = df.select("c").distinct()
df5 = df.select("d").distinct()
df6 = df.select("e").distinct()  # exceeds threshold

Good:

# Consolidate deduplication into fewer steps
df_deduped = df.dropDuplicates(["a", "b", "c", "d", "e"])