Rule ARR003
Avoid array_distinct(collect_set(...)) — collect_set already returns distinct values
Severity
🟡 MEDIUM — Moderate performance impact.
PySpark version
Compatible with PySpark 2.4 and later.
Information
collect_set() deduplicates values during the aggregation shuffle — the resulting
array is guaranteed to contain no duplicates. Wrapping it in array_distinct() therefore:
- Runs a second deduplication pass over data that is already unique
- Wastes CPU on a no-op sort/comparison
- Makes the code misleading — it implies the input could have duplicates when it cannot
This applies to both the plain aggregate form and the window aggregate form:
array_distinct(collect_set(col("x"))) # aggregate
array_distinct(collect_set(col("x")).over(w)) # window
Best practices
Remove the outer array_distinct() — collect_set() is sufficient on its own.
Rule of thumb: collect_set = distinct by definition. array_distinct(collect_set(...)) is always a no-op.
Example
Bad:
df.withColumn("tags", array_distinct(collect_set(col("tag"))))
df.withColumn("tags", array_distinct(collect_set(col("tag")).over(w)))
Good: