Rule ARR003

Avoid array_distinct(collect_set(...)) — collect_set already returns distinct values

Severity

🟡 MEDIUM — Moderate performance impact.

PySpark version

Compatible with PySpark 2.4 and later.

Information

collect_set() deduplicates values during the aggregation shuffle — the resulting array is guaranteed to contain no duplicates. Wrapping it in array_distinct() therefore:

Runs a second deduplication pass over data that is already unique
Wastes CPU on a no-op sort/comparison
Makes the code misleading — it implies the input could have duplicates when it cannot

This applies to both the plain aggregate form and the window aggregate form:

array_distinct(collect_set(col("x")))           # aggregate
array_distinct(collect_set(col("x")).over(w))   # window

Best practices

Remove the outer array_distinct() — collect_set() is sufficient on its own.

Rule of thumb: collect_set = distinct by definition. array_distinct(collect_set(...)) is always a no-op.

Example

Bad:

df.withColumn("tags", array_distinct(collect_set(col("tag"))))
df.withColumn("tags", array_distinct(collect_set(col("tag")).over(w)))

Good:

df.withColumn("tags", collect_set(col("tag")))
df.withColumn("tags", collect_set(col("tag")).over(w))