Rule ARR001
Avoid array_distinct(collect_list()) — use collect_set() instead
Severity
🟡 MEDIUM — Moderate performance impact.
PySpark version
Compatible with PySpark 2.4 and later.
Information
Using array_distinct(collect_list(col)) to deduplicate collected elements is redundant. collect_set() does the same thing in a single aggregation step without materializing the duplicates first.
collect_list()gathers all values including duplicates, thenarray_distinct()removes them — two operations where one suffices- The split form (
withColumn("a", collect_list(...))thenwithColumn("a", array_distinct(col("a")))) is even worse: it forces a full column rewrite to undo the duplicates that were just collected collect_set()deduplicates during aggregation, consuming less memory and producing a smaller shuffle
Best practices
- Replace
array_distinct(collect_list(x))withcollect_set(x)directly - The result is an unordered set — if ordering matters, sort after collecting
Example
Bad:
# inline form
df.agg(array_distinct(collect_list(col("item"))).alias("items"))
# split form
df.withColumn("items", collect_list(col("item"))) \
.withColumn("items", array_distinct(col("items")))
Good: