Rule ARR001

Avoid array_distinct(collect_list()) — use collect_set() instead

Severity

🟡 MEDIUM — Moderate performance impact.

PySpark version

Compatible with PySpark 2.4 and later.

Information

Using array_distinct(collect_list(col)) to deduplicate collected elements is redundant. collect_set() does the same thing in a single aggregation step without materializing the duplicates first.

collect_list() gathers all values including duplicates, then array_distinct() removes them — two operations where one suffices
The split form (withColumn("a", collect_list(...)) then withColumn("a", array_distinct(col("a")))) is even worse: it forces a full column rewrite to undo the duplicates that were just collected
collect_set() deduplicates during aggregation, consuming less memory and producing a smaller shuffle

Best practices

Replace array_distinct(collect_list(x)) with collect_set(x) directly
The result is an unordered set — if ordering matters, sort after collecting

Example

Bad:

# inline form
df.agg(array_distinct(collect_list(col("item"))).alias("items"))

# split form
df.withColumn("items", collect_list(col("item"))) \
  .withColumn("items", array_distinct(col("items")))

Good:

df.agg(collect_set(col("item")).alias("items"))