Rule ARR004
Avoid size(collect_set(...)) inside .agg() — use count_distinct() instead
Severity
🟢 LOW — Minor impact — avoidable overhead at scale.
PySpark version
Compatible with PySpark 3.2 and later.
Information
size(collect_set(col)) is a two-step anti-pattern when used as an aggregation:
collect_set(col)performs a full shuffle and deduplication, materialising all distinct values into an in-memory arraysize(...)then counts the length of that array
count_distinct(col) performs the exact same semantic operation — counting distinct values — in a single, optimised aggregation pass without materialising the intermediate array. It is cheaper in memory, faster in execution, and communicates intent more clearly.
Best practices
Replace size(collect_set(col)) with count_distinct(col) inside any .agg() call.
Example
Bad:
df.agg(size(collect_set(col("product"))).alias("distinct_products"))
df.agg(
size(collect_set("user_id")).alias("unique_users"),
size(collect_set("country")).alias("unique_countries"),
)
Good: