Rule ARR004

Avoid size(collect_set(...)) inside .agg() — use count_distinct() instead

Severity

🟢 LOW — Minor impact — avoidable overhead at scale.

PySpark version

Compatible with PySpark 3.2 and later.

Information

size(collect_set(col)) is a two-step anti-pattern when used as an aggregation:

collect_set(col) performs a full shuffle and deduplication, materialising all distinct values into an in-memory array
size(...) then counts the length of that array

count_distinct(col) performs the exact same semantic operation — counting distinct values — in a single, optimised aggregation pass without materialising the intermediate array. It is cheaper in memory, faster in execution, and communicates intent more clearly.

Best practices

Replace size(collect_set(col)) with count_distinct(col) inside any .agg() call.

Example

Bad:

df.agg(size(collect_set(col("product"))).alias("distinct_products"))
df.agg(
    size(collect_set("user_id")).alias("unique_users"),
    size(collect_set("country")).alias("unique_countries"),
)

Good:

from pyspark.sql.functions import count_distinct

df.agg(count_distinct(col("product")).alias("distinct_products"))
df.agg(
    count_distinct("user_id").alias("unique_users"),
    count_distinct("country").alias("unique_countries"),
)