Rule ARR005
Avoid size(collect_list(...)) inside .agg() — use count() instead
Severity
🟢 LOW — Minor impact — avoidable overhead at scale.
PySpark version
Compatible with PySpark 1.6 and later.
Information
size(collect_list(col)) counts all values (including duplicates) by first collecting every value into an in-memory array, then measuring the array length. This forces a full shuffle just to produce a count:
collect_list(col)gathers every value into an in-memory arraysize(...)counts the elements
count(col) performs the exact same operation — counting non-null values — in a single, optimised aggregation pass without materialising the intermediate array. It is cheaper in memory and communicates intent directly.
Best practices
Replace size(collect_list(col)) with count(col) inside any .agg() call.
Example
Bad:
df.agg(size(collect_list(col("event"))).alias("total_events"))
df.agg(
size(collect_list("order_id")).alias("order_count"),
)
Good: