Rule ARR005

Avoid size(collect_list(...)) inside .agg() — use count() instead

Severity

🟢 LOW — Minor impact — avoidable overhead at scale.

PySpark version

Compatible with PySpark 1.6 and later.

Information

size(collect_list(col)) counts all values (including duplicates) by first collecting every value into an in-memory array, then measuring the array length. This forces a full shuffle just to produce a count:

collect_list(col) gathers every value into an in-memory array
size(...) counts the elements

count(col) performs the exact same operation — counting non-null values — in a single, optimised aggregation pass without materialising the intermediate array. It is cheaper in memory and communicates intent directly.

Best practices

Replace size(collect_list(col)) with count(col) inside any .agg() call.

Example

Bad:

df.agg(size(collect_list(col("event"))).alias("total_events"))
df.agg(
    size(collect_list("order_id")).alias("order_count"),
)

Good:

df.agg(count(col("event")).alias("total_events"))
df.agg(
    count("order_id").alias("order_count"),
)