Rule ARR006
Avoid size(collect_list(...).over(w)) — use count(...).over(w) instead
Severity
🟢 LOW — Minor impact — avoidable overhead at scale.
PySpark version
Compatible with PySpark 1.6 and later.
Information
size(collect_list(col).over(w)) counts the number of rows in a window by first
materialising every value in that window into an in-memory array, then measuring its length:
collect_list(col).over(w)collects all values for each window frame into an arraysize(...)counts the elements of that array
count(col).over(w) computes the same per-window row count in a single pass directly
inside the window aggregation, without allocating the intermediate array. It uses less
memory and executes faster, especially on large windows or high-cardinality partitions.
Best practices
Replace size(collect_list(col).over(w)) with count(col).over(w).
Example
Bad:
w = Window.partitionBy("user_id").orderBy("ts").rowsBetween(Window.unboundedPreceding, 0)
df.withColumn("running_count", size(collect_list(col("event")).over(w)))
Good: