Skip to content

Rule ARR006

Avoid size(collect_list(...).over(w)) — use count(...).over(w) instead

Severity

🟢 LOW — Minor impact — avoidable overhead at scale.

PySpark version

Compatible with PySpark 1.6 and later.

Information

size(collect_list(col).over(w)) counts the number of rows in a window by first materialising every value in that window into an in-memory array, then measuring its length:

  1. collect_list(col).over(w) collects all values for each window frame into an array
  2. size(...) counts the elements of that array

count(col).over(w) computes the same per-window row count in a single pass directly inside the window aggregation, without allocating the intermediate array. It uses less memory and executes faster, especially on large windows or high-cardinality partitions.

Best practices

Replace size(collect_list(col).over(w)) with count(col).over(w).

Example

Bad:

w = Window.partitionBy("user_id").orderBy("ts").rowsBetween(Window.unboundedPreceding, 0)
df.withColumn("running_count", size(collect_list(col("event")).over(w)))

Good:

w = Window.partitionBy("user_id").orderBy("ts").rowsBetween(Window.unboundedPreceding, 0)
df.withColumn("running_count", count(col("event")).over(w))