Rule S016
first() or last() with .over(Window.partitionBy(...)) without orderBy() — non-deterministic result
Severity
🟡 MEDIUM — Non-deterministic results across runs.
PySpark version
Compatible with PySpark 1.4 and later.
Information
Using first() or last() as window functions with a Window.partitionBy(...) that has no orderBy() is non-deterministic. Without an explicit ordering inside the window specification, the "first" or "last" row within each partition is undefined and depends on how Spark distributes data across executors.
Just like with groupBy().agg(), each shuffle redistributes rows differently. Without ordering in the window spec, first() and last() return arbitrary values that can change between runs, even on the same cluster with the same data.
Run 1 — Window partition "user_1" Run 2 — Window partition "user_1"
┌─────────────┐ ┌─────────────┐
│ row A <── first() │ row C <── first()
│ row B │ row A
│ row C │ row B
└─────────────┘ └─────────────┘
This leads to:
- Silent correctness bugs — different results each run without any error
- Unreproducible analytics — dashboards and reports that shift between runs
- Hard-to-debug issues — non-determinism only manifests at scale
Best practices
- Always include
orderBy()in the Window specification when usingfirst()orlast() - Alternatively, replace
first()/last()with a deterministic function such asmin()ormax()
Rule of thumb: Never use first() or last() with a window that only has partitionBy() — always add orderBy() to the window spec.
Example
Bad:
from pyspark.sql.window import Window
# Non-deterministic: no ordering in the window spec
w = Window.partitionBy("user_id")
df.withColumn("first_email", first("email").over(w))
# Inline window spec without orderBy
df.select(first("email").over(Window.partitionBy("user_id")))
# F-qualified calls are also flagged
df.withColumn("last_login", F.last("login_date").over(Window.partitionBy("user_id")))
Good:
from pyspark.sql.window import Window
# Deterministic: orderBy in the window spec
w = Window.partitionBy("user_id").orderBy("created_at")
df.withColumn("first_email", first("email").over(w))
# Inline with orderBy
df.select(first("email").over(Window.partitionBy("user_id").orderBy("created_at")))
# Use a deterministic aggregate instead
df.withColumn("min_email", min("email").over(Window.partitionBy("user_id")))