Skip to content

Rule S016

first() or last() with .over(Window.partitionBy(...)) without orderBy() — non-deterministic result

Severity

🟡 MEDIUM — Non-deterministic results across runs.

PySpark version

Compatible with PySpark 1.4 and later.

Information

Using first() or last() as window functions with a Window.partitionBy(...) that has no orderBy() is non-deterministic. Without an explicit ordering inside the window specification, the "first" or "last" row within each partition is undefined and depends on how Spark distributes data across executors.

Just like with groupBy().agg(), each shuffle redistributes rows differently. Without ordering in the window spec, first() and last() return arbitrary values that can change between runs, even on the same cluster with the same data.

Run 1 — Window partition "user_1"       Run 2 — Window partition "user_1"
┌─────────────┐                         ┌─────────────┐
│  row A  <── first()                   │  row C  <── first()
│  row B                                │  row A
│  row C                                │  row B
└─────────────┘                         └─────────────┘

This leads to:

  • Silent correctness bugs — different results each run without any error
  • Unreproducible analytics — dashboards and reports that shift between runs
  • Hard-to-debug issues — non-determinism only manifests at scale

Best practices

  • Always include orderBy() in the Window specification when using first() or last()
  • Alternatively, replace first() / last() with a deterministic function such as min() or max()

Rule of thumb: Never use first() or last() with a window that only has partitionBy() — always add orderBy() to the window spec.

Example

Bad:

from pyspark.sql.window import Window

# Non-deterministic: no ordering in the window spec
w = Window.partitionBy("user_id")
df.withColumn("first_email", first("email").over(w))

# Inline window spec without orderBy
df.select(first("email").over(Window.partitionBy("user_id")))

# F-qualified calls are also flagged
df.withColumn("last_login", F.last("login_date").over(Window.partitionBy("user_id")))

Good:

from pyspark.sql.window import Window

# Deterministic: orderBy in the window spec
w = Window.partitionBy("user_id").orderBy("created_at")
df.withColumn("first_email", first("email").over(w))

# Inline with orderBy
df.select(first("email").over(Window.partitionBy("user_id").orderBy("created_at")))

# Use a deterministic aggregate instead
df.withColumn("min_email", min("email").over(Window.partitionBy("user_id")))