Skip to content

Rule F015

Avoid multiple consecutive .filter() calls — combine conditions into one

Severity

🟢 LOW — Minor performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

Applying multiple .filter() or .where() calls in a chain instead of combining their conditions into a single call creates an unnecessarily large execution plan.

  • Each separate .filter() adds a step in the logical plan, making it harder to read and reason about
  • Spark will eventually combine them during optimization, but the intent is clearer when written as a single predicate
  • Mixing .filter() and .where() across a chain is particularly confusing since they are aliases

Best practices

  • Combine conditions with & (and) or | (or) inside a single .filter() call
  • Use parentheses to group compound conditions clearly

Example

Bad:

df.filter(col("age") > 18).filter(col("country") == "FR")
df.where(col("status") == "active").where(col("score") > 0.5)

Good:

df.filter((col("age") > 18) & (col("country") == "FR"))
df.filter((col("status") == "active") & (col("score") > 0.5))