Rule F015
Avoid multiple consecutive .filter() calls — combine conditions into one
Severity
🟢 LOW — Minor performance impact.
PySpark version
Compatible with PySpark 1.3 and later.
Information
Applying multiple .filter() or .where() calls in a chain instead of combining their conditions into a single call creates an unnecessarily large execution plan.
- Each separate
.filter()adds a step in the logical plan, making it harder to read and reason about - Spark will eventually combine them during optimization, but the intent is clearer when written as a single predicate
- Mixing
.filter()and.where()across a chain is particularly confusing since they are aliases
Best practices
- Combine conditions with
&(and) or|(or) inside a single.filter()call - Use parentheses to group compound conditions clearly
Example
Bad:
df.filter(col("age") > 18).filter(col("country") == "FR")
df.where(col("status") == "active").where(col("score") > 0.5)
Good: