Rule F007

Prefer using filter() before select() for clarity

Severity

🟢 LOW — Minor performance impact.

Compatible with PySpark 1.3 and later.

Applying select() before filter() may work correctly, but can make the code harder to read and understand:

Makes it less clear which columns are being filtered
Reduces readability when chaining multiple transformations
Even though Spark’s AQE and Catalyst optimizer handle filter pushdown, explicit filter() first improves code clarity

Apply filter() before select() to make the transformation logic easier to follow
This approach makes DataFrame pipelines more understandable and maintainable

Rule of thumb: Use filter() before select() primarily for code clarity and readability.

Bad:

df.select("a", "b").filter(col("a") > 1)

Good:

df.filter(col("a") > 1).select("a", "b")