Skip to content

Rule F007

Prefer using filter() before select() for clarity

Severity

🟢 LOW — Minor performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

Applying select() before filter() may work correctly, but can make the code harder to read and understand:

  • Makes it less clear which columns are being filtered
  • Reduces readability when chaining multiple transformations
  • Even though Spark’s AQE and Catalyst optimizer handle filter pushdown, explicit filter() first improves code clarity

Best practices

  • Apply filter() before select() to make the transformation logic easier to follow
  • This approach makes DataFrame pipelines more understandable and maintainable

Rule of thumb: Use filter() before select() primarily for code clarity and readability.

Example

Bad:

df.select("a", "b").filter(col("a") > 1)

Good:

df.filter(col("a") > 1).select("a", "b")