Rule F007
Prefer using filter() before select() for clarity
Severity
🟢 LOW — Minor performance impact.
PySpark version
Compatible with PySpark 1.3 and later.
Information
Applying select() before filter() may work correctly, but can make the code harder to read and understand:
- Makes it less clear which columns are being filtered
- Reduces readability when chaining multiple transformations
- Even though Spark’s AQE and Catalyst optimizer handle filter pushdown, explicit
filter()first improves code clarity
Best practices
- Apply
filter()beforeselect()to make the transformation logic easier to follow - This approach makes DataFrame pipelines more understandable and maintainable
Rule of thumb: Use filter() before select() primarily for code clarity and readability.
Example
Bad:
Good: