Rule F002
Avoid using drop()
Severity
🟢 LOW — Minor performance impact.
PySpark version
Compatible with PySpark 1.4 and later.
Information
Using drop() can make transformations less clear and harder to maintain:
- You may lose track of which columns remain in the DataFrame
- Chained
drop()calls can create confusion about the final schema - It reduces control over column order and subsequent transformations
Best practices
- Prefer using
select()to explicitly choose columns you want to keep - Using
select()improves readability and maintainability of your transformation logic - It makes it easier to reason about what happens next in your pipeline
Rule of thumb: Use select() instead of drop() for better control and clarity of your DataFrame transformations.
Example
Bad:
Good: