Rule F002

Avoid using drop()

Severity

🟢 LOW — Minor performance impact.

Compatible with PySpark 1.4 and later.

Using drop() can make transformations less clear and harder to maintain:

Prefer using select() to explicitly choose columns you want to keep
Using select() improves readability and maintainability of your transformation logic
It makes it easier to reason about what happens next in your pipeline

Rule of thumb: Use select() instead of drop() for better control and clarity of your DataFrame transformations.

Bad:

df.drop("col_a")

Good:

df.select("col_b", "col_c")