Skip to content

Rule F002

Avoid using drop()

Severity

🟢 LOW — Minor performance impact.

PySpark version

Compatible with PySpark 1.4 and later.

Information

Using drop() can make transformations less clear and harder to maintain:

  • You may lose track of which columns remain in the DataFrame
  • Chained drop() calls can create confusion about the final schema
  • It reduces control over column order and subsequent transformations

Best practices

  • Prefer using select() to explicitly choose columns you want to keep
  • Using select() improves readability and maintainability of your transformation logic
  • It makes it easier to reason about what happens next in your pipeline

Rule of thumb: Use select() instead of drop() for better control and clarity of your DataFrame transformations.

Example

Bad:

df.drop("col_a")

Good:

df.select("col_b", "col_c")