Skip to content

Rule F003

Avoid using selectExpr(); prefer select()

Severity

🟢 LOW — Minor performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

Using selectExpr() can make transformations less readable and harder to maintain:

  • Expressions as strings are prone to typos and errors
  • It’s harder to track column lineage in complex transformations
  • Debugging becomes more difficult compared to using select() with column objects

Best practices

  • Prefer select() with col() or column expressions for clarity
  • Using select() improves readability and makes transformations easier to maintain
  • Enables better compatibility with IDEs, static analysis, and refactoring

Rule of thumb: Use select() instead of selectExpr() for more readable, maintainable, and safer DataFrame transformations.

Example

Bad:

df.selectExpr("age * 2 as double_age")

Good:

df.select((col("age") * 2).alias("double_age"))