Rule F020
Avoid select("*") — use explicit column names
Severity
🟢 LOW — Minor performance impact.
PySpark version
Compatible with PySpark 1.3 and later.
Information
Passing the wildcard string "*" to select() is the DataFrame equivalent of
SELECT * in SQL. It silently pulls every column from the DataFrame
without any guarantee about which columns will be present at runtime:
- Adding, removing, or reordering a column upstream changes your DataFrame's schema without any error or warning downstream.
- Code reviewers and future maintainers have no idea which columns are actually needed, making the intent of the transformation invisible.
- In wide tables with dozens of columns, a
select("*")followed by joins or aggregations inflates the shuffle size and slows the job.
Always name the columns you need explicitly. If you genuinely want all columns
plus extras, use df.columns to build the list programmatically so the intent
is clear.
Best practices
Bad:
Good: