Skip to content

Rule F020

Avoid select("*") — use explicit column names

Severity

🟢 LOW — Minor performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

Passing the wildcard string "*" to select() is the DataFrame equivalent of SELECT * in SQL. It silently pulls every column from the DataFrame without any guarantee about which columns will be present at runtime:

  • Adding, removing, or reordering a column upstream changes your DataFrame's schema without any error or warning downstream.
  • Code reviewers and future maintainers have no idea which columns are actually needed, making the intent of the transformation invisible.
  • In wide tables with dozens of columns, a select("*") followed by joins or aggregations inflates the shuffle size and slows the job.

Always name the columns you need explicitly. If you genuinely want all columns plus extras, use df.columns to build the list programmatically so the intent is clear.

Best practices

Bad:

df.select("*")
df.select("*", "extra_col")
df.select("id", "*")

Good:

# Name the columns you actually need
df.select("id", "name", "country")

# If you want all existing columns plus a new one, build the list explicitly
df.select(*df.columns, lit(1).alias("flag"))