Skip to content

Rule F016

Avoid long renaming chains — overwrite the same DataFrame variable instead

Severity

🟢 LOW — Minor performance impact.

Experimental rule

This rule is experimental. Detection relies on static variable-name tracking across assignment statements and cannot follow aliasing through function calls, conditional branches, or dynamic attribute access. False positives and false negatives are possible; review every finding before acting on it.

PySpark version

Compatible with PySpark 1.0 and later.

Information

Renaming a DataFrame at every transformation step (df_a = df.filter(...), df_b = df_a.distinct(), df_c = df_b.join(...)) creates a long chain of variables that clutter the namespace when the names carry no meaningful information about what changed.

  • Each new name forces the reader to trace a chain of assignments to understand the current state of the data
  • Short, informative renames are fine; a chain of 3 or more steps without a clear reason to keep every intermediate is a code smell
  • If the dataset is the same logical entity throughout, reuse the same variable name

This rule fires when more than 2 consecutive renames are detected: a = x.m(), b = a.m(), c = b.m() → flag on the third step.

Best practices

  • Reuse the variable name for pure pipeline steps
  • Only introduce a new name when the result represents a genuinely different entity

Example

Bad:

df_filtered  = df.filter(col("active") == True)
df_deduped   = df_filtered.distinct()
df_enriched  = df_deduped.join(ref, "id")   # third rename — flag

Good:

df = df.filter(col("active") == True)
df = df.distinct()
df = df.join(ref, "id")