Rule F016
Avoid long renaming chains — overwrite the same DataFrame variable instead
Severity
🟢 LOW — Minor performance impact.
Experimental rule
This rule is experimental. Detection relies on static variable-name tracking across assignment statements and cannot follow aliasing through function calls, conditional branches, or dynamic attribute access. False positives and false negatives are possible; review every finding before acting on it.
PySpark version
Compatible with PySpark 1.0 and later.
Information
Renaming a DataFrame at every transformation step (df_a = df.filter(...), df_b = df_a.distinct(), df_c = df_b.join(...)) creates a long chain of variables that clutter the namespace when the names carry no meaningful information about what changed.
- Each new name forces the reader to trace a chain of assignments to understand the current state of the data
- Short, informative renames are fine; a chain of 3 or more steps without a clear reason to keep every intermediate is a code smell
- If the dataset is the same logical entity throughout, reuse the same variable name
This rule fires when more than 2 consecutive renames are detected: a = x.m(), b = a.m(), c = b.m() → flag on the third step.
Best practices
- Reuse the variable name for pure pipeline steps
- Only introduce a new name when the result represents a genuinely different entity
Example
Bad:
df_filtered = df.filter(col("active") == True)
df_deduped = df_filtered.distinct()
df_enriched = df_deduped.join(ref, "id") # third rename — flag
Good: