Rule L002

Avoid using while loops with DataFrames

Severity

🔴 HIGH — Major performance impact.

Compatible with PySpark 1.0 and later.

Using while loops in PySpark to repeatedly process DataFrames can lead to:

Use Spark transformations like .map(), .filter(), or .foreachBatch() for iterative processing
Use .localCheckpoint() or .checkpoint() if iterative logic is unavoidable
Prefer structured streaming or batch operations instead of manual loops

Rule of thumb: Avoid while loops; leverage Spark’s distributed operations for iterations.

Bad:

while condition:
    df = df.filter(col("status") != "done")

Good:

# process iteratively without a while loop over DataFrames
df = df.filter(col("status") != "done")