Skip to content

Rule L002

Avoid using while loops with DataFrames

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.0 and later.

Information

Using while loops in PySpark to repeatedly process DataFrames can lead to:

  • Exponentially growing lineage of transformations
  • Excessive task creation and memory usage
  • Performance degradation or job failure

Best practices

  • Use Spark transformations like .map(), .filter(), or .foreachBatch() for iterative processing
  • Use .localCheckpoint() or .checkpoint() if iterative logic is unavoidable
  • Prefer structured streaming or batch operations instead of manual loops

Rule of thumb: Avoid while loops; leverage Spark’s distributed operations for iterations.

Example

Bad:

while condition:
    df = df.filter(col("status") != "done")

Good:

# process iteratively without a while loop over DataFrames
df = df.filter(col("status") != "done")