Rule L002
Avoid using while loops with DataFrames
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 1.0 and later.
Information
Using while loops in PySpark to repeatedly process DataFrames can lead to:
- Exponentially growing lineage of transformations
- Excessive task creation and memory usage
- Performance degradation or job failure
Best practices
- Use Spark transformations like
.map(),.filter(), or.foreachBatch()for iterative processing - Use
.localCheckpoint()or.checkpoint()if iterative logic is unavoidable - Prefer structured streaming or batch operations instead of manual loops
Rule of thumb: Avoid while loops; leverage Spark’s distributed operations for iterations.
Example
Bad:
Good: