Rule L001
Avoid looping without .localCheckpoint() or .checkpoint()
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 2.3 and later.
Information
Looping over a DataFrame or repeatedly applying transformations without checkpointing can cause Spark to:
- Build a very long lineage of transformations
- Increase the number of tasks exponentially
- Risk stack overflow or job failure
Best practices
- Use
.localCheckpoint()for iterative or intermediate DataFrames to truncate lineage:
- Use
.checkpoint()for long-running jobs or when persistence is needed across stages - Cache DataFrames if they are reused multiple times in loops
Rule of thumb: Always truncate lineage with checkpointing in loops to prevent excessive tasks and job failures.
Example
Bad:
Good: