Rule L001

Avoid looping without .localCheckpoint() or .checkpoint()

Severity

🔴 HIGH — Major performance impact.

Compatible with PySpark 2.3 and later.

Looping over a DataFrame or repeatedly applying transformations without checkpointing can cause Spark to:

Use .localCheckpoint() for iterative or intermediate DataFrames to truncate lineage:
```
df = df.localCheckpoint()
```
Use .checkpoint() for long-running jobs or when persistence is needed across stages
Cache DataFrames if they are reused multiple times in loops

Rule of thumb: Always truncate lineage with checkpointing in loops to prevent excessive tasks and job failures.

Bad:

for i in range(100):
    df = df.withColumn("x", col("x") + i)
    # no checkpoint

Good:

for i in range(100):
    df = df.withColumn("x", col("x") + i)
    if i % 10 == 0:
        df = df.localCheckpoint()