Skip to content

Rule L001

Avoid looping without .localCheckpoint() or .checkpoint()

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 2.3 and later.

Information

Looping over a DataFrame or repeatedly applying transformations without checkpointing can cause Spark to:

  • Build a very long lineage of transformations
  • Increase the number of tasks exponentially
  • Risk stack overflow or job failure

Best practices

  • Use .localCheckpoint() for iterative or intermediate DataFrames to truncate lineage:
    df = df.localCheckpoint()
    
  • Use .checkpoint() for long-running jobs or when persistence is needed across stages
  • Cache DataFrames if they are reused multiple times in loops

Rule of thumb: Always truncate lineage with checkpointing in loops to prevent excessive tasks and job failures.

Example

Bad:

for i in range(100):
    df = df.withColumn("x", col("x") + i)
    # no checkpoint

Good:

for i in range(100):
    df = df.withColumn("x", col("x") + i)
    if i % 10 == 0:
        df = df.localCheckpoint()