Skip to content

Rule PERF005

DataFrame persisted but never unpersisted

Severity

🟡 MEDIUM — Moderate performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

Every .persist() call pins the DataFrame's partitions in memory (and/or disk) for the rest of the Spark session. Forgetting to call .unpersist() causes:

  • Memory pressure that accumulates with every job run, eventually evicting other cached data and forcing expensive recomputation
  • Silent leaks — the cached blocks remain pinned until the session ends or the executor is killed, with no warning in logs
  • OOM crashes in long-running applications or notebooks that persist many DataFrames without cleaning up

Every DataFrame that is persisted should have a matching .unpersist() call once the cached data is no longer needed.

Best practices

  • Call .unpersist() explicitly once the DataFrame is no longer needed downstream
  • If two variables hold the same persisted DataFrame, unpersist both names — each assignment that received a .persist() result must be unpersisted

Rule of thumb: Every .persist() should have a paired .unpersist().

Example

Bad:

df  = df.persist()
df2 = df2.persist()

df.unpersist()
# df2 is never unpersisted — memory leak

df  = df.persist()
df2 = df.persist()   # df2 holds the same persisted ref

df.unpersist()
# df2.unpersist() was never called — still a leak

Good:

df  = df.persist()
df2 = df2.persist()

# ... use df and df2 ...

df.unpersist()
df2.unpersist()