Rule PERF005
DataFrame persisted but never unpersisted
Severity
🟡 MEDIUM — Moderate performance impact.
PySpark version
Compatible with PySpark 1.3 and later.
Information
Every .persist() call pins the DataFrame's partitions in memory (and/or disk)
for the rest of the Spark session. Forgetting to call .unpersist() causes:
- Memory pressure that accumulates with every job run, eventually evicting other cached data and forcing expensive recomputation
- Silent leaks — the cached blocks remain pinned until the session ends or the executor is killed, with no warning in logs
- OOM crashes in long-running applications or notebooks that persist many DataFrames without cleaning up
Every DataFrame that is persisted should have a matching .unpersist() call
once the cached data is no longer needed.
Best practices
- Call
.unpersist()explicitly once the DataFrame is no longer needed downstream - If two variables hold the same persisted DataFrame, unpersist both names —
each assignment that received a
.persist()result must be unpersisted
Rule of thumb: Every .persist() should have a paired .unpersist().
Example
Bad:
df = df.persist()
df2 = df.persist() # df2 holds the same persisted ref
df.unpersist()
# df2.unpersist() was never called — still a leak
Good: