Rule PERF002
Avoid multiple getOrCreate() calls — use getActiveSession() instead
Severity
🟡 MEDIUM — Moderate performance impact.
PySpark version
Compatible with PySpark 3.0 and later.
Information
SparkSession.builder.getOrCreate() is expensive: it checks whether a session already exists and, if not, initializes a new one. Calling it multiple times throughout a codebase causes repeated overhead and makes session lifecycle management unclear.
- Each
getOrCreate()call acquires a lock and inspects the session registry - Scattering
getOrCreate()across modules makes it impossible to know which call actually created the session SparkSession.getActiveSession()returns the already-running session without any initialization cost and returnsNoneif no session exists, making the dependency explicit
This rule fires when more than one getOrCreate() call is found in the same file.
Best practices
- Call
getOrCreate()exactly once at the entry point of your application - Everywhere else, retrieve the session with
SparkSession.getActiveSession()or pass it as a parameter
Example
Bad:
# module_a.py
spark = SparkSession.builder.getOrCreate()
# module_b.py
spark = SparkSession.builder.getOrCreate() # unnecessary second call
Good: