Rule D009
Avoid using .count() as a boolean truth value; use .isEmpty() instead
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 3.3 and later.
Information
Using .count() as a boolean condition (e.g. if df.count():) triggers a
full distributed scan to count every row before the branch is taken. This is
wasteful when the only question is "does any row exist?". It can cause:
- A complete scan of the DataFrame (no early exit)
- High memory and CPU usage on large datasets
- Slower pipeline execution compared to short-circuit alternatives
.isEmpty() stops as soon as the first row is found, making it an O(1)
operation in the best case instead of O(n).
Best practices
Replace boolean .count() uses with .isEmpty() or not df.isEmpty():
| Anti-pattern | Replacement |
|---|---|
if df.count(): |
if not df.isEmpty(): |
if not df.count(): |
if df.isEmpty(): |
if x and df.count(): |
if x and not df.isEmpty(): |
if x and not df.count(): |
if x and df.isEmpty(): |
Rule of thumb: Reserve .count() for when you need the exact row count.
For presence/absence checks always prefer .isEmpty().
Example
Bad:
if df.filter(col("status") == "active").count():
process(df)
if not df.count():
raise ValueError("empty DataFrame")
Good: