Rule D006
Avoid using df.count() == 0; use .isEmpty() instead
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 3.3 and later.
Information
Using df.count() == 0 to check if a DataFrame is empty triggers a full scan, which is expensive on large datasets. This can lead to:
- High computation and memory usage
- Slower performance
- Unnecessary resource consumption
Best practices
- Use
.isEmpty()for an efficient empty check:
- Reserve
.count()only when you need the exact number of rows
Rule of thumb: Use .isEmpty() for emptiness checks to avoid costly full scans.
Example
Bad:
Good: