Rule D007
Avoid using df.filter(...).count() == 0; use .isEmpty() instead
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 3.3 and later.
Information
Using df.filter(...).count() == 0 triggers a full scan of the filtered DataFrame, which is inefficient for large datasets. This can cause:
- High computation and memory usage
- Slower execution
- Unnecessary resource consumption
Best practices
- Use
.filter(...).isEmpty()to efficiently check if filtered data exists:
- Use
.count()only when the exact number of filtered rows is required
Rule of thumb: Prefer .isEmpty() over .count() == 0 for emptiness checks after filtering.
Example
Bad:
Good: