Rule D007

Avoid using df.filter(...).count() == 0; use .isEmpty() instead

Severity

🔴 HIGH — Major performance impact.

Compatible with PySpark 3.3 and later.

Using df.filter(...).count() == 0 triggers a full scan of the filtered DataFrame, which is inefficient for large datasets. This can cause:

Use .filter(...).isEmpty() to efficiently check if filtered data exists:
```
if df.filter(condition).isEmpty():
    # handle empty result
```
Use .count() only when the exact number of filtered rows is required

Rule of thumb: Prefer .isEmpty() over .count() == 0 for emptiness checks after filtering.

Bad:

if df.filter(col("a") > 1).count() == 0: pass

Good:

if df.filter(col("a") > 1).isEmpty(): pass