Skip to content

Rule D007

Avoid using df.filter(...).count() == 0; use .isEmpty() instead

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 3.3 and later.

Information

Using df.filter(...).count() == 0 triggers a full scan of the filtered DataFrame, which is inefficient for large datasets. This can cause:

  • High computation and memory usage
  • Slower execution
  • Unnecessary resource consumption

Best practices

  • Use .filter(...).isEmpty() to efficiently check if filtered data exists:
    if df.filter(condition).isEmpty():
        # handle empty result
    
  • Use .count() only when the exact number of filtered rows is required

Rule of thumb: Prefer .isEmpty() over .count() == 0 for emptiness checks after filtering.

Example

Bad:

if df.filter(col("a") > 1).count() == 0: pass

Good:

if df.filter(col("a") > 1).isEmpty(): pass