Skip to content

Rule D006

Avoid using df.count() == 0; use .isEmpty() instead

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 3.3 and later.

Information

Using df.count() == 0 to check if a DataFrame is empty triggers a full scan, which is expensive on large datasets. This can lead to:

  • High computation and memory usage
  • Slower performance
  • Unnecessary resource consumption

Best practices

  • Use .isEmpty() for an efficient empty check:
    if df.isEmpty():
        # handle empty DataFrame
    
  • Reserve .count() only when you need the exact number of rows

Rule of thumb: Use .isEmpty() for emptiness checks to avoid costly full scans.

Example

Bad:

if df.count() == 0: pass

Good:

if df.isEmpty(): pass