Skip to content

Rule D009

Avoid using .count() as a boolean truth value; use .isEmpty() instead

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 3.3 and later.

Information

Using .count() as a boolean condition (e.g. if df.count():) triggers a full distributed scan to count every row before the branch is taken. This is wasteful when the only question is "does any row exist?". It can cause:

  • A complete scan of the DataFrame (no early exit)
  • High memory and CPU usage on large datasets
  • Slower pipeline execution compared to short-circuit alternatives

.isEmpty() stops as soon as the first row is found, making it an O(1) operation in the best case instead of O(n).

Best practices

Replace boolean .count() uses with .isEmpty() or not df.isEmpty():

Anti-pattern Replacement
if df.count(): if not df.isEmpty():
if not df.count(): if df.isEmpty():
if x and df.count(): if x and not df.isEmpty():
if x and not df.count(): if x and df.isEmpty():

Rule of thumb: Reserve .count() for when you need the exact row count. For presence/absence checks always prefer .isEmpty().

Example

Bad:

if df.filter(col("status") == "active").count():
    process(df)

if not df.count():
    raise ValueError("empty DataFrame")

Good:

if not df.filter(col("status") == "active").isEmpty():
    process(df)

if df.isEmpty():
    raise ValueError("empty DataFrame")