Rule D004

Avoid using .count() on large DataFrames

Severity

🔴 HIGH — Major performance impact.

Compatible with PySpark 1.3 and later.

Using .count() triggers a full scan of the DataFrame, which can be very expensive on large datasets. This can lead to:

Use .isEmpty() or .take(1) to check for non-empty DataFrames
Use approximate counts with .approx_count_distinct() or DataFrame.summary() if exact numbers are not required
Cache or persist the DataFrame before counting if it will be used multiple times

Rule of thumb: Avoid .count() unless an exact number is truly necessary; prefer lightweight alternatives.

Bad:

n = df.count()

Good:

df.isEmpty()