Rule D004
Avoid using .count() on large DataFrames
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 1.3 and later.
Information
Using .count() triggers a full scan of the DataFrame, which can be very expensive on large datasets. This can lead to:
- High computation and memory usage
- Increased job execution time
- Potential performance bottlenecks
Best practices
- Use
.isEmpty()or.take(1)to check for non-empty DataFrames - Use approximate counts with
.approx_count_distinct()orDataFrame.summary()if exact numbers are not required - Cache or persist the DataFrame before counting if it will be used multiple times
Rule of thumb: Avoid .count() unless an exact number is truly necessary; prefer lightweight alternatives.
Example
Bad:
Good: