Skip to content

Rule D004

Avoid using .count() on large DataFrames

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

Using .count() triggers a full scan of the DataFrame, which can be very expensive on large datasets. This can lead to:

  • High computation and memory usage
  • Increased job execution time
  • Potential performance bottlenecks

Best practices

  • Use .isEmpty() or .take(1) to check for non-empty DataFrames
  • Use approximate counts with .approx_count_distinct() or DataFrame.summary() if exact numbers are not required
  • Cache or persist the DataFrame before counting if it will be used multiple times

Rule of thumb: Avoid .count() unless an exact number is truly necessary; prefer lightweight alternatives.

Example

Bad:

n = df.count()

Good:

df.isEmpty()