Rule D100
Avoid using collect()
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 1.3 and later.
Information
Using .collect() in PySpark is generally a bad practice because it pulls all data from distributed workers into the driver node. This defeats Spark’s distributed nature and can easily cause:
- Out-of-memory (OOM) errors if the dataset is large
- Performance bottlenecks due to data transfer over the network
- Driver overload, making your application unstable
Best practices
Use distributed operations like .select(), .filter(), .groupBy() instead of bringing data locally
Use .show() for quick inspection instead of .collect()
Use .limit() before .collect() if you really need a small sample
Write results to storage (e.g., .write.parquet()) instead of collecting
Use .take(n) or .head(n) for small subsets safely
Rule of thumb: Only use .collect() when you're 100% sure the data is small enough to fit in driver memory.
Example
Bad:
Good: