Rule D100

Avoid using collect()

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

Using .collect() in PySpark is generally a bad practice because it pulls all data from distributed workers into the driver node. This defeats Spark’s distributed nature and can easily cause:

Out-of-memory (OOM) errors if the dataset is large
Performance bottlenecks due to data transfer over the network
Driver overload, making your application unstable

Best practices

Use distributed operations like .select(), .filter(), .groupBy() instead of bringing data locally Use .show() for quick inspection instead of .collect() Use .limit() before .collect() if you really need a small sample Write results to storage (e.g., .write.parquet()) instead of collecting Use .take(n) or .head(n) for small subsets safely

Rule of thumb: Only use .collect() when you're 100% sure the data is small enough to fit in driver memory.

Example

Bad:

rows = df.collect()

Good:

# Do not use collect()