Rule PERF001

Avoid .rdd.collect() — use .toPandas() instead

Severity

🔴 HIGH — Major performance impact.

Compatible with PySpark 1.3 and later.

Calling .rdd.collect() to bring data to the driver and then manually converting to Python objects bypasses Spark's optimized data transfer path.

.rdd.collect() forces full deserialization of every row through the RDD layer, which is significantly slower than the columnar transfer used by .toPandas()
Memory usage is higher because the data is materialized as Java/Python objects rather than Arrow buffers
With Arrow optimization enabled (spark.sql.execution.arrow.pyspark.enabled = true), .toPandas() uses zero-copy columnar transfer and is orders of magnitude faster
.rdd.collect() returns a list of Row objects that still need to be converted manually, adding extra processing overhead

Use .toPandas() for small datasets that need to be brought to the driver
Enable Arrow optimization for maximum performance: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
For large datasets, avoid collecting at all — process in Spark and write results to storage

Bad:

rows = df.rdd.collect()
data = [(r["id"], r["value"]) for r in rows]

Good:

pdf = df.toPandas()