Skip to content

Rule P001

.toPandas() without enabling Arrow

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 3.0 and later.

Information

Using .toPandas() in PySpark without enabling Arrow optimization can result in slower conversion and higher memory usage because data is serialized/deserialized inefficiently between JVM and Python.

Best practices

  • Enable Arrow for faster conversion:
    spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
    
  • Ensure your system has enough memory for the resulting Pandas DataFrame
  • Consider sampling the data if it is very large

Rule of thumb: Always enable Arrow when using .toPandas() for better performance and lower memory overhead.

Example

Bad:

pdf = df.toPandas()  # Arrow not enabled

Good:

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
pdf = df.toPandas()