Rule P001
.toPandas() without enabling Arrow
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 3.0 and later.
Information
Using .toPandas() in PySpark without enabling Arrow optimization can result in slower conversion and higher memory usage because data is serialized/deserialized inefficiently between JVM and Python.
Best practices
- Enable Arrow for faster conversion:
- Ensure your system has enough memory for the resulting Pandas DataFrame
- Consider sampling the data if it is very large
Rule of thumb: Always enable Arrow when using .toPandas() for better performance and lower memory overhead.
Example
Bad:
Good: