Rule S009
Prefer using mapPartitions() over map()
Severity
🟡 MEDIUM — Moderate performance impact.
PySpark version
Compatible with PySpark 1.0 and later.
Information
Using map() applies a function to each row individually, which can lead to:
- Higher serialization/deserialization overhead for each row
- Poorer performance on large datasets
- Increased pressure on the JVM and Python communication layer (when using PySpark)
mapPartitions() applies a function to an entire partition at once, which:
- Reduces serialization/deserialization overhead
- Improves performance for large datasets
- Allows more efficient resource utilization per executor
Best practices
- Use
mapPartitions()when performing row-level transformations that can be applied to an entire partition - Only use
map()for simple transformations on small datasets or when partition-level access is unnecessary
Rule of thumb: Favor mapPartitions() over map() to improve performance and reduce overhead in PySpark transformations.
Example
Bad:
Good: