Rule S009

Prefer using mapPartitions() over map()

Severity

🟡 MEDIUM — Moderate performance impact.

Compatible with PySpark 1.0 and later.

Using map() applies a function to each row individually, which can lead to:

Higher serialization/deserialization overhead for each row
Poorer performance on large datasets
Increased pressure on the JVM and Python communication layer (when using PySpark)

mapPartitions() applies a function to an entire partition at once, which:

Use mapPartitions() when performing row-level transformations that can be applied to an entire partition
Only use map() for simple transformations on small datasets or when partition-level access is unnecessary

Rule of thumb: Favor mapPartitions() over map() to improve performance and reduce overhead in PySpark transformations.

Bad:

df.rdd.map(lambda row: (row["id"], row["value"] * 2))

Good:

df.rdd.mapPartitions(lambda rows: ((r["id"], r["value"] * 2) for r in rows))