Skip to content

Rule S009

Prefer using mapPartitions() over map()

Severity

🟡 MEDIUM — Moderate performance impact.

PySpark version

Compatible with PySpark 1.0 and later.

Information

Using map() applies a function to each row individually, which can lead to:

  • Higher serialization/deserialization overhead for each row
  • Poorer performance on large datasets
  • Increased pressure on the JVM and Python communication layer (when using PySpark)

mapPartitions() applies a function to an entire partition at once, which:

  • Reduces serialization/deserialization overhead
  • Improves performance for large datasets
  • Allows more efficient resource utilization per executor

Best practices

  • Use mapPartitions() when performing row-level transformations that can be applied to an entire partition
  • Only use map() for simple transformations on small datasets or when partition-level access is unnecessary

Rule of thumb: Favor mapPartitions() over map() to improve performance and reduce overhead in PySpark transformations.

Example

Bad:

df.rdd.map(lambda row: (row["id"], row["value"] * 2))

Good:

df.rdd.mapPartitions(lambda rows: ((r["id"], r["value"] * 2) for r in rows))