Skip to content

Rule D101

Avoid using .rdd

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

Using .rdd in PySpark is generally a bad practice because it drops you from the optimized DataFrame API to low-level RDDs. This bypasses Spark’s Catalyst optimizer and Tungsten execution engine, leading to:

  • Loss of query optimization
  • Slower performance
  • More complex and less maintainable code

Best practices

  • Use DataFrame/SQL APIs (.select(), .filter(), .groupBy()) instead
  • Leverage built-in functions from pyspark.sql.functions
  • Use SQL queries when transformations become complex
  • Only use .rdd for cases not supported by DataFrames

Rule of thumb: Stay in the DataFrame API unless you have a clear need for RDD-level control.

Example

Bad:

rdd = df.rdd

Good:

# Do not use rdd