Rule D101

Avoid using .rdd

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

Using .rdd in PySpark is generally a bad practice because it drops you from the optimized DataFrame API to low-level RDDs. This bypasses Spark’s Catalyst optimizer and Tungsten execution engine, leading to:

Loss of query optimization
Slower performance
More complex and less maintainable code

Best practices

Use DataFrame/SQL APIs (.select(), .filter(), .groupBy()) instead
Leverage built-in functions from pyspark.sql.functions
Use SQL queries when transformations become complex
Only use .rdd for cases not supported by DataFrames

Rule of thumb: Stay in the DataFrame API unless you have a clear need for RDD-level control.

Example

Bad:

rdd = df.rdd

Good:

# Do not use rdd