Rule D101
Avoid using .rdd
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 1.3 and later.
Information
Using .rdd in PySpark is generally a bad practice because it drops you from the optimized DataFrame API to low-level RDDs. This bypasses Spark’s Catalyst optimizer and Tungsten execution engine, leading to:
- Loss of query optimization
- Slower performance
- More complex and less maintainable code
Best practices
- Use DataFrame/SQL APIs (
.select(),.filter(),.groupBy()) instead - Leverage built-in functions from
pyspark.sql.functions - Use SQL queries when transformations become complex
- Only use
.rddfor cases not supported by DataFrames
Rule of thumb: Stay in the DataFrame API unless you have a clear need for RDD-level control.
Example
Bad:
Good: