Rule D005
Avoid using .rdd.isEmpty(); use .isEmpty() on DataFrames
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 3.3 and later.
Information
Calling .rdd.isEmpty() converts the DataFrame to an RDD, bypassing Spark’s optimized DataFrame API. This can lead to:
- Loss of Catalyst optimizations
- Slower execution
- Less readable and maintainable code
Best practices
- Use
.isEmpty()directly on the DataFrame:
- Only use RDD methods when a DataFrame operation is not available
Rule of thumb: Stick to DataFrame APIs to benefit from Spark optimizations; avoid falling back to .rdd unnecessarily.
Example
Bad:
Good: