Skip to content

Rule D005

Avoid using .rdd.isEmpty(); use .isEmpty() on DataFrames

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 3.3 and later.

Information

Calling .rdd.isEmpty() converts the DataFrame to an RDD, bypassing Spark’s optimized DataFrame API. This can lead to:

  • Loss of Catalyst optimizations
  • Slower execution
  • Less readable and maintainable code

Best practices

  • Use .isEmpty() directly on the DataFrame:
    if df.isEmpty():
        # handle empty DataFrame
    
  • Only use RDD methods when a DataFrame operation is not available

Rule of thumb: Stick to DataFrame APIs to benefit from Spark optimizations; avoid falling back to .rdd unnecessarily.

Example

Bad:

if df.rdd.isEmpty(): pass

Good:

if df.isEmpty(): pass