Rule F004

Avoid using spark.sql(); prefer native PySpark DataFrame operations

Severity

🟡 MEDIUM — Moderate performance impact.

Compatible with PySpark 2.0 and later.

Using spark.sql() can introduce maintainability and performance issues:

Queries as strings are prone to typos and harder to refactor
Debugging and tracing column lineage is more difficult
Native DataFrame operations integrate better with Spark optimizations and type safety
Using spark.sql() may bypass Catalyst optimizations for some operations

Use native PySpark DataFrame API (select(), filter(), withColumn(), etc.) whenever possible
Only use spark.sql() for legacy SQL queries that cannot be expressed with the DataFrame API
Native operations provide better IDE support, static analysis, and maintainability

Rule of thumb: Favor native PySpark DataFrame operations over spark.sql() for safer, more maintainable, and optimized code.

Bad:

spark.sql("SELECT * FROM my_table WHERE age > 18")

Good:

df.filter(col("age") > 18)