Rule F004
Avoid using spark.sql(); prefer native PySpark DataFrame operations
Severity
🟡 MEDIUM — Moderate performance impact.
PySpark version
Compatible with PySpark 2.0 and later.
Information
Using spark.sql() can introduce maintainability and performance issues:
- Queries as strings are prone to typos and harder to refactor
- Debugging and tracing column lineage is more difficult
- Native DataFrame operations integrate better with Spark optimizations and type safety
- Using
spark.sql()may bypass Catalyst optimizations for some operations
Best practices
- Use native PySpark DataFrame API (
select(),filter(),withColumn(), etc.) whenever possible - Only use
spark.sql()for legacy SQL queries that cannot be expressed with the DataFrame API - Native operations provide better IDE support, static analysis, and maintainability
Rule of thumb: Favor native PySpark DataFrame operations over spark.sql() for safer, more maintainable, and optimized code.
Example
Bad:
Good: