Skip to content

Rule F004

Avoid using spark.sql(); prefer native PySpark DataFrame operations

Severity

🟡 MEDIUM — Moderate performance impact.

PySpark version

Compatible with PySpark 2.0 and later.

Information

Using spark.sql() can introduce maintainability and performance issues:

  • Queries as strings are prone to typos and harder to refactor
  • Debugging and tracing column lineage is more difficult
  • Native DataFrame operations integrate better with Spark optimizations and type safety
  • Using spark.sql() may bypass Catalyst optimizations for some operations

Best practices

  • Use native PySpark DataFrame API (select(), filter(), withColumn(), etc.) whenever possible
  • Only use spark.sql() for legacy SQL queries that cannot be expressed with the DataFrame API
  • Native operations provide better IDE support, static analysis, and maintainability

Rule of thumb: Favor native PySpark DataFrame operations over spark.sql() for safer, more maintainable, and optimized code.

Example

Bad:

spark.sql("SELECT * FROM my_table WHERE age > 18")

Good:

df.filter(col("age") > 18)