Skip to content

Rule F017

Avoid expr() — use native PySpark functions instead

Severity

🟢 LOW — Minor performance impact.

PySpark version

Compatible with PySpark 1.5 and later.

Information

expr() embeds a raw SQL string inside a DataFrame API call. It bypasses the Python type system, IDE autocompletion, and static analysis — errors are only caught at runtime when Spark parses the SQL fragment.

  • expr("a + b") and col("a") + col("b") are equivalent, but only the latter is refactorable and statically analysable
  • SQL strings inside expr() cannot be linted, renamed, or traced by standard tooling
  • Mixing expr() with the DataFrame API is the same footgun as mixing spark.sql() — it fragments your code between two paradigms
  • Every operation supported by expr() has a native PySpark equivalent: arithmetic, string functions, conditionals, window functions, etc.

Best practices

Replace SQL string expressions with their native PySpark equivalents:

# Bad
df.withColumn("total", expr("price * quantity"))
df.withColumn("name", expr("upper(first_name) || ' ' || upper(last_name)"))
df.withColumn("flag", expr("CASE WHEN status = 'A' THEN 1 ELSE 0 END"))
df.select(expr("count(distinct id) as cnt"))
# Good
df.withColumn("total", col("price") * col("quantity"))
df.withColumn("name", concat(upper(col("first_name")), lit(" "), upper(col("last_name"))))
df.withColumn("flag", when(col("status") == "A", 1).otherwise(0))
df.select(countDistinct("id").alias("cnt"))