Rule F017

Avoid expr() — use native PySpark functions instead

Severity

🟢 LOW — Minor performance impact.

PySpark version

Compatible with PySpark 1.5 and later.

Information

expr() embeds a raw SQL string inside a DataFrame API call. It bypasses the Python type system, IDE autocompletion, and static analysis — errors are only caught at runtime when Spark parses the SQL fragment.

expr("a + b") and col("a") + col("b") are equivalent, but only the latter is refactorable and statically analysable
SQL strings inside expr() cannot be linted, renamed, or traced by standard tooling
Mixing expr() with the DataFrame API is the same footgun as mixing spark.sql() — it fragments your code between two paradigms
Every operation supported by expr() has a native PySpark equivalent: arithmetic, string functions, conditionals, window functions, etc.

Best practices

Replace SQL string expressions with their native PySpark equivalents:

# Bad
df.withColumn("total", expr("price * quantity"))
df.withColumn("name", expr("upper(first_name) || ' ' || upper(last_name)"))
df.withColumn("flag", expr("CASE WHEN status = 'A' THEN 1 ELSE 0 END"))
df.select(expr("count(distinct id) as cnt"))

# Good
df.withColumn("total", col("price") * col("quantity"))
df.withColumn("name", concat(upper(col("first_name")), lit(" "), upper(col("last_name"))))
df.withColumn("flag", when(col("status") == "A", 1).otherwise(0))
df.select(countDistinct("id").alias("cnt"))