Rules
All rules are organized by category. Each rule page explains what is flagged, why it matters, and how to fix it.
ARR — Array
| Rule |
Title |
| ARR001 |
Avoid array_distinct(collect_list()) — use collect_set() instead |
| ARR002 |
Avoid array_except(col, None/lit(None)) — use array_compact() instead |
D — Driver
| Rule |
Title |
| D001 |
Avoid using collect() |
| D002 |
Avoid accessing .rdd |
| D003 |
Avoid .show() in production |
| D004 |
Avoid .count() on large DataFrames |
| D005 |
Avoid .rdd.isEmpty() — use .isEmpty() directly |
| D006 |
Avoid df.count() == 0 — use .isEmpty() |
| D007 |
Avoid .filter(...).count() == 0 — use .filter(...).isEmpty() |
| D008 |
Avoid .display() in production |
| Rule |
Title |
| F001 |
Avoid chaining withColumn() and withColumnRenamed() |
| F002 |
Avoid drop() — use select() for explicit columns |
| F003 |
Avoid selectExpr() — prefer select() with col() |
| F004 |
Avoid spark.sql() — prefer native DataFrame API |
| F005 |
Avoid stacking multiple withColumn() — use withColumns() |
| F006 |
Avoid stacking multiple withColumnRenamed() — use withColumnsRenamed() |
| F007 |
Prefer filter() before select() for clarity |
| F008 |
Avoid print() — prefer the logging module |
| F009 |
Avoid nested when() — use stacked .when().when().otherwise() |
| F010 |
Always include otherwise() at the end of a when() chain |
| F011 |
Avoid backslash line continuation — use parentheses |
| F012 |
Always wrap literal values with lit() |
| F013 |
Avoid reserved column names with __ prefix and __ suffix |
| F014 |
Avoid explode_outer() — handle nulls with higher-order functions |
| F015 |
Avoid multiple consecutive filter() calls — combine conditions |
| F016 |
Avoid long DataFrame renaming chains — overwrite the same variable |
| F017 |
Avoid expr() — use native PySpark functions instead |
| F018 |
Use Spark native datetime functions instead of Python datetime objects |
L — Looping
| Rule |
Title |
| L001 |
Avoid looping without .localCheckpoint() or .checkpoint() |
| L002 |
Avoid while loops with DataFrames |
| L003 |
Avoid calling withColumn() inside a loop |
P — Pandas
| Rule |
Title |
| P001 |
.toPandas() without enabling Arrow optimization |
| Rule |
Title |
| PERF001 |
Avoid .rdd.collect() — use .toPandas() for driver-side consumption |
| PERF002 |
Too many getOrCreate() calls — use getActiveSession() everywhere else |
| PERF003 |
Too many shuffle operations without a checkpoint |
S — Shuffle
| Rule |
Title |
| S001 |
Missing .coalesce() after .union() / .unionByName() |
| S002 |
Join without a broadcast or merge hint |
| S003 |
.groupBy() directly followed by .distinct() |
| S004 |
Too many .distinct() operations in one file |
| S005 |
.repartition() with fewer partitions than the Spark default |
| S006 |
.repartition() with more partitions than the Spark default |
| S007 |
Avoid repartition(1) or coalesce(1) |
| S008 |
Overusing explode() / explode_outer() |
| S009 |
Prefer mapPartitions() over map() for row-level transforms |
| S010 |
Avoid crossJoin() — produces a Cartesian product |
| S011 |
Join without join conditions causes a nested-loop scan |
| S012 |
Avoid inner join followed by filter — prefer leftSemi join |
| S013 |
Avoid reduceByKey() — use DataFrame groupBy().agg() instead |
U — UDF
| Rule |
Title |
| U001 |
Avoid UDFs that return StringType — use built-in string functions |
| U002 |
Avoid UDFs that return ArrayType — use built-in array functions |
| U003 |
Avoid UDFs — use Spark built-in functions instead |
| U004 |
Avoid nested UDF calls — merge logic or use plain Python helpers |