Rules

All rules are organized by category. Each rule page explains what is flagged, why it matters, and how to fix it.

ARR — Array

Rule	Title
ARR001	Avoid `array_distinct(collect_list())` — use `collect_set()` instead
ARR002	Avoid `array_except(col, None/lit(None))` — use `array_compact()` instead

Rule	Title
D001	Avoid using `collect()`
D002	Avoid accessing `.rdd`
D003	Avoid `.show()` in production
D004	Avoid `.count()` on large DataFrames
D005	Avoid `.rdd.isEmpty()` — use `.isEmpty()` directly
D006	Avoid `df.count() == 0` — use `.isEmpty()`
D007	Avoid `.filter(...).count() == 0` — use `.filter(...).isEmpty()`
D008	Avoid `.display()` in production

Rule	Title
F001	Avoid chaining `withColumn()` and `withColumnRenamed()`
F002	Avoid `drop()` — use `select()` for explicit columns
F003	Avoid `selectExpr()` — prefer `select()` with `col()`
F004	Avoid `spark.sql()` — prefer native DataFrame API
F005	Avoid stacking multiple `withColumn()` — use `withColumns()`
F006	Avoid stacking multiple `withColumnRenamed()` — use `withColumnsRenamed()`
F007	Prefer `filter()` before `select()` for clarity
F008	Avoid `print()` — prefer the logging module
F009	Avoid nested `when()` — use stacked `.when().when().otherwise()`
F010	Always include `otherwise()` at the end of a `when()` chain
F011	Avoid backslash line continuation — use parentheses
F012	Always wrap literal values with `lit()`
F013	Avoid reserved column names with `__` prefix and `__` suffix
F014	Avoid `explode_outer()` — handle nulls with higher-order functions
F015	Avoid multiple consecutive `filter()` calls — combine conditions
F016	Avoid long DataFrame renaming chains — overwrite the same variable
F017	Avoid `expr()` — use native PySpark functions instead
F018	Use Spark native datetime functions instead of Python datetime objects

Rule	Title
L001	Avoid looping without `.localCheckpoint()` or `.checkpoint()`
L002	Avoid while loops with DataFrames
L003	Avoid calling `withColumn()` inside a loop

Rule	Title
P001	`.toPandas()` without enabling Arrow optimization

Rule	Title
PERF001	Avoid `.rdd.collect()` — use `.toPandas()` for driver-side consumption
PERF002	Too many `getOrCreate()` calls — use `getActiveSession()` everywhere else
PERF003	Too many shuffle operations without a checkpoint

Rule	Title
S001	Missing `.coalesce()` after `.union()` / `.unionByName()`
S002	Join without a broadcast or merge hint
S003	`.groupBy()` directly followed by `.distinct()`
S004	Too many `.distinct()` operations in one file
S005	`.repartition()` with fewer partitions than the Spark default
S006	`.repartition()` with more partitions than the Spark default
S007	Avoid `repartition(1)` or `coalesce(1)`
S008	Overusing `explode()` / `explode_outer()`
S009	Prefer `mapPartitions()` over `map()` for row-level transforms
S010	Avoid `crossJoin()` — produces a Cartesian product
S011	Join without join conditions causes a nested-loop scan
S012	Avoid inner join followed by filter — prefer `leftSemi` join
S013	Avoid `reduceByKey()` — use DataFrame `groupBy().agg()` instead

Rule	Title
U001	Avoid UDFs that return `StringType` — use built-in string functions
U002	Avoid UDFs that return `ArrayType` — use built-in array functions
U003	Avoid UDFs — use Spark built-in functions instead
U004	Avoid nested UDF calls — merge logic or use plain Python helpers