Skip to content

Rules

All rules are organized by category. Each rule page explains what is flagged, why it matters, and how to fix it.


ARR — Array

Rule Title
ARR001 Avoid array_distinct(collect_list()) — use collect_set() instead
ARR002 Avoid array_except(col, None/lit(None)) — use array_compact() instead

D — Driver

Rule Title
D001 Avoid using collect()
D002 Avoid accessing .rdd
D003 Avoid .show() in production
D004 Avoid .count() on large DataFrames
D005 Avoid .rdd.isEmpty() — use .isEmpty() directly
D006 Avoid df.count() == 0 — use .isEmpty()
D007 Avoid .filter(...).count() == 0 — use .filter(...).isEmpty()
D008 Avoid .display() in production

F — Format

Rule Title
F001 Avoid chaining withColumn() and withColumnRenamed()
F002 Avoid drop() — use select() for explicit columns
F003 Avoid selectExpr() — prefer select() with col()
F004 Avoid spark.sql() — prefer native DataFrame API
F005 Avoid stacking multiple withColumn() — use withColumns()
F006 Avoid stacking multiple withColumnRenamed() — use withColumnsRenamed()
F007 Prefer filter() before select() for clarity
F008 Avoid print() — prefer the logging module
F009 Avoid nested when() — use stacked .when().when().otherwise()
F010 Always include otherwise() at the end of a when() chain
F011 Avoid backslash line continuation — use parentheses
F012 Always wrap literal values with lit()
F013 Avoid reserved column names with __ prefix and __ suffix
F014 Avoid explode_outer() — handle nulls with higher-order functions
F015 Avoid multiple consecutive filter() calls — combine conditions
F016 Avoid long DataFrame renaming chains — overwrite the same variable
F017 Avoid expr() — use native PySpark functions instead
F018 Use Spark native datetime functions instead of Python datetime objects

L — Looping

Rule Title
L001 Avoid looping without .localCheckpoint() or .checkpoint()
L002 Avoid while loops with DataFrames
L003 Avoid calling withColumn() inside a loop

P — Pandas

Rule Title
P001 .toPandas() without enabling Arrow optimization

PERF — Performance

Rule Title
PERF001 Avoid .rdd.collect() — use .toPandas() for driver-side consumption
PERF002 Too many getOrCreate() calls — use getActiveSession() everywhere else
PERF003 Too many shuffle operations without a checkpoint

S — Shuffle

Rule Title
S001 Missing .coalesce() after .union() / .unionByName()
S002 Join without a broadcast or merge hint
S003 .groupBy() directly followed by .distinct()
S004 Too many .distinct() operations in one file
S005 .repartition() with fewer partitions than the Spark default
S006 .repartition() with more partitions than the Spark default
S007 Avoid repartition(1) or coalesce(1)
S008 Overusing explode() / explode_outer()
S009 Prefer mapPartitions() over map() for row-level transforms
S010 Avoid crossJoin() — produces a Cartesian product
S011 Join without join conditions causes a nested-loop scan
S012 Avoid inner join followed by filter — prefer leftSemi join
S013 Avoid reduceByKey() — use DataFrame groupBy().agg() instead

U — UDF

Rule Title
U001 Avoid UDFs that return StringType — use built-in string functions
U002 Avoid UDFs that return ArrayType — use built-in array functions
U003 Avoid UDFs — use Spark built-in functions instead
U004 Avoid nested UDF calls — merge logic or use plain Python helpers