S — Shuffle
Rules that flag patterns that cause expensive data movement across the cluster — wide transformations, unoptimized joins, and over-partitioning.
| Rule | Title |
|---|---|
| S001 | Missing .coalesce() after .union() / .unionByName() |
| S002 | Join without a broadcast or merge hint |
| S003 | .groupBy() followed by .distinct() or .dropDuplicates() |
| S004 | Too many .distinct() operations in one file |
| S005 | .repartition() with fewer partitions than the Spark default |
| S006 | .repartition() with more partitions than the Spark default |
| S007 | Avoid repartition(1) or coalesce(1) |
| S008 | Overusing explode() / explode_outer() |
| S009 | Prefer mapPartitions() over map() for row-level transforms |
| S010 | Avoid crossJoin() — produces a Cartesian product |
| S011 | Join without join conditions causes a nested-loop scan |
| S012 | Avoid inner join followed by filter — prefer leftSemi join |
| S013 | Avoid reduceByKey() — use DataFrame groupBy().agg() instead |
| S014 | .distinct() or .dropDuplicates() called before .groupBy() — redundant shuffle |