S — Shuffle

Rules that flag patterns that cause expensive data movement across the cluster — wide transformations, unoptimized joins, and over-partitioning.

Rule	Title
S001	Missing `.coalesce()` after `.union()` / `.unionByName()`
S002	Join without a broadcast or merge hint
S003	`.groupBy()` followed by `.distinct()` or `.dropDuplicates()`
S004	Too many `.distinct()` operations in one file
S005	`.repartition()` with fewer partitions than the Spark default
S006	`.repartition()` with more partitions than the Spark default
S007	Avoid `repartition(1)` or `coalesce(1)`
S008	Overusing `explode()` / `explode_outer()`
S009	Prefer `mapPartitions()` over `map()` for row-level transforms
S010	Avoid `crossJoin()` — produces a Cartesian product
S011	Join without join conditions causes a nested-loop scan
S012	Avoid inner join followed by filter — prefer `leftSemi` join
S013	Avoid `reduceByKey()` — use DataFrame `groupBy().agg()` instead
S014	`.distinct()` or `.dropDuplicates()` called before `.groupBy()` — redundant shuffle