Skip to content

S — Shuffle

Rules that flag patterns that cause expensive data movement across the cluster — wide transformations, unoptimized joins, and over-partitioning.

Rule Title
S001 Missing .coalesce() after .union() / .unionByName()
S002 Join without a broadcast or merge hint
S003 .groupBy() followed by .distinct() or .dropDuplicates()
S004 Too many .distinct() operations in one file
S005 .repartition() with fewer partitions than the Spark default
S006 .repartition() with more partitions than the Spark default
S007 Avoid repartition(1) or coalesce(1)
S008 Overusing explode() / explode_outer()
S009 Prefer mapPartitions() over map() for row-level transforms
S010 Avoid crossJoin() — produces a Cartesian product
S011 Join without join conditions causes a nested-loop scan
S012 Avoid inner join followed by filter — prefer leftSemi join
S013 Avoid reduceByKey() — use DataFrame groupBy().agg() instead
S014 .distinct() or .dropDuplicates() called before .groupBy() — redundant shuffle