Skip to content

Rule S002

Join without hint

Severity

🟡 MEDIUM — Moderate performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

Joining DataFrames in PySpark without using hints can lead to suboptimal execution plans. Spark may choose a less efficient join strategy, resulting in:

  • Unnecessary shuffles
  • Increased execution time
  • Higher resource consumption

Best practices

  • Use .hint("broadcast") when one DataFrame is small enough to fit in memory
  • Use .hint("merge") for large, sorted DataFrames to enable sort-merge joins
  • Analyze data size and distribution before choosing a join strategy
  • df = df.join(df2.hint('broadcast'), ...)

Rule of thumb: Provide explicit join hints when you know the data characteristics to help Spark choose the most efficient plan.

Example

Bad:

df.join(df2, "id")

Good:

df.join(df2.hint("broadcast"), "id")