Rule S002
Join without hint
Severity
🟡 MEDIUM — Moderate performance impact.
PySpark version
Compatible with PySpark 1.3 and later.
Information
Joining DataFrames in PySpark without using hints can lead to suboptimal execution plans. Spark may choose a less efficient join strategy, resulting in:
- Unnecessary shuffles
- Increased execution time
- Higher resource consumption
Best practices
- Use
.hint("broadcast")when one DataFrame is small enough to fit in memory - Use
.hint("merge")for large, sorted DataFrames to enable sort-merge joins - Analyze data size and distribution before choosing a join strategy
df = df.join(df2.hint('broadcast'), ...)
Rule of thumb: Provide explicit join hints when you know the data characteristics to help Spark choose the most efficient plan.
Example
Bad:
Good: