Rule S011
Avoid nested loop joins without proper join conditions
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 1.3 and later.
Information
A nested loop join occurs when the join condition does not contain a mathematical or equality comparison (such as ==, !=, <>, >=, >, <=, <), which can lead to:
- Full Cartesian-like scans of the tables, increasing execution time
- High memory usage and potential out-of-memory errors
- Poor scalability on large datasets
Best practices
- Only use nested loop joins when absolutely necessary and the dataset is small enough to be broadcasted
- Use
broadcast()on the smaller dataset to enable aBroadcastNestedLoopJoin, which is more efficient - Prefer using standard equality joins whenever possible for better performance and scalability
Rule of thumb: Avoid joins without proper comparison conditions on large datasets; use broadcasted nested loop joins only for small datasets.
Example
Bad:
Good: