Rule S011

Avoid nested loop joins without proper join conditions

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

A nested loop join occurs when the join condition does not contain a mathematical or equality comparison (such as ==, !=, <>, >=, >, <=, <), which can lead to:

Full Cartesian-like scans of the tables, increasing execution time
High memory usage and potential out-of-memory errors
Poor scalability on large datasets

Best practices

Only use nested loop joins when absolutely necessary and the dataset is small enough to be broadcasted
Use broadcast() on the smaller dataset to enable a BroadcastNestedLoopJoin, which is more efficient
Prefer using standard equality joins whenever possible for better performance and scalability

Rule of thumb: Avoid joins without proper comparison conditions on large datasets; use broadcasted nested loop joins only for small datasets.

Example

Bad:

df.join(df2)  # no condition — Cartesian product

Good:

df.join(df2, "id")