Skip to content

Rule S011

Avoid nested loop joins without proper join conditions

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

A nested loop join occurs when the join condition does not contain a mathematical or equality comparison (such as ==, !=, <>, >=, >, <=, <), which can lead to:

  • Full Cartesian-like scans of the tables, increasing execution time
  • High memory usage and potential out-of-memory errors
  • Poor scalability on large datasets

Best practices

  • Only use nested loop joins when absolutely necessary and the dataset is small enough to be broadcasted
  • Use broadcast() on the smaller dataset to enable a BroadcastNestedLoopJoin, which is more efficient
  • Prefer using standard equality joins whenever possible for better performance and scalability

Rule of thumb: Avoid joins without proper comparison conditions on large datasets; use broadcasted nested loop joins only for small datasets.

Example

Bad:

df.join(df2)  # no condition — Cartesian product

Good:

df.join(df2, "id")