Skip to content

Rule S012

Avoid inner join followed by a direct filter; prefer leftSemi join

Severity

🟢 LOW — Minor performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

Applying an inner join and then immediately filtering the result can lead to:

  • Unnecessary shuffles and data movement
  • Larger intermediate DataFrames than required
  • Reduced performance and increased memory usage

A leftSemi join achieves the same result more efficiently by:

  • Returning only rows from the left DataFrame that match the join condition
  • Avoiding the creation of extra columns from the right DataFrame
  • Reducing shuffles and memory overhead

Best practices

  • Use leftSemi when you only need to check existence of matching rows
  • Avoid combining inner join with immediate filtering for existence checks

Rule of thumb: Replace inner join + filter patterns with leftSemi for more efficient and clear DataFrame operations.

Example

Bad:

df.join(df2, "id", "inner").filter(col("age") > 18)

Good:

df.join(df2.filter(col("age") > 18), "id", "leftSemi")