Rule S012
Avoid inner join followed by a direct filter; prefer leftSemi join
Severity
🟢 LOW — Minor performance impact.
PySpark version
Compatible with PySpark 1.3 and later.
Information
Applying an inner join and then immediately filtering the result can lead to:
- Unnecessary shuffles and data movement
- Larger intermediate DataFrames than required
- Reduced performance and increased memory usage
A leftSemi join achieves the same result more efficiently by:
- Returning only rows from the left DataFrame that match the join condition
- Avoiding the creation of extra columns from the right DataFrame
- Reducing shuffles and memory overhead
Best practices
- Use
leftSemiwhen you only need to check existence of matching rows - Avoid combining inner join with immediate filtering for existence checks
Rule of thumb: Replace inner join + filter patterns with leftSemi for more efficient and clear DataFrame operations.
Example
Bad:
Good: