Rule S010
Avoid using crossJoin()
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 2.1 and later.
Information
crossJoin() performs a Cartesian product between two DataFrames, which can lead to:
- Massive increase in the number of rows, causing memory and performance issues
- Potential out-of-memory errors on large datasets
- Hard-to-maintain transformations and unexpected results
Best practices
- Only use
crossJoin()when absolutely necessary and the dataset size is small - Prefer
join()with explicit join keys to avoid Cartesian products - Validate the join logic carefully to ensure correct results
Rule of thumb: Avoid crossJoin(); use it only for small datasets and when a true Cartesian product is required.
Example
Bad:
Good: