Skip to content

Rule S010

Avoid using crossJoin()

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 2.1 and later.

Information

crossJoin() performs a Cartesian product between two DataFrames, which can lead to:

  • Massive increase in the number of rows, causing memory and performance issues
  • Potential out-of-memory errors on large datasets
  • Hard-to-maintain transformations and unexpected results

Best practices

  • Only use crossJoin() when absolutely necessary and the dataset size is small
  • Prefer join() with explicit join keys to avoid Cartesian products
  • Validate the join logic carefully to ensure correct results

Rule of thumb: Avoid crossJoin(); use it only for small datasets and when a true Cartesian product is required.

Example

Bad:

df.crossJoin(ref)

Good:

df.join(ref, "id")