Skip to content

pyspark-antipattern

S010

skanderboudawara/pyspark-antipattern

Rule S010

Avoid using crossJoin()

Severity

🔴 HIGH — Major performance impact.

PySpark version

Compatible with PySpark 2.1 and later.

Information

crossJoin() performs a Cartesian product between two DataFrames, which can lead to:

Massive increase in the number of rows, causing memory and performance issues
Potential out-of-memory errors on large datasets
Hard-to-maintain transformations and unexpected results

Best practices

Only use crossJoin() when absolutely necessary and the dataset size is small
Prefer join() with explicit join keys to avoid Cartesian products
Validate the join logic carefully to ensure correct results

Rule of thumb: Avoid crossJoin(); use it only for small datasets and when a true Cartesian product is required.

Example

Bad:

df.crossJoin(ref)

Good:

df.join(ref, "id")