Rule L003
Avoid calling withColumn() inside a loop
Severity
🔴 HIGH — Major performance impact.
PySpark version
Compatible with PySpark 1.3 and later.
Information
Overusing withColumn() by calling it in a loop to add many columns (usually as part of a metadata-driven framework) introduces a new projection each time, leading to:
- Massive execution plans
- Performance degradation
- Possible
StackOverflowException
Example of what not to do:
Best practices
- Prefer using
select()with multiple columns,selectExpr(),withColumns(), or programmatically building a SQL statement.
Example of the recommended approach:
Rule of thumb: Avoid repeated withColumn() calls in loops; build transformations in a single select or SQL statement to improve performance and maintainability.
Example
Bad:
Good: