Rule D003
Avoid using .show() in production code
Severity
🟡 MEDIUM — Moderate performance impact.
PySpark version
Compatible with PySpark 1.3 and later.
Information
Using .show() in production is generally a bad practice because it:
- Collects data to the driver, risking memory issues on large datasets
- Triggers an action, potentially causing unexpected computation
- Is meant for debugging/inspection, not for production pipelines
Best practices
- Use
.limit(n)with.collect()only for small samples if needed - Log DataFrame statistics (
.count(),.describe()) instead of showing full data - Avoid
.show()in scheduled jobs or ETL pipelines
Rule of thumb: Reserve .show() for local debugging; never use it in production code.
Example
Bad:
Good: