Rule D003
Avoid using .display() in production code
Severity
🟡 MEDIUM — Moderate performance impact.
PySpark version
Compatible with PySpark 3.0 and later.
Information
Using .display() in production is generally a bad practice because it:
- Collects data to the driver, risking memory issues on large datasets
- Triggers an action, potentially causing unexpected computation
- Is meant for debugging/inspection, not for production pipelines
Best practices
- Use
.limit(n)with.collect()only for small samples if needed - Log DataFrame statistics (
.count(),.describe()) instead of displaying full data - Avoid
.display()in scheduled jobs or ETL pipelines
Rule of thumb: Reserve .display() for local debugging; never use it in production code.
Example
Bad:
Good: