Skip to content

Rule D003

Avoid using .display() in production code

Severity

🟡 MEDIUM — Moderate performance impact.

PySpark version

Compatible with PySpark 3.0 and later.

Information

Using .display() in production is generally a bad practice because it:

  • Collects data to the driver, risking memory issues on large datasets
  • Triggers an action, potentially causing unexpected computation
  • Is meant for debugging/inspection, not for production pipelines

Best practices

  • Use .limit(n) with .collect() only for small samples if needed
  • Log DataFrame statistics (.count(), .describe()) instead of displaying full data
  • Avoid .display() in scheduled jobs or ETL pipelines

Rule of thumb: Reserve .display() for local debugging; never use it in production code.

Example

Bad:

df.display()

Good:

# Do not use display()