Skip to content

Rule D003

Avoid using .show() in production code

Severity

🟡 MEDIUM — Moderate performance impact.

PySpark version

Compatible with PySpark 1.3 and later.

Information

Using .show() in production is generally a bad practice because it:

  • Collects data to the driver, risking memory issues on large datasets
  • Triggers an action, potentially causing unexpected computation
  • Is meant for debugging/inspection, not for production pipelines

Best practices

  • Use .limit(n) with .collect() only for small samples if needed
  • Log DataFrame statistics (.count(), .describe()) instead of showing full data
  • Avoid .show() in scheduled jobs or ETL pipelines

Rule of thumb: Reserve .show() for local debugging; never use it in production code.

Example

Bad:

df.show()

Good:

# Do not use show()