Skip to content

Rule F008

Avoid using print(); prefer logging

Severity

🟢 LOW — Minor performance impact.

PySpark version

Compatible with PySpark 1.0 and later.

Information

Using print() in production code can lead to:

  • High I/O overhead, especially in distributed environments
  • Large amounts of unstructured output that are hard to manage
  • Reduced performance due to unnecessary console writes
  • Difficulty in filtering and tracing important information

Best practices

  • Use Python’s logging module to control log levels and destinations
  • Configure logs appropriately for the environment (e.g., DEBUG, INFO, WARN, ERROR)
  • Avoid leaving print() statements in production pipelines

Rule of thumb: Replace print() with structured logging to reduce I/O overhead and improve maintainability.

Example

Bad:

print("row count:", df.count())

Good:

import logging
logging.info("row count: %d", df.count())