Rule F018
Use Spark native datetime functions instead of Python datetime objects
Severity
🟢 LOW — Minor performance impact.
PySpark version
Compatible with PySpark 1.5 and later.
Information
Passing Python datetime, date, or timedelta objects into Spark expressions
(lit(), withColumn(), when(), filter(), etc.) is an antipattern because:
- The value is evaluated on the driver at plan-construction time, not on the executors — so it cannot benefit from partition pruning or predicate push-down
- The Catalyst optimizer sees an opaque constant and cannot fold or simplify it
- It produces subtle bugs when the driver and executor clocks differ (e.g.
datetime.now()is captured once, not re-evaluated per partition)
Best practices
Replace Python datetime calls with Spark built-in date/time functions:
| Python (avoid) | Spark native (prefer) |
|---|---|
lit(datetime.now()) |
current_timestamp() |
lit(date.today()) |
current_date() |
lit(datetime(2024, 1, 1)) |
to_timestamp(lit("2024-01-01")) |
lit(date(2024, 1, 1)) |
to_date(lit("2024-01-01")) |
lit(timedelta(days=7)) |
expr("interval 7 days") |
col("ts") > datetime.now() |
col("ts") > current_timestamp() |
See the full reference: PySpark date and timestamp functions
Example
Bad:
from datetime import datetime, date, timedelta
df.withColumn("snapshot", lit(datetime.now()))
df.filter(col("event_date") > date.today())
df.withColumn("cutoff", lit(datetime(2024, 6, 1)))
df.withColumn("expires", when(col("active"), lit(date.today() + timedelta(days=30))))
Good: