Rule F018

Use Spark native datetime functions instead of Python datetime objects

Severity

🟢 LOW — Minor performance impact.

PySpark version

Compatible with PySpark 1.5 and later.

Information

Passing Python datetime, date, or timedelta objects into Spark expressions (lit(), withColumn(), when(), filter(), etc.) is an antipattern because:

The value is evaluated on the driver at plan-construction time, not on the executors — so it cannot benefit from partition pruning or predicate push-down
The Catalyst optimizer sees an opaque constant and cannot fold or simplify it
It produces subtle bugs when the driver and executor clocks differ (e.g. datetime.now() is captured once, not re-evaluated per partition)

Best practices

Replace Python datetime calls with Spark built-in date/time functions:

Python (avoid)	Spark native (prefer)
`lit(datetime.now())`	`current_timestamp()`
`lit(date.today())`	`current_date()`
`lit(datetime(2024, 1, 1))`	`to_timestamp(lit("2024-01-01"))`
`lit(date(2024, 1, 1))`	`to_date(lit("2024-01-01"))`
`lit(timedelta(days=7))`	`expr("interval 7 days")`
`col("ts") > datetime.now()`	`col("ts") > current_timestamp()`

See the full reference: PySpark date and timestamp functions

Example

Bad:

from datetime import datetime, date, timedelta

df.withColumn("snapshot", lit(datetime.now()))
df.filter(col("event_date") > date.today())
df.withColumn("cutoff", lit(datetime(2024, 6, 1)))
df.withColumn("expires", when(col("active"), lit(date.today() + timedelta(days=30))))

Good:

from pyspark.sql import functions as F

df.withColumn("snapshot", F.current_timestamp())
df.filter(col("event_date") > F.current_date())
df.withColumn("cutoff", F.to_date(F.lit("2024-06-01")))
df.withColumn("expires", F.when(col("active"), F.date_add(F.current_date(), 30)))