Skip to content

fix: reject Parquet TimestampLTZ as TimestampNTZ on Spark 3.x for native_datafusion#4356

Closed
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:fix-4219-ltz-as-ntz-native-datafusion
Closed

fix: reject Parquet TimestampLTZ as TimestampNTZ on Spark 3.x for native_datafusion#4356
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:fix-4219-ltz-as-ntz-native-datafusion

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented May 17, 2026

Which issue does this PR close?

Partial fix for #4219 (INT96 case tracked in the same issue, see below).

Rationale for this change

Pre-Spark-4 (SPARK-36182) rejects reading a Parquet TimestampLTZ column as TimestampNTZ because LTZ encodes UTC-adjusted instants that cannot be safely reinterpreted as timezone-free values. The native_datafusion scan did not enforce this, and silently returned the UTC instant as the NTZ value. That is a correctness divergence on Spark 3.x: queries Spark would have failed instead return values, and downstream filters / joins / aggregations on the column may produce different results than running the same query without Comet. Spark 4.0 (SPARK-47447) lifted the restriction.

The native_iceberg_compat path was already correct because its JVM-side TypeUtil.checkParquetType rejects the read before reaching native code.

What changes are included in this PR?

  • New per-Spark-version constant COMET_ALLOW_TIMESTAMP_LTZ_AS_NTZ in ShimCometConf (false on 3.x, true on 4.x).
  • New allow_timestamp_ltz_to_ntz field on the NativeScanCommon proto, set from the shim constant in CometNativeScan, plumbed into SparkParquetOptions via init_datasource_exec / get_options.
  • New rejection arm in SparkPhysicalExprAdapter (schema_adapter.rs) that emits reject_on_non_empty_expr when an Arrow Timestamp(_, Some(_)) column is read as Timestamp(_, None) and the flag forbids it. Deferred to runtime so empty files (SPARK-26709) continue to pass.
  • iceberg-compat JNI path passes true because TypeUtil.checkParquetType has already validated.
  • Compatibility guide updated (docs/source/user-guide/latest/compatibility/scans.md) to narrow the documented gap to INT96 and describe the correctness implications.

Known gap

INT96 → TimestampNTZ on Spark 3.x is still not rejected by native_datafusion. DataFusion's coerce_int96 produces Timestamp(unit, None) for INT96 columns, stripping the source timezone before Comet's schema adapter sees the column. At that point INT96 is indistinguishable from a true TIMESTAMP_NTZ source, so the new pattern does not fire. The annotated LTZ encodings (TIMESTAMP_MICROS / TIMESTAMP_MILLIS with isAdjustedToUTC=true) are covered. The INT96 + native_datafusion test case in the suite stays skipped with a link back to #4219.

How are these changes tested?

ParquetTimestampLtzAsNtzSuite already covered the pre-Spark-4 case for native_iceberg_compat and the Spark 4.0+ positive case. It has been extended to parametrize the pre-Spark-4 case across both scan implementations, so the three native_datafusion variants (INT96, TIMESTAMP_MICROS, TIMESTAMP_MILLIS) now run instead of being globally skipped. On Spark 3.5, the two annotated-encoding variants pass; the INT96 variant skips with a link to #4219. On Spark 4.0 the existing positive tests continue to pass.

Verified locally:

  • make (default Spark profile)
  • ./mvnw test -Pspark-3.5 -Dsuites=org.apache.comet.parquet.ParquetTimestampLtzAsNtzSuite
  • ./mvnw test -Pspark-4.0 -Dsuites=org.apache.comet.parquet.ParquetTimestampLtzAsNtzSuite
  • cargo clippy --all-targets --workspace -- -D warnings

andygrove added 2 commits May 17, 2026 09:59
…ive_datafusion

Pre-Spark-4 (SPARK-36182) rejects reading a Parquet TimestampLTZ column as
TimestampNTZ; native_datafusion previously did not, and silently returned
the UTC instant. Plumb a per-Spark-version flag from ShimCometConf
through the NativeScan proto into SparkParquetOptions, and gate a new
rejection arm in the schema adapter on it.

INT96 remains a gap because DataFusion's coerce_int96 strips the source
timezone before the schema adapter runs, so it is indistinguishable from
a true TIMESTAMP_NTZ source. Compatibility guide updated to describe the
correctness implications.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant