fix: reject Parquet TimestampLTZ as TimestampNTZ on Spark 3.x for native_datafusion by andygrove · Pull Request #4356 · apache/datafusion-comet

andygrove · 2026-05-17T15:59:55Z

Which issue does this PR close?

Partial fix for #4219 (INT96 case tracked in the same issue, see below).

Rationale for this change

Pre-Spark-4 (SPARK-36182) rejects reading a Parquet TimestampLTZ column as TimestampNTZ because LTZ encodes UTC-adjusted instants that cannot be safely reinterpreted as timezone-free values. The native_datafusion scan did not enforce this, and silently returned the UTC instant as the NTZ value. That is a correctness divergence on Spark 3.x: queries Spark would have failed instead return values, and downstream filters / joins / aggregations on the column may produce different results than running the same query without Comet. Spark 4.0 (SPARK-47447) lifted the restriction.

The native_iceberg_compat path was already correct because its JVM-side TypeUtil.checkParquetType rejects the read before reaching native code.

What changes are included in this PR?

New per-Spark-version constant COMET_ALLOW_TIMESTAMP_LTZ_AS_NTZ in ShimCometConf (false on 3.x, true on 4.x).
New allow_timestamp_ltz_to_ntz field on the NativeScanCommon proto, set from the shim constant in CometNativeScan, plumbed into SparkParquetOptions via init_datasource_exec / get_options.
New rejection arm in SparkPhysicalExprAdapter (schema_adapter.rs) that emits reject_on_non_empty_expr when an Arrow Timestamp(_, Some(_)) column is read as Timestamp(_, None) and the flag forbids it. Deferred to runtime so empty files (SPARK-26709) continue to pass.
iceberg-compat JNI path passes true because TypeUtil.checkParquetType has already validated.
Compatibility guide updated (docs/source/user-guide/latest/compatibility/scans.md) to narrow the documented gap to INT96 and describe the correctness implications.

Known gap

INT96 → TimestampNTZ on Spark 3.x is still not rejected by native_datafusion. DataFusion's coerce_int96 produces Timestamp(unit, None) for INT96 columns, stripping the source timezone before Comet's schema adapter sees the column. At that point INT96 is indistinguishable from a true TIMESTAMP_NTZ source, so the new pattern does not fire. The annotated LTZ encodings (TIMESTAMP_MICROS / TIMESTAMP_MILLIS with isAdjustedToUTC=true) are covered. The INT96 + native_datafusion test case in the suite stays skipped with a link back to #4219.

How are these changes tested?

ParquetTimestampLtzAsNtzSuite already covered the pre-Spark-4 case for native_iceberg_compat and the Spark 4.0+ positive case. It has been extended to parametrize the pre-Spark-4 case across both scan implementations, so the three native_datafusion variants (INT96, TIMESTAMP_MICROS, TIMESTAMP_MILLIS) now run instead of being globally skipped. On Spark 3.5, the two annotated-encoding variants pass; the INT96 variant skips with a link to #4219. On Spark 4.0 the existing positive tests continue to pass.

Verified locally:

make (default Spark profile)
./mvnw test -Pspark-3.5 -Dsuites=org.apache.comet.parquet.ParquetTimestampLtzAsNtzSuite
./mvnw test -Pspark-4.0 -Dsuites=org.apache.comet.parquet.ParquetTimestampLtzAsNtzSuite
cargo clippy --all-targets --workspace -- -D warnings

…ive_datafusion Pre-Spark-4 (SPARK-36182) rejects reading a Parquet TimestampLTZ column as TimestampNTZ; native_datafusion previously did not, and silently returned the UTC instant. Plumb a per-Spark-version flag from ShimCometConf through the NativeScan proto into SparkParquetOptions, and gate a new rejection arm in the schema adapter on it. INT96 remains a gap because DataFusion's coerce_int96 strips the source timezone before the schema adapter runs, so it is indistinguishable from a true TIMESTAMP_NTZ source. Compatibility guide updated to describe the correctness implications.

andygrove added 2 commits May 17, 2026 09:59

style: cargo fmt

94c97e3

andygrove marked this pull request as draft May 17, 2026 16:56

andygrove closed this May 17, 2026

andygrove mentioned this pull request May 17, 2026

fix: reject Parquet INT96 as TimestampNTZ on Spark 3.x for native_datafusion (depends on apache/datafusion#22318) #4357

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reject Parquet TimestampLTZ as TimestampNTZ on Spark 3.x for native_datafusion#4356

fix: reject Parquet TimestampLTZ as TimestampNTZ on Spark 3.x for native_datafusion#4356
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:fix-4219-ltz-as-ntz-native-datafusion

andygrove commented May 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Known gap

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andygrove commented May 17, 2026 •

edited

Loading