fix: configurable fallback when parquet vectorized reader is disabled (#4352)#4355
Open
andygrove wants to merge 2 commits into
Open
fix: configurable fallback when parquet vectorized reader is disabled (#4352)#4355andygrove wants to merge 2 commits into
andygrove wants to merge 2 commits into
Conversation
…apache#4352) Comet's native_datafusion scan rejects Parquet-to-Spark conversions that Spark's vectorized reader rejects, but Spark's parquet-mr (non-vectorized) path silently overflows / nulls. Disabling spark.sql.parquet.enableVectorizedReader opts into parquet-mr semantics that Comet has no equivalent for, so by default Comet now falls back to Spark in that case. Users who want Comet to handle the scan regardless can opt in. - New config spark.comet.scan.allowDisabledParquetVectorizedReader (default false: fall back to Spark when vectorized reader is disabled). - CometScanRule.nativeDataFusionScan skips itself when the vectorized reader is disabled and the opt-in flag is false. - CometTestBase sets the flag to true so existing Comet tests continue to exercise the native scan. - Re-enables (un-ignores) the affected ParquetTypeWideningSuite tests in the 4.0.2 and 4.1.1 diffs.
40e130e to
4bcb25c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #4352.
Rationale for this change
Comet's
native_datafusionscan rejects Parquet-to-Spark conversions that Spark's vectorized reader rejects, but Spark's parquet-mr (non-vectorized) path silently overflows / nulls. Disablingspark.sql.parquet.enableVectorizedReaderopts into parquet-mr semantics that Comet has no equivalent for, so by default Comet should fall back to Spark in that case. Users who want Comet to handle the scan regardless can opt in.What changes are included in this PR?
spark.comet.scan.allowDisabledParquetVectorizedReader(defaultfalse→ fall back to Spark when vectorized reader is disabled).CometScanRule.nativeDataFusionScanskips itself when the vectorized reader is disabled and the opt-in flag is false.CometTestBasesets the flag totrueso existing Comet tests continue to exercise the native scan.ParquetTypeWideningSuitetests in the 4.0.2 and 4.1.1 diffs.How are these changes tested?
Existing test suites — the previously ignored
ParquetTypeWideningSuitetests are now exercised on Spark 4.0 and 4.1 via the parquet-mr fallback path.