You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a staged change to make existing column and Arrow data validity checks available outside debug-only paths while keeping them disabled by default.
Root cause: ColumnString::sanity_check() was guarded by NDEBUG, so release builds could not enable the existing offset and chars consistency checks. Arrow conversion paths also accepted invalid Arrow arrays without a shared validation gate, and nested Arrow serde reads would repeat validation if validation were placed directly in every nested call.
This change adds two opt-in BE configs:
enable_column_sanity_check: enables column sanity checks in release builds.
enable_arrow_validate_full: enables Arrow ValidateFull() checks on Arrow read/write conversion paths.
The column path keeps the existing sanity_check() API name and semantics. Debug builds still run ColumnString sanity checks as before, while release builds return early unless enable_column_sanity_check is enabled. Expression evaluation calls sanity_check() after producing result columns when the config is enabled.
The Arrow read path now uses a non-virtual DataTypeSerDe::read_column_from_arrow() wrapper to run ValidateFull() once at the top-level entry, then delegates to read_column_from_arrow_impl(). Nested array/map/nullable/struct serde reads call the impl method directly to avoid repeated validation. The Arrow write path validates finished arrays and the final record batch when enable_arrow_validate_full is enabled, without adding another serde write implementation layer.
Release note
None
Check List (For Author)
Test
Regression test
Unit Test
Manual test (add detailed scripts or steps below)
No need to test or manual test. Explain why:
This is a refactor/code format and no logic has been changed.
The reason will be displayed to describe this comment to others. Learn more.
I found one correctness issue in the opt-in Arrow full-validation path. The final RecordBatch validation can reject the converter's existing large-string fallback because the generated array type no longer matches the unchanged schema.
The reason will be displayed to describe this comment to others. Learn more.
This full-batch validation conflicts with the existing large-string fallback above. For string fields the schema is still built as arrow::utf8(), but when column->byte_size() >= MAX_ARROW_UTF8 the converter switches only the builder/array to arrow::large_utf8(). RecordBatch::ValidateFull() then validates that LargeString array against the original UTF8 field, so enabling enable_arrow_validate_full can fail on the very fallback this converter uses for large string columns. Please update the output schema when the array is promoted, or avoid validating a batch whose schema no longer matches the arrays.
The reason will be displayed to describe this comment to others. Learn more.
这个默认是true吧,我们还是避免挂,而且这个影响不了多少性能
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: N/A
Problem Summary:
This is a staged change to make existing column and Arrow data validity checks available outside debug-only paths while keeping them disabled by default.
Root cause: ColumnString::sanity_check() was guarded by NDEBUG, so release builds could not enable the existing offset and chars consistency checks. Arrow conversion paths also accepted invalid Arrow arrays without a shared validation gate, and nested Arrow serde reads would repeat validation if validation were placed directly in every nested call.
This change adds two opt-in BE configs:
The column path keeps the existing sanity_check() API name and semantics. Debug builds still run ColumnString sanity checks as before, while release builds return early unless enable_column_sanity_check is enabled. Expression evaluation calls sanity_check() after producing result columns when the config is enabled.
The Arrow read path now uses a non-virtual DataTypeSerDe::read_column_from_arrow() wrapper to run ValidateFull() once at the top-level entry, then delegates to read_column_from_arrow_impl(). Nested array/map/nullable/struct serde reads call the impl method directly to avoid repeated validation. The Arrow write path validates finished arrays and the final record batch when enable_arrow_validate_full is enabled, without adding another serde write implementation layer.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)