Patched DF 52.1.0 (revision c) - (NOTE: superseded by #93)#92
Closed
erratic-pattern wants to merge 12 commits intobase-df-upgrade-ver5210from
Closed
Patched DF 52.1.0 (revision c) - (NOTE: superseded by #93)#92erratic-pattern wants to merge 12 commits intobase-df-upgrade-ver5210from
erratic-pattern wants to merge 12 commits intobase-df-upgrade-ver5210from
Conversation
Discovered this bug while working on apache#19724. TLDR: just because the files themselves are sorted doesn't mean the partition streams are sorted. - **`eq_properties()` in `FileScanConfig` blindly trusted `output_ordering`** (set from Parquet `sorting_columns` metadata) without verifying that files within a group are in the correct inter-file order - `EnforceSorting` then removed `SortExec` based on this unvalidated ordering, producing **wrong results** when filesystem order didn't match data order - Added `validated_output_ordering()` that filters orderings using `MinMaxStatistics::new_from_files()` + `is_sorted()` to verify inter-file sort order before reporting them to the optimizer - Added `validated_output_ordering()` method on `FileScanConfig` that validates each output ordering against actual file group statistics - Changed `eq_properties()` to call `self.validated_output_ordering()` instead of `self.output_ordering.clone()` Added 8 new regression tests (Tests 4-11): | Test | Scenario | Key assertion | |------|----------|---------------| | **4** | Reversed filesystem order (inferred ordering) | SortExec retained — wrong inter-file order detected | | **5** | Overlapping file ranges (inferred ordering) | SortExec retained — overlapping ranges detected | | **6** | `WITH ORDER` + reversed filesystem order | SortExec retained despite explicit ordering | | **7** | Correctly ordered multi-file group (positive) | SortExec eliminated — validation passes | | **8** | DESC ordering with wrong inter-file DESC order | SortExec retained for DESC direction | | **9** | Multi-column sort key (overlapping vs non-overlapping) | Conservative rejection with overlapping stats; passes with clean boundaries | | **10** | Correctly ordered + `WITH ORDER` (positive) | SortExec eliminated — both ordering and stats agree | | **11** | Multiple partitions (one file per group) | `SortPreservingMergeExec` merges; no per-partition sort needed | - [x] `cargo test --test sqllogictests -- sort_pushdown` — all new + existing tests pass - [x] `cargo test -p datafusion-datasource` — 97 unit tests + 6 doc tests pass - [x] Existing Test 1 (single-file sort pushdown with `WITH ORDER`) still eliminates SortExec (no regression) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
apache#20441) - Closes apache#20437 `flatten_dictionary_array` returned only the unique values rather then the full expanded array when being called on a `DictionaryArray`. When building a `StructArray` this caused a length mismatch panic. Replaced `array.values()` with `arrow::compute::cast(array, value_type)` in `flatten_dictionary_array`, which properly expands the dictionary into a full length array matching the row count. Yes, both a new unit test aswell as a regression test were added. Nope --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Includes fix for FixedSizeBinary LEFT JOIN bug - apache/arrow-rs#8981 Cherry-picked test and API updates from - apache#19355
…he#20279) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Partially closes apache#20267 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Currently whenever we get a query with `min` or `max` we default to always pushing down the dynamic filter (when it's enabled). However if the query contains other aggregate functions such as `sum`, `avg` they require the full batch of rows. And because of the pruned rows we receive incorrect outputs for the query. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> return the `init_dynamic_filter()` early if it contains non min/max aggregates. ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Tested locally for the same query mentioned in the issue with `hits_partitioned` and got the correct output. Will add the tests! ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
Collaborator
|
I wonder what you think about using the datafusion 52.3.0 as a base (rather than 52.1.0 as is this PR)? Specifically we would base it on this: Maybe it would be better to wait until that is officially released 🤔 |
Author
|
Superseded — branch preserved as upgrade-df-ver5210-d. Will rebase onto new revision c when ready. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR contains a patched DataFusion 52.1.0 fork for InfluxDB IOx, based on 52.1.0 branch
Changes from revision b
Patches Included
6417671 - Skip order calculation - Fixes slow planning time for queries with Unions on many columns.
4277f8f - SanityCheck workaround - Skips ordering validation for UnionExec/SortExec children. Required because the previous patch "skip order calculation" produces incomplete ordering/equivalence information.
64dc11c - Physical schema check skip - Workaround for Internal error: Physical input schema should be the same as the one converted from logical input schema. apache/datafusion#18337
506dd6d - Query cancellation support - Wrap join operators with cooperative() for cancellation support
d615ad9 - Security: Update bytes - Update bytes to v1.11.1 to avoid security audit
d87aafd - Security: Update time - Update time crate to avoid rustsec error
a21b158 - Fix incorrect SortExec removal before AggregateExec
CAST(y AS BIGINT) % 2)b627864 - Fix inter-file ordering validation in eq_properties()
866a4f0 - Fix HashJoin panic with dictionary-encoded columns in multi-key joins
2059811 - Avoid flattening dictionaries in Join InLists
flatten_dictionary_arrayentirely, keeping Dictionary arrays as-is in InList values6a0206e - Bump arrow/parquet to 57.3.0
031c1e7 - Disable dynamic filter pushdown for non min/max aggregates