Skip to content

Patched DF 52.1.0 (revision c) - (NOTE: superseded by #93)#92

Closed
erratic-pattern wants to merge 12 commits intobase-df-upgrade-ver5210from
upgrade-df-ver5210-c
Closed

Patched DF 52.1.0 (revision c) - (NOTE: superseded by #93)#92
erratic-pattern wants to merge 12 commits intobase-df-upgrade-ver5210from
upgrade-df-ver5210-c

Conversation

@erratic-pattern
Copy link

@erratic-pattern erratic-pattern commented Mar 9, 2026

Summary

This PR contains a patched DataFusion 52.1.0 fork for InfluxDB IOx, based on 52.1.0 branch

Changes from revision b

Patches Included

alamb and others added 12 commits March 3, 2026 15:38
Discovered this bug while working on apache#19724.

TLDR: just because the files themselves are sorted doesn't mean the
partition streams are sorted.

- **`eq_properties()` in `FileScanConfig` blindly trusted
`output_ordering`** (set from Parquet `sorting_columns` metadata)
without verifying that files within a group are in the correct
inter-file order
- `EnforceSorting` then removed `SortExec` based on this unvalidated
ordering, producing **wrong results** when filesystem order didn't match
data order
- Added `validated_output_ordering()` that filters orderings using
`MinMaxStatistics::new_from_files()` + `is_sorted()` to verify
inter-file sort order before reporting them to the optimizer

- Added `validated_output_ordering()` method on `FileScanConfig` that
validates each output ordering against actual file group statistics
- Changed `eq_properties()` to call `self.validated_output_ordering()`
instead of `self.output_ordering.clone()`

Added 8 new regression tests (Tests 4-11):

| Test | Scenario | Key assertion |
|------|----------|---------------|
| **4** | Reversed filesystem order (inferred ordering) | SortExec
retained — wrong inter-file order detected |
| **5** | Overlapping file ranges (inferred ordering) | SortExec
retained — overlapping ranges detected |
| **6** | `WITH ORDER` + reversed filesystem order | SortExec retained
despite explicit ordering |
| **7** | Correctly ordered multi-file group (positive) | SortExec
eliminated — validation passes |
| **8** | DESC ordering with wrong inter-file DESC order | SortExec
retained for DESC direction |
| **9** | Multi-column sort key (overlapping vs non-overlapping) |
Conservative rejection with overlapping stats; passes with clean
boundaries |
| **10** | Correctly ordered + `WITH ORDER` (positive) | SortExec
eliminated — both ordering and stats agree |
| **11** | Multiple partitions (one file per group) |
`SortPreservingMergeExec` merges; no per-partition sort needed |

- [x] `cargo test --test sqllogictests -- sort_pushdown` — all new +
existing tests pass
- [x] `cargo test -p datafusion-datasource` — 97 unit tests + 6 doc
tests pass
- [x] Existing Test 1 (single-file sort pushdown with `WITH ORDER`)
still eliminates SortExec (no regression)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
apache#20441)

- Closes apache#20437

`flatten_dictionary_array` returned only the unique values rather then
the full expanded array when being called on a `DictionaryArray`. When
building a `StructArray` this caused a length mismatch panic.

Replaced `array.values()` with `arrow::compute::cast(array, value_type)`
in `flatten_dictionary_array`, which properly expands the dictionary
into a full length array matching the row count.

Yes, both a new unit test aswell as a regression test were added.

Nope

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Includes fix for FixedSizeBinary LEFT JOIN bug
- apache/arrow-rs#8981

Cherry-picked test and API updates from
- apache#19355
…he#20279)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Partially closes apache#20267 

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->
Currently whenever we get a query with `min` or `max` we default to
always pushing down the dynamic filter (when it's enabled). However if
the query contains other aggregate functions such as `sum`, `avg` they
require the full batch of rows. And because of the pruned rows we
receive incorrect outputs for the query.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

return the `init_dynamic_filter()` early if it contains non min/max
aggregates.

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Tested locally for the same query mentioned in the issue with
`hits_partitioned` and got the correct output. Will add the tests!

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
@alamb
Copy link
Collaborator

alamb commented Mar 9, 2026

I wonder what you think about using the datafusion 52.3.0 as a base (rather than 52.1.0 as is this PR)?

Specifically we would base it on this:

Maybe it would be better to wait until that is officially released 🤔

@erratic-pattern erratic-pattern changed the title Patched DF 52.1.0 (revision c) Patched DF 52.1.0 (revision d) Mar 9, 2026
@erratic-pattern
Copy link
Author

Superseded — branch preserved as upgrade-df-ver5210-d. Will rebase onto new revision c when ready.

@erratic-pattern erratic-pattern changed the title Patched DF 52.1.0 (revision d) ~Patched DF 52.1.0 (revision c)~ — superseded by #93 Mar 9, 2026
@erratic-pattern erratic-pattern changed the title ~Patched DF 52.1.0 (revision c)~ — superseded by #93 ~~Patched DF 52.1.0 (revision c)~~ — superseded by #93 Mar 11, 2026
@erratic-pattern erratic-pattern changed the title ~~Patched DF 52.1.0 (revision c)~~ — superseded by #93 ~~Patched DF 52.1.0 (revision c)~~ - superseded by #93 Mar 11, 2026
@erratic-pattern erratic-pattern changed the title ~~Patched DF 52.1.0 (revision c)~~ - superseded by #93 Patched DF 52.1.0 (revision c) - (NOTE: superseded by #93) Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants