Patched DF 52.1.0 (revision c) - (NOTE: superseded by #93) by erratic-pattern · Pull Request #92 · influxdata/arrow-datafusion

erratic-pattern · 2026-03-09T15:33:00Z

Summary

This PR contains a patched DataFusion 52.1.0 fork for InfluxDB IOx, based on 52.1.0 branch

Changes from revision b

Added 031c1e7 - cherry-picked fix from apache/datafusion#20279 to disable dynamic filter pushdown for non min/max aggregates

Patches Included

6417671 - Skip order calculation - Fixes slow planning time for queries with Unions on many columns.
- Related Issues: apache/datafusion#17261, influxdata/influxdb_iox#13038
4277f8f - SanityCheck workaround - Skips ordering validation for UnionExec/SortExec children. Required because the previous patch "skip order calculation" produces incomplete ordering/equivalence information.
- Related Issues: apache/datafusion#11492
64dc11c - Physical schema check skip - Workaround for Internal error: Physical input schema should be the same as the one converted from logical input schema. apache/datafusion#18337
506dd6d - Query cancellation support - Wrap join operators with cooperative() for cancellation support
- Related: apache/datafusion#19360
d615ad9 - Security: Update bytes - Update bytes to v1.11.1 to avoid security audit
d87aafd - Security: Update time - Update time crate to avoid rustsec error
a21b158 - Fix incorrect SortExec removal before AggregateExec
- Fixes wrong results when ORDER BY uses expressions that transform ordering (e.g., CAST(y AS BIGINT) % 2)
- Related: apache/datafusion#20244, apache/datafusion#20247
b627864 - Fix inter-file ordering validation in eq_properties()
- Cherry-picked from apache/datafusion#20329
- Restores validation that files within file groups are sorted relative to each other
- Without this fix, SortExec was incorrectly removed when multiple files exist in a file group
866a4f0 - Fix HashJoin panic with dictionary-encoded columns in multi-key joins
- Cherry-picked from apache/datafusion#20441
- Fixes cast handling when comparing dictionary-encoded arrays in join predicates
2059811 - Avoid flattening dictionaries in Join InLists
- Cherry-picked from apache/datafusion#20505
- Removes flatten_dictionary_array entirely, keeping Dictionary arrays as-is in InList values
- Fixes "Can't compare arrays of different types" error when HashJoin InList builder flattens Dictionary arrays to Utf8 but the probe-side Parquet scan retains Dictionary encoding
- Fixes https://github.com/influxdata/influxdb_iox/issues/16245
6a0206e - Bump arrow/parquet to 57.3.0
- Includes fix for FixedSizeBinary LEFT JOIN bug
  - Take fsb null indices apache/arrow-rs#8981
- Cherry-picked test and API updates from
  - Upgrade DataFusion to arrow-rs/parquet 57.2.0 apache/datafusion#19355
031c1e7 - Disable dynamic filter pushdown for non min/max aggregates
- Cherry-picked from apache/datafusion#20279
- Fixes incorrect results when dynamic filter pushdown is applied to aggregates other than MIN/MAX (e.g., SUM, COUNT), which could silently filter out rows that should be included
- Discovered via EOD mirroring: Enterprise (DF 50) returned correct results while IOx (DF 52) did not

Discovered this bug while working on apache#19724. TLDR: just because the files themselves are sorted doesn't mean the partition streams are sorted. - **`eq_properties()` in `FileScanConfig` blindly trusted `output_ordering`** (set from Parquet `sorting_columns` metadata) without verifying that files within a group are in the correct inter-file order - `EnforceSorting` then removed `SortExec` based on this unvalidated ordering, producing **wrong results** when filesystem order didn't match data order - Added `validated_output_ordering()` that filters orderings using `MinMaxStatistics::new_from_files()` + `is_sorted()` to verify inter-file sort order before reporting them to the optimizer - Added `validated_output_ordering()` method on `FileScanConfig` that validates each output ordering against actual file group statistics - Changed `eq_properties()` to call `self.validated_output_ordering()` instead of `self.output_ordering.clone()` Added 8 new regression tests (Tests 4-11): | Test | Scenario | Key assertion | |------|----------|---------------| | **4** | Reversed filesystem order (inferred ordering) | SortExec retained — wrong inter-file order detected | | **5** | Overlapping file ranges (inferred ordering) | SortExec retained — overlapping ranges detected | | **6** | `WITH ORDER` + reversed filesystem order | SortExec retained despite explicit ordering | | **7** | Correctly ordered multi-file group (positive) | SortExec eliminated — validation passes | | **8** | DESC ordering with wrong inter-file DESC order | SortExec retained for DESC direction | | **9** | Multi-column sort key (overlapping vs non-overlapping) | Conservative rejection with overlapping stats; passes with clean boundaries | | **10** | Correctly ordered + `WITH ORDER` (positive) | SortExec eliminated — both ordering and stats agree | | **11** | Multiple partitions (one file per group) | `SortPreservingMergeExec` merges; no per-partition sort needed | - [x] `cargo test --test sqllogictests -- sort_pushdown` — all new + existing tests pass - [x] `cargo test -p datafusion-datasource` — 97 unit tests + 6 doc tests pass - [x] Existing Test 1 (single-file sort pushdown with `WITH ORDER`) still eliminates SortExec (no regression) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

apache#20441) - Closes apache#20437 `flatten_dictionary_array` returned only the unique values rather then the full expanded array when being called on a `DictionaryArray`. When building a `StructArray` this caused a length mismatch panic. Replaced `array.values()` with `arrow::compute::cast(array, value_type)` in `flatten_dictionary_array`, which properly expands the dictionary into a full length array matching the row count. Yes, both a new unit test aswell as a regression test were added. Nope --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Includes fix for FixedSizeBinary LEFT JOIN bug - apache/arrow-rs#8981 Cherry-picked test and API updates from - apache#19355

…he#20279) ## Which issue does this PR close?  - Partially closes apache#20267 ## Rationale for this change  Currently whenever we get a query with `min` or `max` we default to always pushing down the dynamic filter (when it's enabled). However if the query contains other aggregate functions such as `sum`, `avg` they require the full batch of rows. And because of the pruned rows we receive incorrect outputs for the query. ## What changes are included in this PR?  return the `init_dynamic_filter()` early if it contains non min/max aggregates. ## Are these changes tested?  Tested locally for the same query mentioned in the issue with `hits_partitioned` and got the correct output. Will add the tests! ## Are there any user-facing changes?   --------- Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>

alamb · 2026-03-09T17:29:48Z

I wonder what you think about using the datafusion 52.3.0 as a base (rather than 52.1.0 as is this PR)?

Specifically we would base it on this:

https://github.com/apache/datafusion/releases/tag/52.3.0-rc1

Maybe it would be better to wait until that is officially released 🤔

Release DataFusion 52.3.0 (minor/) Release (Mar 2026) apache/datafusion#20681

erratic-pattern · 2026-03-09T20:37:01Z

Superseded — branch preserved as upgrade-df-ver5210-d. Will rebase onto new revision c when ready.

alamb and others added 12 commits March 3, 2026 15:38

chore: skip order calculation / exponential planning

6417671

(New) Test + workaround for SanityCheck plan

4277f8f

chore: add debug logging and skip error on physical schema check

64dc11c

fix: wrap join operators with cooperative() for cancellation support

506dd6d

Update bytes to v1.11.1 to avoid security audit

d615ad9

chore: Update time to avoid rustsec error

d87aafd

Fix incorrect SortExec removal before AggregateExec

a21b158

Avoid flattening dictionaries in Join InLists

2059811

chore: bump arrow/parquet to 57.3.0

6a0206e

Includes fix for FixedSizeBinary LEFT JOIN bug - apache/arrow-rs#8981 Cherry-picked test and API updates from - apache#19355

github-actions bot added optimizer sqllogictest core physical-expr common functions proto datasource physical-plan labels Mar 9, 2026

erratic-pattern changed the title ~~Patched DF 52.1.0 (revision c)~~ Patched DF 52.1.0 (revision d) Mar 9, 2026

erratic-pattern closed this Mar 9, 2026

erratic-pattern changed the title ~~Patched DF 52.1.0 (revision d)~~ ~Patched DF 52.1.0 (revision c)~ — superseded by #93 Mar 9, 2026

erratic-pattern changed the title ~~~Patched DF 52.1.0 (revision c)~ — superseded by #93~~ ~~Patched DF 52.1.0 (revision c)~~ — superseded by #93 Mar 11, 2026

erratic-pattern changed the title ~~~~Patched DF 52.1.0 (revision c)~~ — superseded by #93~~ ~~Patched DF 52.1.0 (revision c)~~ - superseded by #93 Mar 11, 2026

erratic-pattern changed the title ~~~~Patched DF 52.1.0 (revision c)~~ - superseded by #93~~ Patched DF 52.1.0 (revision c) - (NOTE: superseded by #93) Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patched DF 52.1.0 (revision c) - (NOTE: superseded by #93)#92

Patched DF 52.1.0 (revision c) - (NOTE: superseded by #93)#92
erratic-pattern wants to merge 12 commits intobase-df-upgrade-ver5210from
upgrade-df-ver5210-c

erratic-pattern commented Mar 9, 2026 •

edited

Loading

Uh oh!

alamb commented Mar 9, 2026

Uh oh!

erratic-pattern commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

erratic-pattern commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes from revision b

Patches Included

Uh oh!

alamb commented Mar 9, 2026

Uh oh!

erratic-pattern commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

erratic-pattern commented Mar 9, 2026 •

edited

Loading