You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today page-level pruning in Parquet (opener/mod.rs:1314 → PagePruningPredicate::prune_plan_with_page_index_and_metrics) runs once at file open with the static query predicate. #22450 added dynamic RG-level pruning at every RG boundary (should_prune in push_decoder.rs:183), but its rebuild path never re-evaluates the page-level predicate.
This issue extends #22450's "refresh at RG boundary" pattern to also refresh the PagePruningPredicate, so the page-level RowSelection of upcoming RGs is tightened by the latest TopK threshold.
Gap: after #22450, RG-level is dynamic but page-level is still static. If TopK heap tightens after file open, surviving RGs still have their initial (loose) page-level RowSelection — pages whose min/max no longer survive the new threshold are still fetched + decompressed + decoded for filter-col evaluation.
Proposal
At every RG boundary (PushDecoderStreamState::transition):
If changed: rebuild a fresh PagePruningPredicate from latest filter
Walk remaining RGs in access plan; refine each RowSelection via prune_plan_with_page_index_and_metrics
Apply via existing into_builder() → with_row_groups(...) → build()
Errors fall back to "keep current selection" (mirrors should_prune).
Expected wins
Saves filter-column IO + decompress + decode for individual dead pages — extends #22450's "chip away Layer B residue" philosophy from RG to page granularity.
Most useful when:
RGs are large (many pages each)
Threshold tightens significantly mid-scan (e.g. after first few RGs fill the heap)
Page index is enabled (prerequisite — without it, no-op)
Predicate chain contains a DynamicFilter (TopK source)
Open design questions
Refresh frequency: every RG boundary, or only when tracker.changed() returns true?
Granularity: refresh access plan for all surviving RGs, or only the next one to be touched?
arrow-rs API gap: does the existing with_row_groups(...) path accept an updated per-RG RowSelection, or do we need a new arrow-rs API hook? (May overlap with arrow-rs#10158 territory.)
Stretch goal · mid-RG refresh: refresh between pages of the same RG, not just at RG boundary. Needs a brand-new arrow-rs "mid-RG predicate adapt" callback hook.
Summary
Today page-level pruning in Parquet (
opener/mod.rs:1314→PagePruningPredicate::prune_plan_with_page_index_and_metrics) runs once at file open with the static query predicate. #22450 added dynamic RG-level pruning at every RG boundary (should_pruneinpush_decoder.rs:183), but its rebuild path never re-evaluates the page-level predicate.This issue extends #22450's "refresh at RG boundary" pattern to also refresh the
PagePruningPredicate, so the page-levelRowSelectionof upcoming RGs is tightened by the latest TopK threshold.Current state (source-confirmed)
push_decoder.rs:183 should_prune(RG boundary)opener/mod.rs:1314(file open only)Gap: after #22450, RG-level is dynamic but page-level is still static. If TopK heap tightens after file open, surviving RGs still have their initial (loose) page-level
RowSelection— pages whose min/max no longer survive the new threshold are still fetched + decompressed + decoded for filter-col evaluation.Proposal
At every RG boundary (
PushDecoderStreamState::transition):tracker.changed()— same single atomic load feat(parquet): intra-file early stopping via statistics + dynamic filters #22450 usesPagePruningPredicatefrom latest filterRowSelectionviaprune_plan_with_page_index_and_metricsinto_builder() → with_row_groups(...) → build()Errors fall back to "keep current selection" (mirrors
should_prune).Expected wins
Saves filter-column IO + decompress + decode for individual dead pages — extends #22450's "chip away Layer B residue" philosophy from RG to page granularity.
Most useful when:
Prerequisites
datafusion.execution.parquet.enable_page_index = trueDynamicFilter(TopK source)Open design questions
tracker.changed()returns true?with_row_groups(...)path accept an updated per-RGRowSelection, or do we need a new arrow-rs API hook? (May overlap with arrow-rs#10158 territory.)Related
fully_matchedRowFilter skip · needs arrow-rs#10158 (peek_next_row_group) #23067 — Per-RG `fully_matched` RowFilter skipPart of the Sort Pushdown EPIC #23036, future direction.