Skip to content

perf: emit WindowAggExec output in batch_size chunks#23206

Draft
Dandandan wants to merge 1 commit into
apache:mainfrom
Dandandan:perf/window-agg-exec-batch-size
Draft

perf: emit WindowAggExec output in batch_size chunks#23206
Dandandan wants to merge 1 commit into
apache:mainfrom
Dandandan:perf/window-agg-exec-batch-size

Conversation

@Dandandan

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

  • Closes #.

Rationale for this change

WindowAggExec (the non-streaming window operator, used when a frame ends in UNBOUNDED FOLLOWING, a UDWF lacks bounded execution, etc.) buffers the entire input, computes all window columns, and then emits the result as one RecordBatch sized to the whole input. That forces every downstream operator which doesn't internally coalesce (sort ingest, joins, the client, …) to hold a single batch covering all rows at once — unlike AggregateExec (row_hash.rs) and BoundedWindowAggExec, which both honor batch_size.

What changes are included in this PR?

  • WindowAggStream now stores the fully-computed result and emits it in batch_size-row slices across polls. Slicing is zero-copy (RecordBatch::slice adjusts offset/length over shared buffers), so this adds no per-row work and no extra copy — it only bounds the batch each downstream operator must hold.
  • batch_size is read from the session config in execute (before context is moved into the child) and clamped to at least 1.

Scope note: the window computation itself is unchanged, so this does not reduce WindowAggExec's own peak memory (it still buffers all input + the concatenated copy). That is a separate, larger concern; this PR only stops forcing a mega-batch onto downstream consumers.

Are these changes tested?

Yes:

  • A new unit test asserts a 10-row result is emitted as 4/4/2-row chunks with batch_size = 4, and that the running-count window column is unaffected by chunking.
  • The full window sqllogictest suite passes (all 6 files). Plan-shape tests are unaffected (only output batching changes, not the plan).

Are there any user-facing changes?

No behavior change beyond output batch sizing (results and ordering are identical); WindowAggExec now honors the configured batch_size like other operators.

`WindowAggExec` (the non-streaming window operator used when a frame ends
in UNBOUNDED FOLLOWING, etc.) buffers the entire input, computes all
window columns, and then emits the result as a single `RecordBatch`
sized to the whole input. This forces every downstream operator that
doesn't internally coalesce (sort ingest, joins, the client, ...) to
hold one batch covering all rows at once, unlike `AggregateExec` and
`BoundedWindowAggExec`, which honor `batch_size`.

This emits the computed result in `batch_size`-row slices across polls.
Slicing is zero-copy (`RecordBatch::slice` adjusts offset/length over
shared buffers), so it adds no per-row work and no extra copy; it only
bounds the batch each downstream operator must hold. The window
computation itself is unchanged, so this does not reduce WindowAggExec's
own peak (it still buffers all input) — a separate, larger concern.

`batch_size` is read from the session config in `execute` and clamped to
at least 1. A unit test asserts a 10-row result is emitted as 4/4/2-row
chunks with `batch_size = 4` and that the running-count column is
unaffected; the `window` sqllogictest suite passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions github-actions Bot added the physical-plan Changes to the physical-plan crate label Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant