fix: track join_arrays memory in reservation after SMJ spill by SubhamSinghal · Pull Request #21962 · apache/datafusion

SubhamSinghal · 2026-04-30T15:01:00Z

Which issue does this PR close?

Related to the TODO at materializing_stream.rs:283 (from #17429): spilled BufferedBatch join key arrays are not tracked in memory reservation.

Rationale for this change

When a BufferedBatch is spilled to disk in Sort Merge Join, only the RecordBatch data is written to the IPC file. The join_arrays (evaluated join key columns) remain in memory because the merge-scan comparator needs them to detect key group boundaries.
Before this fix, these in-memory join_arrays were invisible to the memory pool:

allocate_reservation():
try_grow(size_estimation) → FAILS (pool full)
spill batch to disk
→ join_arrays still in memory, but reservation was never grown
→ pool thinks 0 bytes are used for this batch

free_reservation():
if InMemory → shrink(size_estimation)
if Spilled → no-op ← correct (nothing was grown), but join_arrays are invisible

With many spilled batches for a skewed key (e.g., millions of rows sharing the same join key), the untracked join_arrays memory accumulates. The memory pool cannot account for this when making spill decisions for concurrent operators.

What changes are included in this PR?

Memory accounting fix (materializing_stream.rs):

Add reserved_amount field to BufferedBatch — tracks how much memory was actually reserved in the pool for this batch
Add join_arrays_mem() helper — computes total memory of join key arrays
allocate_reservation(): after spilling, calls try_grow(join_arrays_mem) to track the remaining in-memory data. If the pool is too tight for even that, reserved_amount stays 0
(best-effort, safe)
free_reservation(): shrinks by reserved_amount instead of checking InMemory variant. Invariant: only shrink by what was actually grown — no underflow risk

Scenario	`try_grow`	`reserved_amount`	`try_shrink`	Safe?
InMemory	Ok(size_estimation)	size_estimation	size_estimation	Yes
Spilled, tracked	Ok(join_arrays_mem)	join_arrays_mem	join_arrays_mem	Yes
Spilled, pool tight	Err	0	0 (no-op)	Yes

Tests (tests.rs):

spill_many_batches_same_key — 10+5 batches all sharing key=1, verifies correctness under heavy spilling
spill_string_join_keys — Utf8 join keys to exercise larger join_arrays footprint
spill_mixed_keys_some_match — multiple distinct keys with partial matching, tests Full outer join NULL rows from spilled batches
spill_join_arrays_memory_accounting — verifies memory pool is fully released after join completes (memory_pool.reserved() == 0) and peak_mem_used > 0

Are these changes tested?

Yes. Four new tests added covering heavy spilling with same-key batches, string join keys, mixed keys with partial matching, and memory pool accounting verification.

Are there any user-facing changes?

No.

mbutrovich · 2026-04-30T15:42:49Z

Thanks for picking this up, the accounting fix itself reads cleanly and the reserved_amount invariant (only shrink by what you grew) is easy to follow.

A few things I wanted to ask about:

Tests that fail without the fix?

overallocation_multi_batch_spill already covers the same shape as spill_many_batches_same_key: N+M batches all on one key, 500-byte limit, same join types, same [1, 50] batch sizes, same spilled-vs-non-spilled equality check. The new tests assert the same kinds of things, so I think they pass on main too.

spill_join_arrays_memory_accounting asserts pool.reserved() == 0 after the join, but that was already true before this PR: pre-fix, spilled batches never grew the reservation, so there was nothing left to release either.

Since this is a pure accounting fix with no observable behavior change, could you point to a test here that fails on main and passes on this branch? If one is hard to construct, I wonder if we'd be better off with a single targeted assertion added to an existing spill test (something like a mid-execution check that pool.reserved() reflects the spilled batches' join_arrays), rather than ~440 lines of new tests. Happy to hear if I'm missing what these catch.

peak_mem_used on the spill path

In allocate_reservation, the Ok(_) arm calls peak_mem_used().set_max(self.reservation.size()), but the new try_grow(join_arrays_mem) after a spill doesn't. For spill-heavy workloads that metric will under-report now that the pool is actually being grown on that path. Worth mirroring the set_max call there?

Test duplication

The three spill_* tests are structurally very similar (spill run + baseline, nested batch_size and join_type loops). Would a small helper parameterized over (left, right, on, memory_limit) be worth it, or do you think the explicit form reads better here?

Minor

reserved_amount: 0, // set by allocate_reservation() is already covered by the doc comment on the field itself, could probably drop the inline comment.
The field doc listing the three cases (InMemory / Spilled-tracked / Spilled-untracked) is really helpful, thanks for including it.

Overall the change looks safe and the invariant is nice. Mainly curious about the test angle before signing off.

SubhamSinghal · 2026-04-30T18:16:45Z

@mbutrovich Thanks for reviewing this PR. I have removed redundant tests. I have added UT which would always hit spill path, with this peak_memory would be 0 in main and non-zero in this branch.

Added set_max call in spill path.

Subham Singhal added 2 commits April 30, 2026 20:29

fix: track join_arrays memory in reservation after SMJ spill

23f941f

Remove unused import

a0e19e3

github-actions Bot added the physical-plan Changes to the physical-plan crate label Apr 30, 2026

Fix lint

4ef0ce8

mbutrovich self-requested a review April 30, 2026 15:38

Subham Singhal added 2 commits April 30, 2026 22:51

Adds max mem usage metrics

89af197

Adds UT

ccefc21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: track join_arrays memory in reservation after SMJ spill#21962

fix: track join_arrays memory in reservation after SMJ spill#21962
SubhamSinghal wants to merge 5 commits intoapache:mainfrom
SubhamSinghal:smj-spill-join-arrays-memory-accounting

SubhamSinghal commented Apr 30, 2026

Uh oh!

mbutrovich commented Apr 30, 2026 •

edited

Loading

Uh oh!

SubhamSinghal commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SubhamSinghal commented Apr 30, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

mbutrovich commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SubhamSinghal commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mbutrovich commented Apr 30, 2026 •

edited

Loading