fix: track join_arrays memory in reservation after SMJ spill#21962
fix: track join_arrays memory in reservation after SMJ spill#21962SubhamSinghal wants to merge 5 commits intoapache:mainfrom
Conversation
|
Thanks for picking this up, the accounting fix itself reads cleanly and the A few things I wanted to ask about: Tests that fail without the fix?
Since this is a pure accounting fix with no observable behavior change, could you point to a test here that fails on
In Test duplication The three Minor
Overall the change looks safe and the invariant is nice. Mainly curious about the test angle before signing off. |
|
@mbutrovich Thanks for reviewing this PR. I have removed redundant tests. I have added UT which would always hit spill path, with this Added |
Which issue does this PR close?
Related to the TODO at
materializing_stream.rs:283(from #17429): spilledBufferedBatchjoin key arrays are not tracked in memory reservation.Rationale for this change
When a
BufferedBatchis spilled to disk in Sort Merge Join, only theRecordBatchdata is written to the IPC file. Thejoin_arrays(evaluated join key columns) remain in memory because the merge-scan comparator needs them to detect key group boundaries.Before this fix, these in-memory
join_arrayswere invisible to the memory pool:allocate_reservation():
try_grow(size_estimation) → FAILS (pool full)
spill batch to disk
→ join_arrays still in memory, but reservation was never grown
→ pool thinks 0 bytes are used for this batch
free_reservation():
if InMemory → shrink(size_estimation)
if Spilled → no-op ← correct (nothing was grown), but join_arrays are invisible
With many spilled batches for a skewed key (e.g., millions of rows sharing the same join key), the untracked
join_arraysmemory accumulates. The memory pool cannot account for this when making spill decisions for concurrent operators.What changes are included in this PR?
Memory accounting fix (
materializing_stream.rs):reserved_amountfield toBufferedBatch— tracks how much memory was actually reserved in the pool for this batchjoin_arrays_mem()helper — computes total memory of join key arraysallocate_reservation(): after spilling, callstry_grow(join_arrays_mem)to track the remaining in-memory data. If the pool is too tight for even that,reserved_amountstays 0(best-effort, safe)
free_reservation(): shrinks byreserved_amountinstead of checkingInMemoryvariant. Invariant: only shrink by what was actually grown — no underflow risktry_growreserved_amounttry_shrinkTests (
tests.rs):spill_many_batches_same_key— 10+5 batches all sharing key=1, verifies correctness under heavy spillingspill_string_join_keys— Utf8 join keys to exercise largerjoin_arraysfootprintspill_mixed_keys_some_match— multiple distinct keys with partial matching, tests Full outer join NULL rows from spilled batchesspill_join_arrays_memory_accounting— verifies memory pool is fully released after join completes (memory_pool.reserved() == 0) andpeak_mem_used > 0Are these changes tested?
Yes. Four new tests added covering heavy spilling with same-key batches, string join keys, mixed keys with partial matching, and memory pool accounting verification.
Are there any user-facing changes?
No.