feat: multi-RG output at metric_name boundaries (PR-6c)#6410
Draft
g-talbot wants to merge 1 commit intogtt/streaming-merge-engine-mergerfrom
Draft
feat: multi-RG output at metric_name boundaries (PR-6c)#6410g-talbot wants to merge 1 commit intogtt/streaming-merge-engine-mergerfrom
g-talbot wants to merge 1 commit intogtt/streaming-merge-engine-mergerfrom
Conversation
6226032 to
add52f0
Compare
bcaa08b to
5d486d5
Compare
add52f0 to
e1990b2
Compare
5d486d5 to
3ee8ef9
Compare
e1990b2 to
2af2fa8
Compare
3ee8ef9 to
307a981
Compare
2af2fa8 to
b2eee32
Compare
307a981 to
0f6892a
Compare
b2eee32 to
85b679a
Compare
Slices each merge output into row groups at metric_name transitions, greedily packing adjacent metric_name runs until adding the next would exceed `MergeConfig::writer_config.row_group_size`. Outputs that fit in one RG remain single-RG (vacuously sort-prefix-aligned); outputs that exceed the cap roll into multiple RGs, every boundary aligned to a metric_name transition. A single metric_name run that on its own exceeds the cap stays as one RG — splitting it would break sort-prefix alignment, which the slicer prefers to preserve over the row-count budget. Implementation: `pack_into_metric_aligned_rgs` finds metric_name run boundaries (handling both Utf8 and Dictionary(_, Utf8), since the post-`align_inputs_to_union_schema` schema is Utf8 and `optimize_output_batch` may dict-encode if cardinality is low), then applies a greedy packer. `column_major_write` issues one `start_row_group` / column writes / `finish` cycle per RG range, with zero-copy `Array::slice` per column. Tests (6 new, all passing): RG boundaries align with metric_name transitions when row_group_size forces splits; runs pack into one RG when total ≤ threshold; runs pack-then-roll across the threshold; a single run larger than threshold becomes one RG; row count preservation across multi-RG outputs (MC-1); multi-RG output drainable by `StreamDecoder` round-trip. Note: `qh.rg_partition_prefix_len` propagation from input/to output is left to PR-7 (wiring) since the marker constant lives on PR-1's branch and isn't yet in the streaming-base. PR-6c produces metric-aligned output structurally; PR-7 wires the metadata claim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0f6892a to
4ce95e5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
metric_nametransitions, greedily packing adjacent runs until adding the next would exceedMergeConfig::writer_config.row_group_size.Implementation
pack_into_metric_aligned_rgs(batch, max_rg_rows)finds metric_name run boundaries (handles bothUtf8andDictionary(_, Utf8)since post-align_inputs_to_union_schemaisUtf8andoptimize_output_batchmay dict-encode for low-cardinality strings) and runs a greedy packer.column_major_writeissues onestart_row_group/ column writes /finishcycle per RG range; columns are zero-copyArray::sliceviews into the parent sorted batch.metric_name(defensive — production schemas always include it), falls back to single-RG with adebug_assert!(false, ...).What's not in this PR
qh.rg_partition_prefix_lenmetadata propagation. The marker constant lives on PR-1's branch (gtt/parquet-page-stats-and-marker, not yet merged); PR-6c produces metric-aligned output structurally and PR-7 (wiring) will add the metadata claim once PR-1 lands. Until then PR-6c outputs look prefix=0/absent to readers — they'll route through the legacy adapter unnecessarily but correctly.Stack
Test plan
test_multi_rg_at_metric_name_boundaries— three metric_name runs ≤ row_group_size produces 3 RGs each containing one metric_nametest_metric_name_runs_packed_into_single_rg_when_under_threshold— three small runs total ≤ threshold pack into 1 RGtest_metric_name_runs_pack_then_roll— pack until the next run pushes past threshold, then rolltest_single_run_larger_than_threshold_stays_one_rg— 250 rows, threshold 50 → still 1 RG (alignment preserved)test_multi_rg_total_rows_preserved— MC-1 across 4 metric_names with mixed sizestest_multi_rg_output_drainable_by_stream_decoder— round-trip output →StreamDecoderyields one batch per RGCI gates locally green: clippy
--workspace --all-features --testswith-Dwarnings, nightlyfmt --check,cargo doc --no-deps,cargo machete, license headers, log format, typos. 417/417 crate tests pass (16 streaming tests, +6 new from PR-6c).🤖 Generated with Claude Code