feat: column-major streaming Parquet writer primitive#6384
feat: column-major streaming Parquet writer primitive#6384
Conversation
e014bbf to
e134191
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 844bdc23c2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
https://github.com/quickwit-oss/quickwit/blob/1949915db885d4a47381645ab327c4ada353223e/quickwit-parquet-engine/src/storage/streaming_writer.rs#L432-L438
Write UInt32 columns with the INT32 physical writer
When write_next_column_arrays handles a DataType::UInt32 field, this branch selects write_int64_compatible, but try_new builds the Parquet schema with ArrowSchemaConverter, where Arrow UInt32 is an INT32 physical column with unsigned logical annotation. The direct SerializedColumnWriter path must use the physical writer type, so any UInt32 column streamed through this API will be written through an Int64 writer against an INT32 column and fail at runtime; this affects existing schemas such as the DDSketch flags field (quickwit-parquet-engine/src/schema/sketch_fields.rs:68).
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
`ArrowSchemaConverter` maps Arrow `UInt32` to a parquet INT32 *physical* type with an unsigned logical annotation, not INT64. The earlier dispatch routed `DataType::UInt32` through the Int64 column writer (`write_int64_compatible`), which fails at runtime with a physical-type mismatch when the row group is finalised — affecting any production schema with a `UInt32` field, e.g. the sketch `flags` column in `quickwit-parquet-engine/src/schema/sketch_fields.rs`. Switches `UInt32` to `write_int32_compatible` with `value(i) as i32` (bit-reinterpreting cast); the unsigned logical annotation tells readers to interpret the 32 bits back as `u32`, so the round trip is bit-exact. Updates the doc-comment limitation list to put `UInt32` in the INT32 group. Adds `test_uint32_round_trip_through_array_stream` covering the high-bit range (`0x8000_0000`, `0xFFFF_FFFF`, etc.) — values that would clip if the writer used a signed `i32` interpretation without the unsigned logical annotation. Reported by Codex review on #6384. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the review — fix pushed as |
|
@codex review |
Wraps SerializedFileWriter directly to expose a per-column write API
that flushes one column chunk at a time. Peak memory per row group
is bounded by a single column chunk plus small bookkeeping (bloom
filters + page indexes), not by the whole row group.
PR-2 of the streaming-merge stack. No production callers in this PR;
PR-3 cuts ingest over to a single-RG writer built on this primitive,
PR-6 cuts the merge engine over.
Two write APIs on RowGroupBuilder:
- `write_next_column(&ArrayRef)` — one whole arrow column per call.
Used by ingest (PR-3), where the data is already a complete batch.
- `write_next_column_arrays(impl IntoIterator<Item = ArrayRef>)` —
next column as a stream of arrow arrays, each typically one input
page's worth of values. Used by PR-6's merge engine, which
natively works with arrow arrays for sort-key comparison and
produces an array per input page as it drives the merge. Output
page boundaries are determined by the writer's data_page_size_limit
and need not match the input partition; logical values round-trip.
Memory: bounded by one input array plus the writer's internal page
accumulator (~one output-page worth) — bounded by *page* size, not
column-chunk size.
The arrays-input path was originally planned as a raw-page-bytes
input (`InputPage { header, bytes }`) to enable direct page copy in
PR-6 (no decompress + recompress). Direct page copy turns out to
require parquet 58's `next_column_with_factory` to be public, which
it is not (`pub(crate)`). That optimization is deferred to a future
arrow-rs upstream PR plus a follow-up here. The arrays-input shape
delivers the same per-page memory bound today using only the public
parquet API.
The caller contract is a 4-step state machine (start_row_group →
{write_next_column | write_next_column_arrays} × N → finish →
repeat → close). Out-of-order calls return a structured error
rather than panicking. Top-level arrow fields must each map to
exactly one parquet leaf, which the metrics schema satisfies;
nested types are rejected at start_row_group with a clear message.
21 tests cover: round-trip (single RG, multi RG, with nulls),
metadata identity vs ArrowWriter (single + multi RG, with KV
entries and sorting_columns populated), per-column statistics
enabled propagation, bloom filter functional round-trip,
bounded-memory contract, empty row group, four caller-contract
violations, plus 8 new array-stream tests covering page-stream
round-trip with nulls, varied chunk sizes, single-row-per-array,
empty-iterator edge case, row-count-mismatch error, too-many-
columns error, and bounded-memory under array streaming.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bounded) Reviewer caught that the previous `write_next_column_arrays` implementation, despite taking arrays one at a time, still buffered the entire encoded column chunk in `ArrowColumnWriter`'s internal `SharedColumnChunk` until `close + append_to_row_group`. Memory scaled with column-chunk size, not page size — silently violating the page-bounded contract the API documented. Switch to `row_group_writer.next_column()` which returns a `SerializedColumnWriter` whose `SerializedPageWriter` flushes pages directly through `TrackedWrite` to the user's sink. The Arrow → typed-values dispatch that `ArrowColumnWriter` provided internally is reimplemented at the call site for the metrics schema's physical types: Boolean, Int8/16/32, UInt8/16, Int64, UInt32/64, Float32/64, Utf8/LargeUtf8, Binary/LargeBinary, and Dictionary over those (materialized via `take`). Tests: - Renamed `test_array_stream_bounded_memory_per_column` to `test_array_stream_pending_writers_drained_per_column`. The previous name overstated the test — `pending_writers_memory_size` only sees pre-allocated writers for *future* columns, never the in-flight column's `SharedColumnChunk`. The renamed version documents what it actually tests. - Added `test_array_stream_pages_flush_incrementally` that configures uncompressed UInt64 with pages > BufWriter capacity. Asserts the per-column-write sink-write count is consistent with one-write-per-page (SerializedColumnWriter path) and inconsistent with io::copy-at-end (ArrowColumnWriter path). Direct page copy (input bytes → output bytes, no decompress + recompress) for columns whose pages all come from one input remains a deferred follow-up; arrow-rs's `SerializedRowGroupWriter::next_column_with_factory` is `pub(crate)`, blocking that optimization until an upstream PR lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`ArrowSchemaConverter` maps Arrow `UInt32` to a parquet INT32 *physical* type with an unsigned logical annotation, not INT64. The earlier dispatch routed `DataType::UInt32` through the Int64 column writer (`write_int64_compatible`), which fails at runtime with a physical-type mismatch when the row group is finalised — affecting any production schema with a `UInt32` field, e.g. the sketch `flags` column in `quickwit-parquet-engine/src/schema/sketch_fields.rs`. Switches `UInt32` to `write_int32_compatible` with `value(i) as i32` (bit-reinterpreting cast); the unsigned logical annotation tells readers to interpret the 32 bits back as `u32`, so the round trip is bit-exact. Updates the doc-comment limitation list to put `UInt32` in the INT32 group. Adds `test_uint32_round_trip_through_array_stream` covering the high-bit range (`0x8000_0000`, `0xFFFF_FFFF`, etc.) — values that would clip if the writer used a signed `i32` interpretation without the unsigned logical annotation. Reported by Codex review on #6384. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a96342addd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
PR-1 (#6377) merged today and switched the default \`ParquetWriterConfig::statistics_enabled\` from \`Chunk\` to \`Page\`. \`test_statistics_enabled_propagates\` was asserting the old \`Chunk\` value, so it failed under PR-2's CI once GitHub auto-merged PR-1's main into PR-2. Updates the assertion to expect \`Page\` and extends the test to also verify that the column index and offset index land in the file — the new behaviour PR-1 brings to production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a96342a to
396f1e0
Compare
The DDSketch \`keys\` (\`List<Int16>\`) and \`counts\` (\`List<UInt64>\`) columns are top-level arrow fields with exactly one parquet leaf, so they passed the leaf-count check at \`start_row_group\` but then hit the unmatched \`DataType::List(_)\` arm in \`write_array_via_serialized_column_writer\` and panicked deep in the column write — leaving the row-group builder on an unrecoverable error path. Codex review on #6384 caught the gap; sketches must work end-to- end through the streaming engine. Adds an \`is_nested_arrow_type\` predicate. The page-bounded direct \`SerializedColumnWriter\` dispatch handles flat physical types as before (\`Boolean\`, \`Int*/UInt*\`, \`Float*\`, byte-array, \`Dictionary\`). Nested fields (single-leaf \`List\`/\`LargeList\`/\`FixedSizeList\`/\`Struct\`/\`Map\`) fall back to the column-chunk-buffered \`ArrowColumnWriter\` path: every input array is fed via \`compute_leaves\` → \`writer.write\`, then the chunk is finalised + appended at end-of-column. Memory for nested columns grows with column-chunk size, not page size; primitive columns remain page-bounded. Multi-leaf nested types (e.g. \`Struct\` with multiple primitive children) are still rejected up front at \`start_row_group\`, with the error message updated to match. Adds \`test_list_uint64_round_trip_through_array_stream\` covering the sketch shape: feeds two input arrays with variable-length lists (including the empty list and \`u64::MAX\`) through \`write_next_column_arrays\`, reads back via the standard reader, and asserts row-by-row value equality. Renames the multi-leaf rejection test to \`test_multi_leaf_nested_type_rejected_at_start_row_group\` to reflect the narrower contract. Reported by Codex review on #6384. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rrays The DDSketch \`keys\` (\`List<Int16>\`) and \`counts\` (\`List<UInt64>\`) columns are top-level arrow fields with one parquet leaf, so they passed the leaf-count check at \`start_row_group\` but then hit the unmatched \`DataType::List(_)\` arm in \`write_array_via_serialized_column_writer\` and panicked deep in the column write. Codex review on #6384 caught the gap; sketches must work end-to-end through the streaming engine, page-bounded. Implements page-bounded support for \`List<T>\` / \`LargeList<T>\` where the outer field is non-nullable and the inner field is non-nullable and a flat numeric primitive (Int8/16/32/64, UInt8/16/32/64, Float32/64). Memory stays bounded by one in-flight page — the same \`SerializedColumnWriter::write_batch\` call the flat path uses, just with hand-computed Dremel def/rep levels (max_def = 1, max_rep = 1). \`write_non_nullable_list_via_serialized_column_writer\` walks per-row to compute (def, rep) pairs (empty list -> def=0, rep=0; non-empty list -> def=1, rep=0 for first / rep=1 for rest), gathers present values, and dispatches the inner type through \`write_list_inner_values\` — same shape as the flat-primitive arms. Other nested shapes (nullable inner, nullable outer, Struct, Map, FixedSizeList, multi-leaf nested) return SchemaValidation. Multi-leaf nested is also rejected up front by \`start_row_group\` with the error message updated. Tests: - \`test_list_uint64_round_trip_through_array_stream\`: DDSketch \`counts\` shape with multi-array input, variable-length lists, edge values (\`u64::MAX\`, empty list), row-by-row value equality. - \`test_list_uint64_pages_flush_incrementally\`: pages flush during the list-column write through the row-count limit (32 768 levels / 4096 = 8 pages). Confirms the page-bounded contract held by the flat path also holds for the list path; a column-chunk-buffered fallback would not pass this assertion. - \`test_multi_leaf_nested_type_rejected_at_start_row_group\` (renamed from \`test_nested_type_rejected_at_start_row_group\`). Reported by Codex review on #6384. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8c5482e to
258012c
Compare
Summary
PR-2 of the streaming-merge stack. Lands a primitive in
quickwit-parquet-engine/src/storage/streaming_writer.rsthat wrapsSerializedFileWriterdirectly (notArrowWriter) and exposes aper-column write API. Each column's chunk is appended and flushed
before the next column is written, so peak memory per row group is
bounded by one column's encoded chunk plus small bookkeeping (bloom
filters + page indexes) — not by the entire row group.
No production callers in this PR. PR-6 will cut the merge engine over
(primary consumer of
write_next_column_arrays); PR-3 (last in thestack) will cut ingest over to a single-RG writer using
write_next_column.Two write APIs
write_next_column(&ArrayRef)— one whole arrow column percall. Used by ingest (PR-3), where the input is already a complete
batch in memory.
write_next_column_arrays(impl IntoIterator<Item = ArrayRef>)—next column as a stream of arrow arrays, each typically one input
page's worth of values. Used by PR-6's merge engine, which natively
works with arrow arrays for sort-key comparison and produces an
array per input page as it drives the merge.
Output page boundaries are determined by the writer's
data_page_size_limit/data_page_row_count_limitand need notmatch the input partition; logical values round-trip. Memory is
bounded by one input array plus the writer's internal page
accumulator (~one output-page worth) — by page size, not
column-chunk size.
Why arrays, not raw page bytes
The arrays-input path was originally planned as a raw-page-bytes
input (
InputPage { header, bytes }) so PR-6 could enable directpage copy (no decompress + recompress) for columns whose pages all
come from one input. Direct page copy requires parquet 58's
SerializedRowGroupWriter::next_column_with_factoryto be public,which it is not (
pub(crate)). That optimization is deferred to afuture arrow-rs upstream PR plus a follow-up here. The arrays-input
shape delivers the same per-page memory bound today using only the
public parquet API; PR-6 decodes input pages to arrow arrays
(which it would do anyway for sort-key comparison) and feeds them
directly through
write_next_column_arrays.Caller contract
Out-of-order calls (too many columns, finish before all columns,
mismatched row counts) return a structured
ParquetWriteError::SchemaValidationrather than panicking. The compiler enforces single-RG-open via the
lifetime on
RowGroupBuilder.Limitations (PR-2)
Top-level arrow fields must each map to exactly one parquet leaf —
"flat" schemas of primitive, byte-array, and dictionary types. The
metrics schema is flat in this sense. Nested types (Struct, List,
Map) are rejected at
start_row_groupwith a clear error message.The streaming writer does not implicitly add the
ARROW:schemaIPC entry that
ArrowWriterwrites by default. Callers wanting thatself-describing arrow round-trip pass their
WriterPropertiesthrough
parquet::arrow::add_encoded_arrow_schema_to_metadatafirst.Tests (21)
Round-trip: single RG, multi RG, nulls preserved.
Metadata identity vs
ArrowWriterunder identicalWriterProperties:single RG and multi RG. Asserts schema descriptor, per-RG
sorting_columns, allqh.*KV entries, per-column compression,per-column bloom filter presence, per-column
statistics_enabledlevel, num row groups, and per-RG num_rows.
Functional contracts: chunk-level statistics propagate, bloom
filter round-trip (present + absent values), bounded memory
(pending-writers' total
memory_sizeis monotone non-increasing).Caller-contract violations: nested type rejected at start_row_group;
writing past last field; row-count mismatch across columns; finish
before all columns; empty row group still produces a readable file.
Array-stream-specific (8 new): page-stream round-trip; with nulls;
varied chunk sizes per column; single-row-per-array; empty iterator
→ zero-row column; row-count-mismatch error; too-many-columns error;
bounded memory under array streaming.
Position in stack (execution order)
PR labels (PR-1..PR-7) are stable IDs that match opened PR
descriptions. The execution order below is updated 2026-05-05 to
put PR-3 (ingest cutover) last — the OOM concern lives in
compaction, not ingest.
qh.rg_partition_prefix_lenmarkerParquetMergeExecutor; delete downloaderPR-2 is independent of PR-1 review — it branches off
main.Test plan
cargo +nightly fmt --all -- --check(per files touched)RUSTFLAGS="-Dwarnings --cfg tokio_unstable" cargo clippy -p quickwit-parquet-engine --testscargo doc -p quickwit-parquet-engine --no-depscargo machetebash quickwit/scripts/check_license_headers.shbash quickwit/scripts/check_log_format.shcargo nextest run -p quickwit-parquet-engine --all-features— 377 tests, all pass🤖 Generated with Claude Code