feat: column-major streaming Parquet writer primitive by g-talbot · Pull Request #6384 · quickwit-oss/quickwit

g-talbot · 2026-05-05T19:26:49Z

Summary

PR-2 of the streaming-merge stack. Lands a primitive in
quickwit-parquet-engine/src/storage/streaming_writer.rs that wraps
SerializedFileWriter directly (not ArrowWriter) and exposes a
per-column write API. Each column's chunk is appended and flushed
before the next column is written, so peak memory per row group is
bounded by one column's encoded chunk plus small bookkeeping (bloom
filters + page indexes) — not by the entire row group.

No production callers in this PR. PR-6 will cut the merge engine over
(primary consumer of write_next_column_arrays); PR-3 (last in the
stack) will cut ingest over to a single-RG writer using
write_next_column.

Two write APIs

write_next_column(&ArrayRef) — one whole arrow column per
call. Used by ingest (PR-3), where the input is already a complete
batch in memory.
write_next_column_arrays(impl IntoIterator<Item = ArrayRef>) —
next column as a stream of arrow arrays, each typically one input
page's worth of values. Used by PR-6's merge engine, which natively
works with arrow arrays for sort-key comparison and produces an
array per input page as it drives the merge.

Output page boundaries are determined by the writer's
data_page_size_limit / data_page_row_count_limit and need not
match the input partition; logical values round-trip. Memory is
bounded by one input array plus the writer's internal page
accumulator (~one output-page worth) — by page size, not
column-chunk size.

Why arrays, not raw page bytes

The arrays-input path was originally planned as a raw-page-bytes
input (InputPage { header, bytes }) so PR-6 could enable direct
page copy (no decompress + recompress) for columns whose pages all
come from one input. Direct page copy requires parquet 58's
SerializedRowGroupWriter::next_column_with_factory to be public,
which it is not (pub(crate)). That optimization is deferred to a
future arrow-rs upstream PR plus a follow-up here. The arrays-input
shape delivers the same per-page memory bound today using only the
public parquet API; PR-6 decodes input pages to arrow arrays
(which it would do anyway for sort-key comparison) and feeds them
directly through write_next_column_arrays.

Caller contract

1. start_row_group()                                       → RowGroupBuilder<'_, W>
2. RowGroupBuilder::{write_next_column | write_next_column_arrays}
   once per top-level arrow field, in arrow schema order
3. RowGroupBuilder::finish()
4. repeat (1)–(3) for additional row groups
5. StreamingParquetWriter::close()                          → ParquetMetaData

Out-of-order calls (too many columns, finish before all columns,
mismatched row counts) return a structured ParquetWriteError::SchemaValidation
rather than panicking. The compiler enforces single-RG-open via the
lifetime on RowGroupBuilder.

Limitations (PR-2)

Top-level arrow fields must each map to exactly one parquet leaf —
"flat" schemas of primitive, byte-array, and dictionary types. The
metrics schema is flat in this sense. Nested types (Struct, List,
Map) are rejected at start_row_group with a clear error message.

The streaming writer does not implicitly add the ARROW:schema
IPC entry that ArrowWriter writes by default. Callers wanting that
self-describing arrow round-trip pass their WriterProperties
through parquet::arrow::add_encoded_arrow_schema_to_metadata first.

Tests (21)

Round-trip: single RG, multi RG, nulls preserved.

Metadata identity vs ArrowWriter under identical WriterProperties:
single RG and multi RG. Asserts schema descriptor, per-RG
sorting_columns, all qh.* KV entries, per-column compression,
per-column bloom filter presence, per-column statistics_enabled
level, num row groups, and per-RG num_rows.

Functional contracts: chunk-level statistics propagate, bloom
filter round-trip (present + absent values), bounded memory
(pending-writers' total memory_size is monotone non-increasing).

Caller-contract violations: nested type rejected at start_row_group;
writing past last field; row-count mismatch across columns; finish
before all columns; empty row group still produces a readable file.

Array-stream-specific (8 new): page-stream round-trip; with nulls;
varied chunk sizes per column; single-row-per-array; empty iterator
→ zero-row column; row-count-mismatch error; too-many-columns error;
bounded memory under array streaming.

Position in stack (execution order)

PR labels (PR-1..PR-7) are stable IDs that match opened PR
descriptions. The execution order below is updated 2026-05-05 to
put PR-3 (ingest cutover) last — the OOM concern lives in
compaction, not ingest.

Order	PR	Status	Description
1	PR-1	open (#6377)	Page-level stats + `qh.rg_partition_prefix_len` marker
2	PR-2	this PR	Streaming column-major writer primitive
3	PR-4	open (#6386)	Streaming column-major reader, page-level
4	PR-5	next	Legacy multi-RG input adapter
5	PR-6		Streaming merge engine; consumes PR-2 + PR-4
6	PR-7		Wire `ParquetMergeExecutor`; delete downloader
7	PR-3	last	Ingest writer cuts over to single-RG (least critical — splits already small)

PR-2 is independent of PR-1 review — it branches off main.

Test plan

cargo +nightly fmt --all -- --check (per files touched)
RUSTFLAGS="-Dwarnings --cfg tokio_unstable" cargo clippy -p quickwit-parquet-engine --tests
cargo doc -p quickwit-parquet-engine --no-deps
cargo machete
bash quickwit/scripts/check_license_headers.sh
bash quickwit/scripts/check_log_format.sh
cargo nextest run -p quickwit-parquet-engine --all-features — 377 tests, all pass

🤖 Generated with Claude Code

g-talbot · 2026-05-07T18:52:23Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 844bdc23c2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

g-talbot · 2026-05-07T20:16:30Z

@codex review

chatgpt-codex-connector

💡 Codex Review

https://github.com/quickwit-oss/quickwit/blob/1949915db885d4a47381645ab327c4ada353223e/quickwit-parquet-engine/src/storage/streaming_writer.rs#L432-L438
Write UInt32 columns with the INT32 physical writer

When write_next_column_arrays handles a DataType::UInt32 field, this branch selects write_int64_compatible, but try_new builds the Parquet schema with ArrowSchemaConverter, where Arrow UInt32 is an INT32 physical column with unsigned logical annotation. The direct SerializedColumnWriter path must use the physical writer type, so any UInt32 column streamed through this API will be written through an Int64 writer against an INT32 column and fail at runtime; this affects existing schemas such as the DDSketch flags field (quickwit-parquet-engine/src/schema/sketch_fields.rs:68).

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

`ArrowSchemaConverter` maps Arrow `UInt32` to a parquet INT32 *physical* type with an unsigned logical annotation, not INT64. The earlier dispatch routed `DataType::UInt32` through the Int64 column writer (`write_int64_compatible`), which fails at runtime with a physical-type mismatch when the row group is finalised — affecting any production schema with a `UInt32` field, e.g. the sketch `flags` column in `quickwit-parquet-engine/src/schema/sketch_fields.rs`. Switches `UInt32` to `write_int32_compatible` with `value(i) as i32` (bit-reinterpreting cast); the unsigned logical annotation tells readers to interpret the 32 bits back as `u32`, so the round trip is bit-exact. Updates the doc-comment limitation list to put `UInt32` in the INT32 group. Adds `test_uint32_round_trip_through_array_stream` covering the high-bit range (`0x8000_0000`, `0xFFFF_FFFF`, etc.) — values that would clip if the writer used a signed `i32` interpretation without the unsigned logical annotation. Reported by Codex review on #6384. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

g-talbot · 2026-05-08T20:50:17Z

Thanks for the review — fix pushed as a96342add. UInt32 now goes through write_int32_compatible with a bit-reinterpreting u32 -> i32 cast (the unsigned logical annotation handles round-trip on read). Added test_uint32_round_trip_through_array_stream covering the high-bit range (0x8000_0000..0xFFFF_FFFF) so the regression is caught next time. Rebased the downstream stack (PR-5a/PR-5/PR-6a/PR-6b/PR-6c) on the updated base.

g-talbot · 2026-05-08T21:19:01Z

@codex review

Wraps SerializedFileWriter directly to expose a per-column write API that flushes one column chunk at a time. Peak memory per row group is bounded by a single column chunk plus small bookkeeping (bloom filters + page indexes), not by the whole row group. PR-2 of the streaming-merge stack. No production callers in this PR; PR-3 cuts ingest over to a single-RG writer built on this primitive, PR-6 cuts the merge engine over. Two write APIs on RowGroupBuilder: - `write_next_column(&ArrayRef)` — one whole arrow column per call. Used by ingest (PR-3), where the data is already a complete batch. - `write_next_column_arrays(impl IntoIterator<Item = ArrayRef>)` — next column as a stream of arrow arrays, each typically one input page's worth of values. Used by PR-6's merge engine, which natively works with arrow arrays for sort-key comparison and produces an array per input page as it drives the merge. Output page boundaries are determined by the writer's data_page_size_limit and need not match the input partition; logical values round-trip. Memory: bounded by one input array plus the writer's internal page accumulator (~one output-page worth) — bounded by *page* size, not column-chunk size. The arrays-input path was originally planned as a raw-page-bytes input (`InputPage { header, bytes }`) to enable direct page copy in PR-6 (no decompress + recompress). Direct page copy turns out to require parquet 58's `next_column_with_factory` to be public, which it is not (`pub(crate)`). That optimization is deferred to a future arrow-rs upstream PR plus a follow-up here. The arrays-input shape delivers the same per-page memory bound today using only the public parquet API. The caller contract is a 4-step state machine (start_row_group → {write_next_column | write_next_column_arrays} × N → finish → repeat → close). Out-of-order calls return a structured error rather than panicking. Top-level arrow fields must each map to exactly one parquet leaf, which the metrics schema satisfies; nested types are rejected at start_row_group with a clear message. 21 tests cover: round-trip (single RG, multi RG, with nulls), metadata identity vs ArrowWriter (single + multi RG, with KV entries and sorting_columns populated), per-column statistics enabled propagation, bloom filter functional round-trip, bounded-memory contract, empty row group, four caller-contract violations, plus 8 new array-stream tests covering page-stream round-trip with nulls, varied chunk sizes, single-row-per-array, empty-iterator edge case, row-count-mismatch error, too-many- columns error, and bounded-memory under array streaming. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…bounded) Reviewer caught that the previous `write_next_column_arrays` implementation, despite taking arrays one at a time, still buffered the entire encoded column chunk in `ArrowColumnWriter`'s internal `SharedColumnChunk` until `close + append_to_row_group`. Memory scaled with column-chunk size, not page size — silently violating the page-bounded contract the API documented. Switch to `row_group_writer.next_column()` which returns a `SerializedColumnWriter` whose `SerializedPageWriter` flushes pages directly through `TrackedWrite` to the user's sink. The Arrow → typed-values dispatch that `ArrowColumnWriter` provided internally is reimplemented at the call site for the metrics schema's physical types: Boolean, Int8/16/32, UInt8/16, Int64, UInt32/64, Float32/64, Utf8/LargeUtf8, Binary/LargeBinary, and Dictionary over those (materialized via `take`). Tests: - Renamed `test_array_stream_bounded_memory_per_column` to `test_array_stream_pending_writers_drained_per_column`. The previous name overstated the test — `pending_writers_memory_size` only sees pre-allocated writers for *future* columns, never the in-flight column's `SharedColumnChunk`. The renamed version documents what it actually tests. - Added `test_array_stream_pages_flush_incrementally` that configures uncompressed UInt64 with pages > BufWriter capacity. Asserts the per-column-write sink-write count is consistent with one-write-per-page (SerializedColumnWriter path) and inconsistent with io::copy-at-end (ArrowColumnWriter path). Direct page copy (input bytes → output bytes, no decompress + recompress) for columns whose pages all come from one input remains a deferred follow-up; arrow-rs's `SerializedRowGroupWriter::next_column_with_factory` is `pub(crate)`, blocking that optimization until an upstream PR lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`ArrowSchemaConverter` maps Arrow `UInt32` to a parquet INT32 *physical* type with an unsigned logical annotation, not INT64. The earlier dispatch routed `DataType::UInt32` through the Int64 column writer (`write_int64_compatible`), which fails at runtime with a physical-type mismatch when the row group is finalised — affecting any production schema with a `UInt32` field, e.g. the sketch `flags` column in `quickwit-parquet-engine/src/schema/sketch_fields.rs`. Switches `UInt32` to `write_int32_compatible` with `value(i) as i32` (bit-reinterpreting cast); the unsigned logical annotation tells readers to interpret the 32 bits back as `u32`, so the round trip is bit-exact. Updates the doc-comment limitation list to put `UInt32` in the INT32 group. Adds `test_uint32_round_trip_through_array_stream` covering the high-bit range (`0x8000_0000`, `0xFFFF_FFFF`, etc.) — values that would clip if the writer used a signed `i32` interpretation without the unsigned logical annotation. Reported by Codex review on #6384. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a96342addd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

PR-1 (#6377) merged today and switched the default \`ParquetWriterConfig::statistics_enabled\` from \`Chunk\` to \`Page\`. \`test_statistics_enabled_propagates\` was asserting the old \`Chunk\` value, so it failed under PR-2's CI once GitHub auto-merged PR-1's main into PR-2. Updates the assertion to expect \`Page\` and extends the test to also verify that the column index and offset index land in the file — the new behaviour PR-1 brings to production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The DDSketch \`keys\` (\`List<Int16>\`) and \`counts\` (\`List<UInt64>\`) columns are top-level arrow fields with exactly one parquet leaf, so they passed the leaf-count check at \`start_row_group\` but then hit the unmatched \`DataType::List(_)\` arm in \`write_array_via_serialized_column_writer\` and panicked deep in the column write — leaving the row-group builder on an unrecoverable error path. Codex review on #6384 caught the gap; sketches must work end-to- end through the streaming engine. Adds an \`is_nested_arrow_type\` predicate. The page-bounded direct \`SerializedColumnWriter\` dispatch handles flat physical types as before (\`Boolean\`, \`Int*/UInt*\`, \`Float*\`, byte-array, \`Dictionary\`). Nested fields (single-leaf \`List\`/\`LargeList\`/\`FixedSizeList\`/\`Struct\`/\`Map\`) fall back to the column-chunk-buffered \`ArrowColumnWriter\` path: every input array is fed via \`compute_leaves\` → \`writer.write\`, then the chunk is finalised + appended at end-of-column. Memory for nested columns grows with column-chunk size, not page size; primitive columns remain page-bounded. Multi-leaf nested types (e.g. \`Struct\` with multiple primitive children) are still rejected up front at \`start_row_group\`, with the error message updated to match. Adds \`test_list_uint64_round_trip_through_array_stream\` covering the sketch shape: feeds two input arrays with variable-length lists (including the empty list and \`u64::MAX\`) through \`write_next_column_arrays\`, reads back via the standard reader, and asserts row-by-row value equality. Renames the multi-leaf rejection test to \`test_multi_leaf_nested_type_rejected_at_start_row_group\` to reflect the narrower contract. Reported by Codex review on #6384. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rrays The DDSketch \`keys\` (\`List<Int16>\`) and \`counts\` (\`List<UInt64>\`) columns are top-level arrow fields with one parquet leaf, so they passed the leaf-count check at \`start_row_group\` but then hit the unmatched \`DataType::List(_)\` arm in \`write_array_via_serialized_column_writer\` and panicked deep in the column write. Codex review on #6384 caught the gap; sketches must work end-to-end through the streaming engine, page-bounded. Implements page-bounded support for \`List<T>\` / \`LargeList<T>\` where the outer field is non-nullable and the inner field is non-nullable and a flat numeric primitive (Int8/16/32/64, UInt8/16/32/64, Float32/64). Memory stays bounded by one in-flight page — the same \`SerializedColumnWriter::write_batch\` call the flat path uses, just with hand-computed Dremel def/rep levels (max_def = 1, max_rep = 1). \`write_non_nullable_list_via_serialized_column_writer\` walks per-row to compute (def, rep) pairs (empty list -> def=0, rep=0; non-empty list -> def=1, rep=0 for first / rep=1 for rest), gathers present values, and dispatches the inner type through \`write_list_inner_values\` — same shape as the flat-primitive arms. Other nested shapes (nullable inner, nullable outer, Struct, Map, FixedSizeList, multi-leaf nested) return SchemaValidation. Multi-leaf nested is also rejected up front by \`start_row_group\` with the error message updated. Tests: - \`test_list_uint64_round_trip_through_array_stream\`: DDSketch \`counts\` shape with multi-array input, variable-length lists, edge values (\`u64::MAX\`, empty list), row-by-row value equality. - \`test_list_uint64_pages_flush_incrementally\`: pages flush during the list-column write through the row-count limit (32 768 levels / 4096 = 8 pages). Confirms the page-bounded contract held by the flat path also holds for the list path; a column-chunk-buffered fallback would not pass this assertion. - \`test_multi_leaf_nested_type_rejected_at_start_row_group\` (renamed from \`test_nested_type_rejected_at_start_row_group\`). Reported by Codex review on #6384. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

g-talbot requested review from alexanderbianchi and mattmkim May 5, 2026 20:12

g-talbot mentioned this pull request May 5, 2026

feat: column-major streaming Parquet reader primitive #6386

Open

7 tasks

mattmkim approved these changes May 6, 2026

View reviewed changes

g-talbot force-pushed the gtt/streaming-parquet-writer branch from e014bbf to e134191 Compare May 7, 2026 15:57

g-talbot requested review from a team as code owners May 7, 2026 15:57

chatgpt-codex-connector Bot reviewed May 7, 2026

View reviewed changes

Comment thread quickwit/quickwit-parquet-engine/src/storage/streaming_writer.rs Outdated

chatgpt-codex-connector Bot reviewed May 7, 2026

View reviewed changes

g-talbot mentioned this pull request May 8, 2026

feat: ColumnPageStream trait — single contract for streaming + legacy inputs (PR-5a) #6406

Draft

6 tasks

g-talbot and others added 3 commits May 8, 2026 17:23

chatgpt-codex-connector Bot reviewed May 8, 2026

View reviewed changes

Comment thread quickwit/quickwit-parquet-engine/src/storage/streaming_writer.rs

g-talbot force-pushed the gtt/streaming-parquet-writer branch from a96342a to 396f1e0 Compare May 8, 2026 21:25

g-talbot force-pushed the gtt/streaming-parquet-writer branch from 8c5482e to 258012c Compare May 9, 2026 00:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: column-major streaming Parquet writer primitive#6384

feat: column-major streaming Parquet writer primitive#6384
g-talbot wants to merge 5 commits intomainfrom
gtt/streaming-parquet-writer

g-talbot commented May 5, 2026 •

edited

Loading

Uh oh!

g-talbot commented May 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

g-talbot commented May 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

g-talbot commented May 8, 2026

Uh oh!

g-talbot commented May 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

g-talbot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Two write APIs

Why arrays, not raw page bytes

Caller contract

Limitations (PR-2)

Tests (21)

Position in stack (execution order)

Test plan

Uh oh!

g-talbot commented May 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

g-talbot commented May 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

g-talbot commented May 8, 2026

Uh oh!

g-talbot commented May 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

g-talbot commented May 5, 2026 •

edited

Loading