Skip to content

feat: column-major streaming Parquet writer primitive#6384

Open
g-talbot wants to merge 5 commits intomainfrom
gtt/streaming-parquet-writer
Open

feat: column-major streaming Parquet writer primitive#6384
g-talbot wants to merge 5 commits intomainfrom
gtt/streaming-parquet-writer

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

@g-talbot g-talbot commented May 5, 2026

Summary

PR-2 of the streaming-merge stack. Lands a primitive in
quickwit-parquet-engine/src/storage/streaming_writer.rs that wraps
SerializedFileWriter directly (not ArrowWriter) and exposes a
per-column write API. Each column's chunk is appended and flushed
before the next column is written, so peak memory per row group is
bounded by one column's encoded chunk plus small bookkeeping (bloom
filters + page indexes) — not by the entire row group.

No production callers in this PR. PR-6 will cut the merge engine over
(primary consumer of write_next_column_arrays); PR-3 (last in the
stack) will cut ingest over to a single-RG writer using
write_next_column.

Two write APIs

  • write_next_column(&ArrayRef) — one whole arrow column per
    call. Used by ingest (PR-3), where the input is already a complete
    batch in memory.

  • write_next_column_arrays(impl IntoIterator<Item = ArrayRef>)
    next column as a stream of arrow arrays, each typically one input
    page's worth of values. Used by PR-6's merge engine, which natively
    works with arrow arrays for sort-key comparison and produces an
    array per input page as it drives the merge.

    Output page boundaries are determined by the writer's
    data_page_size_limit / data_page_row_count_limit and need not
    match the input partition; logical values round-trip. Memory is
    bounded by one input array plus the writer's internal page
    accumulator (~one output-page worth) — by page size, not
    column-chunk size.

Why arrays, not raw page bytes

The arrays-input path was originally planned as a raw-page-bytes
input (InputPage { header, bytes }) so PR-6 could enable direct
page copy (no decompress + recompress) for columns whose pages all
come from one input. Direct page copy requires parquet 58's
SerializedRowGroupWriter::next_column_with_factory to be public,
which it is not (pub(crate)). That optimization is deferred to a
future arrow-rs upstream PR plus a follow-up here. The arrays-input
shape delivers the same per-page memory bound today using only the
public parquet API; PR-6 decodes input pages to arrow arrays
(which it would do anyway for sort-key comparison) and feeds them
directly through write_next_column_arrays.

Caller contract

1. start_row_group()                                       → RowGroupBuilder<'_, W>
2. RowGroupBuilder::{write_next_column | write_next_column_arrays}
   once per top-level arrow field, in arrow schema order
3. RowGroupBuilder::finish()
4. repeat (1)–(3) for additional row groups
5. StreamingParquetWriter::close()                          → ParquetMetaData

Out-of-order calls (too many columns, finish before all columns,
mismatched row counts) return a structured ParquetWriteError::SchemaValidation
rather than panicking. The compiler enforces single-RG-open via the
lifetime on RowGroupBuilder.

Limitations (PR-2)

Top-level arrow fields must each map to exactly one parquet leaf —
"flat" schemas of primitive, byte-array, and dictionary types. The
metrics schema is flat in this sense. Nested types (Struct, List,
Map) are rejected at start_row_group with a clear error message.

The streaming writer does not implicitly add the ARROW:schema
IPC entry that ArrowWriter writes by default. Callers wanting that
self-describing arrow round-trip pass their WriterProperties
through parquet::arrow::add_encoded_arrow_schema_to_metadata first.

Tests (21)

Round-trip: single RG, multi RG, nulls preserved.

Metadata identity vs ArrowWriter under identical WriterProperties:
single RG and multi RG. Asserts schema descriptor, per-RG
sorting_columns, all qh.* KV entries, per-column compression,
per-column bloom filter presence, per-column statistics_enabled
level, num row groups, and per-RG num_rows.

Functional contracts: chunk-level statistics propagate, bloom
filter round-trip (present + absent values), bounded memory
(pending-writers' total memory_size is monotone non-increasing).

Caller-contract violations: nested type rejected at start_row_group;
writing past last field; row-count mismatch across columns; finish
before all columns; empty row group still produces a readable file.

Array-stream-specific (8 new): page-stream round-trip; with nulls;
varied chunk sizes per column; single-row-per-array; empty iterator
→ zero-row column; row-count-mismatch error; too-many-columns error;
bounded memory under array streaming.

Position in stack (execution order)

PR labels (PR-1..PR-7) are stable IDs that match opened PR
descriptions. The execution order below is updated 2026-05-05 to
put PR-3 (ingest cutover) last — the OOM concern lives in
compaction, not ingest.

Order PR Status Description
1 PR-1 open (#6377) Page-level stats + qh.rg_partition_prefix_len marker
2 PR-2 this PR Streaming column-major writer primitive
3 PR-4 open (#6386) Streaming column-major reader, page-level
4 PR-5 next Legacy multi-RG input adapter
5 PR-6 Streaming merge engine; consumes PR-2 + PR-4
6 PR-7 Wire ParquetMergeExecutor; delete downloader
7 PR-3 last Ingest writer cuts over to single-RG (least critical — splits already small)

PR-2 is independent of PR-1 review — it branches off main.

Test plan

  • cargo +nightly fmt --all -- --check (per files touched)
  • RUSTFLAGS="-Dwarnings --cfg tokio_unstable" cargo clippy -p quickwit-parquet-engine --tests
  • cargo doc -p quickwit-parquet-engine --no-deps
  • cargo machete
  • bash quickwit/scripts/check_license_headers.sh
  • bash quickwit/scripts/check_log_format.sh
  • cargo nextest run -p quickwit-parquet-engine --all-features — 377 tests, all pass

🤖 Generated with Claude Code

@g-talbot g-talbot force-pushed the gtt/streaming-parquet-writer branch from e014bbf to e134191 Compare May 7, 2026 15:57
@g-talbot g-talbot requested review from a team as code owners May 7, 2026 15:57
@g-talbot
Copy link
Copy Markdown
Contributor Author

g-talbot commented May 7, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 844bdc23c2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread quickwit/quickwit-parquet-engine/src/storage/streaming_writer.rs Outdated
@g-talbot
Copy link
Copy Markdown
Contributor Author

g-talbot commented May 7, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

https://github.com/quickwit-oss/quickwit/blob/1949915db885d4a47381645ab327c4ada353223e/quickwit-parquet-engine/src/storage/streaming_writer.rs#L432-L438
P2 Badge Write UInt32 columns with the INT32 physical writer

When write_next_column_arrays handles a DataType::UInt32 field, this branch selects write_int64_compatible, but try_new builds the Parquet schema with ArrowSchemaConverter, where Arrow UInt32 is an INT32 physical column with unsigned logical annotation. The direct SerializedColumnWriter path must use the physical writer type, so any UInt32 column streamed through this API will be written through an Int64 writer against an INT32 column and fail at runtime; this affects existing schemas such as the DDSketch flags field (quickwit-parquet-engine/src/schema/sketch_fields.rs:68).

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

g-talbot added a commit that referenced this pull request May 8, 2026
`ArrowSchemaConverter` maps Arrow `UInt32` to a parquet INT32
*physical* type with an unsigned logical annotation, not INT64.
The earlier dispatch routed `DataType::UInt32` through the Int64
column writer (`write_int64_compatible`), which fails at runtime
with a physical-type mismatch when the row group is finalised —
affecting any production schema with a `UInt32` field, e.g. the
sketch `flags` column in
`quickwit-parquet-engine/src/schema/sketch_fields.rs`.

Switches `UInt32` to `write_int32_compatible` with `value(i) as i32`
(bit-reinterpreting cast); the unsigned logical annotation tells
readers to interpret the 32 bits back as `u32`, so the round trip
is bit-exact. Updates the doc-comment limitation list to put
`UInt32` in the INT32 group.

Adds `test_uint32_round_trip_through_array_stream` covering the
high-bit range (`0x8000_0000`, `0xFFFF_FFFF`, etc.) — values that
would clip if the writer used a signed `i32` interpretation without
the unsigned logical annotation.

Reported by Codex review on #6384.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@g-talbot
Copy link
Copy Markdown
Contributor Author

g-talbot commented May 8, 2026

Thanks for the review — fix pushed as a96342add. UInt32 now goes through write_int32_compatible with a bit-reinterpreting u32 -> i32 cast (the unsigned logical annotation handles round-trip on read). Added test_uint32_round_trip_through_array_stream covering the high-bit range (0x8000_0000..0xFFFF_FFFF) so the regression is caught next time. Rebased the downstream stack (PR-5a/PR-5/PR-6a/PR-6b/PR-6c) on the updated base.

@g-talbot
Copy link
Copy Markdown
Contributor Author

g-talbot commented May 8, 2026

@codex review

g-talbot and others added 3 commits May 8, 2026 17:23
Wraps SerializedFileWriter directly to expose a per-column write API
that flushes one column chunk at a time. Peak memory per row group
is bounded by a single column chunk plus small bookkeeping (bloom
filters + page indexes), not by the whole row group.

PR-2 of the streaming-merge stack. No production callers in this PR;
PR-3 cuts ingest over to a single-RG writer built on this primitive,
PR-6 cuts the merge engine over.

Two write APIs on RowGroupBuilder:

- `write_next_column(&ArrayRef)` — one whole arrow column per call.
  Used by ingest (PR-3), where the data is already a complete batch.

- `write_next_column_arrays(impl IntoIterator<Item = ArrayRef>)` —
  next column as a stream of arrow arrays, each typically one input
  page's worth of values. Used by PR-6's merge engine, which
  natively works with arrow arrays for sort-key comparison and
  produces an array per input page as it drives the merge. Output
  page boundaries are determined by the writer's data_page_size_limit
  and need not match the input partition; logical values round-trip.
  Memory: bounded by one input array plus the writer's internal page
  accumulator (~one output-page worth) — bounded by *page* size, not
  column-chunk size.

The arrays-input path was originally planned as a raw-page-bytes
input (`InputPage { header, bytes }`) to enable direct page copy in
PR-6 (no decompress + recompress). Direct page copy turns out to
require parquet 58's `next_column_with_factory` to be public, which
it is not (`pub(crate)`). That optimization is deferred to a future
arrow-rs upstream PR plus a follow-up here. The arrays-input shape
delivers the same per-page memory bound today using only the public
parquet API.

The caller contract is a 4-step state machine (start_row_group →
{write_next_column | write_next_column_arrays} × N → finish →
repeat → close). Out-of-order calls return a structured error
rather than panicking. Top-level arrow fields must each map to
exactly one parquet leaf, which the metrics schema satisfies;
nested types are rejected at start_row_group with a clear message.

21 tests cover: round-trip (single RG, multi RG, with nulls),
metadata identity vs ArrowWriter (single + multi RG, with KV
entries and sorting_columns populated), per-column statistics
enabled propagation, bloom filter functional round-trip,
bounded-memory contract, empty row group, four caller-contract
violations, plus 8 new array-stream tests covering page-stream
round-trip with nulls, varied chunk sizes, single-row-per-array,
empty-iterator edge case, row-count-mismatch error, too-many-
columns error, and bounded-memory under array streaming.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bounded)

Reviewer caught that the previous `write_next_column_arrays`
implementation, despite taking arrays one at a time, still buffered
the entire encoded column chunk in `ArrowColumnWriter`'s internal
`SharedColumnChunk` until `close + append_to_row_group`. Memory
scaled with column-chunk size, not page size — silently violating
the page-bounded contract the API documented.

Switch to `row_group_writer.next_column()` which returns a
`SerializedColumnWriter` whose `SerializedPageWriter` flushes pages
directly through `TrackedWrite` to the user's sink. The Arrow →
typed-values dispatch that `ArrowColumnWriter` provided internally
is reimplemented at the call site for the metrics schema's
physical types: Boolean, Int8/16/32, UInt8/16, Int64, UInt32/64,
Float32/64, Utf8/LargeUtf8, Binary/LargeBinary, and Dictionary
over those (materialized via `take`).

Tests:

- Renamed `test_array_stream_bounded_memory_per_column` to
  `test_array_stream_pending_writers_drained_per_column`. The
  previous name overstated the test — `pending_writers_memory_size`
  only sees pre-allocated writers for *future* columns, never the
  in-flight column's `SharedColumnChunk`. The renamed version
  documents what it actually tests.
- Added `test_array_stream_pages_flush_incrementally` that
  configures uncompressed UInt64 with pages > BufWriter capacity.
  Asserts the per-column-write sink-write count is consistent with
  one-write-per-page (SerializedColumnWriter path) and inconsistent
  with io::copy-at-end (ArrowColumnWriter path).

Direct page copy (input bytes → output bytes, no decompress +
recompress) for columns whose pages all come from one input
remains a deferred follow-up; arrow-rs's
`SerializedRowGroupWriter::next_column_with_factory` is
`pub(crate)`, blocking that optimization until an upstream PR
lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`ArrowSchemaConverter` maps Arrow `UInt32` to a parquet INT32
*physical* type with an unsigned logical annotation, not INT64.
The earlier dispatch routed `DataType::UInt32` through the Int64
column writer (`write_int64_compatible`), which fails at runtime
with a physical-type mismatch when the row group is finalised —
affecting any production schema with a `UInt32` field, e.g. the
sketch `flags` column in
`quickwit-parquet-engine/src/schema/sketch_fields.rs`.

Switches `UInt32` to `write_int32_compatible` with `value(i) as i32`
(bit-reinterpreting cast); the unsigned logical annotation tells
readers to interpret the 32 bits back as `u32`, so the round trip
is bit-exact. Updates the doc-comment limitation list to put
`UInt32` in the INT32 group.

Adds `test_uint32_round_trip_through_array_stream` covering the
high-bit range (`0x8000_0000`, `0xFFFF_FFFF`, etc.) — values that
would clip if the writer used a signed `i32` interpretation without
the unsigned logical annotation.

Reported by Codex review on #6384.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a96342addd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread quickwit/quickwit-parquet-engine/src/storage/streaming_writer.rs
PR-1 (#6377) merged today and switched the default
\`ParquetWriterConfig::statistics_enabled\` from \`Chunk\` to \`Page\`.
\`test_statistics_enabled_propagates\` was asserting the old
\`Chunk\` value, so it failed under PR-2's CI once GitHub auto-merged
PR-1's main into PR-2.

Updates the assertion to expect \`Page\` and extends the test to also
verify that the column index and offset index land in the file —
the new behaviour PR-1 brings to production.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot force-pushed the gtt/streaming-parquet-writer branch from a96342a to 396f1e0 Compare May 8, 2026 21:25
g-talbot added a commit that referenced this pull request May 8, 2026
The DDSketch \`keys\` (\`List<Int16>\`) and \`counts\` (\`List<UInt64>\`)
columns are top-level arrow fields with exactly one parquet leaf, so
they passed the leaf-count check at \`start_row_group\` but then hit
the unmatched \`DataType::List(_)\` arm in
\`write_array_via_serialized_column_writer\` and panicked deep in the
column write — leaving the row-group builder on an unrecoverable error
path. Codex review on #6384 caught the gap; sketches must work end-to-
end through the streaming engine.

Adds an \`is_nested_arrow_type\` predicate. The page-bounded direct
\`SerializedColumnWriter\` dispatch handles flat physical types as
before (\`Boolean\`, \`Int*/UInt*\`, \`Float*\`, byte-array,
\`Dictionary\`). Nested fields (single-leaf
\`List\`/\`LargeList\`/\`FixedSizeList\`/\`Struct\`/\`Map\`) fall back to
the column-chunk-buffered \`ArrowColumnWriter\` path: every input array
is fed via \`compute_leaves\` → \`writer.write\`, then the chunk is
finalised + appended at end-of-column. Memory for nested columns
grows with column-chunk size, not page size; primitive columns remain
page-bounded. Multi-leaf nested types (e.g. \`Struct\` with multiple
primitive children) are still rejected up front at
\`start_row_group\`, with the error message updated to match.

Adds \`test_list_uint64_round_trip_through_array_stream\` covering the
sketch shape: feeds two input arrays with variable-length lists
(including the empty list and \`u64::MAX\`) through
\`write_next_column_arrays\`, reads back via the standard reader, and
asserts row-by-row value equality. Renames the multi-leaf rejection
test to \`test_multi_leaf_nested_type_rejected_at_start_row_group\` to
reflect the narrower contract.

Reported by Codex review on #6384.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rrays

The DDSketch \`keys\` (\`List<Int16>\`) and \`counts\` (\`List<UInt64>\`)
columns are top-level arrow fields with one parquet leaf, so they
passed the leaf-count check at \`start_row_group\` but then hit the
unmatched \`DataType::List(_)\` arm in
\`write_array_via_serialized_column_writer\` and panicked deep in the
column write. Codex review on #6384 caught the gap; sketches must work
end-to-end through the streaming engine, page-bounded.

Implements page-bounded support for \`List<T>\` / \`LargeList<T>\` where
the outer field is non-nullable and the inner field is non-nullable
and a flat numeric primitive (Int8/16/32/64, UInt8/16/32/64,
Float32/64). Memory stays bounded by one in-flight page — the same
\`SerializedColumnWriter::write_batch\` call the flat path uses, just
with hand-computed Dremel def/rep levels (max_def = 1, max_rep = 1).
\`write_non_nullable_list_via_serialized_column_writer\` walks per-row
to compute (def, rep) pairs (empty list -> def=0, rep=0; non-empty
list -> def=1, rep=0 for first / rep=1 for rest), gathers present
values, and dispatches the inner type through \`write_list_inner_values\`
— same shape as the flat-primitive arms.

Other nested shapes (nullable inner, nullable outer, Struct, Map,
FixedSizeList, multi-leaf nested) return SchemaValidation. Multi-leaf
nested is also rejected up front by \`start_row_group\` with the error
message updated.

Tests:
- \`test_list_uint64_round_trip_through_array_stream\`: DDSketch
  \`counts\` shape with multi-array input, variable-length lists,
  edge values (\`u64::MAX\`, empty list), row-by-row value equality.
- \`test_list_uint64_pages_flush_incrementally\`: pages flush during
  the list-column write through the row-count limit (32 768 levels
  / 4096 = 8 pages). Confirms the page-bounded contract held by the
  flat path also holds for the list path; a column-chunk-buffered
  fallback would not pass this assertion.
- \`test_multi_leaf_nested_type_rejected_at_start_row_group\` (renamed
  from \`test_nested_type_rejected_at_start_row_group\`).

Reported by Codex review on #6384.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot force-pushed the gtt/streaming-parquet-writer branch from 8c5482e to 258012c Compare May 9, 2026 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants