GH-45847: [C++] Optimize Parquet column reader by fusing decoding and counting #48549

tennisleng · 2025-12-15T23:51:33Z

Rationale for this change

This PR implements the optimization to fuse definition level decoding with counting in the Parquet column reader, addressing the TODO in cpp/src/parquet/column_reader.cc.

What changes are included in this PR?

Added GetBatchWithCount to RleBitPackedDecoder, RleRunDecoder, and BitPackedRunDecoder in cpp/src/arrow/util/rle_encoding_internal.h.
Added DecodeAndCount to LevelDecoder in cpp/src/parquet/column_reader.h and cpp/src/parquet/column_reader.cc.
Updated TypedColumnReaderImpl::ReadLevels in cpp/src/parquet/column_reader.cc to use DecodeAndCount.

Are these changes tested?

Yes.

Added a new unit test Rle.GetBatchWithCount in cpp/src/arrow/util/rle_encoding_test.cc.
Verified with arrow-bit-utility-test.

Are there any user-facing changes?

No, this is an internal performance optimization.

GitHub Issue: [C++][Compute][Acero] Poor aggregate performance when there is a large number of batches on the build side #45847

…ng and counting This PR implements the optimization to fuse definition level decoding with counting in the Parquet column reader, addressing the TODO in column_reader.cc.

github-actions · 2025-12-15T23:51:59Z

⚠️ GitHub issue #45847 has been automatically assigned in GitHub to PR creator.

wgtmac · 2025-12-16T07:05:56Z

Thanks for creating the PR! I'm just confused about its relationship with the linked issue.

tennisleng · 2025-12-16T20:38:05Z

sorry for the confusion. i stumbled onto this optimization while investigating #45847, but you're right that they're not really related. this is more of a standalone parquet reader improvement. i can create a separate issue for it and update the reference if you'd like.

pitrou · 2025-12-17T14:09:16Z

Two things:

You should show benchmark numbers to make sure that this is actually an optimization (and that it brings benefits large enough)
Some work on level decoding is also being done in GH-48277: [C++][Parquet] unpack with shuffle algorithm #47994 , hopefully the two PRs won't conflict

cc @AntoinePrv FYI

Add ReadLevels_RleCountSeparate and ReadLevels_RleCountFused benchmarks to compare the old approach (Decode + std::count) vs the new fused DecodeAndCount approach. Results show ~12% speedup for RLE-heavy data (high repeat counts) where counting is O(1) for entire runs.

…hCount design Clarify that the fused counting approach is only beneficial for RLE runs where counting is O(1). For bit-packed runs, std::count after GetBatch is used since it's highly optimized by modern compilers (SIMD).

AntoinePrv

I don't know enough about Parquet internal to know if decode and count is a much needed operation. I agree that a benchmark is needed.

From a design perspective: the split of RleBitPackingDecoder in a RleRunDecoder and a BitPackingRunDecoder is to allow modularity and avoid adding too many specialized operations there. If only LevelDecoder::DecodeAndCount needs GetBatchWithCount, perhaps all that logic could be implemented there by using the parser and runs decoders.

AntoinePrv · 2025-12-18T09:25:59Z

cpp/src/arrow/util/rle_encoding_internal.h

+  [[nodiscard]] rle_size_t GetBatchWithCount(value_type* out, rle_size_t batch_size,
+                                             rle_size_t value_bit_width,
+                                             value_type match_value, int64_t* out_count) {
+    if (ARROW_PREDICT_FALSE(remaining_count_ == 0)) {
+      return 0;
+    }
+
+    const auto to_read = std::min(remaining_count_, batch_size);
+    std::fill(out, out + to_read, value_);
+    if (value_ == match_value) {
+      *out_count += to_read;
+    }
+    remaining_count_ -= to_read;
+    return to_read;
+  }
+


Could this call RleRunDecoder::GetBatch to avoid duplicating the logic?

AntoinePrv · 2025-12-18T09:30:29Z

cpp/src/arrow/util/rle_encoding_internal.h

+    const auto steps = GetBatch(out, batch_size, value_bit_width);
+    // std::count is highly optimized (SIMD) by modern compilers
+    *out_count += std::count(out, out + steps, match_value);


I have been working on the unpack function used in GetBatch and my intuition is also that this function could not easily be extended to count at the same time as it extract (not impossible but heavy changes).

Still this could provide better data locality when doing run by run.

The typical batch size for levels is probably small, so it would fit at least in L2 cache and perhaps L1. Not sure it's worth trying to do it while decoding.

AntoinePrv · 2025-12-18T09:38:08Z

2. Some work on level decoding is also being done in

They do not

Address review feedback by calling GetBatch instead of duplicating the fill logic. For RLE runs, counting remains O(1) since all values in the run are identical.

apacheGH-45847: [C++] Optimize Parquet column reader by fusing decodi…

0292132

…ng and counting This PR implements the optimization to fuse definition level decoding with counting in the Parquet column reader, addressing the TODO in column_reader.cc.

tennisleng requested a review from wgtmac as a code owner December 15, 2025 23:51

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Dec 15, 2025

tennisleng added 2 commits December 17, 2025 19:28

AntoinePrv reviewed Dec 18, 2025

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 18, 2025

Refactor RleRunDecoder::GetBatchWithCount to avoid code duplication

6f4c7e1

Address review feedback by calling GetBatch instead of duplicating the fill logic. For RLE runs, counting remains O(1) since all values in the run are identical.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-45847: [C++] Optimize Parquet column reader by fusing decoding and counting #48549

GH-45847: [C++] Optimize Parquet column reader by fusing decoding and counting #48549

Uh oh!

tennisleng commented Dec 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Dec 15, 2025

Uh oh!

wgtmac commented Dec 16, 2025

Uh oh!

tennisleng commented Dec 16, 2025

Uh oh!

pitrou commented Dec 17, 2025

Uh oh!

AntoinePrv left a comment

Uh oh!

AntoinePrv Dec 18, 2025

Uh oh!

AntoinePrv Dec 18, 2025

Uh oh!

pitrou Dec 18, 2025

Uh oh!

AntoinePrv commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

GH-45847: [C++] Optimize Parquet column reader by fusing decoding and counting #48549

Are you sure you want to change the base?

GH-45847: [C++] Optimize Parquet column reader by fusing decoding and counting #48549

Uh oh!

Conversation

tennisleng commented Dec 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Dec 15, 2025

Uh oh!

wgtmac commented Dec 16, 2025

Uh oh!

tennisleng commented Dec 16, 2025

Uh oh!

pitrou commented Dec 17, 2025

Uh oh!

AntoinePrv left a comment

Choose a reason for hiding this comment

Uh oh!

AntoinePrv Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

AntoinePrv Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

pitrou Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

AntoinePrv commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tennisleng commented Dec 15, 2025 •

edited by github-actions bot

Loading