Skip to content

Conversation

@tennisleng
Copy link
Contributor

@tennisleng tennisleng commented Dec 15, 2025

Rationale for this change

This PR implements the optimization to fuse definition level decoding with counting in the Parquet column reader, addressing the TODO in cpp/src/parquet/column_reader.cc.

What changes are included in this PR?

  1. Added GetBatchWithCount to RleBitPackedDecoder, RleRunDecoder, and BitPackedRunDecoder in cpp/src/arrow/util/rle_encoding_internal.h.
  2. Added DecodeAndCount to LevelDecoder in cpp/src/parquet/column_reader.h and cpp/src/parquet/column_reader.cc.
  3. Updated TypedColumnReaderImpl::ReadLevels in cpp/src/parquet/column_reader.cc to use DecodeAndCount.

Are these changes tested?

Yes.

  1. Added a new unit test Rle.GetBatchWithCount in cpp/src/arrow/util/rle_encoding_test.cc.
  2. Verified with arrow-bit-utility-test.

Are there any user-facing changes?

No, this is an internal performance optimization.

…ng and counting

This PR implements the optimization to fuse definition level decoding with counting in the Parquet column reader, addressing the TODO in column_reader.cc.
@tennisleng tennisleng requested a review from wgtmac as a code owner December 15, 2025 23:51
@github-actions
Copy link

⚠️ GitHub issue #45847 has been automatically assigned in GitHub to PR creator.

@wgtmac
Copy link
Member

wgtmac commented Dec 16, 2025

Thanks for creating the PR! I'm just confused about its relationship with the linked issue.

@tennisleng
Copy link
Contributor Author

sorry for the confusion. i stumbled onto this optimization while investigating #45847, but you're right that they're not really related. this is more of a standalone parquet reader improvement. i can create a separate issue for it and update the reference if you'd like.

@pitrou
Copy link
Member

pitrou commented Dec 17, 2025

Two things:

  1. You should show benchmark numbers to make sure that this is actually an optimization (and that it brings benefits large enough)
  2. Some work on level decoding is also being done in GH-48277: [C++][Parquet] unpack with shuffle algorithm #47994 , hopefully the two PRs won't conflict

cc @AntoinePrv FYI

Add ReadLevels_RleCountSeparate and ReadLevels_RleCountFused benchmarks
to compare the old approach (Decode + std::count) vs the new fused
DecodeAndCount approach.

Results show ~12% speedup for RLE-heavy data (high repeat counts)
where counting is O(1) for entire runs.
…hCount design

Clarify that the fused counting approach is only beneficial for RLE runs
where counting is O(1). For bit-packed runs, std::count after GetBatch is
used since it's highly optimized by modern compilers (SIMD).
Copy link
Contributor

@AntoinePrv AntoinePrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know enough about Parquet internal to know if decode and count is a much needed operation. I agree that a benchmark is needed.

From a design perspective: the split of RleBitPackingDecoder in a RleRunDecoder and a BitPackingRunDecoder is to allow modularity and avoid adding too many specialized operations there. If only LevelDecoder::DecodeAndCount needs GetBatchWithCount, perhaps all that logic could be implemented there by using the parser and runs decoders.

Comment on lines 312 to 327
[[nodiscard]] rle_size_t GetBatchWithCount(value_type* out, rle_size_t batch_size,
rle_size_t value_bit_width,
value_type match_value, int64_t* out_count) {
if (ARROW_PREDICT_FALSE(remaining_count_ == 0)) {
return 0;
}

const auto to_read = std::min(remaining_count_, batch_size);
std::fill(out, out + to_read, value_);
if (value_ == match_value) {
*out_count += to_read;
}
remaining_count_ -= to_read;
return to_read;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this call RleRunDecoder::GetBatch to avoid duplicating the logic?

Comment on lines +404 to +406
const auto steps = GetBatch(out, batch_size, value_bit_width);
// std::count is highly optimized (SIMD) by modern compilers
*out_count += std::count(out, out + steps, match_value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been working on the unpack function used in GetBatch and my intuition is also that this function could not easily be extended to count at the same time as it extract (not impossible but heavy changes).

Still this could provide better data locality when doing run by run.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The typical batch size for levels is probably small, so it would fit at least in L2 cache and perhaps L1. Not sure it's worth trying to do it while decoding.

@AntoinePrv
Copy link
Contributor

2. Some work on level decoding is also being done in

They do not

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 18, 2025
Address review feedback by calling GetBatch instead of duplicating the
fill logic. For RLE runs, counting remains O(1) since all values in the
run are identical.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants