-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-45847: [C++] Optimize Parquet column reader by fusing decoding and counting #48549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ng and counting This PR implements the optimization to fuse definition level decoding with counting in the Parquet column reader, addressing the TODO in column_reader.cc.
|
|
|
Thanks for creating the PR! I'm just confused about its relationship with the linked issue. |
|
sorry for the confusion. i stumbled onto this optimization while investigating #45847, but you're right that they're not really related. this is more of a standalone parquet reader improvement. i can create a separate issue for it and update the reference if you'd like. |
|
Two things:
cc @AntoinePrv FYI |
Add ReadLevels_RleCountSeparate and ReadLevels_RleCountFused benchmarks to compare the old approach (Decode + std::count) vs the new fused DecodeAndCount approach. Results show ~12% speedup for RLE-heavy data (high repeat counts) where counting is O(1) for entire runs.
…hCount design Clarify that the fused counting approach is only beneficial for RLE runs where counting is O(1). For bit-packed runs, std::count after GetBatch is used since it's highly optimized by modern compilers (SIMD).
AntoinePrv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know enough about Parquet internal to know if decode and count is a much needed operation. I agree that a benchmark is needed.
From a design perspective: the split of RleBitPackingDecoder in a RleRunDecoder and a BitPackingRunDecoder is to allow modularity and avoid adding too many specialized operations there. If only LevelDecoder::DecodeAndCount needs GetBatchWithCount, perhaps all that logic could be implemented there by using the parser and runs decoders.
| [[nodiscard]] rle_size_t GetBatchWithCount(value_type* out, rle_size_t batch_size, | ||
| rle_size_t value_bit_width, | ||
| value_type match_value, int64_t* out_count) { | ||
| if (ARROW_PREDICT_FALSE(remaining_count_ == 0)) { | ||
| return 0; | ||
| } | ||
|
|
||
| const auto to_read = std::min(remaining_count_, batch_size); | ||
| std::fill(out, out + to_read, value_); | ||
| if (value_ == match_value) { | ||
| *out_count += to_read; | ||
| } | ||
| remaining_count_ -= to_read; | ||
| return to_read; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this call RleRunDecoder::GetBatch to avoid duplicating the logic?
| const auto steps = GetBatch(out, batch_size, value_bit_width); | ||
| // std::count is highly optimized (SIMD) by modern compilers | ||
| *out_count += std::count(out, out + steps, match_value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been working on the unpack function used in GetBatch and my intuition is also that this function could not easily be extended to count at the same time as it extract (not impossible but heavy changes).
Still this could provide better data locality when doing run by run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The typical batch size for levels is probably small, so it would fit at least in L2 cache and perhaps L1. Not sure it's worth trying to do it while decoding.
They do not |
Address review feedback by calling GetBatch instead of duplicating the fill logic. For RLE runs, counting remains O(1) since all values in the run are identical.
Rationale for this change
This PR implements the optimization to fuse definition level decoding with counting in the Parquet column reader, addressing the TODO in
cpp/src/parquet/column_reader.cc.What changes are included in this PR?
GetBatchWithCounttoRleBitPackedDecoder,RleRunDecoder, andBitPackedRunDecoderincpp/src/arrow/util/rle_encoding_internal.h.DecodeAndCounttoLevelDecoderincpp/src/parquet/column_reader.handcpp/src/parquet/column_reader.cc.TypedColumnReaderImpl::ReadLevelsincpp/src/parquet/column_reader.ccto useDecodeAndCount.Are these changes tested?
Yes.
Rle.GetBatchWithCountincpp/src/arrow/util/rle_encoding_test.cc.arrow-bit-utility-test.Are there any user-facing changes?
No, this is an internal performance optimization.