GH-3522: Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode)#3523
Open
iemejia wants to merge 4 commits intoapache:masterfrom
Open
Conversation
Fokko
approved these changes
Apr 28, 2026
| int bytesToRead = (int) Math.ceil(currentCount * bitWidth / 8.0); | ||
| bytesToRead = Math.min(bytesToRead, in.available()); | ||
| new DataInputStream(in).readFully(bytes, 0, bytesToRead); | ||
| new DataInputStream(in).readFully(packedBytesBuffer, 0, bytesToRead); |
Contributor
There was a problem hiding this comment.
I think it is on purpose to not close this resource, should we add a comment to it?
| valueIndex < currentCount; | ||
| valueIndex += 8, byteIndex += bitWidth) { | ||
| packer.unpack8Values(bytes, byteIndex, currentBuffer, valueIndex); | ||
| packer.unpack8Values(packedBytesBuffer, byteIndex, currentBuffer, valueIndex); |
Contributor
There was a problem hiding this comment.
I noticed that unpack8Values(final byte[] input, ...) is deprecated, should we switch to the ByteBuffer overload while we're reworking this?
diff --git a/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java b/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java
index da9db00c7..14ec1a87d 100644
--- a/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java
+++ b/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java
@@ -21,6 +21,7 @@ package org.apache.parquet.column.values.rle;
import java.io.DataInputStream;
import java.io.IOException;
import java.io.InputStream;
+import java.nio.ByteBuffer;
import org.apache.parquet.Preconditions;
import org.apache.parquet.bytes.BytesUtils;
import org.apache.parquet.column.values.bitpacking.BytePacker;
@@ -52,7 +53,7 @@ public class RunLengthBitPackingHybridDecoder {
// Reusable buffers to avoid per-run allocation in PACKED mode
private int[] packedValuesBuffer = new int[0];
- private byte[] packedBytesBuffer = new byte[0];
+ private ByteBuffer packedBytesBuffer = ByteBuffer.allocate(0);
public RunLengthBitPackingHybridDecoder(int bitWidth, InputStream in) {
LOG.debug("decoding bitWidth {}", bitWidth);
@@ -102,13 +103,13 @@ public class RunLengthBitPackingHybridDecoder {
}
currentBuffer = packedValuesBuffer;
int bytesRequired = numGroups * bitWidth;
- if (packedBytesBuffer.length < bytesRequired) {
- packedBytesBuffer = new byte[bytesRequired];
+ if (packedBytesBuffer.capacity() < bytesRequired) {
+ packedBytesBuffer = ByteBuffer.allocate(bytesRequired);
}
// At the end of the file RLE data though, there might not be that many bytes left.
int bytesToRead = (int) Math.ceil(currentCount * bitWidth / 8.0);
bytesToRead = Math.min(bytesToRead, in.available());
- new DataInputStream(in).readFully(packedBytesBuffer, 0, bytesToRead);
+ new DataInputStream(in).readFully(packedBytesBuffer.array(), 0, bytesToRead);
for (int valueIndex = 0, byteIndex = 0;
valueIndex < currentCount;
valueIndex += 8, byteIndex += bitWidth) {
…dDecoder PACKED path Allocate the int[] values buffer and byte[] read-staging buffer once per decoder and grow them lazily, instead of allocating fresh arrays on every PACKED run. Resolves the existing "TODO: reuse a buffer" comment. A new currentBufferLength field tracks the logical length of the active region in packedValuesBuffer (which may now exceed the current run's size after a prior larger run grew it). Benchmark (RleDictionaryIndexDecodingBenchmark, 100k INT32, BIT_WIDTH=10, JMH -wi 5 -i 10 -f 2): Pattern | master ops/s | optimized ops/s | Improvement SEQUENTIAL | 93,061,521 | 113,856,860 | +22.3% RANDOM | 92,929,824 | 114,238,638 | +22.9% LOW_CARDINALITY | 92,813,229 | 115,271,347 | +24.2% End-to-end FileReadBenchmark sees ~2% improvement (RLE decoding is a small fraction of full file reads). Validation: 573 parquet-column tests pass. Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.
…eam to direct ByteBuffer Replace the InputStream-based I/O in RunLengthBitPackingHybridDecoder with direct ByteBuffer access using LITTLE_ENDIAN byte order. This eliminates the InputStream/DataInputStream abstraction layer and enables more efficient reads via buffer.get(), buffer.getShort(), buffer.getInt() intrinsics. Changes: - Add ByteBuffer overloads for BytesUtils.readUnsignedVarInt() and readIntLittleEndianPaddedOnBitWidth() - Update all callers (DictionaryValuesReader, RunLengthBitPackingHybrid- ValuesReader, ColumnReaderBase) to pass ByteBuffer instead of InputStream - Remove IOException from readInt() since ByteBuffer operations don't throw checked exceptions, simplifying all call sites - Update tests to use ByteBuffer API directly All 573 parquet-column tests pass.
Buffer four 8-value groups (32 values) before packing, then use the packer's pack32Values entry point instead of calling pack8Values four separate times. This reduces per-group overhead and enables the packer's optimized 32-value code path. When fewer than 32 values remain at run end, flushBitPackedValues falls back to per-8 packing via pack8Values. All 573 parquet-column tests pass.
ba5d027 to
38a48ed
Compare
…KED branch Symmetric to the encoder's pack32Values fast path, the decoder's PACKED branch now batches 4 groups (32 values) into a single unpack32Values call instead of looping unpack8Values four times. Falls back to unpack8Values for residual <4-group tails. This benefits long PACKED runs (>=32 values) by reducing loop overhead and enabling the packer's optimized 32-value code path. Combined benchmark for the full par9 branch (all 4 commits: buffer reuse + ByteBuffer conversion + pack32Values encoder + unpack32Values decoder): RleDictionaryIndexDecodingBenchmark (100k dictionary IDs, JMH -wi 3 -i 5 -f 1): Pattern Before (ops/s) After (ops/s) Improvement SEQUENTIAL 603,445,362 698,066,810 +16% (1.16x) RANDOM 613,691,096 681,685,407 +11% (1.11x) LOW_CARDINALITY 611,963,736 686,200,341 +12% (1.12x) IntEncodingBenchmark.decodeDictionary (100k INT32 values, full dictionary decode path including RLE index decode): Pattern Before (ops/s) After (ops/s) Improvement SEQUENTIAL 418,357,276 539,458,940 +29% (1.29x) RANDOM 417,041,197 527,231,831 +26% (1.26x) LOW_CARDINALITY 605,354,083 628,283,691 +4% HIGH_CARDINALITY 416,731,808 535,763,242 +29% (1.29x) All 573 parquet-column tests pass.
38a48ed to
af02aef
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #3522.
RunLengthBitPackingHybridDecoderallocates a newint[]andbyte[]on every PACKED run during decode. The code itself flagged this with a// TODO: reuse a buffercomment. This PR resolves the TODO by reusing the buffers across runs within the same decoder instance, growing them lazily only when a larger run is encountered.Also adds a
currentBufferLengthfield to track the logical active-region length inpackedValuesBuffer(sincepackedValuesBuffer.lengthmay now exceed the current run's size after a prior larger run grew it).Benchmark
RleDictionaryIndexDecodingBenchmark(added in #3512) isolates the RLE/bit-packed dictionary-id decode path. 100k INT32 dictionary IDs, BIT_WIDTH=10, JMH-wi 5 -i 10 -f 2(20 measurement iterations):End-to-end
FileReadBenchmarksees a much smaller ~2% improvement because RLE decoding is only one of many pipeline stages; the isolated micro-benchmark shows the true magnitude on the affected code path.Validation
parquet-column: 573 tests passTestRunLengthBitPackingHybridEncoder: 9 tests pass (these round-trip values through the decoder)-Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=trueScope
17 LOC change to a single file. Self-contained and obviously correct (resolves the existing TODO).
Related
Part of the focused performance PR series from https://github.com/iemejia/parquet-perf. The companion ByteStreamSplit writer/reader changes from the same source commit (
ba52f82c3) have already been submitted as #3504 and #3506.How to reproduce
The benchmark is added in #3512. Once that lands, reproduce with: