Skip to content

Core: Add a read-only Mumbling bitmap implementation#16747

Open
rdblue wants to merge 5 commits into
apache:mainfrom
rdblue:mumbling
Open

Core: Add a read-only Mumbling bitmap implementation#16747
rdblue wants to merge 5 commits into
apache:mainfrom
rdblue:mumbling

Conversation

@rdblue

@rdblue rdblue commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

This is a simple implementation of Mumbling bitmap that has been proposed for embedded bitmaps in v4 metadata.

This implementation includes 3 main classes:

  • BitPacking: bit packing and unpacking implementations for specific widths (1-7) used for descriptor encoding
  • PFOREncoding: patched frame-of-reference encoding and decoding for descriptor bytes
  • `MumblingBitmap: a read-only implementation of Mumbling bitmap

Support for creating and modifying bitmaps will be added in later PRs, similar to the approach for variant where SerializedValue implementations were added first as a building block for mutable implementations.

This also includes a benchmark to compare the PFOR implementation to JavaFastPFOR. This is not a fair comparison because JavaFastPFOR is intended for large arrays and vectorization, but the use cases tested are very small arrays that don't benefit from vectorization and have high overhead. The reason for the benchmark is to show that it doesn't make sense to delegate to JavaFastPFOR for a small descriptor array. This benchmark probably won't be committed as it is now in the final version, but I wanted to make it available for reviewers.

Co-Authored-By: Claude Code (Opus 4.7, 1M context) noreply@anthropic.com

Comment thread gradle/libs.versions.toml
httpcomponents-httpclient5 = { module = "org.apache.httpcomponents.client5:httpclient5", version.ref = "httpcomponents-httpclient5" }
immutables-value = { module = "org.immutables:value", version.ref = "immutables-value" }
jackson-bom = { module = "com.fasterxml.jackson:jackson-bom", version.ref = "jackson-bom" }
javafastpfor = { module = "me.lemire.integercompression:JavaFastPFOR", version.ref = "javafastpfor" }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the benchmark and doesn't need to be included in the committed version.

this.cardinality =
(data.get(data.position() + 1) & 0xFF)
| ((data.get(data.position() + 2) & 0xFF) << 8)
| ((data.get(data.position() + 3) & 0xFF) << 16);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and other direct reads will be refactored to use utils in ByteBuffers after #16748 moves them out of VariantUtil.

Random random = new Random(1938745);

// 256-value descriptor-like data: mostly [0,31] with ~5% [0,255] outliers
descriptorValues = PFORRandomData.exceptions(random, 256, 0.5f);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.05?

@pvary

pvary commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What pattern should I test?
This is what I tried, and what I have found:

pattern                 set positions       Mumbling        Roaring    Roaring + RLE     Parquet v2
last 50k only                   50000           6491          16408               25            631
random 1% + last 50k            53495          10472          22980             7509           4628
random 10%                      40000          40636          50552            50552          27227
random 20%                      80000          50111          51862            51862          42340
random 5%                       20000          20871          40064            40064          16634

@amogh-jahagirdar amogh-jahagirdar self-requested a review June 10, 2026 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants