-
Notifications
You must be signed in to change notification settings - Fork 1.9k
perf: Optimize ArrowBytesViewMap with direct view access #19975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Optimize ArrowBytesViewMap with direct view access #19975
Conversation
|
Hi @Dandandan This PR implements the optimization you suggested in #19961. Changes made:
All 8 existing tests pass. Ready for review when you have time! Could you also approve the CI workflows to run? Thanks! |
|
run benchmarks |
|
🤖 |
| let value: &[u8] = value.as_ref(); | ||
| // Get the input value - Arrow's value() method is already optimized | ||
| // to handle inline strings efficiently | ||
| let input_value: &[u8] = values.value(i).as_ref(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is slow for inlined bytes (u128) as it converts it back to &[u8] (the same as iter() does, so it will not really speed things up. We can also move it inside the equality check,
We should avoid it if we know either:
- all views are inlined (e.g. no buffer)
- this view is inlined (length < 12)
|
|
||
| v == value | ||
| // Compare stored value with input value | ||
| let stored_value = self.builder.get_value(header.view_idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slow for inlined values - we should avoid self.builder.get_value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we either have to avoid the Builder API (store views/buffer/nulls here ourself or add some methods (e.g. .views() append_inline_view, etc.) to make it much faster
| } else { | ||
| // no existing value, make a new one. | ||
| // Only dereference bytes here when we actually need to insert | ||
| let value: &[u8] = values.value(i).as_ref(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid this for inlined values (and it's the same as above input_value, so now it does it twice.
| let value: &[u8] = values.value(i).as_ref(); | ||
| let payload = make_payload_fn(Some(value)); | ||
|
|
||
| let inner_view_idx = self.builder.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can optimize the part of adding a new view by only adjusting the new buffer index, instead of creating a new view from scratch (slow)
|
Thanks @Tushar7012 for you PR, I left some notes what needs to be changed. |
2739149 to
45135f8
Compare
|
🤖: Benchmark completed Details
|
45135f8 to
4010560
Compare
- Use values.views() instead of values.iter() for direct u128 access - Use is_valid(i) for efficient null checking via validity bitmap - Avoid dereferencing overhead for inline strings - No additional memory overhead in Entry struct Closes apache#19961
4010560 to
51768bf
Compare
|
run benchmarks |
|
🤖 |
| return header.view == view_u128; | ||
| } | ||
|
|
||
| // For larger strings: first compare the 4-byte prefix (bytes 4-7 of u128) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think those other paths, while OK will probably not improve that much, as header.hash != hash already will filter out nearly 100% of the false positives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the insight! That makes sense - the hash comparison is doing most of the heavy lifting. I'll keep the code as-is since it's still slightly better and doesn't add complexity.
|
🤖: Benchmark completed Details
|
Starting to look good! (I think we can go further when specializing more for inline cases). |
|
run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
The speedup of those queries (q1 /2 in extended benchmarks) seems reproducible... if we save some of the builder overhead I think it will also speed up the other cases. |
|
Thanks for reviewing the benchmarks! Great to hear the speedup is reproducible. Regarding the builder overhead optimization for other cases - would you like me to explore that in a follow-up PR, or should we address it in this one before merging? Let me know if there's anything else needed for approval! |
I think it would be better to avoid the extra 16 bytes by either
|
|
Thanks for the detailed suggestions! To make sure I understand correctly: Option 1: Remove the builder usage entirely and store only the Option 2: Keep the current struct without the I'm leaning towards Option 1 as it seems more self-contained, but I want to confirm - for Option 1, would we be modifying the source Let me know which approach you'd prefer and I'll implement it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Optimizes ArrowBytesViewMap’s hot insertion/lookup path for StringViewArray/BinaryViewArray by iterating over raw u128 views and using validity-bit checks to avoid per-element Option/dereference overhead.
Changes:
- Iterate over
values.views()and usevalues.is_valid(i)for null handling. - Add fast-path equality for inlined (≤12B) values via direct
u128view comparison. - Add prefix-based precheck for non-inlined values to reduce full byte dereferencing/comparison.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// The original u128 view for fast comparison of inline strings (<=12 bytes) | ||
| /// and prefix comparison for larger strings | ||
| view: u128, | ||
|
|
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR description claims "No additional memory overhead in Entry struct", but this change adds view: u128 to Entry, increasing per-entry size (and thus memory usage / map_size) by at least 16 bytes plus padding. If avoiding additional overhead is a requirement, consider storing only the needed metadata (e.g., len: u32 + prefix: u32) or deriving these from the builder’s stored view via view_idx; otherwise, update the PR description to match the implementation.
| // For larger strings: first compare the 4-byte prefix (bytes 4-7 of u128) | ||
| // The prefix is stored in the next 4 bytes after length | ||
| // Only dereference full bytes if prefixes match | ||
| let stored_prefix = ((header.view >> 32) & 0xFFFFFFFF) as u32; | ||
| let input_prefix = ((view_u128 >> 32) & 0xFFFFFFFF) as u32; | ||
| if stored_prefix != input_prefix { | ||
| return false; | ||
| } | ||
|
|
||
| // Prefix matched - must compare full bytes |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The non-inline comparison path skips a cheap length equality check before doing prefix/full-byte comparisons. Since both input and stored views encode length in the low 32 bits, checking header.view as u32 == len (or equivalent) first would avoid calling builder.get_value(...) / values.value(i) when lengths differ but hashes (and possibly prefixes) match, improving the hot path.
| // For larger strings: first compare the 4-byte prefix (bytes 4-7 of u128) | |
| // The prefix is stored in the next 4 bytes after length | |
| // Only dereference full bytes if prefixes match | |
| let stored_prefix = ((header.view >> 32) & 0xFFFFFFFF) as u32; | |
| let input_prefix = ((view_u128 >> 32) & 0xFFFFFFFF) as u32; | |
| if stored_prefix != input_prefix { | |
| return false; | |
| } | |
| // Prefix matched - must compare full bytes | |
| // For larger strings: first compare length using the low 32 bits. | |
| // If lengths differ, the values cannot be equal, so we can skip | |
| // prefix and full-byte comparisons and avoid dereferencing bytes. | |
| if (header.view as u32) != len { | |
| return false; | |
| } | |
| // Next compare the 4-byte prefix (bytes 4-7 of u128). | |
| // The prefix is stored in the next 4 bytes after length. | |
| // Only dereference full bytes if prefixes match. | |
| let stored_prefix = ((header.view >> 32) & 0xFFFFFFFF) as u32; | |
| let input_prefix = ((view_u128 >> 32) & 0xFFFFFFFF) as u32; | |
| if stored_prefix != input_prefix { | |
| return false; | |
| } | |
| // Prefix and length matched - now compare full bytes. |
| let stored_prefix = ((header.view >> 32) & 0xFFFFFFFF) as u32; | ||
| let input_prefix = ((view_u128 >> 32) & 0xFFFFFFFF) as u32; |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefix extraction via manual bit shifts ((view >> 32) & 0xFFFFFFFF) duplicates knowledge of the ByteView layout and is harder to maintain. There is already a pattern elsewhere to extract the prefix using Arrow helpers (e.g. GenericByteViewArray::<B>::inline_value(&view, 4) in physical-plan/src/aggregates/group_values/multi_group_by/bytes_view.rs:271-275), which avoids hardcoding masks/shifts and documents intent more clearly.
| let stored_prefix = ((header.view >> 32) & 0xFFFFFFFF) as u32; | |
| let input_prefix = ((view_u128 >> 32) & 0xFFFFFFFF) as u32; | |
| let stored_bytes = header.view.to_le_bytes(); | |
| let stored_prefix = | |
| u32::from_le_bytes(stored_bytes[4..8].try_into().unwrap()); | |
| let input_bytes = view_u128.to_le_bytes(); | |
| let input_prefix = | |
| u32::from_le_bytes(input_bytes[4..8].try_into().unwrap()); |
| // Fast path: for inline strings (<=12 bytes), the entire value | ||
| // is stored in the u128 view, so we can compare directly | ||
| // This avoids the expensive conversion back to bytes | ||
| if len <= 12 { | ||
| return header.view == view_u128; | ||
| } | ||
|
|
||
| // For larger strings: first compare the 4-byte prefix (bytes 4-7 of u128) | ||
| // The prefix is stored in the next 4 bytes after length | ||
| // Only dereference full bytes if prefixes match | ||
| let stored_prefix = ((header.view >> 32) & 0xFFFFFFFF) as u32; |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several new comments describe behavior in terms of "strings" (e.g. "inline strings" / "larger strings"), but this function is generic over ByteViewType and is used for both Utf8View and BinaryView. Consider adjusting wording to "values"/"byte sequences" to avoid confusion for the binary case.
| /// Completed buffers containing string data | ||
| completed: Vec<Buffer>, | ||
| /// Tracks null values (true = null) | ||
| nulls: Vec<bool>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
run benchmarks |
|
🤖 |
Some of those unfortunately are noise, but the clickbench_extended are real improvements! |
Per reviewer feedback, replaced Vec<bool> with BooleanBufferBuilder for tracking null values. This uses bit-packed storage (1 bit per value) instead of byte-per-value, reducing memory usage by 8x for the null bitmap. Also fixed clippy warnings for mem_replace_with_default.
|
Fixed! Replaced Changes:
Ready for review! |
|
🤖: Benchmark completed Details
|
|
run benchmarks |
|
🤖 |
|
Nice, starts to show up as improvement for normal clickbench as well. |
|
🤖: Benchmark completed Details
|
| // get the value as bytes | ||
| let value: &[u8] = value.as_ref(); | ||
| // Extract length from the view (first 4 bytes of u128 in little-endian) | ||
| let len = (view_u128 & 0xFFFFFFFF) as u32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
view_u128 as u32 does the same thing
| } | ||
|
|
||
| // For larger strings: first compare the 4-byte prefix | ||
| let stored_prefix = ((header.view >> 32) & 0xFFFFFFFF) as u32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
& 0xFFFFFFFF is not needed
|
|
||
| // For larger strings: first compare the 4-byte prefix | ||
| let stored_prefix = ((header.view >> 32) & 0xFFFFFFFF) as u32; | ||
| let input_prefix = ((view_u128 >> 32) & 0xFFFFFFFF) as u32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
& 0xFFFFFFFF is not needed
Dandandan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two small suggestions, but I think we're good to go then.
| let meta_fetch_concurrency = | ||
| ctx.config_options().execution.meta_fetch_concurrency; | ||
| let file_list = stream::iter(file_list).flatten_unordered(meta_fetch_concurrency); | ||
| let file_list = stream::iter(self.table_paths.iter()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you commit this by accident?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right - this was committed by accident. I've reverted the table.rs changes in commit 9302c3b. These changes belong in a separate PR for parallelizing list_files_for_scan. Thanks for catching it
|
Thank you @Tushar7012 ! |
Closes #19961
Which issue does this PR close?
Closes #19961
Rationale for this change
The ArrowBytesViewMap was using
values.iter()which creates unnecessaryOptionwrappers and extra overhead when iterating over byte view arrays. For ClickBench query 5, >50% CPU was spent during the intern operation.This PR optimizes the hot path in insert_if_new_inner by using direct view access methods that avoid the iteration overhead.
What changes are included in this PR?
values.iter()withvalues.views(): Access the raw&[u128]view buffer directly instead of creating Option wrappers for each valueis_valid(i)for null checking: Check validity via the bitmap instead of pattern matching onOptionvalues.value(i)only when the value is needed, avoiding unnecessary dereferencingAre these changes tested?
Yes, all existing tests pass:
binary_view_map::tests::string_view_set_emptybinary_view_map::tests::string_view_set_one_nullbinary_view_map::tests::string_view_set_many_nullbinary_view_map::tests::test_string_view_set_basicbinary_view_map::tests::test_string_set_non_utf8binary_view_map::tests::test_binary_setbinary_view_map::tests::test_mapbinary_view_map::tests::test_string_set_memory_usageAre there any user-facing changes?
No, this is an internal performance optimization with no changes to public APIs.