Optimize take_fixed_size_binary For Predefined Value Lengths#9535
Optimize take_fixed_size_binary For Predefined Value Lengths#9535tobixdev wants to merge 2 commits intoapache:mainfrom
take_fixed_size_binary For Predefined Value Lengths#9535Conversation
|
run benchmark take_kernels |
|
🤖 Hi @tobixdev, thanks for the request (#9535 (comment)). |
|
I assumed that it would not work but it was worth a shot. Would appreciate someone running the benchmark. I saw approximately -80% on my machine. |
|
run benchmark take_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
It's not quite 80% but still significant. Maybe this difference is due to Aarch64 vs. x86 (my PC).
We're also close to the primitive kernel! Parts of the gap could be explained by the entry size (4 vs. 12 bytes). |
|
run benchmark take_kernels |
|
Will run once more to ensure we can reproduce the results. They look good |
|
🤖 |
| } | ||
| } | ||
| let result_buffer = match size_usize { | ||
| 1 => take_fixed_size::<IndexType, 1>(values.values(), indices), |
There was a problem hiding this comment.
as I read this it results in 7 copies of the code which is probably ok here but we do have to be careful in general to avoid too much bloat
There was a problem hiding this comment.
Yes that's definetely a drawback (for compile time and binary size).
|
🤖: Benchmark completed Details
|
🚀 |
Which issue does this PR close?
Rationale for this change
The
takekernel is very important for many operations (e.g.,HashJoinin DataFusion IIRC). Currently, there is a gap between the performance of the take kernel for primitive arrays (e.g.,DataType::UInt32) and fixed size binary arrays of the same length (e.g.,FixedSizeBinary<4>).In our case this lead to a performance reduction when moving from an integer-based id column to a fixed-size-binary-based id column. This PR aims to address parts of this gap.
The 16-bytes case would especially benefit operations on UUID columns.
What changes are included in this PR?
take_fixed_sizethat can be called for set of predefined fsb-lengths that we want to support. This is a "flat buffer" version of thetake_nativekernel.Are these changes tested?
I've added another test that still exercises the non-optimized code path.
Are there any user-facing changes?
No