-
Notifications
You must be signed in to change notification settings - Fork 270
perf: optimize CometColumnarToRow for complex types #3213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
andygrove
wants to merge
9
commits into
apache:main
Choose a base branch
from
andygrove:perf/optimize-complex-type-codegen
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
perf: optimize CometColumnarToRow for complex types #3213
andygrove
wants to merge
9
commits into
apache:main
from
andygrove:perf/optimize-complex-type-codegen
+1,225
−13
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is a work-in-progress optimization for complex types (arrays and maps) in CometColumnarToRowExec. The changes include: Phase 1: Cache offset buffer addresses in Comet vectors - CometListVector and CometMapVector now cache the offset buffer memory address at construction time - getArray()/getMap() use Platform.getInt() for direct memory access instead of getOffsetBuffer().getInt() - This eliminates virtual method calls per-row Phase 2: Custom code generation for complex types - CometColumnarToRowExec now generates optimized code for ArrayType and MapType columns - Per-batch caching of offset buffer addresses and child vectors - Uses Platform.getInt() directly in generated code for offset access TODO: - Phase 3: Reusable wrapper objects (MutableColumnarArray/Map) - Benchmarking to measure performance improvements - Additional test coverage Co-Authored-By: Claude Opus 4.5 <[email protected]>
Phase 3 of CometColumnarToRow optimization: - Add CometColumnarArray: mutable ArrayData impl that allows updating offset/length without creating new objects - Add CometColumnarMap: mutable MapData impl with same reusability - Update code generation to create wrapper once per batch and reuse across all rows via update() calls This eliminates object allocation per-row for array and map columns, reducing GC pressure. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Phase 4 of CometColumnarToRow optimization: - Add CometColumnarRow: mutable InternalRow impl that allows updating the rowId without creating new objects - Update code generation to detect StructType columns and use CometColumnarRow wrappers created once per batch This eliminates object allocation per-row for struct columns, reducing GC pressure. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3213 +/- ##
============================================
+ Coverage 56.12% 56.60% +0.48%
- Complexity 976 1376 +400
============================================
Files 119 171 +52
Lines 11743 15945 +4202
Branches 2251 2616 +365
============================================
+ Hits 6591 9026 +2435
- Misses 4012 5632 +1620
- Partials 1140 1287 +147 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add getVariant() method to CometColumnarArray and CometColumnarRow without @OverRide annotation to maintain compatibility with both Spark 3.x (which doesn't have this method) and Spark 4.x (which requires it). Co-Authored-By: Claude Opus 4.5 <[email protected]>
… dirs Move these classes to spark-3.x and spark-4.0 directories to handle the getVariant() method difference: - Spark 3.x: SpecializedGetters doesn't have getVariant() - Spark 4.x: SpecializedGetters.getVariant() returns VariantVal This fixes the CI build failures for Spark 4.x. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Use Platform.getXxx() for direct memory access to value buffers for fixed-width primitive types (byte, short, int, long, float, double, date, timestamp). This caches the value buffer address per-batch and uses JIT-intrinsified memory operations. Co-Authored-By: Claude Opus 4.5 <[email protected]>
When accessing nested arrays, maps, or structs from within complex type wrappers (CometColumnarArray, CometColumnarRow), reuse wrapper objects instead of creating new ones per-element. This eliminates per-element allocations for deeply nested types like Array(Array(Int)) or Struct(Array, Map). Co-Authored-By: Claude Opus 4.5 <[email protected]>
This reverts commit 564bf61.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Optimizes
CometColumnarToRowExeccode generation to reduce allocations and improve performance when converting columnar batches to rows, particularly for complex types (arrays, maps, structs) and fixed-width primitives.Optimizations
1. Reusable wrapper objects for complex types
Instead of creating new
ColumnarArray/ColumnarMap/ColumnarRowobjects per-row, we now create reusableCometColumnarArray,CometColumnarMap, andCometColumnarRowwrappers once per-batch and update their internal pointers for each row:CometColumnarArray: MutableArrayDataimplementation withupdate(offset, length)methodCometColumnarMap: MutableMapDataimplementation wrapping key/value arraysCometColumnarRow: MutableInternalRowimplementation withupdate(rowId)method2. Direct offset buffer access
For
ArrayTypeandMapTypecolumns backed byCometListVector/CometMapVector, we cache the offset buffer memory address per-batch and usePlatform.getInt()for direct memory access instead of going through the Arrow API.3. Direct memory access for fixed-width primitives
For fixed-width primitive types (
ByteType,ShortType,IntegerType,LongType,FloatType,DoubleType,DateType,TimestampType), we cache the value buffer address per-batch and generatePlatform.getXxx()calls for direct memory reads, bypassing theColumnVectorabstraction.Files Changed
CometColumnarArray.java- New mutable ArrayData wrapper (spark-3.x and spark-4.0 versions)CometColumnarMap.java- New mutable MapData wrapperCometColumnarRow.java- New mutable InternalRow wrapper (spark-3.x and spark-4.0 versions)CometPlainVector.java- AddedgetValueBufferAddress()for direct memory accessCometColumnarToRowExec.scala- Updated code generation with optimized pathsTest Plan
benchmarks/pyspark/run_all_benchmarks.sh🤖 Generated with Claude Code