geotiff: nvcomp decompress prefix-sum offsets via np.cumsum (#1950)#1955
Merged
brendancol merged 2 commits intoMay 15, 2026
Merged
Conversation
060efe1 to
d28530e
Compare
Contributor
Author
PR Review: geotiff: nvcomp decompress prefix-sum offsets via np.cumsum (#1950)Blockers (must fix before merge)
Suggestions (should fix, not blocking)
Nits (optional improvements)
What looks good
Checklist
|
brendancol
added a commit
that referenced
this pull request
May 15, 2026
brendancol
added a commit
that referenced
this pull request
May 15, 2026
2a417f7 to
8fdb402
Compare
Contributor
Author
PR Review: geotiff: nvcomp decompress prefix-sum offsets via np.cumsum (#1950) (re-review after rebase)Blockers (must fix before merge)
Suggestions (should fix, not blocking)
Nits (optional improvements)
What looks good
Checklist
|
`_try_nvcomp_batch_decompress` computed its per-tile host offsets via
a Python `for` loop:
```
comp_offsets_h = np.zeros(n_tiles, dtype=np.int64)
for i in range(1, n_tiles):
comp_offsets_h[i] = comp_offsets_h[i - 1] + comp_sizes_list[i - 1]
```
The sibling helper `_batched_d2h_to_bytes` at ~L924 and the
post-compress prefix sum in `_nvcomp_batch_compress` at ~L2572 already
use `np.cumsum(sizes, out=offsets[1:])`. Aligning the decompress side
keeps the codebase consistent and trims interpreter overhead.
Replace the loop with `np.fromiter` + `np.cumsum(..., out=...)` and
materialise the per-tile sizes once as `comp_sizes_arr`; subsequent
slicing and the `cupy.asarray` upload use the array directly. The
microbench on 1024 tiles drops the prefix sum from 84us to 21us
(~3.9x).
Tests cover:
* structural -- the decompress upload block uses `np.cumsum` and no
longer contains the `for i in range(1, n_tiles)` loop,
* equivalence -- the cumsum-based offsets match the prior loop
pointwise on a 1024-element random array,
* round-trip -- a 1024x1024 float32 deflate-tiled raster still decodes
through the GPU path and matches the source byte-for-byte.
8fdb402 to
84b7f9a
Compare
Contributor
Author
|
Rebased onto current main at |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1950.
Summary
_try_nvcomp_batch_decompresscomputed its per-tile host offsets viaa Python
forloop:The sibling helper
_batched_d2h_to_bytesat ~L924 and thepost-compress prefix sum in
_nvcomp_batch_compressat ~L2572 alreadyuse
np.cumsum(sizes, out=offsets[1:]). Aligning the decompress sidekeeps the codebase consistent and trims interpreter overhead.
Replace the loop with
np.fromiter+np.cumsum(..., out=...). Themicrobench on 1024 tiles drops the prefix sum from 84us to 21us
(~3.9x); the absolute saving is below 0.1% of the nvCOMP kernel
runtime, but the change keeps the three batched-transfer helpers in
the codebase using one prefix-sum idiom.
Test plan
test_nvcomp_decompress_uses_cumsum_for_offsets_1950--structural guard against reverting to the Python loop.
test_cumsum_matches_loop_prefix_sum_1950-- equivalence on a1024-element random sizes array.
test_nvcomp_batch_decompress_roundtrip_1950-- end-to-endGPU read of a 1024x1024 float32 deflate-tiled raster still matches
the source byte-for-byte.
Filed under the performance sweep run from
/deep-sweep performanceon the geotiff module.