geotiff: nvcomp decompress prefix-sum offsets via np.cumsum (#1950) by brendancol · Pull Request #1955 · xarray-contrib/xarray-spatial

brendancol · 2026-05-15T15:30:41Z

Closes #1950.

Summary

_try_nvcomp_batch_decompress computed its per-tile host offsets via
a Python for loop:

comp_offsets_h = np.zeros(n_tiles, dtype=np.int64)
for i in range(1, n_tiles):
    comp_offsets_h[i] = comp_offsets_h[i - 1] + comp_sizes_list[i - 1]

The sibling helper _batched_d2h_to_bytes at ~L924 and the
post-compress prefix sum in _nvcomp_batch_compress at ~L2572 already
use np.cumsum(sizes, out=offsets[1:]). Aligning the decompress side
keeps the codebase consistent and trims interpreter overhead.

Replace the loop with np.fromiter + np.cumsum(..., out=...). The
microbench on 1024 tiles drops the prefix sum from 84us to 21us
(~3.9x); the absolute saving is below 0.1% of the nvCOMP kernel
runtime, but the change keeps the three batched-transfer helpers in
the codebase using one prefix-sum idiom.

Test plan

test_nvcomp_decompress_uses_cumsum_for_offsets_1950 --
structural guard against reverting to the Python loop.
test_cumsum_matches_loop_prefix_sum_1950 -- equivalence on a
1024-element random sizes array.
test_nvcomp_batch_decompress_roundtrip_1950 -- end-to-end
GPU read of a 1024x1024 float32 deflate-tiled raster still matches
the source byte-for-byte.
Existing nvCOMP / gpu_decode suite (18 tests) passes unchanged.

Filed under the performance sweep run from /deep-sweep performance
on the geotiff module.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

brendancol · 2026-05-15T15:58:48Z

brendancol · 2026-05-15T17:28:15Z

PR Review: geotiff: nvcomp decompress prefix-sum offsets via np.cumsum (#1950) (re-review after rebase)

Blockers (must fix before merge)

None

Suggestions (should fix, not blocking)

None

Nits (optional improvements)

xrspatial/geotiff/_gpu_decode.py:1210 total_comp = int(comp_sizes_arr.sum()) performs a second pass over the size array. total_comp = int(comp_offsets_h[-1]) + int(comp_sizes_arr[-1]) reuses the already-computed prefix sum, though the difference is microseconds at most.
xrspatial/geotiff/tests/test_nvcomp_decompress_cumsum_offsets_1950.py:54 regex anchors on comp_sizes_arr / comp_offsets_h literal names. Future renames will fail the structural guard before the runtime guard catches anything. Anchoring on np.cumsum(...out=...[1:]) shape alone would be looser but more refactor-tolerant.

What looks good

Prefix-sum vectorisation matches the sibling _batched_d2h_to_bytes and _nvcomp_batch_compress sites, so the three nvcomp helpers now share the same pattern.
Allocating comp_sizes_arr and comp_offsets_h as uint64 up front drops the .astype(np.uint64) copies at the cupy.asarray call sites.
Tests cover both source-level guard (regex) and numeric equivalence vs the explicit loop. The strict-GPU round-trip is correctly gated on XRSPATIAL_GEOTIFF_STRICT_GPU=1 so a CPU fallback cannot produce a misleading pass.
Microbench numbers (84us -> 21us at 1024 tiles) recorded in the comment for future readers.

Checklist

Algorithm matches reference/paper (np.cumsum equivalent to explicit loop, asserted in test)
All implemented backends produce consistent results (GPU-only path)
NaN handling is correct (not applicable, byte-level offsets)
Edge cases are covered by tests (n_tiles > 1 guard, single-tile path)
Dask chunk boundaries handled correctly (not applicable)
No premature materialization or unnecessary copies (uint64 up front removes astype copies)
Benchmark exists or is not needed (microbench inlined in comment)
README feature matrix updated (if applicable) (not applicable)
Docstrings present and accurate

`_try_nvcomp_batch_decompress` computed its per-tile host offsets via a Python `for` loop: ``` comp_offsets_h = np.zeros(n_tiles, dtype=np.int64) for i in range(1, n_tiles): comp_offsets_h[i] = comp_offsets_h[i - 1] + comp_sizes_list[i - 1] ``` The sibling helper `_batched_d2h_to_bytes` at ~L924 and the post-compress prefix sum in `_nvcomp_batch_compress` at ~L2572 already use `np.cumsum(sizes, out=offsets[1:])`. Aligning the decompress side keeps the codebase consistent and trims interpreter overhead. Replace the loop with `np.fromiter` + `np.cumsum(..., out=...)` and materialise the per-tile sizes once as `comp_sizes_arr`; subsequent slicing and the `cupy.asarray` upload use the array directly. The microbench on 1024 tiles drops the prefix sum from 84us to 21us (~3.9x). Tests cover: * structural -- the decompress upload block uses `np.cumsum` and no longer contains the `for i in range(1, n_tiles)` loop, * equivalence -- the cumsum-based offsets match the prior loop pointwise on a 1024-element random array, * round-trip -- a 1024x1024 float32 deflate-tiled raster still decodes through the GPU path and matches the source byte-for-byte.

brendancol · 2026-05-15T17:52:34Z

Rebased onto current main at 84b7f9a to pick up the Python 3.14 CI matrix limit (b1579f8).

Copilot AI review requested due to automatic review settings May 15, 2026 15:30

github-actions Bot added the performance PR touches performance-sensitive code label May 15, 2026

Copilot started reviewing on behalf of brendancol May 15, 2026 15:31 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

brendancol force-pushed the deep-sweep-performance-geotiff-2026-05-15-pass2-02 branch from 060efe1 to d28530e Compare May 15, 2026 15:46

brendancol added a commit that referenced this pull request May 15, 2026

geotiff: address PR #1955 review nits

2a417f7

brendancol added a commit that referenced this pull request May 15, 2026

geotiff: address PR #1955 review nits

8fdb402

brendancol force-pushed the deep-sweep-performance-geotiff-2026-05-15-pass2-02 branch 2 times, most recently from 2a417f7 to 8fdb402 Compare May 15, 2026 17:27

brendancol added 2 commits May 15, 2026 10:52

geotiff: address PR #1955 review nits

84b7f9a

brendancol force-pushed the deep-sweep-performance-geotiff-2026-05-15-pass2-02 branch from 8fdb402 to 84b7f9a Compare May 15, 2026 17:52

brendancol merged commit 8dc2a17 into main May 15, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

geotiff: nvcomp decompress prefix-sum offsets via np.cumsum (#1950)#1955

geotiff: nvcomp decompress prefix-sum offsets via np.cumsum (#1950)#1955
brendancol merged 2 commits into
mainfrom
deep-sweep-performance-geotiff-2026-05-15-pass2-02

brendancol commented May 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

brendancol commented May 15, 2026

Uh oh!

brendancol commented May 15, 2026

Uh oh!

brendancol commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brendancol commented May 15, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

brendancol commented May 15, 2026

PR Review: geotiff: nvcomp decompress prefix-sum offsets via np.cumsum (#1950)

Blockers (must fix before merge)

Suggestions (should fix, not blocking)

Nits (optional improvements)

What looks good

Checklist

Uh oh!

brendancol commented May 15, 2026

PR Review: geotiff: nvcomp decompress prefix-sum offsets via np.cumsum (#1950) (re-review after rebase)

Blockers (must fix before merge)

Suggestions (should fix, not blocking)

Nits (optional improvements)

What looks good

Checklist

Uh oh!

brendancol commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants