Skip to content

geotiff: nvcomp decompress prefix-sum offsets via np.cumsum (#1950)#1955

Merged
brendancol merged 2 commits into
mainfrom
deep-sweep-performance-geotiff-2026-05-15-pass2-02
May 15, 2026
Merged

geotiff: nvcomp decompress prefix-sum offsets via np.cumsum (#1950)#1955
brendancol merged 2 commits into
mainfrom
deep-sweep-performance-geotiff-2026-05-15-pass2-02

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Closes #1950.

Summary

_try_nvcomp_batch_decompress computed its per-tile host offsets via
a Python for loop:

comp_offsets_h = np.zeros(n_tiles, dtype=np.int64)
for i in range(1, n_tiles):
    comp_offsets_h[i] = comp_offsets_h[i - 1] + comp_sizes_list[i - 1]

The sibling helper _batched_d2h_to_bytes at ~L924 and the
post-compress prefix sum in _nvcomp_batch_compress at ~L2572 already
use np.cumsum(sizes, out=offsets[1:]). Aligning the decompress side
keeps the codebase consistent and trims interpreter overhead.

Replace the loop with np.fromiter + np.cumsum(..., out=...). The
microbench on 1024 tiles drops the prefix sum from 84us to 21us
(~3.9x); the absolute saving is below 0.1% of the nvCOMP kernel
runtime, but the change keeps the three batched-transfer helpers in
the codebase using one prefix-sum idiom.

Test plan

  • test_nvcomp_decompress_uses_cumsum_for_offsets_1950 --
    structural guard against reverting to the Python loop.
  • test_cumsum_matches_loop_prefix_sum_1950 -- equivalence on a
    1024-element random sizes array.
  • test_nvcomp_batch_decompress_roundtrip_1950 -- end-to-end
    GPU read of a 1024x1024 float32 deflate-tiled raster still matches
    the source byte-for-byte.
  • Existing nvCOMP / gpu_decode suite (18 tests) passes unchanged.

Filed under the performance sweep run from /deep-sweep performance
on the geotiff module.

Copilot AI review requested due to automatic review settings May 15, 2026 15:30
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 15, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@brendancol brendancol force-pushed the deep-sweep-performance-geotiff-2026-05-15-pass2-02 branch from 060efe1 to d28530e Compare May 15, 2026 15:46
@brendancol
Copy link
Copy Markdown
Contributor Author

PR Review: geotiff: nvcomp decompress prefix-sum offsets via np.cumsum (#1950)

Blockers (must fix before merge)

  • None

Suggestions (should fix, not blocking)

  • None

Nits (optional improvements)

  • xrspatial/geotiff/_gpu_decode.py:1209 still does comp_offsets_h.astype(np.uint64) even though comp_sizes_arr is now materialised once. The offsets array could likewise be allocated as np.uint64 up front to skip the astype copy; current code is correct, just one extra small allocation.
  • xrspatial/geotiff/tests/test_nvcomp_decompress_cumsum_offsets_1950.py:60 the 1500-char window after the anchor comment is a brittle structural assertion. If the comment block grows, the window may miss the prefix-sum site. A regex over the whole _try_nvcomp_batch_decompress body would be sturdier, but the current form is fine for a regression guard.
  • xrspatial/geotiff/tests/test_nvcomp_decompress_cumsum_offsets_1950.py:120 the round-trip test docstring notes that without nvCOMP the GPU path falls back to a CPU codec and the prefix-sum site is bypassed; consider gating on XRSPATIAL_GEOTIFF_STRICT_GPU=1 or skipping when nvCOMP is unavailable so the test name matches what it exercises.

What looks good

  • Vectorised prefix sum matches the idiom already used in _batched_d2h_to_bytes (~L924) and _nvcomp_batch_compress (~L2572).
  • n_tiles == 0 and n_tiles == 1 edge cases handled via the if n_tiles > 1 guard; np.fromiter(..., count=n_tiles) allocates exactly once.
  • total_comp = int(comp_sizes_arr.sum()) keeps the int64 accumulator and matches the prior sum(comp_sizes_list) semantics.
  • Equivalence test on a 1024-element random sizes array directly compares cumsum vs the prior loop.

Checklist

  • Algorithm matches reference/paper (prefix sum identity preserved)
  • All implemented backends produce consistent results (GPU-only code path; behaviour unchanged for non-GPU)
  • NaN handling is correct (not applicable, integer offsets)
  • Edge cases are covered by tests (n_tiles == 0 / 1 handled by guard; round-trip on 1024x1024 raster)
  • Dask chunk boundaries handled correctly (not applicable, single-process decode)
  • No premature materialization or unnecessary copies (one fromiter + one astype; comp_sizes_list redundant copy removed)
  • Benchmark exists or is not needed (microbench numbers cited in PR body, 84us -> 21us on 1024 tiles)
  • README feature matrix updated (not applicable)
  • Docstrings present and accurate (inline comment block updated with rationale and issue ref)

brendancol added a commit that referenced this pull request May 15, 2026
brendancol added a commit that referenced this pull request May 15, 2026
@brendancol brendancol force-pushed the deep-sweep-performance-geotiff-2026-05-15-pass2-02 branch 2 times, most recently from 2a417f7 to 8fdb402 Compare May 15, 2026 17:27
@brendancol
Copy link
Copy Markdown
Contributor Author

PR Review: geotiff: nvcomp decompress prefix-sum offsets via np.cumsum (#1950) (re-review after rebase)

Blockers (must fix before merge)

  • None

Suggestions (should fix, not blocking)

  • None

Nits (optional improvements)

  • xrspatial/geotiff/_gpu_decode.py:1210 total_comp = int(comp_sizes_arr.sum()) performs a second pass over the size array. total_comp = int(comp_offsets_h[-1]) + int(comp_sizes_arr[-1]) reuses the already-computed prefix sum, though the difference is microseconds at most.
  • xrspatial/geotiff/tests/test_nvcomp_decompress_cumsum_offsets_1950.py:54 regex anchors on comp_sizes_arr / comp_offsets_h literal names. Future renames will fail the structural guard before the runtime guard catches anything. Anchoring on np.cumsum(...out=...[1:]) shape alone would be looser but more refactor-tolerant.

What looks good

  • Prefix-sum vectorisation matches the sibling _batched_d2h_to_bytes and _nvcomp_batch_compress sites, so the three nvcomp helpers now share the same pattern.
  • Allocating comp_sizes_arr and comp_offsets_h as uint64 up front drops the .astype(np.uint64) copies at the cupy.asarray call sites.
  • Tests cover both source-level guard (regex) and numeric equivalence vs the explicit loop. The strict-GPU round-trip is correctly gated on XRSPATIAL_GEOTIFF_STRICT_GPU=1 so a CPU fallback cannot produce a misleading pass.
  • Microbench numbers (84us -> 21us at 1024 tiles) recorded in the comment for future readers.

Checklist

  • Algorithm matches reference/paper (np.cumsum equivalent to explicit loop, asserted in test)
  • All implemented backends produce consistent results (GPU-only path)
  • NaN handling is correct (not applicable, byte-level offsets)
  • Edge cases are covered by tests (n_tiles > 1 guard, single-tile path)
  • Dask chunk boundaries handled correctly (not applicable)
  • No premature materialization or unnecessary copies (uint64 up front removes astype copies)
  • Benchmark exists or is not needed (microbench inlined in comment)
  • README feature matrix updated (if applicable) (not applicable)
  • Docstrings present and accurate

`_try_nvcomp_batch_decompress` computed its per-tile host offsets via
a Python `for` loop:

```
comp_offsets_h = np.zeros(n_tiles, dtype=np.int64)
for i in range(1, n_tiles):
    comp_offsets_h[i] = comp_offsets_h[i - 1] + comp_sizes_list[i - 1]
```

The sibling helper `_batched_d2h_to_bytes` at ~L924 and the
post-compress prefix sum in `_nvcomp_batch_compress` at ~L2572 already
use `np.cumsum(sizes, out=offsets[1:])`. Aligning the decompress side
keeps the codebase consistent and trims interpreter overhead.

Replace the loop with `np.fromiter` + `np.cumsum(..., out=...)` and
materialise the per-tile sizes once as `comp_sizes_arr`; subsequent
slicing and the `cupy.asarray` upload use the array directly. The
microbench on 1024 tiles drops the prefix sum from 84us to 21us
(~3.9x).

Tests cover:
* structural -- the decompress upload block uses `np.cumsum` and no
  longer contains the `for i in range(1, n_tiles)` loop,
* equivalence -- the cumsum-based offsets match the prior loop
  pointwise on a 1024-element random array,
* round-trip -- a 1024x1024 float32 deflate-tiled raster still decodes
  through the GPU path and matches the source byte-for-byte.
@brendancol brendancol force-pushed the deep-sweep-performance-geotiff-2026-05-15-pass2-02 branch from 8fdb402 to 84b7f9a Compare May 15, 2026 17:52
@brendancol
Copy link
Copy Markdown
Contributor Author

Rebased onto current main at 84b7f9a to pick up the Python 3.14 CI matrix limit (b1579f8).

@brendancol brendancol merged commit 8dc2a17 into main May 15, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nvCOMP decompress prefix-sum offsets: replace Python for loop with np.cumsum

2 participants