Skip to content

geotiff: GPU writer overview loop uses cupy.putmask for in-place NaN rewrite (#1948)#1952

Merged
brendancol merged 2 commits into
mainfrom
deep-sweep-performance-geotiff-2026-05-15-pass2-01
May 15, 2026
Merged

geotiff: GPU writer overview loop uses cupy.putmask for in-place NaN rewrite (#1948)#1952
brendancol merged 2 commits into
mainfrom
deep-sweep-performance-geotiff-2026-05-15-pass2-01

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Closes #1948.

Summary

The COG overview loop inside write_geotiff_gpu used to allocate a
fresh current.copy() before rewriting NaN cells back to the nodata
sentinel:

current = make_overview_gpu(current, ...)
...
nan_mask = cupy.isnan(current)
if bool(nan_mask.any().item()):
    current = current.copy()
    current[nan_mask] = sentinel

make_overview_gpu returns a freshly allocated cupy buffer at every
call site (the 2-D path ends in cupy.nan* / cupy.around(...).astype(...)
/ cropped[::2, ::2].copy(); the 3-D path ends in cupy.stack), so
nothing aliased the buffer between the return and the in-place rewrite.
The current.copy() allocated a second chunk-sized GPU buffer per
overview level for no semantic gain.

The fix replaces the two-line rewrite with
cupy.putmask(current, nan_mask, sentinel) so the existing buffer is
mutated in place. Mirrors the in-place sentinel rewrite
_apply_nodata_mask_gpu adopted in #1934.

For an 8192x8192 float32 raster with 4 auto-generated overview levels,
the extra allocations sum to roughly 21 MB per write; the saving is
modest but the pattern aligns with #1934 and removes a redundant
device allocation per overview level.

Test plan

  • test_gpu_writer_overview_loop_uses_putmask_1948 -- structural
    guard against reintroducing the redundant copy.
  • test_gpu_writer_cog_overview_sentinel_roundtrip_1948 -- COG
    write -> CPU read preserves the sentinel through every overview level
    (uses an all-sentinel quadrant so every 2x reduction still hits a
    fully-sentinel 2x2 block and triggers the rewrite branch).
  • test_gpu_writer_overview_uses_make_overview_gpu_fresh_buffer_1948
    -- contract: every overview method in _block_reduce_2d_gpu returns
    a cupy buffer whose data.ptr differs from the input.
  • Existing GPU writer + overview + COG suite (137 tests) passes
    unchanged.

Filed under the performance sweep run from /deep-sweep performance
on the geotiff module.

Copilot AI review requested due to automatic review settings May 15, 2026 15:28
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 15, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@brendancol brendancol force-pushed the deep-sweep-performance-geotiff-2026-05-15-pass2-01 branch from 179d83c to 38d5b2f Compare May 15, 2026 15:46
@brendancol
Copy link
Copy Markdown
Contributor Author

PR Review: geotiff: GPU writer overview loop uses cupy.putmask for in-place NaN rewrite (#1948)

Blockers (must fix before merge)

  • None

Suggestions (should fix, not blocking)

  • None

Nits (optional improvements)

  • xrspatial/geotiff/_writers/gpu.py:499 -- the scalar cast np.dtype(str(current.dtype)).type(nodata) could be hoisted out of the inner while so it runs once per target_factor instead of once per halving step. The cost is negligible (host-side scalar construction) so this is purely cosmetic.
  • xrspatial/geotiff/tests/test_gpu_writer_overview_inplace_1948.py:79 -- the source-text guard scans a fixed 1500-char window after make_overview_gpu(current. If the surrounding code grows past that window the structural assertion will silently stop covering the right region. Anchoring on the next parts.append( line would be more robust.

What looks good

  • The aliasing argument holds: make_overview_gpu returns a fresh buffer on every reducer path, and the new test_gpu_writer_overview_uses_make_overview_gpu_fresh_buffer_1948 codifies that contract across mean/nearest/min/max/median, so the in-place mutation is safe.
  • The rewrite branch stays gated on kind == 'f', so integer overviews (which can't carry NaN) never reach putmask and dtype semantics are preserved.
  • The sentinel value is still cast to current.dtype before the write, matching the prior indexed-assignment semantics — no upcast surprise from putmask's broadcasting.
  • Correctness test uses an all-sentinel quadrant so every 2x reduction still hits a fully-sentinel 2x2 block and exercises the rewrite branch at every level, including a real COG round-trip through the CPU reader.

Checklist

  • Algorithm matches reference/paper
  • All implemented backends produce consistent results
  • NaN handling is correct
  • Edge cases are covered by tests
  • Dask chunk boundaries handled correctly
  • No premature materialization or unnecessary copies
  • Benchmark exists or is not needed
  • README feature matrix updated (if applicable)
  • Docstrings present and accurate

brendancol added a commit that referenced this pull request May 15, 2026
brendancol added a commit that referenced this pull request May 15, 2026
@brendancol brendancol force-pushed the deep-sweep-performance-geotiff-2026-05-15-pass2-01 branch from 136661a to 288ff83 Compare May 15, 2026 17:29
@brendancol
Copy link
Copy Markdown
Contributor Author

PR Review: geotiff: GPU writer overview loop uses cupy.putmask for in-place NaN rewrite (#1948) (re-review after rebase)

Re-reviewing branch at 288ff83 after rebase onto current main. Two commits: the original bbec25c putmask swap and the follow-up 288ff83 that hoists the loop-invariant sentinel cast and anchors the test slice on parts.append(. CPU-side tests (test_gpu_writer_overview_loop_uses_putmask_1948) pass locally; the two GPU-gated tests skip without a device.

Blockers

  • None

Suggestions

  • None

Nits

  • None

What looks good

  • The hoist of rewrite_nodata and sentinel_scalar out of the inner while is safe. make_overview_gpu preserves dtype (the comment cites the four buffer-producing call sites inside _block_reduce_2d_gpu and the 3-D cupy.stack path), and np_dtype is captured once on line 399 from the source arr. The cast and the float/finite gate no longer repeat per level.
  • The cupy.putmask(current, nan_mask, sentinel_scalar) call is correct against the same buffer make_overview_gpu just returned, with bool(nan_mask.any().item()) still gating the rewrite so the kernel is skipped on overview levels that have no all-sentinel cells.
  • The source-level test now anchors its window on the next parts.append( instead of a fixed character count, so the guard tracks the loop body even if surrounding code shifts.
  • The third GPU test (make_overview_gpu_fresh_buffer_1948) covers all five reducer methods (mean, nearest, min, max, median) and asserts the returned device pointer differs from the input. That is the contract the in-place rewrite depends on, and it is now under test.
  • The round-trip COG test stays green: the bottom-right quadrant stays all-sentinel so each 2x reduction still produces at least one NaN block that exercises the rewrite branch, and both overview levels read back with the sentinel preserved.

Checklist

  • Rebased cleanly on origin/main
  • CPU test passes locally (3 passed in 0.94s; two GPU tests skip without a device)
  • No changes outside _writers/gpu.py and the new regression test
  • No public API or signature changes
  • Sentinel-rewrite semantics preserved (float dtype gate, finite nodata gate, dtype-cast sentinel)

…rewrite (#1948)

The COG overview loop inside `write_geotiff_gpu` used to allocate a
fresh `current.copy()` before rewriting NaN cells back to the nodata
sentinel:

```
current = make_overview_gpu(current, ...)
...
nan_mask = cupy.isnan(current)
if bool(nan_mask.any().item()):
    current = current.copy()
    current[nan_mask] = sentinel
```

`make_overview_gpu` returns a freshly allocated cupy buffer at every
call site (the 2-D path ends in `cupy.nan*` / `cupy.around(...).astype(...)`
/ `cropped[::2, ::2].copy()`; the 3-D path ends in `cupy.stack`), so
nothing aliased the buffer between the return and the in-place rewrite.
The `current.copy()` allocated a second chunk-sized GPU buffer per
overview level for no semantic gain.

Replace the two-line rewrite with `cupy.putmask(current, nan_mask, sentinel)`
so the existing buffer is mutated in place. Mirrors the in-place
sentinel rewrite `_apply_nodata_mask_gpu` adopted in #1934.

Tests cover:
* structural -- the overview branch uses `cupy.putmask` and no longer
  contains `current = current.copy()`,
* correctness -- a COG write of a float32 raster with an all-sentinel
  quadrant round-trips through the GPU writer and back to the CPU
  reader with NaN preserved on every overview level,
* contract -- every overview method in `_block_reduce_2d_gpu` returns
  a cupy buffer whose `data.ptr` differs from the input, so the
  in-place mutation is safe.
@brendancol brendancol force-pushed the deep-sweep-performance-geotiff-2026-05-15-pass2-01 branch from 288ff83 to 19fd38f Compare May 15, 2026 17:53
@brendancol
Copy link
Copy Markdown
Contributor Author

Rebased onto current main at 19fd38f to pick up the Python 3.14 CI matrix limit (b1579f8).

@brendancol brendancol merged commit add6529 into main May 15, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU writer overview loop: redundant cupy.copy() before in-place NaN rewrite

2 participants