Skip to content

geotiff: preflight CUDA in read_geotiff_gpu so broken-driver errors are clean#1904

Merged
brendancol merged 2 commits into
mainfrom
issue-1903
May 15, 2026
Merged

geotiff: preflight CUDA in read_geotiff_gpu so broken-driver errors are clean#1904
brendancol merged 2 commits into
mainfrom
issue-1903

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

  • Add _preflight_cuda_runtime(cupy) to read_geotiff_gpu. Called right after the import cupy step, it runs cupy.cuda.runtime.getDeviceCount() once and raises a clean RuntimeError naming the underlying CUDA error if the driver is unusable or no device is visible.
  • Before this, the failure surfaced as a low-level cudaErrorInsufficientDriver from cupy.asarray(...) deep in the CPU-fallback path (see xrspatial/geotiff/_backends/gpu.py:520). The user had no clue whether the issue was the file, the GPU pipeline, or the driver.
  • Update the docstring lines that previously admitted the upload failure propagates unchanged in both modes.

The preflight is intentionally a one-time call. It catches the common "CuPy installed, driver broken" case (older host driver than CuPy build expects, suspended VM, removed device). Transient errors inside a later cupy.asarray(...) (device OOM during upload) still propagate; the docstring now spells that out.

Tests stub cupy.cuda.runtime.getDeviceCount so they exercise the preflight branch without requiring a real GPU. One test additionally verifies the monkeypatch composes with a real cupy import, gated on cupy being installed.

Closes #1903.

Test plan

  • pytest xrspatial/geotiff/tests/test_gpu_cuda_preflight_1903.py — 5 new tests pass (stubbed cupy + real cupy)
  • pytest xrspatial/geotiff/tests/ -k "gpu or strict or fallback" — 523 passed, no regressions
  • Full xrspatial/geotiff/tests/ suite passes (2867 tests; 2 pre-existing failures on main are unrelated)

Call ``cupy.cuda.runtime.getDeviceCount()`` once after importing
cupy. If the driver is older than the build expects, the GPU is
offline, or no device is visible, raise a clean RuntimeError
naming the underlying CUDA error. Previously this surfaced as a
``cudaErrorInsufficientDriver`` from a ``cupy.asarray(...)`` call
deep in the CPU-fallback path.

Closes #1903.
Copilot AI review requested due to automatic review settings May 15, 2026 05:20
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 15, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a CUDA preflight check at the start of read_geotiff_gpu so that broken-driver scenarios surface as a clear RuntimeError instead of a low-level cudaErrorInsufficientDriver from deep inside a cupy.asarray(...) call. Closes #1903.

Changes:

  • New _preflight_cuda_runtime(cupy) helper that calls cupy.cuda.runtime.getDeviceCount() and raises a diagnostic RuntimeError on failure or zero devices.
  • Calls the preflight right after the import cupy step in read_geotiff_gpu, and updates the docstring to describe the new behavior.
  • Adds five regression tests using a stubbed cupy module (plus one gated on real cupy availability).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
xrspatial/geotiff/_backends/gpu.py Adds _preflight_cuda_runtime helper, invokes it after the cupy import, and refreshes the docstring.
xrspatial/geotiff/tests/test_gpu_cuda_preflight_1903.py New tests covering preflight raise paths, success path, end-to-end surface, and real-cupy monkeypatch composition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

CI failed on macOS 3.12 (and would have failed on every other matrix
job once fail-fast lifted) because the new
``_preflight_cuda_runtime`` call lands before any of the test's
monkeypatched decoders fire. The stub installed by
``_ensure_cupy_stub`` in ``test_gpu_strict_fallback_1516.py``
provided only ``cupy.cuda.is_available``; it never carried a
``cupy.cuda.runtime`` submodule, so
``cupy.cuda.runtime.getDeviceCount()`` raised ``AttributeError``,
which the preflight then converted to its own ``RuntimeError``. Every
``pytest.raises(RuntimeError, match='simulated GPU failure')``
assertion in the file then failed because it saw the preflight
message instead.

Extend the stub to install ``cupy.cuda.runtime`` with a
``getDeviceCount`` that returns 1 so the preflight lets the test
through to the simulated-failure paths it actually wants to exercise.
Track the new submodule in ``_cupy_cuda_runtime_saved`` so
``_restore_cupy`` cleans it up the same way it cleans the existing
two modules.

The real preflight semantics keep their dedicated coverage in
``test_gpu_cuda_preflight_1903.py``; this change is only the test
plumbing that lets the unrelated strict/fallback tests still reach
their assertions.
@brendancol brendancol merged commit b4f0eee into main May 15, 2026
11 checks passed
brendancol added a commit that referenced this pull request May 15, 2026
…t source)

The 10-PR sequence that split __init__.py landed without a single
overview artefact in-tree; the only visual reference was an ASCII
diagram in the wrap-up GitHub comment. Adds:

- geotiff_refactor_topology_1813.png: 1000x1000 rendering of the
  parallel-windows topology (PR1 parity tests -> PR2 _runtime ->
  PR3/4/5 leaves -> PR6 GPU helpers -> PR7/8/9 backends -> PR10
  writers).
- geotiff_refactor_topology_1813.dot: graphviz source so the figure
  can be re-rendered if labels or colours need to change later.

Co-located with the existing ``_static/img/`` assets so Sphinx
discovers them automatically; no .rst include yet -- a follow-up can
embed the figure where it fits in the user guide.

Pushed alongside the CUDA preflight PR (#1904) since both touch the
same broader area and the user asked to bundle.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

geotiff: preflight CUDA in read_geotiff_gpu so broken-driver errors are clean

2 participants