geotiff: preflight CUDA in read_geotiff_gpu so broken-driver errors are clean#1904
Merged
Conversation
Call ``cupy.cuda.runtime.getDeviceCount()`` once after importing cupy. If the driver is older than the build expects, the GPU is offline, or no device is visible, raise a clean RuntimeError naming the underlying CUDA error. Previously this surfaced as a ``cudaErrorInsufficientDriver`` from a ``cupy.asarray(...)`` call deep in the CPU-fallback path. Closes #1903.
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a CUDA preflight check at the start of read_geotiff_gpu so that broken-driver scenarios surface as a clear RuntimeError instead of a low-level cudaErrorInsufficientDriver from deep inside a cupy.asarray(...) call. Closes #1903.
Changes:
- New
_preflight_cuda_runtime(cupy)helper that callscupy.cuda.runtime.getDeviceCount()and raises a diagnosticRuntimeErroron failure or zero devices. - Calls the preflight right after the
import cupystep inread_geotiff_gpu, and updates the docstring to describe the new behavior. - Adds five regression tests using a stubbed
cupymodule (plus one gated on realcupyavailability).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| xrspatial/geotiff/_backends/gpu.py | Adds _preflight_cuda_runtime helper, invokes it after the cupy import, and refreshes the docstring. |
| xrspatial/geotiff/tests/test_gpu_cuda_preflight_1903.py | New tests covering preflight raise paths, success path, end-to-end surface, and real-cupy monkeypatch composition. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
CI failed on macOS 3.12 (and would have failed on every other matrix job once fail-fast lifted) because the new ``_preflight_cuda_runtime`` call lands before any of the test's monkeypatched decoders fire. The stub installed by ``_ensure_cupy_stub`` in ``test_gpu_strict_fallback_1516.py`` provided only ``cupy.cuda.is_available``; it never carried a ``cupy.cuda.runtime`` submodule, so ``cupy.cuda.runtime.getDeviceCount()`` raised ``AttributeError``, which the preflight then converted to its own ``RuntimeError``. Every ``pytest.raises(RuntimeError, match='simulated GPU failure')`` assertion in the file then failed because it saw the preflight message instead. Extend the stub to install ``cupy.cuda.runtime`` with a ``getDeviceCount`` that returns 1 so the preflight lets the test through to the simulated-failure paths it actually wants to exercise. Track the new submodule in ``_cupy_cuda_runtime_saved`` so ``_restore_cupy`` cleans it up the same way it cleans the existing two modules. The real preflight semantics keep their dedicated coverage in ``test_gpu_cuda_preflight_1903.py``; this change is only the test plumbing that lets the unrelated strict/fallback tests still reach their assertions.
brendancol
added a commit
that referenced
this pull request
May 15, 2026
…t source) The 10-PR sequence that split __init__.py landed without a single overview artefact in-tree; the only visual reference was an ASCII diagram in the wrap-up GitHub comment. Adds: - geotiff_refactor_topology_1813.png: 1000x1000 rendering of the parallel-windows topology (PR1 parity tests -> PR2 _runtime -> PR3/4/5 leaves -> PR6 GPU helpers -> PR7/8/9 backends -> PR10 writers). - geotiff_refactor_topology_1813.dot: graphviz source so the figure can be re-rendered if labels or colours need to change later. Co-located with the existing ``_static/img/`` assets so Sphinx discovers them automatically; no .rst include yet -- a follow-up can embed the figure where it fits in the user guide. Pushed alongside the CUDA preflight PR (#1904) since both touch the same broader area and the user asked to bundle.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_preflight_cuda_runtime(cupy)toread_geotiff_gpu. Called right after theimport cupystep, it runscupy.cuda.runtime.getDeviceCount()once and raises a cleanRuntimeErrornaming the underlying CUDA error if the driver is unusable or no device is visible.cudaErrorInsufficientDriverfromcupy.asarray(...)deep in the CPU-fallback path (seexrspatial/geotiff/_backends/gpu.py:520). The user had no clue whether the issue was the file, the GPU pipeline, or the driver.The preflight is intentionally a one-time call. It catches the common "CuPy installed, driver broken" case (older host driver than CuPy build expects, suspended VM, removed device). Transient errors inside a later
cupy.asarray(...)(device OOM during upload) still propagate; the docstring now spells that out.Tests stub
cupy.cuda.runtime.getDeviceCountso they exercise the preflight branch without requiring a real GPU. One test additionally verifies the monkeypatch composes with a real cupy import, gated oncupybeing installed.Closes #1903.
Test plan
pytest xrspatial/geotiff/tests/test_gpu_cuda_preflight_1903.py— 5 new tests pass (stubbed cupy + real cupy)pytest xrspatial/geotiff/tests/ -k "gpu or strict or fallback"— 523 passed, no regressionsxrspatial/geotiff/tests/suite passes (2867 tests; 2 pre-existing failures on main are unrelated)