geotiff: parse GDAL_NODATA as int first to keep 64-bit sentinels exact (#1847)#1852
Conversation
#1847) `extract_geo_info`, `_resolve_masked_fill`, and `_sparse_fill_value` parsed the GDAL_NODATA tag via `float()` unconditionally. For uint64 max (`2**64 - 1`) and int64 max (`2**63 - 1`), the float64 representation sits one ULP above the dtype's max, so the downstream `info.min <= int(nodata) <= info.max` gate rejected the cast and the sentinel pixel survived as a literal valid integer rather than being masked to NaN. `write_vrt` then stringified the float-parsed sentinel into XML as `1.8446744073709552e+19`, which the VRT reader rejected too -- the bug propagated through the entire 64-bit integer pipeline. `_vrt._parse_band_nodata` (PR #1833) already fixed this on the VRT XML parse path. Lift the int-first parse into a shared helper `_parse_nodata_str` in `_geotags.py` and route the three TIFF-side sites through it. The helper tries `int(text)` first, falls back to `float(text)` for NaN / Inf / scientific notation / fractional values. Downstream gates already handle integer values transparently because `np.isfinite(int)` works and `int(int)` is a no-op. 25 regression tests in `test_nodata_int64_precision_1847.py`: - Unit-level `_parse_nodata_str` matrix covering int branch (uint64 max, int64 max, int64 min, uint16 max, negative int, whitespace) and float branch (NaN, Inf, -Inf, scientific notation, fractional); plus empty / None / garbage return None. - Eager `open_geotiff`: uint64 max / int64 max masked to NaN; int64 min / uint16 max / int32 -9999 / float -9999.0 regression guards. - `read_geotiff_dask` masks 64-bit sentinels via the windowed reader. - `write_vrt -> read_vrt` round-trip preserves the integer XML literal and masks the sentinel. - GPU parity test confirms the GPU path also masks correctly (the fix is upstream of the per-backend code). Closes #1847
There was a problem hiding this comment.
Pull request overview
This PR fixes precision loss when parsing the GeoTIFF GDAL_NODATA tag for 64-bit integer sentinels by parsing as int first (preserving exact uint64/int64 max), with a float fallback for NaN/Inf/scientific/fractional values. This aligns the TIFF-side nodata parsing with the existing VRT XML path and prevents downstream masking + VRT round-trip failures for 64-bit nodata.
Changes:
- Add shared
_parse_nodata_strhelper in_geotags.pyand use it inextract_geo_info. - Route
_reader._resolve_masked_filland_reader._sparse_fill_valuethrough the shared int-first nodata parser. - Add a comprehensive regression test suite covering eager, dask, VRT round-trip, and GPU paths for 64-bit nodata sentinels.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
xrspatial/geotiff/_geotags.py |
Introduces _parse_nodata_str and uses it for GDAL_NODATA parsing in extract_geo_info. |
xrspatial/geotiff/_reader.py |
Updates LERC masked-fill and sparse-fill nodata parsing to use the shared helper. |
xrspatial/geotiff/tests/test_nodata_int64_precision_1847.py |
Adds regression coverage for uint64/int64 sentinel masking + VRT round-trips + dask/GPU parity. |
.claude/sweep-accuracy-state.csv |
Updates audit tracking entry for the geotiff module and this fix. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from __future__ import annotations | ||
|
|
||
| import os | ||
| import tempfile |
| @pytest.fixture | ||
| def cupy_or_skip(): | ||
| try: | ||
| import cupy # noqa: F401 | ||
| from numba import cuda | ||
| if not cuda.is_available(): | ||
| pytest.skip("CUDA not available") | ||
| except ImportError: | ||
| pytest.skip("cupy not installed") | ||
|
|
||
|
|
||
| class TestGpuPathParity: | ||
| def test_uint64_max_masked_via_gpu(self, tmp_path, cupy_or_skip): |
| def _parse_nodata_str(text: str) -> int | float | None: | ||
| """Parse a GDAL_NODATA tag string at full integer precision when possible. | ||
|
|
||
| Returns a Python ``int`` for plain integer literals (so 64-bit | ||
| sentinels survive without the float64 round-trip that pushes them one | ||
| ULP past the dtype max), a ``float`` for NaN / Inf / scientific | ||
| notation / fractional values, and ``None`` when the string is not a | ||
| valid number. | ||
|
|
||
| Mirrors :func:`xrspatial.geotiff._vrt._parse_band_nodata` (issue | ||
| #1833) which addressed the same problem on the VRT XML path. See | ||
| issue #1847. | ||
| """ | ||
| if text is None: | ||
| return None | ||
| s = text.strip() | ||
| if not s: |
Address three copilot reviews on #1852: - Drop unused tempfile import in test_nodata_int64_precision_1847.py. - Switch GPU gating to the project-standard _gpu_only / _HAS_GPU helper used by test_gpu_nodata_1542.py for uniform skip reasons. - Widen _parse_nodata_str signature from `text: str` to `text: str | None` to match the runtime None-handling path, and widen GeoInfo.nodata from `float | None` to `int | float | None` since the helper now returns int for integer literals.
|
Addressed all three: unused tempfile import removed, GPU gating switched to the _gpu_only / _HAS_GPU helper used by test_gpu_nodata_1542.py, and _parse_nodata_str signature widened to str | None. I also widened GeoInfo.nodata from float | None to int | float | None since that's the one downstream type that now actually receives int values from the helper. The _resolve_masked_fill / _sparse_fill_value callers don't carry return annotations and their nodata_str parameter is already str | None, so no further churn needed there. |
Summary
Closes #1847.
extract_geo_info,_resolve_masked_fill, and_sparse_fill_valueparsed the GDAL_NODATA tag throughfloat()unconditionally. uint64 max (2**64 - 1) and int64 max (2**63 - 1) are not exactly representable in float64; the nearest float sits one ULP above the dtype's max so the downstreaminfo.min <= int(nodata) <= info.maxgate rejected the cast and the sentinel pixel survived as a literal valid integer rather than being masked to NaN.write_vrtthen stringified the float-parsed sentinel into the VRT XML as1.8446744073709552e+19, which the VRT reader subsequently rejected as out-of-range -- so the bug propagated through the whole 64-bit pipeline rather than just the eager TIFF read.The VRT XML parse path was already int-aware via
_vrt._parse_band_nodata(PR #1833). This change lifts the int-first parse into a shared helper_parse_nodata_strin_geotags.pyand routes the three TIFF-side sites through it. The helper triesint(text)first to preserve full precision, falls back tofloat(text)for NaN / Inf / scientific notation / fractional values. Downstream gates already handle integer values transparently becausenp.isfinite(int)works andint(int)is a no-op, so no call sites change.Categories: Cat 2 (NaN propagation: sentinel pixel survived as a literal valid number on all four backends) + Cat 5 (backend inconsistency: VRT XML parse handled 64-bit sentinels but the TIFF parse fed by
write_vrtdid not).Repro before fix
Test plan
_parse_nodata_strmatrix (int branch + float fallback + edge cases)open_geotiffmasks uint64 max / int64 max to NaNread_geotiff_dask(chunks=)masks 64-bit sentinels via the windowed readerwrite_vrt -> read_vrtround-trip preserves the integer XML literal and masks the sentineltest_size_param_validation_gpu_vrt_1776.test_tile_size_positive_worksasserts pre-to_geotiff accepts tile_size values that are not multiples of 16 #1767 tile_size=4 behaviour)