Skip to content

geotiff: parse GDAL_NODATA as int first to keep 64-bit sentinels exact (#1847)#1852

Merged
brendancol merged 2 commits into
mainfrom
deep-sweep-accuracy-geotiff-2026-05-14
May 14, 2026
Merged

geotiff: parse GDAL_NODATA as int first to keep 64-bit sentinels exact (#1847)#1852
brendancol merged 2 commits into
mainfrom
deep-sweep-accuracy-geotiff-2026-05-14

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

Closes #1847.

extract_geo_info, _resolve_masked_fill, and _sparse_fill_value parsed the GDAL_NODATA tag through float() unconditionally. uint64 max (2**64 - 1) and int64 max (2**63 - 1) are not exactly representable in float64; the nearest float sits one ULP above the dtype's max so the downstream info.min <= int(nodata) <= info.max gate rejected the cast and the sentinel pixel survived as a literal valid integer rather than being masked to NaN. write_vrt then stringified the float-parsed sentinel into the VRT XML as 1.8446744073709552e+19, which the VRT reader subsequently rejected as out-of-range -- so the bug propagated through the whole 64-bit pipeline rather than just the eager TIFF read.

The VRT XML parse path was already int-aware via _vrt._parse_band_nodata (PR #1833). This change lifts the int-first parse into a shared helper _parse_nodata_str in _geotags.py and routes the three TIFF-side sites through it. The helper tries int(text) first to preserve full precision, falls back to float(text) for NaN / Inf / scientific notation / fractional values. Downstream gates already handle integer values transparently because np.isfinite(int) works and int(int) is a no-op, so no call sites change.

Categories: Cat 2 (NaN propagation: sentinel pixel survived as a literal valid number on all four backends) + Cat 5 (backend inconsistency: VRT XML parse handled 64-bit sentinels but the TIFF parse fed by write_vrt did not).

Repro before fix

import numpy as np, xarray as xr, tempfile, os
import xrspatial.geotiff as g

with tempfile.TemporaryDirectory() as d:
    arr = np.full((16, 16), 100, dtype=np.uint64)
    arr[0, 0] = 2**64 - 1
    da = xr.DataArray(arr, dims=['y','x'])
    p = os.path.join(d, 't.tif')
    g.to_geotiff(da, p, nodata=2**64-1)
    out = g.open_geotiff(p)
    print(out.dtype, out.values[0, 0])
    # Before: uint64 18446744073709551615  (sentinel survives unmasked)
    # After:  float64 nan

Test plan

  • Unit-level _parse_nodata_str matrix (int branch + float fallback + edge cases)
  • Eager open_geotiff masks uint64 max / int64 max to NaN
  • int64 min / uint16 max / int32 -9999 / float -9999.0 regression guards still pass
  • read_geotiff_dask(chunks=) masks 64-bit sentinels via the windowed reader
  • write_vrt -> read_vrt round-trip preserves the integer XML literal and masks the sentinel
  • GPU path masks 64-bit sentinels (fix is upstream of per-backend code)
  • All 2434 non-stale geotiff tests pass; 1 pre-existing failure unrelated (test_size_param_validation_gpu_vrt_1776.test_tile_size_positive_works asserts pre-to_geotiff accepts tile_size values that are not multiples of 16 #1767 tile_size=4 behaviour)

#1847)

`extract_geo_info`, `_resolve_masked_fill`, and `_sparse_fill_value`
parsed the GDAL_NODATA tag via `float()` unconditionally. For uint64
max (`2**64 - 1`) and int64 max (`2**63 - 1`), the float64
representation sits one ULP above the dtype's max, so the downstream
`info.min <= int(nodata) <= info.max` gate rejected the cast and the
sentinel pixel survived as a literal valid integer rather than being
masked to NaN. `write_vrt` then stringified the float-parsed sentinel
into XML as `1.8446744073709552e+19`, which the VRT reader rejected
too -- the bug propagated through the entire 64-bit integer pipeline.

`_vrt._parse_band_nodata` (PR #1833) already fixed this on the VRT XML
parse path. Lift the int-first parse into a shared helper
`_parse_nodata_str` in `_geotags.py` and route the three TIFF-side
sites through it. The helper tries `int(text)` first, falls back to
`float(text)` for NaN / Inf / scientific notation / fractional values.
Downstream gates already handle integer values transparently because
`np.isfinite(int)` works and `int(int)` is a no-op.

25 regression tests in `test_nodata_int64_precision_1847.py`:

- Unit-level `_parse_nodata_str` matrix covering int branch (uint64
  max, int64 max, int64 min, uint16 max, negative int, whitespace)
  and float branch (NaN, Inf, -Inf, scientific notation, fractional);
  plus empty / None / garbage return None.
- Eager `open_geotiff`: uint64 max / int64 max masked to NaN; int64
  min / uint16 max / int32 -9999 / float -9999.0 regression guards.
- `read_geotiff_dask` masks 64-bit sentinels via the windowed reader.
- `write_vrt -> read_vrt` round-trip preserves the integer XML
  literal and masks the sentinel.
- GPU parity test confirms the GPU path also masks correctly (the
  fix is upstream of the per-backend code).

Closes #1847
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 14, 2026
@brendancol brendancol requested a review from Copilot May 14, 2026 17:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes precision loss when parsing the GeoTIFF GDAL_NODATA tag for 64-bit integer sentinels by parsing as int first (preserving exact uint64/int64 max), with a float fallback for NaN/Inf/scientific/fractional values. This aligns the TIFF-side nodata parsing with the existing VRT XML path and prevents downstream masking + VRT round-trip failures for 64-bit nodata.

Changes:

  • Add shared _parse_nodata_str helper in _geotags.py and use it in extract_geo_info.
  • Route _reader._resolve_masked_fill and _reader._sparse_fill_value through the shared int-first nodata parser.
  • Add a comprehensive regression test suite covering eager, dask, VRT round-trip, and GPU paths for 64-bit nodata sentinels.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
xrspatial/geotiff/_geotags.py Introduces _parse_nodata_str and uses it for GDAL_NODATA parsing in extract_geo_info.
xrspatial/geotiff/_reader.py Updates LERC masked-fill and sparse-fill nodata parsing to use the shared helper.
xrspatial/geotiff/tests/test_nodata_int64_precision_1847.py Adds regression coverage for uint64/int64 sentinel masking + VRT round-trips + dask/GPU parity.
.claude/sweep-accuracy-state.csv Updates audit tracking entry for the geotiff module and this fix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

from __future__ import annotations

import os
import tempfile
Comment on lines +271 to +283
@pytest.fixture
def cupy_or_skip():
try:
import cupy # noqa: F401
from numba import cuda
if not cuda.is_available():
pytest.skip("CUDA not available")
except ImportError:
pytest.skip("cupy not installed")


class TestGpuPathParity:
def test_uint64_max_masked_via_gpu(self, tmp_path, cupy_or_skip):
Comment thread xrspatial/geotiff/_geotags.py Outdated
Comment on lines +500 to +516
def _parse_nodata_str(text: str) -> int | float | None:
"""Parse a GDAL_NODATA tag string at full integer precision when possible.

Returns a Python ``int`` for plain integer literals (so 64-bit
sentinels survive without the float64 round-trip that pushes them one
ULP past the dtype max), a ``float`` for NaN / Inf / scientific
notation / fractional values, and ``None`` when the string is not a
valid number.

Mirrors :func:`xrspatial.geotiff._vrt._parse_band_nodata` (issue
#1833) which addressed the same problem on the VRT XML path. See
issue #1847.
"""
if text is None:
return None
s = text.strip()
if not s:
Address three copilot reviews on #1852:
- Drop unused tempfile import in test_nodata_int64_precision_1847.py.
- Switch GPU gating to the project-standard _gpu_only / _HAS_GPU
  helper used by test_gpu_nodata_1542.py for uniform skip reasons.
- Widen _parse_nodata_str signature from `text: str` to `text: str | None`
  to match the runtime None-handling path, and widen GeoInfo.nodata
  from `float | None` to `int | float | None` since the helper now
  returns int for integer literals.
@brendancol
Copy link
Copy Markdown
Contributor Author

Addressed all three: unused tempfile import removed, GPU gating switched to the _gpu_only / _HAS_GPU helper used by test_gpu_nodata_1542.py, and _parse_nodata_str signature widened to str | None. I also widened GeoInfo.nodata from float | None to int | float | None since that's the one downstream type that now actually receives int values from the helper. The _resolve_masked_fill / _sparse_fill_value callers don't carry return annotations and their nodata_str parameter is already str | None, so no further churn needed there.

@brendancol brendancol merged commit e951c18 into main May 14, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

geotiff: extract_geo_info float-only nodata parse loses uint64 max / int64 max sentinels

2 participants