Skip to content

geotiff: share VRT chunked-path logic with eager, parse XML once (#1825)#1840

Merged
brendancol merged 2 commits into
mainfrom
issue-1825
May 14, 2026
Merged

geotiff: share VRT chunked-path logic with eager, parse XML once (#1825)#1840
brendancol merged 2 commits into
mainfrom
issue-1825

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

Consolidates four pieces of duplicated logic between the eager and chunked VRT read paths and removes the per-task XML parse the chunked path was paying.

  • _sentinel_for_dtype lifted out of the read_vrt closure (and merged with the module-level twin _vrt_sentinel_for_dtype from the chunked path) into a single helper in _vrt.py.
  • _effective_dtype_for_bands collapses the ComplexSource ScaleRatio / ScaleOffset dtype rule that both paths re-implemented inline.
  • _apply_integer_sentinel_mask replaces the per-band integer-sentinel NaN-masking branch both paths duplicated.
  • read_vrt in _vrt.py accepts a pre-parsed VRTDataset via a new parsed= kwarg. The chunked dispatcher parses the VRT once and threads the parsed object through every per-chunk task in the dask graph; an N-chunk read now pays one XML parse and one source-path allowlist validation, not N+1.

Public signatures of read_vrt, open_geotiff, etc. are unchanged. VRTDataset is a dataclass of plain dataclasses + dtypes and round-trips through stdlib pickle and cloudpickle, so no serialise-and-reparse fallback was needed.

Fixes #1825.

Test plan

  • pytest xrspatial/geotiff/tests/ -k vrt -- 375 passed (one pre-existing failure unrelated to VRT excluded: test_tile_size_positive_works)
  • pytest xrspatial/geotiff/tests/test_vrt_single_parse_1825.py -v -- new tests cover single XML parse, single file read, picklable parsed VRT, eager/chunked parity, and no per-block re-parse
  • pytest xrspatial/geotiff/tests/ -- the only failures are pre-existing (palette/plot and big-endian GPU predictor tests, all failing on main as well)
  • Manual cloudpickle round-trip of the chunked dask graph + .compute() returns the same array

Collapses four pieces of duplicate logic between the eager and chunked
VRT read paths:

- ``_sentinel_for_dtype`` lifted from a closure inside ``read_vrt``
  (and the module-level twin ``_vrt_sentinel_for_dtype`` in the chunked
  path) into a single helper in ``_vrt.py``.
- ``_effective_dtype_for_bands`` consolidates the ComplexSource
  scale/offset dtype rule that both paths re-implemented inline.
- ``_apply_integer_sentinel_mask`` replaces the per-band integer-sentinel
  masking branch that both paths duplicated.
- ``read_vrt`` in ``_vrt.py`` now accepts a pre-parsed ``VRTDataset`` via
  ``parsed=``; the chunked dispatcher parses the VRT once and threads it
  through every per-chunk task in the dask graph. An N-chunk read now
  pays one XML parse and one source-path allowlist validation, not N+1.

Pickling: ``VRTDataset`` is a dataclass of plain dataclasses + dtypes
and round-trips through stdlib ``pickle`` (and cloudpickle), so no
serialise-and-reparse fallback is needed.

Fixes #1825
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 14, 2026
@brendancol brendancol requested a review from Copilot May 14, 2026 16:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the GeoTIFF VRT reader to remove duplicated eager/chunked logic and to avoid re-parsing VRT XML for every dask chunk task by parsing once in the dispatcher and threading a pre-parsed VRTDataset through the task graph (fixes #1825).

Changes:

  • Centralizes shared VRT helpers in xrspatial/geotiff/_vrt.py (sentinel casting, effective dtype computation, integer-sentinel NaN masking).
  • Extends internal _vrt.read_vrt to accept a pre-parsed VRT (parsed=) and updates the chunked dispatcher/tasks to reuse it (single XML read/parse + single containment validation).
  • Adds targeted tests asserting single-parse behavior, picklability of the parsed VRT, and eager/chunked numerical parity.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
xrspatial/geotiff/_vrt.py Adds shared helper functions and a parsed= fast-path to avoid repeated XML parsing in chunked reads.
xrspatial/geotiff/__init__.py Updates eager and chunked VRT paths to use shared helpers and to pass a single parsed VRT into all per-chunk tasks.
xrspatial/geotiff/tests/test_vrt_single_parse_1825.py Adds regression tests ensuring chunked reads parse/read XML once, parsed VRT is picklable, and results match eager reads.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread xrspatial/geotiff/_vrt.py
Comment on lines +890 to +895
if parsed is not None:
vrt = parsed
else:
xml_str = _read_vrt_xml(vrt_path)
vrt_dir = os.path.dirname(os.path.abspath(vrt_path))
vrt = parse_vrt(xml_str, vrt_dir)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in c3b4316. read_vrt now does vrt = dataclasses.replace(parsed, holes=[]) when parsed= is supplied, so the per-call appends stay on a per-task copy and never touch the dispatcher's shared VRTDataset. Bands/sources/dtypes are still shared by reference (the parsed object is otherwise immutable in the hot path).

In practice the chunked dispatcher does its own static os.path.exists sweep for attrs['vrt_holes'] and never reads vrt.holes post-compute, so the only observable consequence today would be cross-task accumulation under the threaded scheduler. But the moment a caller reused the parsed object across reads, or a future change started surfacing parsed_vrt.holes, the leakage would be a real bug — so the defensive copy is worth it.

Added test_parsed_kwarg_does_not_mutate_caller_holes to pin the no-mutation contract.

Resolve __init__.py conflict between the chunked-dispatcher refactor in
this branch (#1825) and the chunked XML-cap routing that landed on main
(#1831): keep the four helper imports plus the security cap comment, and
merge the parsed-VRT plumbing note alongside it.

Also harden ``_vrt.read_vrt`` against caller-visible mutation of the
shared ``VRTDataset``: when ``parsed=`` is supplied, work on a
``dataclasses.replace(parsed, holes=[])`` so the per-call ``holes``
appends do not leak onto the dispatcher's object across per-chunk
tasks. Adds a regression test covering the no-mutation contract.
@brendancol brendancol merged commit daf1ecd into main May 14, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

geotiff: refactor read_vrt chunked path to share logic with eager

2 participants