geotiff: share VRT chunked-path logic with eager, parse XML once (#1825)#1840
Conversation
Collapses four pieces of duplicate logic between the eager and chunked VRT read paths: - ``_sentinel_for_dtype`` lifted from a closure inside ``read_vrt`` (and the module-level twin ``_vrt_sentinel_for_dtype`` in the chunked path) into a single helper in ``_vrt.py``. - ``_effective_dtype_for_bands`` consolidates the ComplexSource scale/offset dtype rule that both paths re-implemented inline. - ``_apply_integer_sentinel_mask`` replaces the per-band integer-sentinel masking branch that both paths duplicated. - ``read_vrt`` in ``_vrt.py`` now accepts a pre-parsed ``VRTDataset`` via ``parsed=``; the chunked dispatcher parses the VRT once and threads it through every per-chunk task in the dask graph. An N-chunk read now pays one XML parse and one source-path allowlist validation, not N+1. Pickling: ``VRTDataset`` is a dataclass of plain dataclasses + dtypes and round-trips through stdlib ``pickle`` (and cloudpickle), so no serialise-and-reparse fallback is needed. Fixes #1825
There was a problem hiding this comment.
Pull request overview
Refactors the GeoTIFF VRT reader to remove duplicated eager/chunked logic and to avoid re-parsing VRT XML for every dask chunk task by parsing once in the dispatcher and threading a pre-parsed VRTDataset through the task graph (fixes #1825).
Changes:
- Centralizes shared VRT helpers in
xrspatial/geotiff/_vrt.py(sentinel casting, effective dtype computation, integer-sentinel NaN masking). - Extends internal
_vrt.read_vrtto accept a pre-parsed VRT (parsed=) and updates the chunked dispatcher/tasks to reuse it (single XML read/parse + single containment validation). - Adds targeted tests asserting single-parse behavior, picklability of the parsed VRT, and eager/chunked numerical parity.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
xrspatial/geotiff/_vrt.py |
Adds shared helper functions and a parsed= fast-path to avoid repeated XML parsing in chunked reads. |
xrspatial/geotiff/__init__.py |
Updates eager and chunked VRT paths to use shared helpers and to pass a single parsed VRT into all per-chunk tasks. |
xrspatial/geotiff/tests/test_vrt_single_parse_1825.py |
Adds regression tests ensuring chunked reads parse/read XML once, parsed VRT is picklable, and results match eager reads. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if parsed is not None: | ||
| vrt = parsed | ||
| else: | ||
| xml_str = _read_vrt_xml(vrt_path) | ||
| vrt_dir = os.path.dirname(os.path.abspath(vrt_path)) | ||
| vrt = parse_vrt(xml_str, vrt_dir) |
There was a problem hiding this comment.
Good catch — fixed in c3b4316. read_vrt now does vrt = dataclasses.replace(parsed, holes=[]) when parsed= is supplied, so the per-call appends stay on a per-task copy and never touch the dispatcher's shared VRTDataset. Bands/sources/dtypes are still shared by reference (the parsed object is otherwise immutable in the hot path).
In practice the chunked dispatcher does its own static os.path.exists sweep for attrs['vrt_holes'] and never reads vrt.holes post-compute, so the only observable consequence today would be cross-task accumulation under the threaded scheduler. But the moment a caller reused the parsed object across reads, or a future change started surfacing parsed_vrt.holes, the leakage would be a real bug — so the defensive copy is worth it.
Added test_parsed_kwarg_does_not_mutate_caller_holes to pin the no-mutation contract.
Resolve __init__.py conflict between the chunked-dispatcher refactor in this branch (#1825) and the chunked XML-cap routing that landed on main (#1831): keep the four helper imports plus the security cap comment, and merge the parsed-VRT plumbing note alongside it. Also harden ``_vrt.read_vrt`` against caller-visible mutation of the shared ``VRTDataset``: when ``parsed=`` is supplied, work on a ``dataclasses.replace(parsed, holes=[])`` so the per-call ``holes`` appends do not leak onto the dispatcher's object across per-chunk tasks. Adds a regression test covering the no-mutation contract.
Summary
Consolidates four pieces of duplicated logic between the eager and chunked VRT read paths and removes the per-task XML parse the chunked path was paying.
_sentinel_for_dtypelifted out of theread_vrtclosure (and merged with the module-level twin_vrt_sentinel_for_dtypefrom the chunked path) into a single helper in_vrt.py._effective_dtype_for_bandscollapses the ComplexSourceScaleRatio/ScaleOffsetdtype rule that both paths re-implemented inline._apply_integer_sentinel_maskreplaces the per-band integer-sentinel NaN-masking branch both paths duplicated.read_vrtin_vrt.pyaccepts a pre-parsedVRTDatasetvia a newparsed=kwarg. The chunked dispatcher parses the VRT once and threads the parsed object through every per-chunk task in the dask graph; an N-chunk read now pays one XML parse and one source-path allowlist validation, not N+1.Public signatures of
read_vrt,open_geotiff, etc. are unchanged.VRTDatasetis a dataclass of plain dataclasses + dtypes and round-trips through stdlibpickleand cloudpickle, so no serialise-and-reparse fallback was needed.Fixes #1825.
Test plan
pytest xrspatial/geotiff/tests/ -k vrt-- 375 passed (one pre-existing failure unrelated to VRT excluded:test_tile_size_positive_works)pytest xrspatial/geotiff/tests/test_vrt_single_parse_1825.py -v-- new tests cover single XML parse, single file read, picklable parsed VRT, eager/chunked parity, and no per-block re-parsepytest xrspatial/geotiff/tests/-- the only failures are pre-existing (palette/plot and big-endian GPU predictor tests, all failing onmainas well).compute()returns the same array