-
-
Notifications
You must be signed in to change notification settings - Fork 396
Add public synchronous API for concurrent group/metadata operations #3859
Description
Motivation
From the discussion in #3835, downstream libraries like xarray need to open multiple zarr groups concurrently for performance. Currently, this requires either:
- Using zarr's internal
zarr.core.sync.sync()to bridge async→sync (not public API) - Managing a separate event loop (risky, conflicts with zarr's internal loop)
The consensus from the discussion is that zarr should provide proper public APIs that handle concurrency internally, rather than exposing sync() to downstream consumers.
Proposed APIs
1. Public synchronous store API
Zarr currently has no public synchronous API for store classes. PR #3638 added _get_bytes_sync, _get_json_sync as private methods on the Store ABC, with the intent to make them public once the API design matures (specifically the prototype parameter question). Promoting these to public would give downstream libraries a supported way to do concurrency-backed store operations.
2. Higher-level concurrent group opening
As @d-v-b suggested in #3835, a synchronous function that opens multiple groups concurrently, async internally:
# Opens all child groups/arrays concurrently, returns sync objects
children: dict[str, zarr.Array | zarr.Group] = zarr.open_members(group)This directly addresses the xarray/xradar open_datatree use case — fetching metadata and coordinate arrays for N groups in parallel. Internally it would use AsyncGroup.members() and the async store APIs, but callers wouldn't need to touch the event loop.
3. Concurrent data access (future)
@ilan-gold raised more advanced use cases in #3835 involving concurrent reads across multiple arrays with interleaved computation. @d-v-b's lazy indexing prototype (#3678) may address some of these patterns. This is a separate concern from (1) and (2) and can be tracked independently.
Use cases
xarray / xradar (@TomNicholas, @aladinor):
- Fetch/set metadata of arbitrary groups/arrays concurrently
- Concurrently fetch/set data for coordinate arrays
- Mostly metadata operations and reading entire arrays, not lazy indexing
- Opening hierarchical group structures (e.g., multi-sweep radar volumes) concurrently is critical for cloud performance
General downstream (@BorisTheBrave, xarray PR #11171):
- Any library wanting zarr's async performance benefits through a sync API
References
- Discussion: Stabilize zarr.core.sync.sync() for downstream async integration #3835
- Prior sync store API (private methods): Add methods for getting bytes + json to store abc #3638
- Lazy indexing prototype: feat/lazy indexing #3678
- xarray async datatree PR: Implement async support for open_datatree pydata/xarray#10742
- xarray async strategy: How should Xarray control asynchronous calls? pydata/xarray#10622
- xarray sync-via-zarr-loop PR: PoC Set variables async pydata/xarray#11171
cc @d-v-b @ilan-gold @TomNicholas @shoyer @keewis @dcherian @BorisTheBrave
This is my understanding of the discussion so far. If I've misrepresented anything or missed important context, please correct me. Additional ideas and suggestions are very welcome.