Skip to content

Partial zarr download/upload (stage: Design + Implementation)#1816

Draft
yarikoptic wants to merge 3 commits intomasterfrom
partial-zarr
Draft

Partial zarr download/upload (stage: Design + Implementation)#1816
yarikoptic wants to merge 3 commits intomasterfrom
partial-zarr

Conversation

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Mar 2, 2026

Summary

Design document for partial zarr download and upload support, addressing #1462, #1474, and related archive issues.

The design covers five areas:

  1. --zarr TYPE:PATTERN filtering for dandi download — glob, path, and regex filters for selecting entries within zarr assets, with a metadata alias for common zarr metadata files
  2. URL parsing with zarr boundary detectionAssetZarrEntryURL to handle URLs like dandi://dandi/000108/.../file.ome.zarr/0/0/0
  3. --zarr-mode {full, patch} for dandi upload — patch mode uploads changed files without deleting remote files absent locally
  4. Checksums and manifests — documents that per-directory checksums are computed hierarchically by the zarr_checksum library but are NOT persisted (only the root digest is stored in the DB); legacy .checksum files exist on S3 at zarr-checksums/ for ~72% of older zarrs but are orphaned since Dec 2022
  5. dandi ls for zarr contents — listing files within a zarr asset

Key findings from investigation

  • The zarr_checksum algorithm IS hierarchical (Merkle tree, bottom-up via ZarrChecksumTree)
  • The archive's ingest_zarr_archive task computes checksums entirely in memory and stores only the root digest
  • Per-directory .checksum files on S3 (zarr-checksums/ prefix) were written by ZarrChecksumFileUpdater, removed in dandi-archive PRs Always clear checksum files during zarr ingestion dandi-archive#1390 (Dec 2022). Legacy files remain for older zarrs but no API exposes them
  • Subtree checksum verification is not possible today without recomputation from file ETags

Review checklist

Please review the design at doc/design/partial-zarr.md and comment on:

  • --zarr TYPE:PATTERN syntax — is the filter approach right? Are glob/path/regex the right types?
  • metadata alias expansion — does glob:**/.z* + glob:**/zarr.json + glob:**/.zmetadata cover all cases?
  • --zarr-mode patch semantics — is "upload without deleting" the right default for patch? Should subtree cleanup happen?
  • URL parsing — is AssetZarrEntryURL with zarr boundary detection the right approach?
  • Checksum strategy — relying on per-file ETags for partial ops, deferring subtree checksums to future manifests
  • Open questions in the doc (AND vs OR composition, --sync interaction, server-side glob)
  • Should legacy zarr-checksums/ files on S3 be cleaned up as part of this or separately?

TODO (post-review)

  • Implement dandi/zarr_filter.py — filter parsing and matching
  • Implement AssetZarrEntryURL and split_zarr_location() in dandi/dandiarchive.py
  • Add --zarr option to dandi download CLI
  • Modify _download_zarr() for partial download support
  • Add --zarr-mode option to dandi upload CLI
  • Implement patch mode in iter_upload() (dandi/files/zarr.py)
  • Thread zarr_mode through dandi/upload.py
  • dandi ls zarr contents support (may be separate PR)
  • Tests for all of the above
  • Coordinate with dandi-archive on manifest design (#2702) for future subtree checksum support

🤖 Generated with Claude Code

@yarikoptic yarikoptic marked this pull request as draft March 2, 2026 21:37
@yarikoptic yarikoptic requested review from kabilar and satra March 2, 2026 21:38
@yarikoptic yarikoptic changed the title Design: partial zarr download/upload Partial zarr download/upload (stage: Design) Mar 2, 2026
@codecov
Copy link

codecov bot commented Mar 2, 2026

Codecov Report

❌ Patch coverage is 92.59259% with 28 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.59%. Comparing base (5f03d9b) to head (16b25ec).

Files with missing lines Patch % Lines
dandi/dandiarchive.py 64.28% 10 Missing ⚠️
dandi/download.py 86.44% 8 Missing ⚠️
dandi/zarr_filter.py 95.31% 3 Missing ⚠️
dandi/cli/cmd_ls.py 60.00% 2 Missing ⚠️
dandi/files/zarr.py 90.00% 2 Missing ⚠️
dandi/tests/test_download.py 96.15% 2 Missing ⚠️
dandi/upload.py 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1816      +/-   ##
==========================================
+ Coverage   75.12%   75.59%   +0.46%     
==========================================
  Files          84       86       +2     
  Lines       11930    12225     +295     
==========================================
+ Hits         8962     9241     +279     
- Misses       2968     2984      +16     
Flag Coverage Δ
unittests 75.59% <92.59%> (+0.46%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yarikoptic yarikoptic added enhancement New feature or request minor Increment the minor version when merged labels Mar 3, 2026
yarikoptic and others added 3 commits March 6, 2026 08:10
Covers five areas:
- --zarr TYPE:PATTERN filtering for download (glob, path, regex)
- URL parsing with zarr boundary detection (AssetZarrEntryURL)
- --zarr-mode {full, patch} for upload
- Checksums and manifests (per-directory checksums are NOT
  persisted on the archive; legacy .checksum files exist on S3
  under zarr-checksums/ for ~72% of older zarrs but are orphaned)
- dandi ls for zarr contents

Related: #1462, #1474

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add --zarr CLI option for download to filter entries within zarr assets
(glob/path/regex patterns with predefined 'metadata' alias), and
--zarr-mode option for upload to support 'patch' mode (upload/update
without deleting remote-only files).

Key changes:
- New dandi/zarr_filter.py: filter parsing, matching, and aliases
- URL parsing: AssetZarrEntryURL for URLs pointing into zarr assets
- Download pipeline: thread zarr_entry_filter through Downloader and
  _download_zarr, skip deletion and checksum when filter active
- Upload pipeline: zarr_mode='patch' skips remote file deletion and
  client-side checksum verification
- dandi ls: list zarr entries when URL points into a zarr

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Pre-compile regex patterns at ZarrFilter construction time, catching
  invalid patterns early instead of on every matches() call (B1)
- Remove redundant get_asset_download_path override in AssetZarrEntryURL
  that was identical to the inherited SingleAssetURL method (H1)
- Use Literal["full", "patch"] for zarr_mode parameter instead of bare
  str to prevent silent misbehavior on invalid values (H2)
- Collapse consecutive ** glob segments to avoid exponential
  backtracking in _glob_match_parts (H3)
- Simplify split_zarr_location to use str.split instead of
  PurePosixPath (M1)
- Add explanatory comment for type: ignore in parse_zarr_filter (M2)
- Yield {"status": "done"} when zarr filter matches zero entries (M4)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yarikoptic yarikoptic changed the title Partial zarr download/upload (stage: Design) Partial zarr download/upload (stage: Design + Implementation) Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request minor Increment the minor version when merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant