Skip to content

feat(dag/walker): opt-in BloomTracker to avoid duplicated walks#1124

Open
lidel wants to merge 20 commits intomainfrom
feat/provide-entity-roots-with-dedup
Open

feat(dag/walker): opt-in BloomTracker to avoid duplicated walks#1124
lidel wants to merge 20 commits intomainfrom
feat/provide-entity-roots-with-dedup

Conversation

@lidel
Copy link
Copy Markdown
Member

@lidel lidel commented Mar 20, 2026

Warning

not ready for review, this is a sandbox for running CI

Summary

New dag/walker package for memory-efficient DAG traversal with bloom filter deduplication, plus new pinned-provider strategies that use it to avoid re-announcing duplicate blocks across pins.

dag/walker (new package)

  • VisitedTracker interface with two implementations:
    • BloomTracker -- auto-scaling bloom filter chain (~4 bytes/CID vs ~75 for a map), with uncorrelated false positives across nodes (unique random SipHash keys per instance)
    • MapTracker -- exact dedup for tests and small datasets
  • WalkDAG -- iterative DFS traversal with integrated dedup, codec-agnostic link extraction (dag-pb, dag-cbor, raw, and any registered codec). ~2x faster than the legacy go-ipld-prime selector-based walker.
  • WalkEntityRoots -- entity-aware traversal that emits only file/directory/HAMT shard roots instead of every block, skipping internal file chunks.

pinner

  • NewUniquePinnedProvider -- emits all blocks reachable from pins with cross-pin bloom dedup (recursive DAGs first, then direct pins).
  • NewPinnedEntityRootsProvider -- same but emits only entity roots via WalkEntityRoots.
  • Both log and skip corrupted pin entries instead of aborting the entire provide cycle.

provider

  • NewPrioritizedProvider now continues to the next stream when one fails instead of stopping all streams.
  • NewConcatProvider added for pre-deduplicated streams that don't need the cidutil.StreamingSet overhead.

Other

  • Identity CIDs are filtered out during walks (they are inlined data, not real blocks).
  • Siblings are visited in left-to-right link order for deterministic traversal.

VisitedTracker interface for memory-efficient DAG traversal dedup.
BloomTracker uses a scalable bloom filter chain (~4 bytes/CID vs ~75
for a map), enabling dedup on repos with tens of millions of CIDs.

- BloomTracker: auto-scaling chain, configurable FP rate via BloomParams,
  unique random SipHash keys per instance (uncorrelated FPs across nodes)
- MapTracker: exact dedup for tests and small datasets
- *cid.Set satisfies the interface for drop-in compatibility
- go.mod: update ipfs/bbloom to master (for NewWithKeys)
@lidel lidel force-pushed the feat/provide-entity-roots-with-dedup branch from 4dad6c1 to c8962fc Compare March 21, 2026 03:15
iterative DFS walker that integrates VisitedTracker dedup directly
into the traversal loop, skipping entire subtrees in O(1).

- LinksFetcherFromBlockstore: extracts links from any codec registered
  in the global multicodec registry (dag-pb, dag-cbor, raw, etc.)
- ~2x faster than legacy go-ipld-prime selector traversal (no selector
  machinery, simpler decoding, fewer allocations)
- WithLocality option for MFS providers to skip non-local blocks
- best-effort error handling: fetch failures log and skip, do not mark
  the CID as visited (allows retry via another pin or next cycle)
- benchmarks comparing BlockAll vs WalkDAG across dag-pb, dag-cbor,
  and mixed-codec DAGs
@lidel lidel force-pushed the feat/provide-entity-roots-with-dedup branch from 19bf557 to 224c2ae Compare March 21, 2026 03:46
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 21, 2026

Codecov Report

❌ Patch coverage is 78.46154% with 70 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.77%. Comparing base (43c30ce) to head (c8920a2).

Files with missing lines Patch % Lines
pinning/pinner/dspinner/uniquepinprovider.go 55.17% 17 Missing and 9 partials ⚠️
dag/walker/walker.go 80.95% 10 Missing and 6 partials ⚠️
dag/walker/entity.go 77.94% 11 Missing and 4 partials ⚠️
dag/walker/visited.go 87.77% 7 Missing and 4 partials ⚠️
provider/provider.go 92.00% 2 Missing ⚠️

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1124      +/-   ##
==========================================
+ Coverage   62.56%   62.77%   +0.21%     
==========================================
  Files         261      265       +4     
  Lines       26216    26539     +323     
==========================================
+ Hits        16402    16660     +258     
- Misses       8125     8170      +45     
- Partials     1689     1709      +20     
Files with missing lines Coverage Δ
provider/provider.go 90.16% <92.00%> (+19.11%) ⬆️
dag/walker/visited.go 87.77% <87.77%> (ø)
dag/walker/entity.go 77.94% <77.94%> (ø)
dag/walker/walker.go 80.95% <80.95%> (ø)
pinning/pinner/dspinner/uniquepinprovider.go 55.17% <55.17%> (ø)

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lidel added 14 commits March 21, 2026 05:23
emits entity roots (files, directories, HAMT shards) skipping
internal file chunks. core of the +entities provide strategy.

- NodeFetcherFromBlockstore: detects UnixFS entity type from the
  ipld-prime decoded node's Data field
- directories and HAMT shards: emit and recurse into children
- non-UnixFS codecs (dag-cbor, dag-json): emit and follow links
- same options as WalkDAG: WithVisitedTracker, WithLocality
- tests: dag-pb, raw, dag-cbor, mixed codecs, HAMT, dedup,
  error handling, stop conditions
catch unexpected regressions in ipfs/bbloom behavior or BloomParams
derivation that would silently degrade the false positive rate.

- measurable rate (1/1000): 100K probes produce observable FPs,
  asserts rate is within 5x of target
- default rate (1/4.75M): 100K probes must produce exactly 0 FPs
- NewPrioritizedProvider: stream init error no longer stops remaining
  streams (e.g. MFS flush error does not prevent pinned content from
  being provided)
- NewConcatProvider: concatenates pre-deduplicated streams without
  its own visited set, for use with shared VisitedTracker
…vider

NewUniquePinnedProvider: emits all pinned blocks with cross-pin
dedup via shared VisitedTracker (bloom or map). walks recursive pin
DAGs first, then direct pins.

NewPinnedEntityRootsProvider: same structure but uses WalkEntityRoots,
emitting only entity roots and skipping internal file chunks.

existing NewPinnedProvider is unchanged.
- remove unused daggen variable in uniquepinprovider_test.go
…tency

match the defensive read-side ctx.Done select pattern already used by
NewPrioritizedProvider in the same file
- deduplicate LinkSystem construction used by both
  LinksFetcherFromBlockstore and NodeFetcherFromBlockstore
- wrap blockstore with NewIdStore so identity CIDs (multihash 0x00,
  data inline in the CID) are decoded without a datastore lookup
identity CIDs (multihash 0x00) embed data inline, so providing them
to the DHT is wasteful. the walker now traverses through identity
CIDs (following their links) but never emits them.

- add isIdentityCID check to WalkDAG and WalkEntityRoots
- simplify WalkEntityRoots emit/descend logic
- tests for identity raw leaf, identity dag-pb directory with normal
  children, normal directory with identity child
- inline identity CID check (c.Prefix().MhType == mh.IDENTITY) in all
  emit paths: WalkDAG, WalkEntityRoots, and direct pin loops in both
  NewUniquePinnedProvider and NewPinnedEntityRootsProvider
- move all identity CID tests to dag/walker/identity_test.go
- add provider-level identity tests for direct pins and recursive DAGs
the stack-based DFS was pushing children in link order, causing
the last child to be popped first (right-to-left). reverse children
before pushing so the first link is on top and gets visited first.

this matches the legacy fetcherhelpers.BlockAll selector traversal
(ipld-prime iterates list/map entries in insertion order) and the
conventional DFS order described in IPIP-0412.

- walker.go, entity.go: slices.Reverse(children) before stack push
- walker.go: document traversal order in WalkDAG godoc
- entity.go: document order parity in WalkEntityRoots godoc
- walker_test.go, entity_test.go: add sibling order regression tests
a corrupted pin entry was stopping the entire provide cycle because
the goroutine returned on RecursiveKeys/DirectKeys error. change to
continue so remaining pins are still provided (best-effort).

the error from the pinner iterator already contains context (bad CID
bytes, datastore key, etc.) -- sc.Pin.Key is zero-value on error so
including it in the log would be noise.

matches the best-effort pattern used in WalkDAG/WalkEntityRoots
where fetch errors are logged and skipped.
- collectLinks: note that map keys are not recursed (no known codec
  uses link-typed map keys)
- detectEntityType: extract c.Prefix() once for readability
- grow: document MinBloomCapacity invariant that prevents small-bitset
  FP rate issues in grown blooms
@lidel lidel changed the title feat(dag/walker): BloomTracker experiment feat(dag/walker): opt-in BloomTracker to avoid duplicated walks Mar 23, 2026
Copy link
Copy Markdown
Contributor

@gammazero gammazero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a few suggestions but nothing blocking.

// skip identity CIDs: content is inline, no need to provide.
// we still descend (above) so an inlined dag-pb directory's
// normal children get provided.
if c.Prefix().MhType == mh.IDENTITY {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be done before any other work is done?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question: the current order is intentional (but yes, maybe overly paranoid).

Identity CIDs can contain inlined dag-pb directories whose children are normal
(non-identity) CIDs that need to be provided. Moving the identity check before fetch/descend would silently drop those children.

I know this is low level unlikely edge case, so added explicit regression tests in dag/walker/identity_test.go:

  • TestWalkDAG_IdentityCID/"identity dag-pb directory with normal raw child"
  • TestWalkEntityRoots_IdentityCID/"identity directory with normal children"

Comment on lines +137 to +141
if tracker.Visit(sc.Pin.Key) {
if !emit(sc.Pin.Key) {
return
}
}
Copy link
Copy Markdown
Contributor

@gammazero gammazero Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matches above logic flow.

Suggested change
if tracker.Visit(sc.Pin.Key) {
if !emit(sc.Pin.Key) {
return
}
}
// Skip if not first visit.
if !tracker.Visit(sc.Pin.Key) {
continue
}
// Emit pin key on first visit.
if !emit(sc.Pin.Key) {
return
}

Do we want to track how many pins are skipped and maybe log the number? Maybe even log that there is a x% chance skipping a pin that should be provided?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adopted the skip-early style in 0fc1a0b.

As for trasking skips, good idea! Added logging in 577fa3f and c8920a2. On bloom creation:

    bloom tracker created  capacity=2000000 fpRate="1 in 4750000 (~0.000021%)"

After each provide/reprovide cycle (wired in ipfs/kubo@8285ec9):

    unique reprovide cycle finished  providedCIDs=6 skippedBranches=1

Tests assert exact counts for both paths.

lidel added 4 commits March 25, 2026 23:59
uniquepinprovider: use skip-early style for tracker.Visit
in direct pin loops (clearer control flow)

visited.go: document that VisitedTracker implementations may
be probabilistic, and must keep FP rate negligible or allow
callers to adjust it
log capacity, FP rate, and hash parameters on creation.
log previous/new capacity and chain length on autoscale.
helps operators understand bloom sizing and detect unexpected
growth during reprovide cycles.
counts Visit() calls that returned false (CID already seen).
callers can log this after a reprovide cycle to show how much
dedup the bloom filter achieved.
@lidel lidel requested a review from gammazero March 29, 2026 22:55
@lidel lidel marked this pull request as ready for review March 29, 2026 22:55
@lidel lidel requested a review from a team as a code owner March 29, 2026 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants