Skip to content

fix(pushsync): prevent silent chunk loss on shallow receipts#5390

Open
gacevicljubisa wants to merge 7 commits intomasterfrom
fix/pushsync-out-of-depth-chunk-loss
Open

fix(pushsync): prevent silent chunk loss on shallow receipts#5390
gacevicljubisa wants to merge 7 commits intomasterfrom
fix/pushsync-out-of-depth-chunk-loss

Conversation

@gacevicljubisa
Copy link
Member

@gacevicljubisa gacevicljubisa commented Mar 9, 2026

Checklist

  • I have read the coding guide.
  • My change requires a documentation update, and I have done it.
  • I have added tests to cover my changes.
  • I have filled out the description and linked the related issues.

Description

Chunks could silently disappear from the network after a successful upload due to three bugs in the pushsync/pusher pipeline.

Root causes & fixes

1. Storer stored chunks it was about to evict (root cause)

  • When a forwarder had no closer peer (ErrWantSelf) but the chunk was outside its AOR, it stored the chunk and sent a receipt anyway
  • The chunk landed in a low proximity bin and was evicted almost immediately by unreserve(), leaving the network with no copy
  • Fix: return the pre-existing but unused ErrOutOfDepthStoring when ErrWantSelf fires but proximity(chunk, self) < radius; the origin treats it as a hard failure and retries with the next peer

2. Shallow receipt abandoned inflight parallel pushes

  • On any shallow receipt, pushToClosest returned immediately, discarding results from other parallel multiplex pushes that could have been valid
  • Only 1 of the 32-attempt error budget was used per call
  • Fix: treat shallow receipt as a regular peer failure — burn the budget, skip the peer, retry; surface ErrShallowReceipt only after the full budget is exhausted

3. False ChunkSynced after retry exhaustion

  • After 6 pusher-level retries, chunks were marked ChunkSynced even though no node in the correct neighbourhood had stored them
  • Fix: report ChunkCouldNotSync instead

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

Screenshots (if appropriate):

gacevicljubisa and others added 2 commits March 9, 2026 12:48
Updated comments to correct formatting and clarity.
@gacevicljubisa gacevicljubisa marked this pull request as ready for review March 10, 2026 14:33
@gacevicljubisa gacevicljubisa requested review from acud and janos March 10, 2026 14:33
// error budget and wait for any other inflight parallel pushes
// (e.g. multiplex forwards) before giving up. Only return
// ErrShallowReceipt once the entire budget is spent.
shallowReceiptResult = result.receipt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you've removed the skiplist addition line that was on the left side of the diff (L493). any particular reason?

Copy link
Member Author

@gacevicljubisa gacevicljubisa Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it is not removed, it is on the common failure path at pushsync.go:524

if ps.fullNode {
// If a peer already has the chunk (even at wrong depth), don't
// store locally — the chunk is closer to its neighbourhood than us.
if shallowReceiptResult != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this prevents the pss code from executing. i'm not sure if its 100% correct. maybe for the sake of eventual correctness you could put this below the pss unwrap block (before the returning of the error)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants