Update/store retrieve list seperation#272
Conversation
Reviewed the latest commit ( Mention @roomote in a comment to request specific changes to this pull request or fix all unresolved issues. |
There was a problem hiding this comment.
Pull request overview
This PR separates peer eligibility into two distinct allowlists—one for routing/read lookups and one for store/write replication—so postponed peers can still participate in routing while only active peers are used as write targets.
Changes:
- Introduces a new
storeAllowlistandeligibleForStoregating alongside updated routing/read eligibility semantics. - Updates store paths (
storeToAlphaNodes,IterateBatchStore,batchStoreNetwork) to skip non-store-eligible peers and improves batch-store error reporting. - Adjusts bootstrap syncing to produce both routing (Active+Postponed) and store (Active-only) allowlists, and updates ping-success handling to reflect the new split.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
p2p/kademlia/dht.go |
Adds store allowlist + store eligibility checks; changes routing eligibility behavior; updates batch store candidate selection; disables delete/cleanup workers. |
p2p/kademlia/bootstrap.go |
Extends chain bootstrap to compute routing vs store allowlists and wires them into the DHT. |
p2p/kademlia/node_activity.go |
Updates ping success logic to add nodes to routing while only marking “Active” when store-eligible. |
p2p/kademlia/dht_batch_store_test.go |
Updates expected error substring for the new batch-store failure message. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Extends PR #272's routing-vs-store allowlist split to cover the new SUPERNODE_STATE_STORAGE_FULL introduced by lumera #113 (Everlight). Policy: - routing (reads) = {ACTIVE, POSTPONED, STORAGE_FULL} - store (writes) = {ACTIVE} STORAGE_FULL nodes continue to serve reads and earn payout, but must not receive new STORE/BatchStore writes or be targeted by replication. Changes: - p2p/kademlia/supernode_state.go: SSoT helpers using chain's sntypes.SuperNodeState enum (no numeric literals), selfState cache, pruneIneligibleStorePeers, selfStoreEligible. - bootstrap.go: use isRoutingEligibleState / isStoreEligibleState; record self-state during chain sync; call pruneIneligibleStorePeers after setStoreAllowlist so replication_info.Active is cleared eagerly for ineligible peers (closes ping-cadence window). - network.go: STORE RPC self-guard for single handleStoreData (reject new-key write when self not store-eligible; replication of already- held key still permitted); BatchStoreData self-guard rejects when batch contains any genuinely new keys. - dht.go: selfState atomic fields; defensive filterEligibleNodes on BatchRetrieve and BatchRetrieveStream closest-contact lists. - dht_batch_store_test.go: accept new "no eligible store peers" error variant alongside legacy "no candidate nodes".
#284) * docs: add everlight supernode compatibility implementation plan * everlight: implement supernode compatibility with audit epoch reports * host_reporter: measure disk usage on p2p data dir mount * supernode: align probing roles with ACTIVE+STORAGE_FULL semantics * deps: bind supernode + tests/system to lumera v1.12.0-rc Aligns both supernode go.mod and tests/system go.mod with the v1.12.0-rc release tag (lumera commit 7ca770a / Everlight #113). Resolves the install-lumera CI step which requires a real downloadable release asset rather than a pseudo-version. * DHT: Store only on ACTIVE supernodes * fix bootstrap routing gate * feat(p2p): Everlight STORAGE_FULL eligibility gate Extends PR #272's routing-vs-store allowlist split to cover the new SUPERNODE_STATE_STORAGE_FULL introduced by lumera #113 (Everlight). Policy: - routing (reads) = {ACTIVE, POSTPONED, STORAGE_FULL} - store (writes) = {ACTIVE} STORAGE_FULL nodes continue to serve reads and earn payout, but must not receive new STORE/BatchStore writes or be targeted by replication. Changes: - p2p/kademlia/supernode_state.go: SSoT helpers using chain's sntypes.SuperNodeState enum (no numeric literals), selfState cache, pruneIneligibleStorePeers, selfStoreEligible. - bootstrap.go: use isRoutingEligibleState / isStoreEligibleState; record self-state during chain sync; call pruneIneligibleStorePeers after setStoreAllowlist so replication_info.Active is cleared eagerly for ineligible peers (closes ping-cadence window). - network.go: STORE RPC self-guard for single handleStoreData (reject new-key write when self not store-eligible; replication of already- held key still permitted); BatchStoreData self-guard rejects when batch contains any genuinely new keys. - dht.go: selfState atomic fields; defensive filterEligibleNodes on BatchRetrieve and BatchRetrieveStream closest-contact lists. - dht_batch_store_test.go: accept new "no eligible store peers" error variant alongside legacy "no candidate nodes". * test(p2p): unit tests for Everlight STORAGE_FULL eligibility gate Coverage matrix aligned with the invariant table in the plan doc: - I1 routing allowlist population/pre-init: TestEligibleForRouting_PreInit_AndPopulated - I2 store allowlist strictly ⊆ routing: TestEligibleForStore_StrictlyContainedInRouting - I3+I9 shouldRejectStore contract (ACTIVE, STORAGE_FULL, POSTPONED, DISABLED, with newKeys=0 vs >0, pre-init permissive): TestShouldRejectStore, TestSelfStoreEligible - I5 eager replication-info prune on storeAllowlist update: TestPruneIneligibleStorePeers_ClearsNonStorePeers, TestPruneIneligibleStorePeers_SkipsWhenNotReady - I6 state-classification SSoT (no drift possible): TestStateClassification_Table (parametric over all 7 chain enum values) Refactors handleStoreData self-guard to use shouldRejectStore helper for direct test coverage without a full *Network + hashtable boot. Adds a minimal fakeStore in-package (satisfies the full Store interface with no-op implementations) used only by prune tests. * test(e2e): assert fee-distribution value conservation for Everlight The cascade e2e asserted that a single coin_spent event in the MsgFinalizeAction tx carried amount == autoPrice. That was true pre-Everlight when the action module paid the full fee to the supernode in one bank send. Under Everlight Phase 1 (lumera #113) x/action/v1/keeper DistributeFees splits the fee across multiple bank sends: the governance-tunable RegistrationFeeShareBps is routed to the x/supernode reward pool, any FoundationFeeShare goes to the community pool, and the remainder is paid to the supernode(s). No single coin_spent event equals autoPrice anymore, so the pre-Everlight assertion fails deterministically on the v1.12.0-rc bump introduced by this PR. Replace the single-event check with a value-conservation assertion: - Sum all msg_index-tagged coin_spent events per spender; feeSpent is true iff some spender (the action module) paid out autoPrice in total. - Sum all msg_index-tagged coin_received events across recipients; feeReceived is true iff the total equals autoPrice. - msg_index discriminator excludes the tx-level gas-fee pair (emitted by x/auth, no msg_index) from the payout accounting. - fromAddress / toAddress are retained for the payment-flow log; toAddress picks the recipient of the largest share (the supernode payout under the default split ratio). This matches the real product invariant (value conservation across the payout flow) and survives future parameter changes to RegistrationFeeShareBps or FoundationFeeShare without further test churn. * fix(p2p): close STORAGE_FULL transition race on cascade upload write path Production-gate devnet run (lumera@master + this branch) found that cascade uploads fail deterministically whenever an on-chain supernode state transition to STORAGE_FULL has just happened and the client's local storeAllowlist has not yet refreshed. Observed failure: ERROR Batch store to node failed err="batch store rejected: self not store-eligible" node=<STORAGE_FULL peer> → failed to achieve desired success rate, only: 71.43% successful → registration task failed → cascade upload FAILS Root cause (two compounding issues): 1. setStoreAllowlist is only called from SyncBootstrapOnce, which ticks every bootstrapRefreshInterval (10 minutes). Between a chain STORAGE_FULL transition and the next tick, the caller's in-memory storeAllow still contains the transitioned peer. IterateBatchStore's eligibleForStore(n) gate therefore passes, and the client dispatches a STORE RPC that the peer correctly rejects via its own self-guard. 2. The IterateBatchStore response handler counted the authoritative "self not store-eligible" rejection as a store failure. A single transitioned peer in the closest-contact list therefore dragged the measured success rate below the 75% minimumDataStoreSuccessRate threshold and failed the entire cascade upload. Fix (surgical, two coordinated parts, both required): 1. Fast allowlist-only refresher — new StartAllowlistRefresher / RefreshAllowlistsFromChain in p2p/kademlia/bootstrap.go, ticking every 30s (allowlistRefreshInterval), plus an opportunistic best-effort refresh at the top of IterateBatchStore. Reuses loadBootstrapCandidates- FromChain's existing gRPC ListSuperNodes query; does NOT touch replication_info upserts, pings, routing-table seeding, or any other heavyweight bootstrap behavior. Converges the write-gate within O(seconds) of a chain transition. 2. Tolerate authoritative peer rejections — new isPeerStoreIneligibleError helper + new pruneStoreAllowEntry method in p2p/kademlia/supernode_state.go. When the peer responds with the canonical "self not store-eligible" marker (emitted by both handleStoreData and handleBatchStoreData), the client now: (a) decrements its request counter so the rejection is NOT in the success-rate denominator, (b) removes the peer from the in-memory storeAllow so subsequent hot-path lookups in the same run do not re-pick it. The server-side emitter and the client-side recognizer share the peerStoreIneligibleMarker constant to prevent silent drift. Invariants locked in by new unit tests in peer_store_rejection_test.go: - isPeerStoreIneligibleError recognizes both STORE and BatchStoreData server emitter phrasings (and wrapped variants). - isPeerStoreIneligibleError does NOT match unrelated store errors (timeout, rpc internal, near-miss typos, case-mismatch). - pruneStoreAllowEntry removes the peer and decrements storeAllowCount atomically; is idempotent. - pruneStoreAllowEntry is a no-op for nil receiver, nil node, empty ID, or unset allowlist. - TestPeerStoreIneligibleMarker_PresentInServerEmitters cross-checks that the client-side recognizer matches the exact strings emitted on the server side. Blast radius: - Write-path only; read-path (BatchRetrieve) is unaffected. - No behavior change in the all-ACTIVE (unmixed) cluster case. - Pre-existing tests on #284 remain green (TestEligibleForRouting_*, TestEligibleForStore_*, TestShouldReject- Store, TestSelfStoreEligible, TestPruneIneligibleStorePeers_*, TestStateClassification_Table). Rollback: single revert of this commit restores previous behavior; the pre-existing STORAGE_FULL gate logic continues to function (it was correct on the server side; the fix only strengthens client-side convergence). * fix(p2p): address Roo review findings on STORAGE_FULL gate Four issues raised on PR #284 by the Roo review bot, triaged and fixed here. Honest summary of severity (from PR review-of-review): 1. REAL BUG: setStoreAllowlist accepted an empty allowlist post-ready, clamping storeAllowCount to 0 and causing eligibleForStore to return false for every peer. A transient chain RPC failure (ListSuperNodes returning zero results) would therefore block ALL writes network-wide until the next bootstrap refresh. setRoutingAllowlist already guarded against this; setStoreAllowlist did not. Asymmetry. 2. MINOR PERF: RefreshAllowlistsFromChain was called unconditionally at the top of IterateBatchStore. A burst of concurrent uploads would fan out to one chain ListSuperNodes query per batch. The 30-second background StartAllowlistRefresher already handles steady-state convergence; the opportunistic call only matters for the ~30s window immediately after a chain state transition. 3. MAINTAINABILITY: handleBatchStoreData reimplemented the shouldRejectStore contract inline (selfStoreEligible() + manual newKeys count) while handleStoreData used shouldRejectStore(1). Future refinements would risk silent divergence. 4. COSMETIC: eligibleForStore checked n == nil AFTER consulting storeAllowReady / storeAllowCount. eligibleForRouting checked it first. No observable bug in current call sites (all iterate over slices built from non-nil sources), but defense-in-depth alignment is free. Changes: * p2p/kademlia/dht.go - setStoreAllowlist: reject empty-map updates post-ready, retaining the previous allowlist. WARN-level log on a transient outage, DEBUG-level log during bootstrap. Symmetric with setRoutingAllowlist. - eligibleForStore: move `n == nil || len(n.ID) == 0` to the top, matching eligibleForRouting's ordering. Defense-in-depth. - DHT struct: new lastAllowlistRefreshUnixNano atomic.Int64 to support the debounced refresh path. - IterateBatchStore: swap RefreshAllowlistsFromChain for the new MaybeRefreshAllowlists helper. * p2p/kademlia/bootstrap.go - new const allowlistOpportunisticMinInterval = 10s - RefreshAllowlistsFromChain: stamps lastAllowlistRefreshUnixNano on successful completion (so the 30s background refresher also feeds the debounce window). - new MaybeRefreshAllowlists: debounced hot-path variant. Stamps on ATTEMPT (not success) so a persistently-failing chain cannot fan out to one RPC per caller; the background ticker retries on its own cadence. * p2p/kademlia/supernode_state.go - new shouldRejectBatchStore(ctx, payloads) ([]bool, int): the batch-sized companion to shouldRejectStore. Preserves the hot-path short-circuit (selfStoreEligible() check bypasses the per-payload storage lookup on the 99% case) while routing the final decision through shouldRejectStore for single-source-of-truth semantics. * p2p/kademlia/network.go - handleBatchStoreData: replace the inline selfStoreEligible + manual newKeys counting with a shouldRejectBatchStore call. The RPC response phrasing ("batch store rejected: self not store-eligible") is unchanged so the client-side isPeerStoreIneligibleError recognizer still matches. * p2p/kademlia/review_fixes_test.go (new) - TestSetStoreAllowlist_EmptyMapPostReady_RetainsPrevious TestSetStoreAllowlist_EmptyMapPreReady_NoOp Invariant test locking in issue (1). Prior allowlist must be retained when the chain returns empty; pre-ready must not falsely flip ready=true with count=0. - TestEligibleForStore_NilNode_ReturnsFalseEvenPreReady Locks in issue (4) nil-ordering. Pre-ready valid node still permissive (bootstrap not broken). - TestMaybeRefreshAllowlists_Debounces TestMaybeRefreshAllowlists_FiresWhenStale Lock in issue (2) debounce semantics: within the 10s window opportunistic callers skip; outside it they fire. - TestShouldRejectBatchStore_ShortCircuitsOnSelfEligible TestShouldRejectBatchStore_NilReceiver Verify the batch-helper's performance-critical short-circuit (no storage lookups when self is store-eligible) and defensive nil-receiver handling. All existing p2p tests remain green. go vet clean. Why the author missed these during the original #284 authorship: - Confirmation bias: focus on "does Phase 4 pass" after the first fix commit blinded me to asymmetric hardening between sibling functions. - Skipped my own invariant-first-coding skill: per-path violation tests for write-path entry points were not authored up front. A side-by-side of setRoutingAllowlist vs setStoreAllowlist would have surfaced (1) immediately. - Positive-path-only testing: peer_store_rejection_test.go covers the new tolerance path but not input edge cases to setStoreAllowlist (empty map, nil map, pre- vs post-ready semantics). These fixes tighten the invariants the first commit put in place and add the missing violation tests that would have surfaced them. --------- Co-authored-by: NightCrawler <nightcrawler@lumera.ai>
No description provided.