Everlight: supernode compatibility + p2p STORAGE_FULL eligibility gate#284
Everlight: supernode compatibility + p2p STORAGE_FULL eligibility gate#284mateeullahmalik merged 12 commits intomasterfrom
Conversation
Reviewed commit
Mention @roomote in a comment to request specific changes to this pull request or fix all unresolved issues. |
| func (s *DHT) setStoreAllowlist(ctx context.Context, allow map[[32]byte]struct{}) { | ||
| if s == nil { | ||
| return | ||
| } | ||
| // Integration tests may use synthetic bootstrap sets; do not enforce chain-state gating. | ||
| if integrationTestEnabled() { | ||
| return | ||
| } | ||
|
|
||
| s.storeAllowMu.Lock() | ||
| s.storeAllow = allow | ||
| s.storeAllowMu.Unlock() | ||
|
|
||
| s.storeAllowCount.Store(int64(len(allow))) | ||
| s.storeAllowReady.Store(true) | ||
|
|
||
| logtrace.Debug(ctx, "store allowlist updated", logtrace.Fields{ | ||
| logtrace.FieldModule: "p2p", | ||
| "store_peers": len(allow), | ||
| }) | ||
| } |
There was a problem hiding this comment.
setStoreAllowlist unconditionally accepts an empty map and marks storeAllowReady=true with storeAllowCount=0. If a transient chain issue returns zero ACTIVE supernodes, eligibleForStore will return false for every peer, blocking all writes network-wide until the next bootstrap refresh (up to 10 min). setRoutingAllowlist explicitly guards against this by returning early when len(allow) == 0 and retaining the previous allowlist. The store allowlist should have the same protection.
Fix it with Roo Code or mention @roomote and request a fix.
| func (s *DHT) eligibleForStore(n *Node) bool { | ||
| if s == nil { | ||
| return false | ||
| } | ||
| // In integration tests allow everything; chain state gating is not stable/available there. | ||
| if integrationTestEnabled() { | ||
| return true | ||
| } | ||
| // If the store allowlist isn't ready yet, avoid blocking writes during bootstrap. | ||
| if !s.storeAllowReady.Load() { | ||
| return true | ||
| } | ||
| // Once initialized, an empty active set means no write-eligible peers. | ||
| if s.storeAllowCount.Load() == 0 { | ||
| return false | ||
| } | ||
| if n == nil || len(n.ID) == 0 { | ||
| return false | ||
| } |
There was a problem hiding this comment.
The nil/empty node check (n == nil || len(n.ID) == 0) is placed after the storeAllowReady and storeAllowCount checks. When storeAllowReady is false (pre-bootstrap), this function returns true for a nil node, which could cause a nil pointer dereference if a nil *Node slips through a call site during bootstrap. eligibleForRouting checks n == nil as the very first thing (line 214). Moving the nil check before the ready/count checks would be consistent and defensive.
| func (s *DHT) eligibleForStore(n *Node) bool { | |
| if s == nil { | |
| return false | |
| } | |
| // In integration tests allow everything; chain state gating is not stable/available there. | |
| if integrationTestEnabled() { | |
| return true | |
| } | |
| // If the store allowlist isn't ready yet, avoid blocking writes during bootstrap. | |
| if !s.storeAllowReady.Load() { | |
| return true | |
| } | |
| // Once initialized, an empty active set means no write-eligible peers. | |
| if s.storeAllowCount.Load() == 0 { | |
| return false | |
| } | |
| if n == nil || len(n.ID) == 0 { | |
| return false | |
| } | |
| func (s *DHT) eligibleForStore(n *Node) bool { | |
| if s == nil { | |
| return false | |
| } | |
| if n == nil || len(n.ID) == 0 { | |
| return false | |
| } | |
| // In integration tests allow everything; chain state gating is not stable/available there. | |
| if integrationTestEnabled() { | |
| return true | |
| } | |
| // If the store allowlist isn't ready yet, avoid blocking writes during bootstrap. | |
| if !s.storeAllowReady.Load() { | |
| return true | |
| } | |
| // Once initialized, an empty active set means no write-eligible peers. | |
| if s.storeAllowCount.Load() == 0 { | |
| return false | |
| } |
Fix it with Roo Code or mention @roomote and request a fix.
| if !s.dht.selfStoreEligible() { | ||
| newKeys := 0 | ||
| for _, data := range request.Data { | ||
| k, _ := utils.Blake3Hash(data) | ||
| existing, rErr := s.dht.store.Retrieve(ctx, k) | ||
| if rErr != nil || len(existing) == 0 { | ||
| newKeys++ | ||
| } | ||
| } | ||
| if newKeys > 0 { | ||
| logtrace.Warn(ctx, "rejecting BatchStore: self is not store-eligible", logtrace.Fields{ | ||
| logtrace.FieldModule: "p2p", | ||
| "sender": message.Sender.String(), | ||
| "self_state": s.dht.selfState.Load(), | ||
| "new_keys": newKeys, | ||
| "total_keys": len(request.Data), | ||
| }) | ||
| return s.generateResponseMessage(ctx, BatchStoreData, message.Sender, ResultFailed, "batch store rejected: self not store-eligible") | ||
| } | ||
| } |
There was a problem hiding this comment.
handleStoreData was refactored to use shouldRejectStore(1), but this batch handler still calls selfStoreEligible() directly and reimplements the newKeys-counting logic inline. If shouldRejectStore is updated later (e.g., to add logging, metrics, or a grace period), this path will silently diverge. Consider replacing the !s.dht.selfStoreEligible() guard + manual newKeys count with s.dht.shouldRejectStore(newKeys) after computing newKeys, matching the single-store handler's pattern.
Fix it with Roo Code or mention @roomote and request a fix.
Aligns both supernode go.mod and tests/system go.mod with the v1.12.0-rc release tag (lumera commit 7ca770a / Everlight #113). Resolves the install-lumera CI step which requires a real downloadable release asset rather than a pseudo-version.
Extends PR #272's routing-vs-store allowlist split to cover the new SUPERNODE_STATE_STORAGE_FULL introduced by lumera #113 (Everlight). Policy: - routing (reads) = {ACTIVE, POSTPONED, STORAGE_FULL} - store (writes) = {ACTIVE} STORAGE_FULL nodes continue to serve reads and earn payout, but must not receive new STORE/BatchStore writes or be targeted by replication. Changes: - p2p/kademlia/supernode_state.go: SSoT helpers using chain's sntypes.SuperNodeState enum (no numeric literals), selfState cache, pruneIneligibleStorePeers, selfStoreEligible. - bootstrap.go: use isRoutingEligibleState / isStoreEligibleState; record self-state during chain sync; call pruneIneligibleStorePeers after setStoreAllowlist so replication_info.Active is cleared eagerly for ineligible peers (closes ping-cadence window). - network.go: STORE RPC self-guard for single handleStoreData (reject new-key write when self not store-eligible; replication of already- held key still permitted); BatchStoreData self-guard rejects when batch contains any genuinely new keys. - dht.go: selfState atomic fields; defensive filterEligibleNodes on BatchRetrieve and BatchRetrieveStream closest-contact lists. - dht_batch_store_test.go: accept new "no eligible store peers" error variant alongside legacy "no candidate nodes".
Coverage matrix aligned with the invariant table in the plan doc: - I1 routing allowlist population/pre-init: TestEligibleForRouting_PreInit_AndPopulated - I2 store allowlist strictly ⊆ routing: TestEligibleForStore_StrictlyContainedInRouting - I3+I9 shouldRejectStore contract (ACTIVE, STORAGE_FULL, POSTPONED, DISABLED, with newKeys=0 vs >0, pre-init permissive): TestShouldRejectStore, TestSelfStoreEligible - I5 eager replication-info prune on storeAllowlist update: TestPruneIneligibleStorePeers_ClearsNonStorePeers, TestPruneIneligibleStorePeers_SkipsWhenNotReady - I6 state-classification SSoT (no drift possible): TestStateClassification_Table (parametric over all 7 chain enum values) Refactors handleStoreData self-guard to use shouldRejectStore helper for direct test coverage without a full *Network + hashtable boot. Adds a minimal fakeStore in-package (satisfies the full Store interface with no-op implementations) used only by prune tests.
af9c32c to
7a2b535
Compare
The cascade e2e asserted that a single coin_spent event in the MsgFinalizeAction tx carried amount == autoPrice. That was true pre-Everlight when the action module paid the full fee to the supernode in one bank send. Under Everlight Phase 1 (lumera #113) x/action/v1/keeper DistributeFees splits the fee across multiple bank sends: the governance-tunable RegistrationFeeShareBps is routed to the x/supernode reward pool, any FoundationFeeShare goes to the community pool, and the remainder is paid to the supernode(s). No single coin_spent event equals autoPrice anymore, so the pre-Everlight assertion fails deterministically on the v1.12.0-rc bump introduced by this PR. Replace the single-event check with a value-conservation assertion: - Sum all msg_index-tagged coin_spent events per spender; feeSpent is true iff some spender (the action module) paid out autoPrice in total. - Sum all msg_index-tagged coin_received events across recipients; feeReceived is true iff the total equals autoPrice. - msg_index discriminator excludes the tx-level gas-fee pair (emitted by x/auth, no msg_index) from the payout accounting. - fromAddress / toAddress are retained for the payment-flow log; toAddress picks the recipient of the largest share (the supernode payout under the default split ratio). This matches the real product invariant (value conservation across the payout flow) and survives future parameter changes to RegistrationFeeShareBps or FoundationFeeShare without further test churn.
…path
Production-gate devnet run (lumera@master + this branch) found that cascade
uploads fail deterministically whenever an on-chain supernode state
transition to STORAGE_FULL has just happened and the client's local
storeAllowlist has not yet refreshed.
Observed failure:
ERROR Batch store to node failed err="batch store rejected: self
not store-eligible" node=<STORAGE_FULL peer>
→ failed to achieve desired success rate, only: 71.43% successful
→ registration task failed → cascade upload FAILS
Root cause (two compounding issues):
1. setStoreAllowlist is only called from SyncBootstrapOnce, which ticks
every bootstrapRefreshInterval (10 minutes). Between a chain STORAGE_FULL
transition and the next tick, the caller's in-memory storeAllow still
contains the transitioned peer. IterateBatchStore's eligibleForStore(n)
gate therefore passes, and the client dispatches a STORE RPC that the
peer correctly rejects via its own self-guard.
2. The IterateBatchStore response handler counted the authoritative
"self not store-eligible" rejection as a store failure. A single
transitioned peer in the closest-contact list therefore dragged the
measured success rate below the 75% minimumDataStoreSuccessRate
threshold and failed the entire cascade upload.
Fix (surgical, two coordinated parts, both required):
1. Fast allowlist-only refresher — new StartAllowlistRefresher /
RefreshAllowlistsFromChain in p2p/kademlia/bootstrap.go, ticking every
30s (allowlistRefreshInterval), plus an opportunistic best-effort
refresh at the top of IterateBatchStore. Reuses loadBootstrapCandidates-
FromChain's existing gRPC ListSuperNodes query; does NOT touch
replication_info upserts, pings, routing-table seeding, or any other
heavyweight bootstrap behavior. Converges the write-gate within
O(seconds) of a chain transition.
2. Tolerate authoritative peer rejections — new isPeerStoreIneligibleError
helper + new pruneStoreAllowEntry method in p2p/kademlia/supernode_state.go.
When the peer responds with the canonical "self not store-eligible"
marker (emitted by both handleStoreData and handleBatchStoreData), the
client now:
(a) decrements its request counter so the rejection is NOT in the
success-rate denominator,
(b) removes the peer from the in-memory storeAllow so subsequent
hot-path lookups in the same run do not re-pick it.
The server-side emitter and the client-side recognizer share the
peerStoreIneligibleMarker constant to prevent silent drift.
Invariants locked in by new unit tests in peer_store_rejection_test.go:
- isPeerStoreIneligibleError recognizes both STORE and BatchStoreData
server emitter phrasings (and wrapped variants).
- isPeerStoreIneligibleError does NOT match unrelated store errors
(timeout, rpc internal, near-miss typos, case-mismatch).
- pruneStoreAllowEntry removes the peer and decrements
storeAllowCount atomically; is idempotent.
- pruneStoreAllowEntry is a no-op for nil receiver, nil node, empty
ID, or unset allowlist.
- TestPeerStoreIneligibleMarker_PresentInServerEmitters cross-checks
that the client-side recognizer matches the exact strings emitted
on the server side.
Blast radius:
- Write-path only; read-path (BatchRetrieve) is unaffected.
- No behavior change in the all-ACTIVE (unmixed) cluster case.
- Pre-existing tests on #284 remain green
(TestEligibleForRouting_*, TestEligibleForStore_*, TestShouldReject-
Store, TestSelfStoreEligible, TestPruneIneligibleStorePeers_*,
TestStateClassification_Table).
Rollback: single revert of this commit restores previous behavior; the
pre-existing STORAGE_FULL gate logic continues to function (it was
correct on the server side; the fix only strengthens client-side
convergence).
| // Best-effort allowlist refresh on the write hot path. This is the | ||
| // first line of defense against a stale storeAllowlist on the very | ||
| // first upload after a chain state transition. The peer-rejection | ||
| // handling below is the second line of defense for races we miss. | ||
| if err := s.RefreshAllowlistsFromChain(ctx); err != nil { | ||
| logtrace.Debug(ctx, "dht: opportunistic allowlist refresh failed", logtrace.Fields{ | ||
| logtrace.FieldModule: "dht", | ||
| "task_id": id, | ||
| logtrace.FieldError: err.Error(), | ||
| }) | ||
| } |
There was a problem hiding this comment.
RefreshAllowlistsFromChain calls ListSuperNodes (a full chain RPC) synchronously on every IterateBatchStore invocation. In a cascade upload this function is called per symbol batch, so every batch pays the latency of a chain round-trip and the chain node sees one ListSuperNodes query per batch. The 30-second background StartAllowlistRefresher already keeps the allowlist current for the steady state. Consider adding a time-based debounce here (e.g., skip if last refresh was <30s ago) so the opportunistic refresh fires at most once per upload rather than once per batch.
Fix it with Roo Code or mention @roomote and request a fix.
Four issues raised on PR #284 by the Roo review bot, triaged and fixed here. Honest summary of severity (from PR review-of-review): 1. REAL BUG: setStoreAllowlist accepted an empty allowlist post-ready, clamping storeAllowCount to 0 and causing eligibleForStore to return false for every peer. A transient chain RPC failure (ListSuperNodes returning zero results) would therefore block ALL writes network-wide until the next bootstrap refresh. setRoutingAllowlist already guarded against this; setStoreAllowlist did not. Asymmetry. 2. MINOR PERF: RefreshAllowlistsFromChain was called unconditionally at the top of IterateBatchStore. A burst of concurrent uploads would fan out to one chain ListSuperNodes query per batch. The 30-second background StartAllowlistRefresher already handles steady-state convergence; the opportunistic call only matters for the ~30s window immediately after a chain state transition. 3. MAINTAINABILITY: handleBatchStoreData reimplemented the shouldRejectStore contract inline (selfStoreEligible() + manual newKeys count) while handleStoreData used shouldRejectStore(1). Future refinements would risk silent divergence. 4. COSMETIC: eligibleForStore checked n == nil AFTER consulting storeAllowReady / storeAllowCount. eligibleForRouting checked it first. No observable bug in current call sites (all iterate over slices built from non-nil sources), but defense-in-depth alignment is free. Changes: * p2p/kademlia/dht.go - setStoreAllowlist: reject empty-map updates post-ready, retaining the previous allowlist. WARN-level log on a transient outage, DEBUG-level log during bootstrap. Symmetric with setRoutingAllowlist. - eligibleForStore: move `n == nil || len(n.ID) == 0` to the top, matching eligibleForRouting's ordering. Defense-in-depth. - DHT struct: new lastAllowlistRefreshUnixNano atomic.Int64 to support the debounced refresh path. - IterateBatchStore: swap RefreshAllowlistsFromChain for the new MaybeRefreshAllowlists helper. * p2p/kademlia/bootstrap.go - new const allowlistOpportunisticMinInterval = 10s - RefreshAllowlistsFromChain: stamps lastAllowlistRefreshUnixNano on successful completion (so the 30s background refresher also feeds the debounce window). - new MaybeRefreshAllowlists: debounced hot-path variant. Stamps on ATTEMPT (not success) so a persistently-failing chain cannot fan out to one RPC per caller; the background ticker retries on its own cadence. * p2p/kademlia/supernode_state.go - new shouldRejectBatchStore(ctx, payloads) ([]bool, int): the batch-sized companion to shouldRejectStore. Preserves the hot-path short-circuit (selfStoreEligible() check bypasses the per-payload storage lookup on the 99% case) while routing the final decision through shouldRejectStore for single-source-of-truth semantics. * p2p/kademlia/network.go - handleBatchStoreData: replace the inline selfStoreEligible + manual newKeys counting with a shouldRejectBatchStore call. The RPC response phrasing ("batch store rejected: self not store-eligible") is unchanged so the client-side isPeerStoreIneligibleError recognizer still matches. * p2p/kademlia/review_fixes_test.go (new) - TestSetStoreAllowlist_EmptyMapPostReady_RetainsPrevious TestSetStoreAllowlist_EmptyMapPreReady_NoOp Invariant test locking in issue (1). Prior allowlist must be retained when the chain returns empty; pre-ready must not falsely flip ready=true with count=0. - TestEligibleForStore_NilNode_ReturnsFalseEvenPreReady Locks in issue (4) nil-ordering. Pre-ready valid node still permissive (bootstrap not broken). - TestMaybeRefreshAllowlists_Debounces TestMaybeRefreshAllowlists_FiresWhenStale Lock in issue (2) debounce semantics: within the 10s window opportunistic callers skip; outside it they fire. - TestShouldRejectBatchStore_ShortCircuitsOnSelfEligible TestShouldRejectBatchStore_NilReceiver Verify the batch-helper's performance-critical short-circuit (no storage lookups when self is store-eligible) and defensive nil-receiver handling. All existing p2p tests remain green. go vet clean. Why the author missed these during the original #284 authorship: - Confirmation bias: focus on "does Phase 4 pass" after the first fix commit blinded me to asymmetric hardening between sibling functions. - Skipped my own invariant-first-coding skill: per-path violation tests for write-path entry points were not authored up front. A side-by-side of setRoutingAllowlist vs setStoreAllowlist would have surfaced (1) immediately. - Positive-path-only testing: peer_store_rejection_test.go covers the new tolerance path but not input edge cases to setStoreAllowlist (empty map, nil map, pre- vs post-ready semantics). These fixes tighten the invariants the first commit put in place and add the missing violation tests that would have surfaced them.
Consolidates the Everlight supernode workstream into one PR. Supersedes #282 (closed) and #272 (closed).
Implements the supernode side of lumera #113 (Everlight Phase 1). Three layers, stacked in commit order:
host_reportermeasures disk usage on the p2p data dir mount and emitscascade_kademlia_db_bytes+disk_usage_percentin audit epoch reports; probing + verifier acceptACTIVEandSTORAGE_FULLas operational states; supernode query client gains wrappers for the new chain state fields;go.modbumped to lumera v1.12.0-rc.routingIDs(reads) andstoreIDs(writes). Replaces the single-allowlist model and is a prerequisite for the STORAGE_FULL gate.SUPERNODE_STATE_STORAGE_FULL(enum 6). STORAGE_FULL nodes continue to serve reads and earn payout but do not receive new STORE / BatchStore writes and are not targeted by replication.State-class matrix
How
p2p/kademlia/supernode_state.goimportssntypes.SuperNodeStatefrom lumera. No numeric literals anywhere in p2p outside this file. Helpers:isRoutingEligibleState,isStoreEligibleState.bootstrap.go::loadBootstrapCandidatesFromChain) —routingIDs = {ACTIVE, POSTPONED, STORAGE_FULL},storeIDs = {ACTIVE}.DHT.selfState(atomic). Consumed by the STORE RPC self-guard.network.go::handleStoreData,handleBatchStoreData) — if self is not store-eligible, reject STORE/BatchStore requests that contain any genuinely new keys. Replication of already-held keys still allowed (preserves availability during transitions).bootstrap.go::SyncBootstrapOnce→pruneIneligibleStorePeers) — on every bootstrap refresh flipreplication_info.Active=falsefor any peer no longer in the store allowlist. Closes the 10-minute+ window between a chain STORAGE_FULL transition and the next successful ping.dht.go::BatchRetrieve,BatchRetrieveStream) —filterEligibleNodesapplied to the closest-contact list as belt-and-braces.host_reporter/service.go) — measures the filesystem backing the p2p data dir rather than/, and reportscascade_kademlia_db_bytesso the chain can drive STORAGE_FULL / POSTPONED transitions deterministically.supernode_metrics/active_probing.go,verifier/verifier.go) —ACTIVEandSTORAGE_FULLboth count as operational for probe-role assignment and challenge response.Tests — invariant-oriented, one violation test per enforcement point
TestEligibleForRouting_PreInit_AndPopulatedTestEligibleForStore_StrictlyContainedInRoutingTestShouldRejectStore,TestSelfStoreEligibleTestPruneIneligibleStorePeers_*TestStateClassification_Table(parametric over all 7 chain enum values)host_reporter/tick_behavior_test.goverifier/verifier_test.goreachability_active_probing_test.goNo numeric state literals anywhere in
p2p/kademlia/outsidesupernode_state.go.Verification
go build ./...OKgo vet ./...OKgo test ./...OK incl. integrationcoin_spentevent of10000ulumebut the v1.12.0-rc fee payout is now split (200ulumeprotocol cut +9800ulumesupernode share). Fix is a test-only change to sumcoin_spentamounts for the spender (or match onaction_registered.fee). Pushing that fix next.Out of scope (follow-ups)
tests/systemtests/everlight_p2p_test.go— Phase D.replication_info.Activewhich is pruned eagerly here).Risks
sntypes.SuperNodeState*constants directly makes this a compile error on the next go-mod bump rather than silent drift.pruneIneligibleStorePeersscans fullreplication_info. O(n) per 10-min bootstrap refresh; negligible for current network size.Rollback
Per-feature reverts are clean:
Related: