Skip to content

Update/store retrieve list seperation#272

Closed
mateeullahmalik wants to merge 4 commits intomasterfrom
update/store-retrieve-list-seperation
Closed

Update/store retrieve list seperation#272
mateeullahmalik wants to merge 4 commits intomasterfrom
update/store-retrieve-list-seperation

Conversation

@mateeullahmalik
Copy link
Copy Markdown
Collaborator

No description provided.

@roomote-v0
Copy link
Copy Markdown

roomote-v0 Bot commented Feb 23, 2026

Rooviewer Clock   See task

Reviewed the latest commit (810de25 -- fix bootstrap routing gate). The change splits the routingAllowReady and routingAllowCount checks in eligibleForRouting and filterEligibleNodes so that routing stays permissive during bootstrap (before the first non-empty chain allowlist arrives) instead of blocking all peers. The logic is sound and the atomics ordering is safe. No new issues found.

Previous reviews

Mention @roomote in a comment to request specific changes to this pull request or fix all unresolved issues.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR separates peer eligibility into two distinct allowlists—one for routing/read lookups and one for store/write replication—so postponed peers can still participate in routing while only active peers are used as write targets.

Changes:

  • Introduces a new storeAllowlist and eligibleForStore gating alongside updated routing/read eligibility semantics.
  • Updates store paths (storeToAlphaNodes, IterateBatchStore, batchStoreNetwork) to skip non-store-eligible peers and improves batch-store error reporting.
  • Adjusts bootstrap syncing to produce both routing (Active+Postponed) and store (Active-only) allowlists, and updates ping-success handling to reflect the new split.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
p2p/kademlia/dht.go Adds store allowlist + store eligibility checks; changes routing eligibility behavior; updates batch store candidate selection; disables delete/cleanup workers.
p2p/kademlia/bootstrap.go Extends chain bootstrap to compute routing vs store allowlists and wires them into the DHT.
p2p/kademlia/node_activity.go Updates ping success logic to add nodes to routing while only marking “Active” when store-eligible.
p2p/kademlia/dht_batch_store_test.go Updates expected error substring for the new batch-store failure message.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread p2p/kademlia/dht.go Outdated
Comment thread p2p/kademlia/dht.go Outdated
Comment thread p2p/kademlia/dht.go
@mateeullahmalik mateeullahmalik marked this pull request as draft April 15, 2026 14:21
mateeullahmalik added a commit that referenced this pull request Apr 22, 2026
Extends PR #272's routing-vs-store allowlist split to cover the new
SUPERNODE_STATE_STORAGE_FULL introduced by lumera #113 (Everlight).

Policy:
- routing (reads)  = {ACTIVE, POSTPONED, STORAGE_FULL}
- store   (writes) = {ACTIVE}

STORAGE_FULL nodes continue to serve reads and earn payout, but must
not receive new STORE/BatchStore writes or be targeted by replication.

Changes:
- p2p/kademlia/supernode_state.go: SSoT helpers using chain's
  sntypes.SuperNodeState enum (no numeric literals), selfState cache,
  pruneIneligibleStorePeers, selfStoreEligible.
- bootstrap.go: use isRoutingEligibleState / isStoreEligibleState;
  record self-state during chain sync; call pruneIneligibleStorePeers
  after setStoreAllowlist so replication_info.Active is cleared
  eagerly for ineligible peers (closes ping-cadence window).
- network.go: STORE RPC self-guard for single handleStoreData (reject
  new-key write when self not store-eligible; replication of already-
  held key still permitted); BatchStoreData self-guard rejects when
  batch contains any genuinely new keys.
- dht.go: selfState atomic fields; defensive filterEligibleNodes on
  BatchRetrieve and BatchRetrieveStream closest-contact lists.
- dht_batch_store_test.go: accept new "no eligible store peers"
  error variant alongside legacy "no candidate nodes".
@mateeullahmalik
Copy link
Copy Markdown
Collaborator Author

mateeullahmalik commented Apr 22, 2026

Superseded by #284, which implements the same store/retrieve list separation but gated on the on-chain metrics_state (STORAGE_FULL / POSTPONED) introduced in lumera #113 and reported by #282. Closing in favor of the everlight-aligned approach.

mateeullahmalik added a commit that referenced this pull request Apr 24, 2026
#284)

* docs: add everlight supernode compatibility implementation plan

* everlight: implement supernode compatibility with audit epoch reports

* host_reporter: measure disk usage on p2p data dir mount

* supernode: align probing roles with ACTIVE+STORAGE_FULL semantics

* deps: bind supernode + tests/system to lumera v1.12.0-rc

Aligns both supernode go.mod and tests/system go.mod with the v1.12.0-rc
release tag (lumera commit 7ca770a / Everlight #113). Resolves the
install-lumera CI step which requires a real downloadable release asset
rather than a pseudo-version.

* DHT: Store only on ACTIVE supernodes

* fix bootstrap routing gate

* feat(p2p): Everlight STORAGE_FULL eligibility gate

Extends PR #272's routing-vs-store allowlist split to cover the new
SUPERNODE_STATE_STORAGE_FULL introduced by lumera #113 (Everlight).

Policy:
- routing (reads)  = {ACTIVE, POSTPONED, STORAGE_FULL}
- store   (writes) = {ACTIVE}

STORAGE_FULL nodes continue to serve reads and earn payout, but must
not receive new STORE/BatchStore writes or be targeted by replication.

Changes:
- p2p/kademlia/supernode_state.go: SSoT helpers using chain's
  sntypes.SuperNodeState enum (no numeric literals), selfState cache,
  pruneIneligibleStorePeers, selfStoreEligible.
- bootstrap.go: use isRoutingEligibleState / isStoreEligibleState;
  record self-state during chain sync; call pruneIneligibleStorePeers
  after setStoreAllowlist so replication_info.Active is cleared
  eagerly for ineligible peers (closes ping-cadence window).
- network.go: STORE RPC self-guard for single handleStoreData (reject
  new-key write when self not store-eligible; replication of already-
  held key still permitted); BatchStoreData self-guard rejects when
  batch contains any genuinely new keys.
- dht.go: selfState atomic fields; defensive filterEligibleNodes on
  BatchRetrieve and BatchRetrieveStream closest-contact lists.
- dht_batch_store_test.go: accept new "no eligible store peers"
  error variant alongside legacy "no candidate nodes".

* test(p2p): unit tests for Everlight STORAGE_FULL eligibility gate

Coverage matrix aligned with the invariant table in the plan doc:

- I1 routing allowlist population/pre-init: TestEligibleForRouting_PreInit_AndPopulated
- I2 store allowlist strictly ⊆ routing:   TestEligibleForStore_StrictlyContainedInRouting
- I3+I9 shouldRejectStore contract (ACTIVE, STORAGE_FULL, POSTPONED,
  DISABLED, with newKeys=0 vs >0, pre-init permissive):
  TestShouldRejectStore, TestSelfStoreEligible
- I5 eager replication-info prune on storeAllowlist update:
  TestPruneIneligibleStorePeers_ClearsNonStorePeers,
  TestPruneIneligibleStorePeers_SkipsWhenNotReady
- I6 state-classification SSoT (no drift possible):
  TestStateClassification_Table (parametric over all 7 chain enum values)

Refactors handleStoreData self-guard to use shouldRejectStore helper for
direct test coverage without a full *Network + hashtable boot.

Adds a minimal fakeStore in-package (satisfies the full Store interface
with no-op implementations) used only by prune tests.

* test(e2e): assert fee-distribution value conservation for Everlight

The cascade e2e asserted that a single coin_spent event in the
MsgFinalizeAction tx carried amount == autoPrice. That was true
pre-Everlight when the action module paid the full fee to the supernode
in one bank send.

Under Everlight Phase 1 (lumera #113) x/action/v1/keeper DistributeFees
splits the fee across multiple bank sends: the governance-tunable
RegistrationFeeShareBps is routed to the x/supernode reward pool, any
FoundationFeeShare goes to the community pool, and the remainder is paid
to the supernode(s). No single coin_spent event equals autoPrice
anymore, so the pre-Everlight assertion fails deterministically on the
v1.12.0-rc bump introduced by this PR.

Replace the single-event check with a value-conservation assertion:

- Sum all msg_index-tagged coin_spent events per spender; feeSpent is
  true iff some spender (the action module) paid out autoPrice in total.
- Sum all msg_index-tagged coin_received events across recipients;
  feeReceived is true iff the total equals autoPrice.
- msg_index discriminator excludes the tx-level gas-fee pair (emitted
  by x/auth, no msg_index) from the payout accounting.
- fromAddress / toAddress are retained for the payment-flow log;
  toAddress picks the recipient of the largest share (the supernode
  payout under the default split ratio).

This matches the real product invariant (value conservation across the
payout flow) and survives future parameter changes to
RegistrationFeeShareBps or FoundationFeeShare without further test
churn.

* fix(p2p): close STORAGE_FULL transition race on cascade upload write path

Production-gate devnet run (lumera@master + this branch) found that cascade
uploads fail deterministically whenever an on-chain supernode state
transition to STORAGE_FULL has just happened and the client's local
storeAllowlist has not yet refreshed.

Observed failure:

  ERROR  Batch store to node failed  err="batch store rejected: self
         not store-eligible"  node=<STORAGE_FULL peer>
  → failed to achieve desired success rate, only: 71.43% successful
  → registration task failed → cascade upload FAILS

Root cause (two compounding issues):

1. setStoreAllowlist is only called from SyncBootstrapOnce, which ticks
   every bootstrapRefreshInterval (10 minutes). Between a chain STORAGE_FULL
   transition and the next tick, the caller's in-memory storeAllow still
   contains the transitioned peer. IterateBatchStore's eligibleForStore(n)
   gate therefore passes, and the client dispatches a STORE RPC that the
   peer correctly rejects via its own self-guard.

2. The IterateBatchStore response handler counted the authoritative
   "self not store-eligible" rejection as a store failure. A single
   transitioned peer in the closest-contact list therefore dragged the
   measured success rate below the 75% minimumDataStoreSuccessRate
   threshold and failed the entire cascade upload.

Fix (surgical, two coordinated parts, both required):

1. Fast allowlist-only refresher — new StartAllowlistRefresher /
   RefreshAllowlistsFromChain in p2p/kademlia/bootstrap.go, ticking every
   30s (allowlistRefreshInterval), plus an opportunistic best-effort
   refresh at the top of IterateBatchStore. Reuses loadBootstrapCandidates-
   FromChain's existing gRPC ListSuperNodes query; does NOT touch
   replication_info upserts, pings, routing-table seeding, or any other
   heavyweight bootstrap behavior. Converges the write-gate within
   O(seconds) of a chain transition.

2. Tolerate authoritative peer rejections — new isPeerStoreIneligibleError
   helper + new pruneStoreAllowEntry method in p2p/kademlia/supernode_state.go.
   When the peer responds with the canonical "self not store-eligible"
   marker (emitted by both handleStoreData and handleBatchStoreData), the
   client now:
     (a) decrements its request counter so the rejection is NOT in the
         success-rate denominator,
     (b) removes the peer from the in-memory storeAllow so subsequent
         hot-path lookups in the same run do not re-pick it.
   The server-side emitter and the client-side recognizer share the
   peerStoreIneligibleMarker constant to prevent silent drift.

Invariants locked in by new unit tests in peer_store_rejection_test.go:

  - isPeerStoreIneligibleError recognizes both STORE and BatchStoreData
    server emitter phrasings (and wrapped variants).
  - isPeerStoreIneligibleError does NOT match unrelated store errors
    (timeout, rpc internal, near-miss typos, case-mismatch).
  - pruneStoreAllowEntry removes the peer and decrements
    storeAllowCount atomically; is idempotent.
  - pruneStoreAllowEntry is a no-op for nil receiver, nil node, empty
    ID, or unset allowlist.
  - TestPeerStoreIneligibleMarker_PresentInServerEmitters cross-checks
    that the client-side recognizer matches the exact strings emitted
    on the server side.

Blast radius:

  - Write-path only; read-path (BatchRetrieve) is unaffected.
  - No behavior change in the all-ACTIVE (unmixed) cluster case.
  - Pre-existing tests on #284 remain green
    (TestEligibleForRouting_*, TestEligibleForStore_*, TestShouldReject-
    Store, TestSelfStoreEligible, TestPruneIneligibleStorePeers_*,
    TestStateClassification_Table).

Rollback: single revert of this commit restores previous behavior; the
pre-existing STORAGE_FULL gate logic continues to function (it was
correct on the server side; the fix only strengthens client-side
convergence).

* fix(p2p): address Roo review findings on STORAGE_FULL gate

Four issues raised on PR #284 by the Roo review bot, triaged and fixed
here. Honest summary of severity (from PR review-of-review):

1. REAL BUG: setStoreAllowlist accepted an empty allowlist post-ready,
   clamping storeAllowCount to 0 and causing eligibleForStore to return
   false for every peer. A transient chain RPC failure (ListSuperNodes
   returning zero results) would therefore block ALL writes
   network-wide until the next bootstrap refresh. setRoutingAllowlist
   already guarded against this; setStoreAllowlist did not. Asymmetry.

2. MINOR PERF: RefreshAllowlistsFromChain was called unconditionally at
   the top of IterateBatchStore. A burst of concurrent uploads would
   fan out to one chain ListSuperNodes query per batch. The 30-second
   background StartAllowlistRefresher already handles steady-state
   convergence; the opportunistic call only matters for the ~30s
   window immediately after a chain state transition.

3. MAINTAINABILITY: handleBatchStoreData reimplemented the
   shouldRejectStore contract inline (selfStoreEligible() + manual
   newKeys count) while handleStoreData used shouldRejectStore(1).
   Future refinements would risk silent divergence.

4. COSMETIC: eligibleForStore checked n == nil AFTER consulting
   storeAllowReady / storeAllowCount. eligibleForRouting checked it
   first. No observable bug in current call sites (all iterate over
   slices built from non-nil sources), but defense-in-depth alignment
   is free.

Changes:

* p2p/kademlia/dht.go
  - setStoreAllowlist: reject empty-map updates post-ready, retaining
    the previous allowlist. WARN-level log on a transient outage,
    DEBUG-level log during bootstrap. Symmetric with setRoutingAllowlist.
  - eligibleForStore: move `n == nil || len(n.ID) == 0` to the top,
    matching eligibleForRouting's ordering. Defense-in-depth.
  - DHT struct: new lastAllowlistRefreshUnixNano atomic.Int64 to
    support the debounced refresh path.
  - IterateBatchStore: swap RefreshAllowlistsFromChain for the new
    MaybeRefreshAllowlists helper.

* p2p/kademlia/bootstrap.go
  - new const allowlistOpportunisticMinInterval = 10s
  - RefreshAllowlistsFromChain: stamps lastAllowlistRefreshUnixNano
    on successful completion (so the 30s background refresher also
    feeds the debounce window).
  - new MaybeRefreshAllowlists: debounced hot-path variant. Stamps
    on ATTEMPT (not success) so a persistently-failing chain cannot
    fan out to one RPC per caller; the background ticker retries on
    its own cadence.

* p2p/kademlia/supernode_state.go
  - new shouldRejectBatchStore(ctx, payloads) ([]bool, int): the
    batch-sized companion to shouldRejectStore. Preserves the hot-path
    short-circuit (selfStoreEligible() check bypasses the per-payload
    storage lookup on the 99% case) while routing the final decision
    through shouldRejectStore for single-source-of-truth semantics.

* p2p/kademlia/network.go
  - handleBatchStoreData: replace the inline selfStoreEligible +
    manual newKeys counting with a shouldRejectBatchStore call. The
    RPC response phrasing ("batch store rejected: self not
    store-eligible") is unchanged so the client-side isPeerStoreIneligibleError
    recognizer still matches.

* p2p/kademlia/review_fixes_test.go (new)
  - TestSetStoreAllowlist_EmptyMapPostReady_RetainsPrevious
    TestSetStoreAllowlist_EmptyMapPreReady_NoOp
      Invariant test locking in issue (1). Prior allowlist must be
      retained when the chain returns empty; pre-ready must not
      falsely flip ready=true with count=0.
  - TestEligibleForStore_NilNode_ReturnsFalseEvenPreReady
      Locks in issue (4) nil-ordering. Pre-ready valid node still
      permissive (bootstrap not broken).
  - TestMaybeRefreshAllowlists_Debounces
    TestMaybeRefreshAllowlists_FiresWhenStale
      Lock in issue (2) debounce semantics: within the 10s window
      opportunistic callers skip; outside it they fire.
  - TestShouldRejectBatchStore_ShortCircuitsOnSelfEligible
    TestShouldRejectBatchStore_NilReceiver
      Verify the batch-helper's performance-critical short-circuit
      (no storage lookups when self is store-eligible) and defensive
      nil-receiver handling.

All existing p2p tests remain green. go vet clean.

Why the author missed these during the original #284 authorship:

- Confirmation bias: focus on "does Phase 4 pass" after the first fix
  commit blinded me to asymmetric hardening between sibling functions.
- Skipped my own invariant-first-coding skill: per-path violation
  tests for write-path entry points were not authored up front. A
  side-by-side of setRoutingAllowlist vs setStoreAllowlist would have
  surfaced (1) immediately.
- Positive-path-only testing: peer_store_rejection_test.go covers the
  new tolerance path but not input edge cases to setStoreAllowlist
  (empty map, nil map, pre- vs post-ready semantics).

These fixes tighten the invariants the first commit put in place and
add the missing violation tests that would have surfaced them.

---------

Co-authored-by: NightCrawler <nightcrawler@lumera.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants