fix(replication): handle inbound audit challenges concurrently#158
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the replication subsystem to prevent head-of-line blocking by handling inbound audit challenges concurrently (bounded), while keeping all other inbound replication request types handled inline to preserve existing request/response latency characteristics.
Changes:
- Offloads inbound
AuditChallengehandling from the P2P receive loop into a boundedJoinSet(via a semaphore) to allow concurrent digest computation. - When at capacity, replies to audit challenges with an
AuditResponse::Rejectedusing a specific overload reason string (no protocol change). - Adds an
OverloadClaimTracker(LRU + per-peer consecutive budget) so the auditor only honors a limited streak of overload rejections before treating them as audit failures.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/replication/mod.rs |
Adds bounded concurrent inbound audit-challenge handling, overload rejection response, and per-peer overload-claim tracking (plus unit tests). |
src/replication/audit.rs |
Integrates overload-claim tracking into the audit tick flow and preserves the previous public helper signature via a wrapper. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
c1a7f9a to
bf5f274
Compare
|
Thanks — both points addressed in the latest push: 1. Overload responses awaited inline in the receive loop. 2. No focused test for the per-peer / saturation behaviour.
Validation after the change: |
…receive loop Answering a responsible-chunk AuditChallenge digests the stored bytes of every challenged key, yet it was the one audit responder still handled inline in the serial replication receive loop — so a single challenge blocked all other replication traffic until its digests completed (head-of-line blocking). The ADR-0002 subtree/byte audits (PR #128) were already spawned off the loop under flood-fair admission. Route the AuditChallenge responder through the same admit_audit_responder admission (global MAX_CONCURRENT_AUDIT_RESPONSES cap + per-peer MAX_AUDIT_RESPONSES_PER_PEER cap) and a detached task, consistent with the subtree and byte challenge handlers. On admission failure the challenge is dropped, which an honest auditor graces as a timeout; a flooder is bounded to its per-peer share and cannot starve other peers. The cap is now a single shared ceiling across all three audit-responder types. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The audit-responder pool is now shared across all three responder types (responsible-chunk, subtree round 1, byte round 2) since the responsible- chunk AuditChallenge was moved onto the same admission. Raise the global ceiling from 8 to 16 so the three types have more combined headroom before challenges are dropped. The per-peer cap (MAX_AUDIT_RESPONSES_PER_PEER = 2) is unchanged, so a single flooder still cannot occupy more than its share. Doubles the worst-case concurrent get_raw reads / resident byte-serve memory; still bounded. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bf5f274 to
5d82f5f
Compare
Summary
Rebased onto
mainafter #128 (audit-on-gossip) landed. #128 already moved thenew storage-bound audits (
SubtreeAuditChallenge,SubtreeByteChallenge) offthe serial replication receive loop via a bounded, flood-fair admission
(
admit_audit_responder: a global cap and a per-peer cap, dropping onoverload). But it left the old responsible-chunk
AuditChallenge— theper-key possession-digest responder, also reused by prune-confirmation audits —
handled inline, so a single such challenge still blocks all other replication
traffic until its digests complete (head-of-line blocking).
This PR closes that one remaining serial path.
What it does
AuditChallengeresponder throughmain's existingadmit_audit_responderadmission + a detachedtokio::spawn, exactly like thesubtree/byte handlers. Answering digests the stored bytes of every challenged
key, so it belongs off the serial loop for the same reason.
non-response as a timeout); the per-peer cap (
MAX_AUDIT_RESPONSES_PER_PEER = 2)guarantees no single flooder can occupy more than its share or starve other
peers.
responder types (responsible-chunk + subtree + byte). Raise that global cap
MAX_CONCURRENT_AUDIT_RESPONSESfrom 8 → 16 for combined headroom now that athird type shares it. (Per-peer cap unchanged.)
Relationship to the earlier version of this PR
The first version of this branch (pre-#128) introduced its own parallel
admission system — a
JoinSetwithMAX_CONCURRENT_INBOUND_AUDIT_CHALLENGES, anoverload reply (
AuditResponse::Rejected) instead of a drop, and anauditor-side
OverloadClaimTracker. Once #128 merged its ownadmit_audit_responder, keeping a second parallel system (two global semaphores,two per-peer maps, two overload policies) was redundant. This PR now reuses
#128's mechanism for the responsible-chunk path, so the codebase has one
audit-admission system — far smaller and consistent with the subtree/byte
handlers.
Trade-off adopted from #128: drop-on-overload rather than reply-with-reason. A
busy-but-honest responder whose challenge is dropped can be charged a
Timeoutaudit failure (after the auditor's responsibility re-confirmation). In practice
this is bounded by the per-peer cap of 2 and the 30-min gossip-audit cooldown
(honest auditors challenge one-at-a-time), and it is the same trade-off
mainalready accepts for the subtree audits.
Why only audit challenges (and not all inbound handlers)
Unchanged rationale: only audit-challenge handling is offloaded; every other
inbound request type stays inline. The e2e harness runs on a single-threaded
Tokio runtime, so spawning the convergence-critical handlers (fresh-offer /
neighbor-sync / verification / fetch) delays their responses past the
request-response timeouts and regresses
test_late_joiner_replicates_responsible_chunks. Keeping them inline preservestheir original latency.
Validation
cargo fmt --all -- --check— cleancargo clippy --all-targets --all-features -- -D warnings— cleancargo test --lib replication::— 375 passedreplication::test_audit_challenge_returns_correct_digest— pass (theresponsible-chunk responder path, now spawned off the loop)
replication::test_late_joiner_replicates_responsible_chunks— pass (~71s), no convergence regressionDiff size
Two commits on top of
main:fix(replication): offload responsible-chunk audit challenges off the receive loop— +42 / −11, one fileperf(replication): raise concurrent audit-responder cap to 16— +12 / −11, one file🤖 Generated with Claude Code