Skip to content

fix(replication): handle inbound audit challenges concurrently#158

Merged
mickvandijke merged 2 commits into
mainfrom
fix/concurrent-inbound-replication
Jun 25, 2026
Merged

fix(replication): handle inbound audit challenges concurrently#158
mickvandijke merged 2 commits into
mainfrom
fix/concurrent-inbound-replication

Conversation

@mickvandijke

@mickvandijke mickvandijke commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

Rebased onto main after #128 (audit-on-gossip) landed. #128 already moved the
new storage-bound audits (SubtreeAuditChallenge, SubtreeByteChallenge) off
the serial replication receive loop via a bounded, flood-fair admission
(admit_audit_responder: a global cap and a per-peer cap, dropping on
overload). But it left the old responsible-chunk AuditChallenge — the
per-key possession-digest responder, also reused by prune-confirmation audits —
handled inline, so a single such challenge still blocks all other replication
traffic until its digests complete (head-of-line blocking).

This PR closes that one remaining serial path.

What it does

  • Route the responsible-chunk AuditChallenge responder through main's existing
    admit_audit_responder admission + a detached tokio::spawn, exactly like the
    subtree/byte handlers. Answering digests the stored bytes of every challenged
    key, so it belongs off the serial loop for the same reason.
  • On admission failure the challenge is dropped (the auditor graces a
    non-response as a timeout); the per-peer cap (MAX_AUDIT_RESPONSES_PER_PEER = 2)
    guarantees no single flooder can occupy more than its share or starve other
    peers.
  • The audit-responder pool is now a single shared ceiling across all three
    responder types (responsible-chunk + subtree + byte). Raise that global cap
    MAX_CONCURRENT_AUDIT_RESPONSES from 8 → 16 for combined headroom now that a
    third type shares it. (Per-peer cap unchanged.)

Relationship to the earlier version of this PR

The first version of this branch (pre-#128) introduced its own parallel
admission system — a JoinSet with MAX_CONCURRENT_INBOUND_AUDIT_CHALLENGES, an
overload reply (AuditResponse::Rejected) instead of a drop, and an
auditor-side OverloadClaimTracker. Once #128 merged its own
admit_audit_responder, keeping a second parallel system (two global semaphores,
two per-peer maps, two overload policies) was redundant. This PR now reuses
#128's mechanism
for the responsible-chunk path, so the codebase has one
audit-admission system — far smaller and consistent with the subtree/byte
handlers.

Trade-off adopted from #128: drop-on-overload rather than reply-with-reason. A
busy-but-honest responder whose challenge is dropped can be charged a Timeout
audit failure (after the auditor's responsibility re-confirmation). In practice
this is bounded by the per-peer cap of 2 and the 30-min gossip-audit cooldown
(honest auditors challenge one-at-a-time), and it is the same trade-off main
already accepts for the subtree audits.

Why only audit challenges (and not all inbound handlers)

Unchanged rationale: only audit-challenge handling is offloaded; every other
inbound request type stays inline
. The e2e harness runs on a single-threaded
Tokio runtime, so spawning the convergence-critical handlers (fresh-offer /
neighbor-sync / verification / fetch) delays their responses past the
request-response timeouts and regresses
test_late_joiner_replicates_responsible_chunks. Keeping them inline preserves
their original latency.

Validation

  • cargo fmt --all -- --check — clean
  • cargo clippy --all-targets --all-features -- -D warnings — clean
  • cargo test --lib replication:: — 375 passed
  • e2e replication::test_audit_challenge_returns_correct_digest — pass (the
    responsible-chunk responder path, now spawned off the loop)
  • e2e replication::test_late_joiner_replicates_responsible_chunkspass (~71s), no convergence regression

Diff size

Two commits on top of main:

  • fix(replication): offload responsible-chunk audit challenges off the receive loop — +42 / −11, one file
  • perf(replication): raise concurrent audit-responder cap to 16 — +12 / −11, one file

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings June 23, 2026 11:55

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the replication subsystem to prevent head-of-line blocking by handling inbound audit challenges concurrently (bounded), while keeping all other inbound replication request types handled inline to preserve existing request/response latency characteristics.

Changes:

  • Offloads inbound AuditChallenge handling from the P2P receive loop into a bounded JoinSet (via a semaphore) to allow concurrent digest computation.
  • When at capacity, replies to audit challenges with an AuditResponse::Rejected using a specific overload reason string (no protocol change).
  • Adds an OverloadClaimTracker (LRU + per-peer consecutive budget) so the auditor only honors a limited streak of overload rejections before treating them as audit failures.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/replication/mod.rs Adds bounded concurrent inbound audit-challenge handling, overload rejection response, and per-peer overload-claim tracking (plus unit tests).
src/replication/audit.rs Integrates overload-claim tracking into the audit tick flow and preserves the previous public helper signature via a wrapper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mickvandijke mickvandijke force-pushed the fix/concurrent-inbound-replication branch 2 times, most recently from c1a7f9a to bf5f274 Compare June 23, 2026 15:26
Copilot AI review requested due to automatic review settings June 23, 2026 15:26
@mickvandijke

Copy link
Copy Markdown
Collaborator Author

Thanks — both points addressed in the latest push:

1. Overload responses awaited inline in the receive loop.
The AuditResponse::Rejected reply is now sent from a bounded task pool
(spawn_audit_overloaded_rejection + MAX_CONCURRENT_INBOUND_AUDIT_OVERLOAD_RESPONSES = 32)
instead of being .awaited inline. The receive loop no longer does any
per-message send work under an overload flood; if the rejection pool itself is
saturated the reply is dropped (the auditor then times out, which its
OverloadClaimTracker budget already tolerates). The pool is reaped in the
select loop and drained on shutdown alongside the handler pool.

2. No focused test for the per-peer / saturation behaviour.
The admission decision (global + per-peer permit acquisition) is now factored
out of the receive loop into try_admit_audit_challenge, returning
Admitted / GlobalFull / PeerFull, and is unit-tested directly:

  • audit_admission_enforces_per_peer_cap — a peer fills its per-peer budget,
    the next challenge is PeerFull while global capacity remains, a different
    peer is unaffected, and releasing permits frees the slots.
  • audit_admission_enforces_global_cap_across_peers — one slot per distinct
    peer until exactly the global cap is granted, then GlobalFull.

Validation after the change: cargo clippy --all-targets --all-features -- -D warnings
clean, cargo test --lib (557) green, replication::test_audit 3/3 and
test_late_joiner_replicates_responsible_chunks pass (~71s).

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

mickvandijke and others added 2 commits June 24, 2026 16:40
…receive loop

Answering a responsible-chunk AuditChallenge digests the stored bytes of
every challenged key, yet it was the one audit responder still handled
inline in the serial replication receive loop — so a single challenge
blocked all other replication traffic until its digests completed
(head-of-line blocking). The ADR-0002 subtree/byte audits (PR #128) were
already spawned off the loop under flood-fair admission.

Route the AuditChallenge responder through the same admit_audit_responder
admission (global MAX_CONCURRENT_AUDIT_RESPONSES cap + per-peer
MAX_AUDIT_RESPONSES_PER_PEER cap) and a detached task, consistent with the
subtree and byte challenge handlers. On admission failure the challenge is
dropped, which an honest auditor graces as a timeout; a flooder is bounded
to its per-peer share and cannot starve other peers. The cap is now a
single shared ceiling across all three audit-responder types.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The audit-responder pool is now shared across all three responder types
(responsible-chunk, subtree round 1, byte round 2) since the responsible-
chunk AuditChallenge was moved onto the same admission. Raise the global
ceiling from 8 to 16 so the three types have more combined headroom before
challenges are dropped. The per-peer cap (MAX_AUDIT_RESPONSES_PER_PEER = 2)
is unchanged, so a single flooder still cannot occupy more than its share.

Doubles the worst-case concurrent get_raw reads / resident byte-serve
memory; still bounded.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mickvandijke mickvandijke force-pushed the fix/concurrent-inbound-replication branch from bf5f274 to 5d82f5f Compare June 24, 2026 14:51
@mickvandijke mickvandijke merged commit 0be9579 into main Jun 25, 2026
10 of 11 checks passed
@mickvandijke mickvandijke deleted the fix/concurrent-inbound-replication branch June 25, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants