fix(replication): handle inbound audit challenges concurrently by mickvandijke · Pull Request #158 · WithAutonomi/ant-node

mickvandijke · 2026-06-23T11:55:23Z

Summary

Rebased onto main after #128 (audit-on-gossip) landed. #128 already moved the
new storage-bound audits (SubtreeAuditChallenge, SubtreeByteChallenge) off
the serial replication receive loop via a bounded, flood-fair admission
(admit_audit_responder: a global cap and a per-peer cap, dropping on
overload). But it left the old responsible-chunk AuditChallenge — the
per-key possession-digest responder, also reused by prune-confirmation audits —
handled inline, so a single such challenge still blocks all other replication
traffic until its digests complete (head-of-line blocking).

This PR closes that one remaining serial path.

What it does

Route the responsible-chunk AuditChallenge responder through main's existing
admit_audit_responder admission + a detached tokio::spawn, exactly like the
subtree/byte handlers. Answering digests the stored bytes of every challenged
key, so it belongs off the serial loop for the same reason.
On admission failure the challenge is dropped (the auditor graces a
non-response as a timeout); the per-peer cap (MAX_AUDIT_RESPONSES_PER_PEER = 2)
guarantees no single flooder can occupy more than its share or starve other
peers.
The audit-responder pool is now a single shared ceiling across all three
responder types (responsible-chunk + subtree + byte). Raise that global cap
MAX_CONCURRENT_AUDIT_RESPONSES from 8 → 16 for combined headroom now that a
third type shares it. (Per-peer cap unchanged.)

Relationship to the earlier version of this PR

The first version of this branch (pre-#128) introduced its own parallel
admission system — a JoinSet with MAX_CONCURRENT_INBOUND_AUDIT_CHALLENGES, an
overload reply (AuditResponse::Rejected) instead of a drop, and an
auditor-side OverloadClaimTracker. Once #128 merged its own
admit_audit_responder, keeping a second parallel system (two global semaphores,
two per-peer maps, two overload policies) was redundant. This PR now reuses
#128's mechanism for the responsible-chunk path, so the codebase has one
audit-admission system — far smaller and consistent with the subtree/byte
handlers.

Trade-off adopted from #128: drop-on-overload rather than reply-with-reason. A
busy-but-honest responder whose challenge is dropped can be charged a Timeout
audit failure (after the auditor's responsibility re-confirmation). In practice
this is bounded by the per-peer cap of 2 and the 30-min gossip-audit cooldown
(honest auditors challenge one-at-a-time), and it is the same trade-off main
already accepts for the subtree audits.

Why only audit challenges (and not all inbound handlers)

Unchanged rationale: only audit-challenge handling is offloaded; every other
inbound request type stays inline. The e2e harness runs on a single-threaded
Tokio runtime, so spawning the convergence-critical handlers (fresh-offer /
neighbor-sync / verification / fetch) delays their responses past the
request-response timeouts and regresses
test_late_joiner_replicates_responsible_chunks. Keeping them inline preserves
their original latency.

Validation

cargo fmt --all -- --check — clean
cargo clippy --all-targets --all-features -- -D warnings — clean
cargo test --lib replication:: — 375 passed
e2e replication::test_audit_challenge_returns_correct_digest — pass (the
responsible-chunk responder path, now spawned off the loop)
e2e replication::test_late_joiner_replicates_responsible_chunks — pass (~71s), no convergence regression

Diff size

Two commits on top of main:

fix(replication): offload responsible-chunk audit challenges off the receive loop — +42 / −11, one file
perf(replication): raise concurrent audit-responder cap to 16 — +12 / −11, one file

🤖 Generated with Claude Code

Copilot

Pull request overview

This PR updates the replication subsystem to prevent head-of-line blocking by handling inbound audit challenges concurrently (bounded), while keeping all other inbound replication request types handled inline to preserve existing request/response latency characteristics.

Changes:

Offloads inbound AuditChallenge handling from the P2P receive loop into a bounded JoinSet (via a semaphore) to allow concurrent digest computation.
When at capacity, replies to audit challenges with an AuditResponse::Rejected using a specific overload reason string (no protocol change).
Adds an OverloadClaimTracker (LRU + per-peer consecutive budget) so the auditor only honors a limited streak of overload rejections before treating them as audit failures.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`src/replication/mod.rs`	Adds bounded concurrent inbound audit-challenge handling, overload rejection response, and per-peer overload-claim tracking (plus unit tests).
`src/replication/audit.rs`	Integrates overload-claim tracking into the audit tick flow and preserves the previous public helper signature via a wrapper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mickvandijke · 2026-06-23T15:27:02Z

Thanks — both points addressed in the latest push:

1. Overload responses awaited inline in the receive loop.
The AuditResponse::Rejected reply is now sent from a bounded task pool
(spawn_audit_overloaded_rejection + MAX_CONCURRENT_INBOUND_AUDIT_OVERLOAD_RESPONSES = 32)
instead of being .awaited inline. The receive loop no longer does any
per-message send work under an overload flood; if the rejection pool itself is
saturated the reply is dropped (the auditor then times out, which its
OverloadClaimTracker budget already tolerates). The pool is reaped in the
select loop and drained on shutdown alongside the handler pool.

2. No focused test for the per-peer / saturation behaviour.
The admission decision (global + per-peer permit acquisition) is now factored
out of the receive loop into try_admit_audit_challenge, returning
Admitted / GlobalFull / PeerFull, and is unit-tested directly:

audit_admission_enforces_per_peer_cap — a peer fills its per-peer budget,
the next challenge is PeerFull while global capacity remains, a different
peer is unaffected, and releasing permits frees the slots.
audit_admission_enforces_global_cap_across_peers — one slot per distinct
peer until exactly the global cap is granted, then GlobalFull.

Validation after the change: cargo clippy --all-targets --all-features -- -D warnings
clean, cargo test --lib (557) green, replication::test_audit 3/3 and
test_late_joiner_replicates_responsible_chunks pass (~71s).

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

…receive loop Answering a responsible-chunk AuditChallenge digests the stored bytes of every challenged key, yet it was the one audit responder still handled inline in the serial replication receive loop — so a single challenge blocked all other replication traffic until its digests completed (head-of-line blocking). The ADR-0002 subtree/byte audits (PR #128) were already spawned off the loop under flood-fair admission. Route the AuditChallenge responder through the same admit_audit_responder admission (global MAX_CONCURRENT_AUDIT_RESPONSES cap + per-peer MAX_AUDIT_RESPONSES_PER_PEER cap) and a detached task, consistent with the subtree and byte challenge handlers. On admission failure the challenge is dropped, which an honest auditor graces as a timeout; a flooder is bounded to its per-peer share and cannot starve other peers. The cap is now a single shared ceiling across all three audit-responder types. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The audit-responder pool is now shared across all three responder types (responsible-chunk, subtree round 1, byte round 2) since the responsible- chunk AuditChallenge was moved onto the same admission. Raise the global ceiling from 8 to 16 so the three types have more combined headroom before challenges are dropped. The per-peer cap (MAX_AUDIT_RESPONSES_PER_PEER = 2) is unchanged, so a single flooder still cannot occupy more than its share. Doubles the worst-case concurrent get_raw reads / resident byte-serve memory; still bounded. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings June 23, 2026 11:55

Copilot started reviewing on behalf of mickvandijke June 23, 2026 11:55 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

mickvandijke force-pushed the fix/concurrent-inbound-replication branch 2 times, most recently from c1a7f9a to bf5f274 Compare June 23, 2026 15:26

Copilot AI review requested due to automatic review settings June 23, 2026 15:26

Copilot started reviewing on behalf of mickvandijke June 23, 2026 15:27 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

mickvandijke mentioned this pull request Jun 24, 2026

fix(replication): handle inbound overload backpressure #156

Closed

mickvandijke and others added 2 commits June 24, 2026 16:40

mickvandijke force-pushed the fix/concurrent-inbound-replication branch from bf5f274 to 5d82f5f Compare June 24, 2026 14:51

mickvandijke merged commit 0be9579 into main Jun 25, 2026
10 of 11 checks passed

mickvandijke deleted the fix/concurrent-inbound-replication branch June 25, 2026 13:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(replication): handle inbound audit challenges concurrently#158

fix(replication): handle inbound audit challenges concurrently#158
mickvandijke merged 2 commits into
mainfrom
fix/concurrent-inbound-replication

mickvandijke commented Jun 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

mickvandijke commented Jun 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mickvandijke commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What it does

Relationship to the earlier version of this PR

Why only audit challenges (and not all inbound handlers)

Validation

Diff size

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

mickvandijke commented Jun 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mickvandijke commented Jun 23, 2026 •

edited

Loading