Skip to content

fix(replication): handle inbound overload backpressure#156

Closed
mickvandijke wants to merge 4 commits into
WithAutonomi:mainfrom
mickvandijke:fix/replication-overload-handling
Closed

fix(replication): handle inbound overload backpressure#156
mickvandijke wants to merge 4 commits into
WithAutonomi:mainfrom
mickvandijke:fix/replication-overload-handling

Conversation

@mickvandijke

Copy link
Copy Markdown
Collaborator

Summary

  • Bound inbound replication request handling with global and per-peer semaphores, plus capped overload response tasks so the P2P event loop does not run expensive replication handlers inline.
  • Add an explicit Overloaded replication message for request types without typed rejection responses, and map overload into neutral deferrals for neighbor sync, verification, prune audit, fetch, and audit flows.
  • Track overload-deferred bootstrap syncs, priority retry those peers, and prevent bootstrap from draining before deferred sync work is resolved.
  • Add per-peer overload-claim budgets for audit/fetch so an unauthenticated overload reason cannot indefinitely avoid accountability.
  • Preserve pending verification work across inconclusive rounds with a bounded retry counter, and let bootstrapping replica hints fetch from present holders even when full quorum is not available.

Details

  • Inbound request handling now accepts work only when both global and per-peer limits have capacity. Accepted work is spawned into tracked tasks; rejected request-response messages receive typed overload responses where possible.
  • NeighborSyncAttempt distinguishes completed, failed, and overloaded sync attempts. Overloaded bootstrap syncs block drain until a priority retry completes, fails, or the peer leaves.
  • Verification handles Overloaded responses as unresolved evidence, caps consecutive inconclusive rounds, and returns stale evictions to bootstrap bookkeeping.
  • Fetch overload errors are retried against the same source within a small budget, then treated as source failures to allow alternate sources.
  • Audit overload claims are honored briefly, then counted as audit failures once the peer exceeds the consecutive-claim budget. Prune audits treat overload as neutral because it only keeps the local replica.

Validation

  • git diff --check
  • cargo fmt --all -- --check
  • cargo clippy --all-targets --all-features -- -D warnings
  • cargo test --lib --all-features replication::
  • cargo test --test e2e --all-features replication::test_late_joiner_replicates_responsible_chunks -- --exact

Note

A broader cargo test --all-targets --all-features replication:: run completed the library suite and 29/30 replication e2e tests but had one late-joiner convergence timeout after 90s. The exact failed e2e test passed on rerun in 71.36s.

Copilot AI review requested due to automatic review settings June 22, 2026 16:14

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds explicit backpressure handling for inbound replication traffic so the P2P receive loop doesn’t execute expensive replication handlers inline under load, while also making downstream replication flows treat overload as a neutral deferral (with bounded budgets where overload claims are unauthenticated).

Changes:

  • Added global + per-peer inbound request semaphores and bounded overload-response task spawning, plus a new Overloaded replication message for request types without typed rejection responses.
  • Updated neighbor sync, verification, fetch, audit, and prune-audit paths to classify overload responses as neutral deferrals, with per-peer overload-claim budgets for audit/fetch.
  • Added bounded retry tracking for inconclusive verification rounds and enhanced bootstrap drain accounting to include overload-deferred initial sync retries.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/poc_d1_bounded_queues.rs Updates test helpers to include the new inconclusive_rounds field.
src/replication/types.rs Extends replication state types: VerificationEntry.inconclusive_rounds and BootstrapState.overload_deferred_sync_peers.
src/replication/scheduling.rs Adds inconclusive tracking helpers, fetch-source introspection, and returns evicted keys from stale eviction.
src/replication/quorum.rs Treats Overloaded and unexpected messages as unresolved evidence in verification rounds.
src/replication/pruning.rs Treats prune-audit overload rejections as neutral (no failure) when the overload reason matches admission-control reasons.
src/replication/protocol.rs Introduces ReplicationMessageBody::Overloaded(OverloadedNotice) with a roundtrip test.
src/replication/neighbor_sync.rs Adds NeighborSyncAttempt (Completed/Failed/Overloaded) and prioritizes overload-deferred peers.
src/replication/mod.rs Implements inbound admission control + overload replies, overload-claim tracking, fetch deferral retrying, and bootstrap drain integration points.
src/replication/config.rs Adds MAX_INCONCLUSIVE_VERIFY_ROUNDS to bound repeated inconclusive verification retries.
src/replication/bootstrap.rs Blocks bootstrap drain on overload-deferred sync retries and adds APIs to track/clear them.
src/replication/audit.rs Adds overload-claim budgeting to the audit path via an overload tracker (engine reuses one across ticks).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/replication/mod.rs
.await;
}

let pending_keys = {
Comment on lines 350 to 352
config: &ReplicationConfig,
is_bootstrapping: bool,
) -> Option<NeighborSyncResponse> {
Copilot AI review requested due to automatic review settings June 22, 2026 20:43

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Comment on lines +132 to +136
/// logging parity with the typed overload rejections.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct OverloadedNotice {
/// Human-readable overload reason (for logging only).
pub reason: String,
Comment on lines +119 to +121
/// Sent in place of a typed response when the receiver's inbound handler
/// capacity is exhausted and a request type has no natural rejection
/// response to carry the overload reason (currently `NeighborSyncRequest`).
Comment on lines +353 to +358
match sync_with_peer_with_outcome(peer, p2p_node, storage, paid_list, config, is_bootstrapping)
.await
.map(|outcome| outcome.response)
{
NeighborSyncAttempt::Completed(outcome) => Some(outcome.response),
NeighborSyncAttempt::Failed | NeighborSyncAttempt::Overloaded => None,
}
Comment thread src/replication/mod.rs
Comment on lines +118 to +121
fn is_inbound_replication_overloaded_reason(reason: &str) -> bool {
reason == INBOUND_REPLICATION_OVERLOADED_REASON
|| reason == INBOUND_REPLICATION_PEER_OVERLOADED_REASON
}
@mickvandijke

Copy link
Copy Markdown
Collaborator Author

superseded by #158

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants