Skip to content

fix(kv): bound verifyLeaderEngine ReadIndex with 5s deadline#745

Merged
bootjp merged 3 commits intomainfrom
fix/verify-leader-deadline
May 8, 2026
Merged

fix(kv): bound verifyLeaderEngine ReadIndex with 5s deadline#745
bootjp merged 3 commits intomainfrom
fix/verify-leader-deadline

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 8, 2026

Summary

verifyLeaderEngine() called engine.VerifyLeader with context.Background(), so callers without an upstream context blocked indefinitely on a ReadIndex round-trip. A single transient stall accumulated callers permanently. This caps the no-context path at 5s.

Production incident — 2026-05-08

Follower 192.168.0.214 lost its network route (no route to host, ARP INCOMPLETE). The leader's ReadIndex completion stalled intermittently and verify-callers piled up at ~9/sec without bound.

After ~37 minutes the leader (192.168.0.212) showed:

  • 20,560 goroutines, 20,478 of them in etcd.(*Engine).submitRead [select, 35-39 minutes]
  • CPU 1870% (Engine.run Ready loop walks pendingReads O(N) per tick → queue feeds back on itself)
  • Host MemAvailable trending toward 0 → OOM
  • Each new leader after failover re-entered the same death spiral

Mitigation: docker restart elastickv on 212 dropped it to 74% CPU / 163 MiB. 214 was hardware-rebooted and is REACHABLE again. This PR prevents the next leader from re-entering the spiral.

Affected callers

All use the no-context verifyLeaderEngine variant:

  • kv/leader_proxy.goLeaderProxy.Commit / .Abort (every Redis write)
  • kv/coordinator.goCoordinate.VerifyLeader
  • kv/sharded_coordinator.goShardedCoordinator.VerifyLeader / VerifyLeaderForKey
  • adapter/s3.goisVerifiedS3Leader / inline VerifyLeader at line 2291 (healthz)
  • adapter/sqs.goisVerifiedSQSLeader (healthz)
  • main_admin.goLeaderProbe callback for /admin/healthz/leader

Failure mode on timeout

context.DeadlineExceeded surfaces to the caller. LeaderProxy falls back to forwardWithRetry (the existing path for any verify failure). Healthz handlers report 503 not-leader. Background loops (lock resolver, HLC lease) skip this tick.

No new infinite loop: even when this node is the leader, a verify-failure → forward path already exists in LeaderProxy.Commit; that path is bounded by leaderProxyRetryBudget = 5s and maxForwardRetries = 3.

Self-review (5 lenses)

  1. Data loss — none. The fix only shortens a never-returning wait. verifyLeaderEngine is a freshness check, not a write path. Already-committed proposals are unaffected.
  2. Concurrency — the new ctx is local to each call (defer cancel), no shared state, no lock changes. Engine-side blocking semantics unchanged; we just stop waiting forever.
  3. Performance — net positive. Removes the unbounded goroutine pile-up and the O(N) pendingReads walk it caused. No new allocations on the success path beyond the WithTimeout context.
  4. Data consistency — ReadIndex still completes when quorum heartbeats land within 5s. A timeout means the caller could not confirm leadership freshness, which the existing "fall through to forward" path already treats as a soft failure.
  5. Test coveragekv/raft_engine_test.go::TestVerifyLeaderEngine_BoundsBlockingReadIndex pins the regression: a blockingLeaderView that holds VerifyLeader on its ctx must surface DeadlineExceeded within 2 * verifyLeaderTimeout.

Test plan

  • go test -race -count=1 ./kv — 9.3s, all green
  • New regression test TestVerifyLeaderEngine_BoundsBlockingReadIndex covers the blocking case
  • Roll out to 192.168.0.x cluster after merge, watch CPU/Mem panel for the next 4-6h to confirm no more OOM cascade

Future work (separate PRs)

Plumb real request contexts through LeaderProxy.Commit/Abort and the healthz handlers so client-side deadlines cascade naturally instead of relying on this fixed bound. Today the Redis adapter's per-command deadline doesn't reach LeaderProxy; the proxy interface takes []*pb.Request only.

Summary by CodeRabbit

  • Bug Fixes

    • Implemented timeout bounds for leader verification operations with a 5-second limit to prevent indefinite blocking.
  • Tests

    • Added test to verify leader verification properly handles timeout scenarios and completes within the expected timeframe under stalled conditions.

verifyLeaderEngine() called engine.VerifyLeader with context.Background(),
so any caller without an upstream context blocked indefinitely on a
ReadIndex round-trip. A single transient stall accumulated callers
permanently because they never timed out and never returned.

Production hit this on 2026-05-08: follower 192.168.0.214 lost its
network route (no route to host, ARP INCOMPLETE), the leader's
ReadIndex completion stalled intermittently, and verify-callers piled
up at roughly 9/sec without bound. After ~37 minutes the leader
(192.168.0.212) held 20,560 goroutines (20,478 in submitRead select,
oldest 39 minutes), CPU pinned at 1870% (Engine.run Ready loop walks
pendingReads O(N) per tick, so the queue feeds back on itself), and
host MemAvailable trended toward 0 until OOM. Each new leader after
failover re-entered the same death spiral.

Affected callers (all use the no-context variant):
- LeaderProxy.Commit / .Abort -- every Redis write hits this
- Coordinate.VerifyLeader / ShardedCoordinator.VerifyLeader[ForKey]
- adapter S3/SQS /healthz/leader handlers (Caddy probes)
- main_admin.go LeaderProbe (admin dashboard /admin/healthz/leader)
- adapter/sqs.go isVerifiedSQSLeader, adapter/s3.go isVerifiedS3Leader

Fix: cap the no-context path at 5s (matching leaderForwardTimeout). On
timeout, callers see context.DeadlineExceeded -- LeaderProxy falls back
to forwardWithRetry as it already does for any verify failure, healthz
handlers report not-leader, and the lock resolver skips this tick.

Self-review (5 lenses):
1. Data loss -- none. The fix only shortens a never-returning wait.
   verifyLeaderEngine is a freshness check, not a write path.
2. Concurrency -- the new ctx is local to each call (defer cancel),
   no shared state, no lock changes. Engine-side blocking semantics
   unchanged; we just stop waiting forever.
3. Performance -- positive. Removes the unbounded goroutine pile-up
   and the O(N) pendingReads walk it caused. No new allocations on
   the success path beyond the WithTimeout context.
4. Data consistency -- ReadIndex still completes when quorum heartbeats
   land within 5s. A timeout means the caller could not confirm
   leadership freshness, which the existing "fall through to forward"
   path already treats as a soft failure.
5. Test coverage -- kv/raft_engine_test.go pins the regression: a
   blockingLeaderView that holds VerifyLeader on its ctx must surface
   DeadlineExceeded within 2x verifyLeaderTimeout.

Test: go test -race -count=1 ./kv -- 9.3s, all green.

Future work (separate PRs): plumb real request contexts through
LeaderProxy.Commit/Abort and the healthz handlers so a client-side
deadline cascades naturally instead of relying on this fixed bound.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 8, 2026

Review Change Stack

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 26 minutes and 55 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ce81b9c5-3e55-4c87-9298-a4efb6e044c9

📥 Commits

Reviewing files that changed from the base of the PR and between ad924ad and a2ab245.

📒 Files selected for processing (2)
  • kv/raft_engine.go
  • kv/raft_engine_test.go
📝 Walkthrough

Walkthrough

This PR adds a 5-second timeout guard to the verifyLeaderEngine function to prevent unbounded blocking on stalled leader verification operations. The implementation wraps the underlying call with a timeout context and includes a test that verifies the deadline is enforced even when the leader read operation blocks indefinitely.

Changes

Leader Verification Timeout

Layer / File(s) Summary
Timeout Constant & Rationale
kv/raft_engine.go
Introduces verifyLeaderTimeout constant (5 seconds) with extensive comments documenting the prior unbounded behavior and operational failure mode.
Implementation Change
kv/raft_engine.go
verifyLeaderEngine now wraps the leader verification call in context.WithTimeout(context.Background(), verifyLeaderTimeout) and defers cancellation, replacing the prior unbounded call.
Test Support & Deadline Verification
kv/raft_engine_test.go
Adds blockingLeaderView mock and TestVerifyLeaderEngine_BoundsBlockingReadIndex test that verifies the timeout is enforced and the function returns context.DeadlineExceeded within the deadline window.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A timeout, so wise and so keen,
Five seconds to bound the in-between,
No more shall the leader read stall,
The deadline will answer the call,
A test hops along to ensure all is right! 🕒

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the main change: bounding the verifyLeaderEngine function's ReadIndex operation with a 5-second deadline.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/verify-leader-deadline

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 8, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 8, 2026

Claude finished @bootjp's task in 2m 44s —— View job


Code Review

  • Gather context and read changed files
  • Review kv/raft_engine.go changes
  • Review kv/raft_engine_test.go changes
  • Cross-check callers and failure modes
  • Post final review

The fix is correct and minimal. The root cause analysis is solid — verifyLeaderEngine calling engine.VerifyLeader with context.Background() is a classic goroutine leak pattern, and the O(N) pendingReads walk compounding the pile-up explains the CPU death spiral. The 5-second cap is the right level of fix here.

Three observations, ordered by severity:


1. Incident narrative in source code (style)

kv/raft_engine.go:12–35 — the comment on verifyLeaderTimeout carries IP addresses, goroutine counts, CPU figures, and timestamps. This is post-mortem/commit-message material, not source code. The specific details will become stale (or misleading if those nodes change roles or leave the cluster), and future readers who encounter 192.168.0.214 will have no way to verify the claim.

The rationale is exactly right and should stay — "why 5s," "why this constant exists," the O(N) pendingReads feedback loop. Everything from "Production hit this on 2026-05-08..." down to "...until OOM" can be collapsed to a one-liner:

// 5s matches leaderForwardTimeout: a verify that takes longer than a
// single forward RPC is useless as a freshness check, and the proxy's
// verify-then-forward path stays within its 5s retry budget.
//
// See incident 2026-05-08 (PR #745) for the goroutine-pile production failure
// this prevents.
const verifyLeaderTimeout = 5 * time.Second

Fix this →


2. Test uses stdlib errors.Is — inconsistent with project conventions

kv/raft_engine_test.go:45:

if !stderrors.Is(err, context.DeadlineExceeded) {

This works correctly (cockroachdb/errors.WithStack implements Unwrap() so stdlib errors.Is traverses the chain), but the project convention from CLAUDE.md is to use github.com/cockroachdb/errors at all boundaries. The errors.Is check here should use the cockroachdb import for consistency, and avoids any future footgun if the wrapping ever uses a mark-based equivalence instead of a direct chain:

import "github.com/cockroachdb/errors"
...
if !errors.Is(err, context.DeadlineExceeded) {

Fix this →


3. Slow test — add testing.Short() guard

kv/raft_engine_test.go:35TestVerifyLeaderEngine_BoundsBlockingReadIndex blocks for verifyLeaderTimeout = 5 seconds on every run, including make test. Go convention for slow tests is to gate on testing.Short():

func TestVerifyLeaderEngine_BoundsBlockingReadIndex(t *testing.T) {
    t.Parallel()
    if testing.Short() {
        t.Skip("skipping: blocks for verifyLeaderTimeout (5s)")
    }
    ...

This keeps go test -short ./kv/... fast without removing regression coverage from the default make test run.

Fix this →


No issues found

  • defer cancel() placement is correct — no leak on either success or timeout path.
  • context.DeadlineExceeded is not classified as isTransientLeaderError (it's not ErrNotLeader, ErrLeadershipLost, or ErrLeadershipTransferInProgress), so the LeaderProxy forward path behaves correctly.
  • verifyLeaderTimeout = 5s aligning with leaderForwardTimeout is well-reasoned and clearly documented.
  • blockingLeaderView satisfies the full raftengine.LeaderView interface (State, Leader, VerifyLeader, LinearizableRead). ✓
  • The forwardWithRetry fallback on timeout is bounded by leaderProxyRetryBudget = 5s, so worst-case the partition path is still bounded (verify 5s + forward budget 5s = 10s max). ✓

The first observation is worth a quick fixup commit; the second and third are minor. The core fix is sound.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a 5-second timeout to the verifyLeaderEngine function to prevent indefinite blocking and potential resource exhaustion during ReadIndex stalls, addressing a production issue that previously led to goroutine leaks and OOM errors. A regression test was added to verify that the function now returns a deadline exceeded error within a reasonable timeframe. The review feedback suggests enhancing the test by asserting a minimum elapsed time to ensure the timeout logic is correctly exercised and not returning early for other reasons.

Comment thread kv/raft_engine_test.go
Comment on lines +50 to +52
if elapsed > 2*verifyLeaderTimeout {
t.Fatalf("verifyLeaderEngine(blocking) returned after %s; want <= 2x verifyLeaderTimeout (%s)", elapsed, verifyLeaderTimeout)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To ensure that the test is actually exercising the timeout logic and not returning early for some other reason, consider asserting a minimum elapsed time. This confirms that the blockingLeaderView mock correctly held the call until the context deadline was reached.

	if elapsed < verifyLeaderTimeout {
		t.Fatalf("verifyLeaderEngine(blocking) returned too early after %s; want >= %s", elapsed, verifyLeaderTimeout)
	}
	if elapsed > 2*verifyLeaderTimeout {
		t.Fatalf("verifyLeaderEngine(blocking) returned after %s; want <= 2x verifyLeaderTimeout (%s)", elapsed, verifyLeaderTimeout)
	}

Round-1 review on commit ad924ad:

- gemini medium: assert minimum elapsed time, not just upper bound.
  Without a lower bound a future regression returning DeadlineExceeded
  before actually waiting would silently pass. Added a lower-bound
  check with 200ms slack so a slow CI scheduler that trips ctx.Done()
  a hair before the wall clock catches up does not flake.

- claude bot (style): trim incident narrative from the
  verifyLeaderTimeout doc comment. Specific IPs, goroutine counts,
  and timestamps belong in the commit message and PR body, not the
  source — they go stale and add no design value at the call site.
  Kept the rationale (why 5s, the O(N) pendingReads feedback loop,
  match leaderForwardTimeout) and added a "see PR #745" reference
  for readers who want the full incident detail.

- claude bot: switch the regression test from stdlib
  `stderrors "errors"` to `github.com/cockroachdb/errors` for
  consistency with the project convention (CLAUDE.md mandates
  cockroachdb/errors at boundaries; the test is a boundary). Also
  drops the import alias which made gci complain about ordering.

- claude bot: add `if testing.Short() { t.Skip(...) }` guard so
  `go test -short ./kv/...` does not pay the verifyLeaderTimeout
  (5s) wait. Default `make test` still runs it.

- reviewdog/golangci (gci): import order fixed by reordering
  third-party (cockroachdb) above local (bootjp) per the project's
  default gci sections (standard, default, prefix(...)).

No behavior change. Test still passes:
  go test -race -count=1 -run TestVerifyLeaderEngine ./kv -- 6.06s
  go test -race -count=1 -short ./kv -- 5.55s (test skipped under -short)

No semantic change to the production fix; this commit is test +
comment polish only, so no caller audit needed.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 8, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 8, 2026

Claude finished @bootjp's task in 1m 43s —— View job


Claude finished @bootjp's task ——


Round-2 Code Review

  • Gather context and read changed files
  • Verify round-1 issues resolved
  • Check remaining linter warning
  • Post final review

All three round-1 issues are resolved, and the Gemini lower-bound suggestion has been incorporated. One item remains open.


Remaining issue: gci import ordering in test file

kv/raft_engine_test.go:8–11 — the golangci gci formatter is still flagging the test file. The import block has two non-stdlib groups separated by a blank line:

import (
    "context"
    "testing"
    "time"

    "github.com/cockroachdb/errors"          // ← its own group

    "github.com/bootjp/elastickv/internal/raftengine"  // ← separate group
)

The project convention (see kv/coordinator.go:3–16) puts all non-stdlib imports in a single group with no blank line between third-party and local packages:

import (
    "context"
    "testing"
    "time"

    "github.com/bootjp/elastickv/internal/raftengine"
    "github.com/cockroachdb/errors"
)

gofmt won't catch this — gci specifically enforces the grouping, which is why the reviewdog bot flagged it.

Fix this →


Minor nit: comment references review tooling

kv/raft_engine_test.go:61 — the inline comment says "Pulled in from gemini's PR #745 round-1 review". This is process metadata that will rot once the PR is merged and the review link is no longer in context. Consider trimming it to just explain the design reason:

// Lower bound: confirm the engine actually held the call until the
// deadline fired, not that some other error path returned immediately.

The PR description already credits the source; the source file only needs the rationale.


Everything else looks correct

  • Comment on verifyLeaderTimeout ✓ — incident narrative removed; design rationale (O(N) pendingReads, leaderForwardTimeout alignment) and PR reference kept.
  • errors.Is ✓ — uses github.com/cockroachdb/errors correctly.
  • testing.Short() guard ✓ — correctly placed after t.Parallel().
  • Lower bound (elapsed+slack < verifyLeaderTimeout) ✓ — catches early-return regressions; 200ms slack is appropriate for CI schedulers.
  • Upper bound (elapsed > 2*verifyLeaderTimeout) ✓ — proves the call returned at all.
  • defer cancel() placement ✓ — no leak on either success or timeout path.
  • blockingLeaderView satisfies the full raftengine.LeaderView interface ✓.

The gci fix is one import-block tweak; everything else is shippable.

Round-2 review on commit 0be9fc7:

- reviewdog/golangci (gci): test file still failed gci because the
  project convention puts ALL non-stdlib imports in a single group
  (no blank line between third-party and local). Round-2's
  "cockroachdb above bootjp with a blank line between" guess was
  wrong; merging into one alphabetical group (bootjp before
  cockroachdb) matches kv/coordinator.go and resolves the warning.

- claude bot (nit): drop "Pulled in from gemini's PR #745 round-1
  review" from the lower-bound rationale comment. Process metadata
  rots once the PR is merged; the design reason is what the source
  needs.

No behavior change. Test still passes 6.05s.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 8, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 8, 2026

Claude finished @bootjp's task in 1m 5s —— View job


Round-3 Code Review

  • Gather context from previous review rounds
  • Read current state of kv/raft_engine.go
  • Read current state of kv/raft_engine_test.go
  • Verify round-2 issues resolved
  • Post final review

Both round-2 issues are resolved. This is clean.

Round-2 issues — verified closed

gci import ordering (kv/raft_engine_test.go:8–9) ✓
Both non-stdlib packages are now in the same group with no blank line between them, alphabetical order (bootjp before cockroachdb), matching the kv/coordinator.go convention exactly. The golangci gci warning is gone.

import (
    "context"
    "testing"
    "time"

    "github.com/bootjp/elastickv/internal/raftengine"
    "github.com/cockroachdb/errors"
)

Comment referencing review tooling (kv/raft_engine_test.go:55–67) ✓
The comment now explains only the design rationale (lower-bound slack, why the check prevents silent early-return regressions). No reference to Gemini or round-1 review.


Everything checks out

  • verifyLeaderTimeout comment — clean. Rationale (O(N) pendingReads, alignment with leaderForwardTimeout), PR reference, no incident narrative. ✓
  • verifyLeaderEnginecontext.WithTimeout + defer cancel(), no leak on any path. ✓
  • blockingLeaderView — satisfies all four methods of raftengine.LeaderView (State, Leader, VerifyLeader, LinearizableRead). ✓
  • Test structure — t.Parallel()testing.Short() guard → lower bound (with 200ms slack) → upper bound (2×). ✓
  • errors.Is — uses github.com/cockroachdb/errors per project convention. ✓

No remaining issues. This PR is ready to merge.

@bootjp bootjp merged commit 5f12d5d into main May 8, 2026
9 checks passed
@bootjp bootjp deleted the fix/verify-leader-deadline branch May 8, 2026 08:51
bootjp added a commit that referenced this pull request May 9, 2026
Follow-up to #745 (verifyLeaderEngine 5s deadline incident hotfix).
PR #745 capped the no-context verify path at 5s as defense-in-depth;
this PR plumbs the caller's context.Context end-to-end through:

  Dispatch path:
    ShardedCoordinator.Dispatch(ctx) → dispatchTxn(ctx) →
      dispatchSingleShardTxn(ctx) /
      commitPrimaryTxn(ctx) /
      commitSecondaryTxns(ctx) /
      abortPreparedTxn(ctx) /
      broadcastToAllGroups(ctx) /
      dispatchDelPrefixBroadcast(ctx)
    → Transactional.Commit/Abort(ctx, reqs)
    → leaseRefreshingTxn / LeaderProxy / TransactionManager / ShardRouter
    → applyRequests(ctx) → proposer.Propose(ctx)
    → verifyLeaderEngineCtx(ctx)

  Lock-resolver / shard-store path:
    LockResolver.resolveExpiredLock(ctx) → applyTxnResolution(ctx) /
      ShardStore.applyScanLockResolutions(ctx) /
      ShardStore.tryAbortExpiredPrimary(ctx)
    → g.Txn.Commit(ctx)

  Leader-probe / healthz path:
    Coordinator.VerifyLeader(ctx) /
    Coordinator.VerifyLeaderForKey(ctx, key)
    LeaderProbe.IsVerifiedLeader(ctx)
    adapter S3/SQS/DynamoDB /healthz/leader handlers feed r.Context()

  Adapter healthz helpers:
    isVerifiedSQSLeader(ctx, coordinator)
    isVerifiedDynamoLeader(ctx, coordinator)
    S3Server.isVerifiedS3Leader(ctx)
    DistributionServer.verifyCatalogLeader(ctx)

  Admin LeaderProbe wiring:
    main_admin.go newAdminLeaderProbe → coordinate.VerifyLeader(ctx)

After: a Redis BLPOP timeout=2s whose dispatch lands on a slow
ReadIndex now fails after 2s (its own deadline), not after the 5s
verifyLeaderTimeout safety bound. A Caddy active health probe with a
1s budget likewise fails after 1s. The 5s bound stays in place as
defense-in-depth for the no-arg verifyLeaderEngine() — still hit by
LockResolver background loops, HLC lease ticks, and any future
internal caller that genuinely cannot inherit a deadline.

Also fixes proposer.Propose at kv/transaction.go:152 which used
context.Background() inline; same shape as the original
verifyLeaderEngine bug, just on the propose path. Now plumbed.

Batched commit caveat: TransactionManager.commitRaw merges many
callers into a single propose via a separate goroutine, so no
single ctx can bound the underlying applyRequests. The wait site
in commitRaw selects between item.done and ctx.Done so per-caller
cancellation still works (the propose continues, the abandoned
caller exits early); the goroutine's applyRequests call uses
context.Background by design and is documented as such.

Test stub updates (3 ripples through the Transactional interface,
6 ripples through Coordinator, 2 through LeaderProbe):
  kv/coordinator_txn_test.go::stubTransactional
  kv/coordinator_retry_test.go::scriptedTransactional
  kv/sharded_coordinator_txn_test.go::recordingTransactional
  kv/sharded_coordinator_abort_test.go::failingTransactional
  kv/sharded_lease_test.go::fixedTransactional
  kv/shard_router_test.go::fakeTM
  kv/shard_router_partition_test.go::fakeTxn
  kv/leader_routed_store_test.go::stubLeaderCoordinator
  adapter/redis_retry_test.go::retryOnceCoordinator
  adapter/redis_info_test.go::infoTestCoordinator
  adapter/redis_hello_test.go::helloTestCoordinator
  adapter/redis_keys_pattern_test.go::stubAdapterCoordinator
  adapter/distribution_server_test.go::distributionCoordinatorStub
  adapter/dynamodb_test.go::testCoordinatorWrapper
  adapter/s3_test.go::routeAwareS3Coordinator + followerS3Coordinator
  internal/admin/router_test.go::LeaderProbeFunc literal forms

Self-review (5 lenses):
1. Data loss — none. ctx is value-passed; no persisted state changes.
   verifyLeaderEngineCtx already existed and is what we now route to;
   only the caller of the no-arg variant changes (background loops
   only).
2. Concurrency — ctx flows through the existing locking unchanged.
   commitRaw goroutine intentionally uses Background to avoid mixing
   per-caller deadlines into a shared batched propose; documented.
3. Performance — same call count, no extra allocations beyond the
   ctx value passed by reference. context.WithTimeout wrapping in the
   no-ctx fallback is unchanged.
4. Data consistency — verify is a freshness check, not a write path.
   Shorter caller deadlines surface ErrLeaderNotFound earlier instead
   of waiting out the 5s safety bound.
5. Test coverage — every Transactional / Coordinator / LeaderProbe
   stub gains the ctx parameter (currently `_ context.Context`,
   ready for future tests asserting cancel propagation).

Test:
  go test -race -count=1 -short ./kv ./adapter ./internal/admin
  -- 9.6m total, all green (adapter is the long pole on Jepsen-shaped
  tests; kv 6.6s, admin 1.4s).

Design doc: docs/design/2026_05_10_proposed_kv_ctx_plumbing.md.
bootjp added a commit that referenced this pull request May 10, 2026
#749)

## Summary

Follow-up to #745. PR #745 capped `verifyLeaderEngine`'s no-context call
site at 5s as an incident hotfix; this PR plumbs the caller's
`context.Context` end-to-end through the kv write path and the
verify-leader path so callers that hold a request deadline see THEIR
deadline honoured, not the 5s safety bound.

The 5s bound stays in place as defense-in-depth for callers without an
upstream context (LockResolver, HLC lease tick).

## What changed

**Interface signatures (kv-internal):**

```go
type Transactional interface {
    Commit(ctx context.Context, reqs []*pb.Request) (*TransactionResponse, error)
    Abort(ctx context.Context, reqs []*pb.Request) (*TransactionResponse, error)
}

type Coordinator interface {
    // …
    VerifyLeader(ctx context.Context) error
    VerifyLeaderForKey(ctx context.Context, key []byte) error
}

type LeaderProbe interface {
    IsVerifiedLeader(ctx context.Context) bool
}
```

**Plumbing chains** (every step now carries ctx):

```
Dispatch path:
  ShardedCoordinator.Dispatch(ctx)
    → dispatchTxn(ctx) / dispatchSingleShardTxn(ctx) /
      commitPrimaryTxn(ctx) / commitSecondaryTxns(ctx) /
      abortPreparedTxn(ctx) / broadcastToAllGroups(ctx) /
      dispatchDelPrefixBroadcast(ctx)
    → Transactional.Commit/Abort(ctx)
    → leaseRefreshingTxn / LeaderProxy / TransactionManager / ShardRouter
    → applyRequests(ctx) → proposer.Propose(ctx)
    → verifyLeaderEngineCtx(ctx)

Lock-resolver / shard-store path:
  LockResolver.resolveExpiredLock(ctx)
    → applyTxnResolution(ctx) / applyScanLockResolutions(ctx) /
      tryAbortExpiredPrimary(ctx)
    → g.Txn.Commit(ctx)

Healthz:
  S3 / SQS / DynamoDB /healthz/leader handlers
    → isVerifiedSQSLeader(ctx, …) / isVerifiedDynamoLeader(ctx, …) /
      S3Server.isVerifiedS3Leader(ctx) /
      DistributionServer.verifyCatalogLeader(ctx)
    → coordinator.VerifyLeader(ctx) / VerifyLeaderForKey(ctx, key)

Admin LeaderProbe:
  /admin/healthz/leader handler → LeaderProbe.IsVerifiedLeader(r.Context())
    → coordinate.VerifyLeader(ctx)
```

Also fixes the inline `context.Background()` at `kv/transaction.go:152`
(`proposer.Propose`) — same shape as the original `verifyLeaderEngine`
bug, on the propose path.

## Behaviour

For callers with upstream deadlines:

| caller | before | after |
|---|---|---|
| Redis BLPOP `timeout=2s` on slow ReadIndex | fails after 5s
(verifyLeaderTimeout) | fails after 2s (caller's deadline) |
| Caddy active probe `timeout=1s` | fails after 5s | fails after 1s |
| gRPC `Dispatch(ctx)` with explicit deadline | dropped to 5s | honours
ctx deadline |

For background callers without upstream ctx:

- LockResolver tick, HLC lease renewer, etc. continue to hit
`verifyLeaderEngine()` (no-arg variant) which still wraps with
`context.WithTimeout(context.Background(), verifyLeaderTimeout)`. PR
#745's 5s bound is unchanged for them.

## Batched-commit caveat

`TransactionManager.commitRaw` merges many callers into a single propose
via a separate goroutine, so no single ctx can bound the underlying
`applyRequests`. The fix:

- Wait site in `commitRaw` selects between `item.done` and `ctx.Done()`
so per-caller cancellation works (caller exits early; the propose still
completes; other waiters in the same batch get their results normally).
- Goroutine's `applyRequests` call uses `context.Background` by design,
documented in the code.

## Self-review (5 lenses)

1. **Data loss** — none. ctx is value-passed; no persisted state
changes. `verifyLeaderEngineCtx` is the existing path; only the caller
of the no-arg variant changes.
2. **Concurrency** — ctx flows through the existing locking unchanged.
`commitRaw` goroutine intentionally uses Background to avoid mixing
per-caller deadlines into a shared batched propose; documented.
3. **Performance** — same call count, no extra allocations beyond the
ctx value passed by reference. `context.WithTimeout` wrapping in the
no-ctx fallback is unchanged.
4. **Data consistency** — verify is a freshness check, not a write path.
Shorter caller deadlines surface `ErrLeaderNotFound` earlier instead of
waiting out the 5s safety bound.
5. **Test coverage** — every Transactional / Coordinator / LeaderProbe
stub gains the ctx parameter (currently `_ context.Context`, ready for
future tests asserting cancel propagation). 16 stubs / signatures
updated.

## Test plan

- [x] `go test -race -count=1 -short ./kv ./adapter ./internal/admin` —
all green
  - kv: 6.6s
  - adapter: 565s (long Jepsen-shaped tests)
  - internal/admin: 1.4s
- [x] No semantic change beyond the ctx propagation; all existing tests
pass without modification of their assertions.

## Follow-up scope (not in this PR)

The audit also surfaced ~210 `context.Background()` usages in tree,
mostly concentrated in:

| file | count | nature |
|---|---|---|
| adapter/redis_compat_commands.go | 66 | dispatch helpers, internal
commands |
| adapter/redis_lua_context.go | 25 | Lua script context loaders |
| adapter/redis.go | 25 | handler context fallback paths |
| proxy/* | 28 | proxy package, separate concern |

These are "easy win" candidates where a `ctx` is in scope but
`Background()` was used instead. Not bundled here to keep the PR
reviewable; a follow-up "audit pass" PR targeting Redis adapter helpers
is the natural next step.

The single `context.TODO()` (kv/fsm.go) is structural — raft `Apply()`
does not provide a ctx — and is left as is.

## Design doc

`docs/design/2026_05_10_proposed_kv_ctx_plumbing.md` — included in this
PR.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Updated internal APIs to propagate request context through leader
verification and transaction operations, enabling request deadlines to
be respected instead of using fixed internal timeouts.

* **Documentation**
* Added design document outlining context plumbing through KV write
paths and leader-probe verification across adapters.

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/bootjp/elastickv/pull/749)

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant