Skip to content

fix(proxy): SSH close race that drops agent's exec reply (EOF on session.Output)#41

Merged
nnemirovsky merged 4 commits into
mainfrom
fix/ssh-close-race-agent-replies
May 12, 2026
Merged

fix(proxy): SSH close race that drops agent's exec reply (EOF on session.Output)#41
nnemirovsky merged 4 commits into
mainfrom
fix/ssh-close-race-agent-replies

Conversation

@nnemirovsky
Copy link
Copy Markdown
Owner

@nnemirovsky nnemirovsky commented May 12, 2026

Summary

sshHandleChannel waits for the three upstream-to-agent forwarders to drain (data copy, stderr copy, request forwarder) before closing srcChan. It does not wait for the agent-to-upstream forwarder, which is the one that calls req.Reply(ok, nil) to ack the agent's WantReply=true requests. A fast upstream that responds to exec with reply + data + exit-status + close in one burst lets sluice's wait complete and close srcChan while this forwarder is still mid-reply.

The agent's gossh client, blocked in session.SendRequest("exec", true, ...) on <-ch.msg, sees SSH_MSG_CHANNEL_CLOSE before the SUCCESS reply. gossh closes ch.msg on close (channel.go:385), and the blocked SendRequest returns io.EOF (channel.go:603-606). session.Output("whoami") surfaces this as "exec command via SSH: EOF".

Repro

Deterministic on the e2e-linux push-event runner since d27b05e narrowed the close timing window:

  • 8 successive main-branch e2e-linux runs passed before d27b05e
  • 2 next main-branch runs (after PR #38 and PR #40 merges) failed at TestCredential_SSHInjection with the EOF symptom
  • PR-event runs of the same code consistently pass — different runner scheduling hides the race

Fix

Track in-flight agent-to-upstream requests with a custom inflightBarrier (mutex + cond + draining flag). enter() returns false once draining has begun so the forwarder rejects further requests without bumping the counter; drain() blocks until any current in-flight handler calls leave(). sshHandleChannel calls drain() after the three upstream-side signals but before srcChan.CloseWrite() / srcChan.Close().

A sync.WaitGroup would have been the obvious choice but is wrong here: Add can race with Wait while the counter is at zero (Go runtime panics with "sync: WaitGroup misuse"). The forwarder ranges over srcReqs and can enter a new iteration at any moment, so the mutex+cond pattern is necessary.

Test plan

  • go test ./... (2476 tests across 13 packages)
  • gofumpt -l internal/ clean
  • golangci-lint run ./internal/... clean
  • New TestSSHJumpHost_BurstCloseDoesNotDropExecReply (50 iterations against a no-sleep burst-close SSH server) — passes with fix, deterministically fails without
  • Local e2e Docker compose TestCredential_SSHInjection — PASS
  • e2e-linux on this PR
  • e2e-linux on push to main after merge

…ng srcChan

When sshHandleChannel processes a session, two independent forwarders
run in parallel: agent-to-upstream requests (e.g. exec, env, pty-req)
go through sshForwardChannelRequests(srcReqs, dstChan), and
upstream-to-agent requests (e.g. exit-status) go the other way. The
agent-to-upstream loop is NOT awaited before sluice closes srcChan.

A fast upstream that, in response to exec, immediately replies + writes
data + sends exit-status + closes lets sluice's wait on the three
upstream-to-agent goroutines complete and close srcChan while the
agent-to-upstream forwarder is still mid-flight: it has received the
upstream's CHANNEL_REQUEST_SUCCESS reply via dstChan.SendRequest but
has not yet called req.Reply on the agent side. The agent then receives
SSH_MSG_CHANNEL_CLOSE before its session.SendRequest("exec", true, ...)
sees the SUCCESS message on ch.msg. gossh closes ch.msg on
SSH_MSG_CHANNEL_CLOSE, which causes the blocked SendRequest to return
io.EOF (channel.go:603-606). session.Output("cmd") surfaces this as
"exec command via SSH: EOF" even though the upstream replied
successfully.

The fix wraps each iteration of the agent-to-upstream forwarder in a
sync.WaitGroup. sshHandleChannel waits for the WaitGroup to drain after
the three upstream-side signals, then closes srcChan. Any in-flight
exec reply has been fully written to srcChan before the close arrives.

The race was deterministically reproducible on the e2e-linux push-event
runner since commit d27b05e moved srcChan.CloseWrite() out of the data
goroutine, narrowing the window between the upstream's reply landing
and sluice's close. Eight successive main-branch e2e runs passed
before d27b05e; the next two (after PR #38 and PR #40 merges) both
failed at TestCredential_SSHInjection with the EOF symptom.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an SSH channel close race in the SSH jump host proxy where the agent can observe SSH_MSG_CHANNEL_CLOSE before receiving the CHANNEL_REQUEST_SUCCESS/FAILURE reply for WantReply=true requests (notably exec), causing session.Output(...) / session.SendRequest(...) to fail with io.EOF.

Changes:

  • Add tracking for agent→upstream per-channel requests and wait for in-flight request replies before closing the agent-facing channel.
  • Introduce a new sshForwardChannelRequestsTracked helper and a pre-close barrier (inFlight.Wait()) in sshHandleChannel.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/proxy/ssh.go
Comment thread internal/proxy/ssh.go Outdated
Addresses Copilot review feedback on PR #41.

sync.WaitGroup.Add and .Wait race when Add is called concurrently with
Wait while the counter is zero — the Go runtime panics ("sync: WaitGroup
misuse: Add called concurrently with Wait"). The previous tracked
forwarder called Add(1) from inside a for-range loop that could enter a
new iteration at any moment, racing the main goroutine's barrier.Wait()
on a fresh counter.

Replaces the WaitGroup with an inflightBarrier struct (mutex + cond +
draining flag + counter). enter() returns false once draining is set so
the forwarder rejects any further request without bumping the counter,
which means drain() can converge even if a new agent request arrives
mid-drain. Replying false to a WantReply request unblocks any caller
blocked on ch.msg.

Adds TestSSHJumpHost_BurstCloseDoesNotDropExecReply, a 50-iteration
regression test that drives a no-sleep burst-close SSH server through
the jump host and asserts that session.Output never returns EOF. The
existing startTestSSHServer used by other ssh_test.go tests papers over
the race with a 50ms sleep before returning from the channel handler;
the new server omits that sleep so the race is deterministically
triggered without the inflightBarrier fix.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment thread internal/proxy/ssh.go
Comment thread internal/proxy/ssh_test.go
Comment thread internal/proxy/ssh_test.go
…ction per iter

Addresses Copilot review feedback on PR #41.

1. The burst-close test server's per-channel goroutine now defers
   ch.Close, so a request loop that exits without hitting the exec
   path (early agent close, non-exec-only request stream) still
   releases the server-side channel and its driving goroutine.

2. TestSSHJumpHost_BurstCloseDoesNotDropExecReply's loop now closes
   agentSSH and agentConn explicitly and waits on errCh with a 5s
   timeout per iteration. A leaked HandleConnection goroutine surfaces
   as a deterministic test failure rather than as silent resource
   exhaustion that could mask a regression in the close path.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread internal/proxy/ssh_test.go Outdated
Addresses Copilot review feedback on PR #41. The burst-close test
previously discarded the value from errCh, which would let a future
regression that produces a non-nil error from HandleConnection
teardown pass silently. The select now captures the return and fails
the iteration when non-nil, so any unexpected teardown error surfaces
explicitly.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@nnemirovsky nnemirovsky merged commit a8317a5 into main May 12, 2026
10 checks passed
@nnemirovsky nnemirovsky deleted the fix/ssh-close-race-agent-replies branch May 12, 2026 04:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants