Skip to content

fix: env-file ownership + response panic recovery + oauth header injection#38

Merged
nnemirovsky merged 10 commits into
mainfrom
v0.13.2-chown-env
May 7, 2026
Merged

fix: env-file ownership + response panic recovery + oauth header injection#38
nnemirovsky merged 10 commits into
mainfrom
v0.13.2-chown-env

Conversation

@nnemirovsky
Copy link
Copy Markdown
Owner

@nnemirovsky nnemirovsky commented May 7, 2026

Summary

Production-deployment fixes that turned up while validating v0.13.1 against a live Hermes deployment running through sluice.

1. Chown env file back to the agent runtime user (the original motivation)

Sluice writes phantom tokens into the agent's env file via docker exec running the image's USER (root for the upstream openclaw and hermes images). The result was a root-owned env file, blocking the agent's own write paths — most visibly, hermes claw migrate failed with Permission denied after a sluice restart. The script now stat | chowns the file back to the parent directory's owner, so a non-root agent can keep editing the file. Works on both GNU and BSD stat (Linux containers + tart macOS guests) via a -c then -f fallback.

2. Response handler panic recovery

The OpenAI Codex device-code OAuth flow surfaced a runtime error: invalid memory address or nil pointer dereference panic from somewhere downstream of the OAuth response handler. go-mitmproxy's outer recover swallowed it, abandoned the response body, and the agent received empty bytes — Hermes failed with JSONDecodeError even though sluice had successfully phantom-swapped the tokens.

Response and StreamResponseModifier now have deferred recover guards that log the full stack trace and fall back to safe defaults. Real tokens cannot leak — the OAuth handler has its own snapshot/rollback that runs BEFORE the panic, so f.Response.Body is either fully phantom-swapped or untouched at the time of the recover. StreamResponseModifier also gets a defensive nil-input guard plus a buffered-body capture so a panic after io.ReadAll drains the input still hands the agent the buffered bytes (rather than an empty reader from the now-drained stream).

3. OAuth-vs-static dispatch for header injection (HTTP/1+2 and HTTP/3)

Reproduced live: when an OAuth-type credential has a header binding with a {value} template, the previous code was substituting the entire JSON envelope into the header (Bearer {"access_token":"<jwt>","refresh_token":"...","token_url":"..."}), and upstream APIs reject that with 401 "Could not parse your authentication token". This bug had been latent since the bidirectional OAuth phantom swap was introduced in v0.6.0; it only manifested for OAuth credentials with header-binding templates, and the typical pre-v0.6.0 setup used static credentials so it never surfaced.

Fix: a new extractInjectableSecret helper consults the OAuthIndex (rebuilt from credential_meta on startup and SIGHUP) to dispatch by credential metadata type, not by JSON shape inference. Static credentials whose value happens to be OAuth-shaped JSON are NOT misclassified. OAuth credentials parse via vault.ParseOAuth and yield just access_token. Server.UpdateOAuthIndex now mirrors the index into the QUIC proxy too so HTTP/3 header injection follows the same dispatch as HTTP/1+2; the index is built once during setup and applied to both the addon and the QUIC proxy after both are initialized so the startup ordering bug from the first iteration is gone.

Test plan

  • go build ./...
  • go test ./... — 2447 passed across 13 packages, including new focused tests for: chown step under both content branches, panic recovery in Response (test injection point), panic recovery in StreamResponseModifier (nil input + buffered body), extractInjectableSecret dispatching on metadata for static / OAuth-shaped-static / oauth-with-leading-whitespace / corrupted-payload / nil-index cases.
  • gofumpt -l clean
  • golangci-lint run ./... clean
  • live OAuth flow against auth.openai.com and auth.anthropic.com returns a phantom-swapped body without the JSONDecodeError or Could not parse your authentication token failures that happened under v0.13.1.
  • live OAuth-managed Codex (gpt-5.5) and Claude Code (Pro/Max OAuth) both confirmed end-to-end through the deployed sluice on the production host.
  • CI: test, e2e-linux, e2e-macos, gofumpt, golangci-lint, openapi

Sluice's docker exec runs the env-injection script as the image's
USER (root for the upstream openclaw and hermes images). The awk
pre-pass renames the file (new file inherits the writer's UID) and
the heredoc append writes the new managed block; both leave the
file owned by root. The agent runtime later runs as a non-root user
(UID 10000 for the upstream hermes image, configurable via
HERMES_UID), and a root-owned env file blocks agent-side write
paths like `hermes claw migrate` or in-agent secret edits with
EACCES.

Inheriting the parent directory's owner is portable: every agent's
entrypoint chowns its data dir to its runtime user before sluice
ever exec's in. The chown is best-effort — a `stat` or `chown`
failure does not abort the script (a read-only env file still
satisfies sluice's own usage), so containers without GNU stat
degrade gracefully.

The chown step is appended on the SAME line as the heredoc `<<`
operator with `&&` short-circuiting, because shell will not let
`&&` at the start of a new line attach to a command whose heredoc
body sits in between. New unit test asserts the chown is present
both when the managed block has content and when it is empty.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes agent env-file permission regressions caused by docker exec running as the image USER (often root), which can leave ~/<profile>/.env root-owned and block agent-side writes after restart (e.g., hermes claw migrate).

Changes:

  • Appends a best-effort “derive parent dir owner → chown env file” step to the generated env injection script.
  • Ensures the ownership fix runs even when no managed env entries are emitted (empty managed block case).
  • Adds a focused unit test asserting the script contains the stat + chown step for both “has values” and “no values” cases.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
internal/container/types.go Adds a best-effort ownership correction step to the generated env injection script (including empty-block scenario).
internal/container/agent_profile_test.go Adds a test to assert the generated script includes the stat + chown ownership fix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/container/types.go Outdated
Three fixes that turned up while validating v0.13.1's OAuth flow on
the live deployment.

- proxy/addon: add a top-level recover in Response and
  StreamResponseModifier with full stack capture. The OAuth Codex
  device-code flow surfaced a "nil pointer dereference" panic from
  somewhere downstream that go-mitmproxy's outer recover swallowed,
  leaving the agent reading an empty response body. The panic
  source itself remains unidentified, but escaping it into
  go-mitmproxy means losing the response — both handlers now log
  the stack and fall back to safe defaults (Response leaves
  f.Response unchanged at the panic point; StreamResponseModifier
  hands back the original input reader). Real tokens cannot leak:
  any panic in Response runs after processOAuthResponseIfMatching's
  own snapshot/rollback, so f.Response.Body is either fully
  phantom-swapped or untouched at the time of the panic.
- proxy/addon: defensive nil check on the input reader for
  StreamResponseModifier. go-mitmproxy normally passes a non-nil
  reader, but a nil here causes io.LimitReader + io.ReadAll to
  nil-deref on the first Read call. When in is nil we now return
  nil, which makes go-mitmproxy's reply() skip its copyStream
  branch and fall through to writing f.Response.Body (already
  phantom-swapped by the Response handler).
- container/types: portable stat fallback. The chown step from the
  earlier commit only worked under GNU stat (Linux containers
  alpine/debian); BSD stat (macOS guests booted via the tart
  backend) uses `-f`. We now try `-c` first and fall back to `-f`
  on non-zero exit so the chown succeeds across both userlands.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment thread internal/proxy/addon.go
Comment thread internal/proxy/addon.go Outdated
Comment thread internal/proxy/addon.go
@nnemirovsky nnemirovsky changed the title fix(container): chown env file back to agent runtime user fix: env-file ownership + response handler panic recovery May 7, 2026
- Add a test-only responsePanicHook field to SluiceAddon. When
  non-nil it fires between processOAuthResponseIfMatching and the
  DLP scan, letting tests force the downstream-of-OAuth panic
  shape we observed in production without engineering a malformed
  Flow that triggers a real nil deref. Always nil in production.
- TestAddonResponse_RecoversFromDownstreamPanic uses the hook to
  panic between OAuth swap and DLP scan, then asserts: the test
  itself doesn't see the panic propagate, real tokens stayed out
  of f.Response.Body, and the phantom token from the swap is
  still present.
- TestAddonStreamResponseModifier_HandlesNilInput passes nil to
  StreamResponseModifier and asserts no panic + the handler
  returns nil so go-mitmproxy's reply() falls through to writing
  f.Response.Body.
- Drop the line-broken hyphen in the Response handler's comment
  ("processOAuthResponseIfMatching" was wrapping with a hyphen
  inside the identifier).
OAuth credentials are stored in the vault as JSON-marshalled
OAuthCredential structs (access_token + refresh_token + token_url +
expires_at). The header-injection path in addon.injectHeader and
quic.go was substituting the binding's `{value}` placeholder with
the entire JSON blob, producing requests like

  Authorization: Bearer {"access_token":"<jwt>","refresh_token":...}

that upstream APIs reject with 401 "Could not parse your
authentication token" (reproduced live against chatgpt.com via
the OpenAI Codex device-code OAuth flow). Static credentials are
stored as plain strings so they were unaffected; the bug only
manifested for OAuth-type credentials with a header binding that
uses a template.

This bug has been latent since the bidirectional OAuth phantom
swap was introduced (commit 0be82d4); before that, OAuth tokens
were typically loaded into the vault as raw strings via `sluice
cred update` and the JSON envelope shape never reached the
injection path. The buildPhantomPairs / phantom-swap path was
already OAuth-aware via vault.IsOAuth, so phantom-token replace
in the request body kept working; only the header-injection path
was broken.

Fix: detect a JSON envelope by shape (first byte `{`, parses as
OAuthCredential, has non-empty access_token) and substitute just
the access_token. Static credentials and any other plain string
pass through unchanged. Both the buffered-HTTP path
(internal/proxy/addon.go) and the QUIC/HTTP3 path
(internal/proxy/quic.go) call the same helper.

New test asserts pass-through for static/empty/malformed values
and access_token extraction for an OAuth blob.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment thread internal/proxy/addon.go Outdated
Comment thread internal/proxy/addon.go
The previous extractInjectableSecret inferred OAuth-vs-static from
the secret's JSON shape (first byte `{`, parses as OAuthCredential).
Two issues with that:

- A static credential whose value happened to be OAuth-shaped JSON
  (e.g. an upstream that already stores access_token+token_url)
  would be silently treated as OAuth, with only access_token
  injected. Other code paths in sluice already key on
  credential_meta.cred_type to avoid this exact misclassification.
- Leading whitespace or newlines before the brace bypassed the
  shape check entirely, even though json.Unmarshal would still
  parse it. The OAuth envelope would round-trip through the
  injection unchanged.

Fix: dispatch on the credential's metadata type via the
OAuthIndex.Has(credName) lookup. The OAuthIndex is rebuilt from
credential_meta on startup and SIGHUP, so the injection path now
asks the same source-of-truth as the response interceptor. A new
free function extractInjectableSecret(idx, name, secret) takes the
index explicitly so the QUIC path (which carries its own pointer)
shares the same logic with the addon.

Also wires the QUIC proxy with its own oauthIndex pointer plus a
SetOAuthIndex setter; Server.UpdateOAuthIndex now mirrors index
updates into both the addon and the QUIC proxy so HTTP/3 header
injection follows the same dispatch rules as HTTP/1+2.

Test coverage exercises:
- Static cred with plain string -> pass-through.
- Static cred whose value is OAuth-shaped JSON -> still
  pass-through (because it's not in the OAuth index).
- OAuth cred with normal JSON -> access_token extraction.
- OAuth cred with leading whitespace -> still extracts
  access_token (json.Unmarshal handles whitespace).
- OAuth cred with corrupted vault payload -> falls back to
  raw secret with a diagnostic log line.
- nil index -> every credential treated as static (early-startup
  path before UpdateOAuthIndex fires).
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comment thread internal/proxy/server.go
Comment thread internal/proxy/addon.go Outdated
Comment thread internal/proxy/addon.go Outdated
- proxy/addon: extractInjectableSecret log uses generic [INJECT]
  prefix because the helper now serves both the HTTP/1+2 addon
  path (internal/proxy/addon.go) and the HTTP/3 (QUIC) path
  (internal/proxy/quic.go). [ADDON-INJECT] would mislead a reader
  who saw the line in a deployment that uses HTTP/3 exclusively.
- proxy/addon: StreamResponseModifier now returns http.NoBody (an
  empty reader) instead of nil when the input reader is nil. The
  nil case only fires under f.Stream=true (where the Response
  addon callback was skipped, so f.Response.Body is empty), so a
  literal nil return left reply() with no body source AND no body
  bytes — undefined-length responses trip some HTTP clients.
  http.NoBody is well-framed: reply() copies its EOF immediately
  and the agent sees a zero-byte body. The nil-input test was
  updated to assert the new behavior.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment thread internal/proxy/server.go
Comment thread internal/proxy/addon.go
@nnemirovsky nnemirovsky changed the title fix: env-file ownership + response handler panic recovery fix: env-file ownership + response panic recovery + oauth header injection May 7, 2026
The deferred recover on StreamResponseModifier previously fell back
to `out = in` on panic. After io.ReadAll consumes the input stream
(line 901), `in` is drained — a panic later in swapOAuthTokens or
the body-replacement steps would hand the agent an empty reader
even though we already had the bytes in memory.

Capture the bytes into a closure-scoped `bufferedBody` slice
immediately after the read completes. The recover prefers
bufferedBody if it is set, otherwise falls back to `in` (the read
hasn't started yet, stream is intact). Either way the agent gets a
usable body.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment thread internal/proxy/addon.go
Comment thread internal/proxy/server.go
Two PR #38 round 6 follow-ups.

Server now caches the latest credential_meta slice it saw via
UpdateOAuthIndex. setupInjection re-applies the cache to the
QUIC proxy right after assigning s.quicProxy, so an
UpdateOAuthIndex call that arrives before the proxy exists is
not silently dropped. Without the cache, a future reorder of
the startup sequence would re-introduce the OAuth-envelope
header injection bug for HTTP/3 traffic.

Response addon also gets a nil-flow guard mirroring
StreamResponseModifier so the deferred recover does not
dereference a nil flow when tests build a minimal Response
addon path.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comment thread internal/proxy/addon.go Outdated
Copilot round 7: the deferred recover in StreamResponseModifier
returned the buffered raw upstream bytes when a panic fired
after io.ReadAll. For an OAuth-matched token endpoint those
bytes contain real access and refresh tokens, so a panic in
swapOAuthTokens (or anywhere between the read and the swap)
would have leaked real credentials to the agent.

Replace bufferedBody with safeFallback. We assign it ONLY
after a successful swap, when the buffer is the phantom-only
modified body. A panic before that point falls back to
http.NoBody. The agent sees an empty 2xx token response and
surfaces the failure as a parse error, which is strictly
preferable to handing over a real bearer.

The non-panic error path (swap parse failure on a non-OAuth
body, e.g. an HTML error page from a misconfigured token
endpoint) keeps the historical pass-through behavior but no
longer assigns safeFallback so a late panic would still hit
http.NoBody rather than echo whatever that body was.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comment thread internal/proxy/addon.go
Copilot round 8: the nil-input guard ran AFTER the early
return for nil flow / nil request, so a call where both `f`
and `in` were nil would skip the guard, leave `out = in` (nil),
and trip a nil io.Reader nil-deref in go-mitmproxy's downstream
copy. Move the nil-input check to the top of the function so
http.NoBody is returned regardless of the flow's shape.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

@nnemirovsky nnemirovsky merged commit 61f4c00 into main May 7, 2026
10 checks passed
@nnemirovsky nnemirovsky deleted the v0.13.2-chown-env branch May 7, 2026 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants