Skip to content

Harden tray<->gateway keep-alive and reconnect lifecycle#627

Draft
RBrid wants to merge 2 commits into
openclaw:masterfrom
RBrid:user/rbrid/KeepAliveFixes2
Draft

Harden tray<->gateway keep-alive and reconnect lifecycle#627
RBrid wants to merge 2 commits into
openclaw:masterfrom
RBrid:user/rbrid/KeepAliveFixes2

Conversation

@RBrid
Copy link
Copy Markdown
Contributor

@RBrid RBrid commented Jun 2, 2026

WebSocketClientBase:

  • Add _disposing flag distinct from _disposed; set before OnDisposing, promote to _disposed after, so teardown callbacks see a stable state.
  • Gate ConnectAsync, ReconnectWithBackoffAsync, listener-finally (CAS/event/reconnect-kickoff), and reconnect kickoff sites on (!_disposed && !_disposing) to prevent post-Dispose work.
  • Re-check disposal after OnConnectedAsync and before spawning the listener Task so a Dispose racing the connect path does not leak a background listener.
  • CAS-clear + Abort() + Dispose() any ClientWebSocket installed into _webSocket if disposal wins the race during ConnectAsync; mirror in the orphan-clean else-if branch.
  • Abort() before Dispose() on owned sockets so peers see a clean RST instead of an unsent CLOSE.
  • Volatile reads on shutdown flags in HeartbeatLoopAsync to avoid cache staleness across cores.
  • Catch ObjectDisposedException narrowly in SendRawAsync and CloseWebSocketAsync so torn-down sockets do not surface as errors.
  • Guard OnDisposing with try/catch so a throwing subclass cannot skip later cleanup steps.
  • Per-event try/catch wrappers around status/error event raises so a throwing subscriber cannot block teardown.

OpenClawGatewayClient:

  • Apply matching reconnect/teardown hygiene around the keep-alive and heartbeat paths so connection state stays consistent across forced disconnects.

Tests:

  • Relax reconnect-backoff log assertion to tolerate jitter in the delay value (still asserts attempt number).

Validation:

  • ./build.ps1 clean (0/0)
  • Shared.Tests: 2045 passed / 29 skipped
  • Tray.Tests: 877 passed
  • Manual: tray launched from net10.0-windows10.0.22621.0 build

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented Jun 2, 2026

Codex review: needs real behavior proof before merge. Reviewed June 4, 2026, 7:40 PM ET / 23:40 UTC.

Summary
Review failed before ClawSweeper could summarize the requested change.

Reproducibility: unclear. The review failed before ClawSweeper could establish a reproduction path.

Review metrics: none identified.

Merge readiness
Overall: 🌊 off-meta tidepool
Proof: 🌊 off-meta tidepool
Patch quality: 🌊 off-meta tidepool
Result: rating does not apply to this item.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Risk before merge

  • [P1] No close action taken because the review did not complete.

Maintainer options:

  1. Decide the mitigation before merge
    Retry the Codex review after fixing the execution failure.
  2. Pause or close
    Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge

  • [P1] Review did not complete, so no work-lane recommendation was made.
Review details

Best possible solution:

Retry the Codex review after fixing the execution failure.

Do we have a high-confidence way to reproduce the issue?

Unclear. The review failed before ClawSweeper could establish a reproduction path.

Is this the best way to solve the issue?

Unclear. Retry the review first so ClawSweeper can evaluate the actual issue and fix direction.

AGENTS.md: unclear because the file could not be read completely.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 99efc50cbc22.

Label changes

Label changes:

  • add rating: 🌊 off-meta tidepool: Overall readiness is 🌊 off-meta tidepool; proof is 🌊 off-meta tidepool and patch quality is 🌊 off-meta tidepool.
  • remove rating: 🧂 unranked krab: Current PR rating is rating: 🌊 off-meta tidepool, so this older rating label is no longer current.
  • remove status: 📣 needs proof: Current PR status no longer selects a status label.
  • remove P2: Current review triage priority is none.
  • remove merge-risk: 🚨 compatibility: Current PR review selected no merge-risk labels.
  • remove merge-risk: 🚨 availability: Current PR review selected no merge-risk labels.

Label justifications:

  • rating: 🌊 off-meta tidepool: Overall readiness is 🌊 off-meta tidepool; proof is 🌊 off-meta tidepool and patch quality is 🌊 off-meta tidepool.
Evidence reviewed

What I checked:

  • failure reason: timeout.
  • codex failure detail: Codex review failed for this PR: spawnSync codex ETIMEDOUT.
  • codex stdout: Per-item Codex failure; continuing with the rest of the shard.

Likely related people:

  • unknown: Codex failed before it could trace repository history. (role: review did not complete; confidence: low)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. merge-risk: 🚨 availability 🚨 Merging this PR could cause crashes, hangs, restart loops, stalls, or process outages. labels Jun 2, 2026
WebSocketClientBase:
- Add _disposing flag distinct from _disposed; set before OnDisposing,
  promote to _disposed after, so teardown callbacks see a stable state.
- Gate ConnectAsync, ReconnectWithBackoffAsync, listener-finally
  (CAS/event/reconnect-kickoff), and reconnect kickoff sites on
  (!_disposed && !_disposing) to prevent post-Dispose work.
- Re-check disposal after OnConnectedAsync and before spawning the
  listener Task so a Dispose racing the connect path does not leak a
  background listener.
- CAS-clear + Abort() + Dispose() any ClientWebSocket installed into
  _webSocket if disposal wins the race during ConnectAsync; mirror in
  the orphan-clean else-if branch.
- Abort() before Dispose() on owned sockets so peers see a clean RST
  instead of an unsent CLOSE.
- Volatile reads on shutdown flags in HeartbeatLoopAsync to avoid
  cache staleness across cores.
- Catch ObjectDisposedException narrowly in SendRawAsync and
  CloseWebSocketAsync so torn-down sockets do not surface as errors.
- Guard OnDisposing with try/catch so a throwing subclass cannot skip
  later cleanup steps.
- Per-event try/catch wrappers around status/error event raises so a
  throwing subscriber cannot block teardown.

OpenClawGatewayClient:
- Apply matching reconnect/teardown hygiene around the keep-alive and
  heartbeat paths so connection state stays consistent across forced
  disconnects.

Tests:
- Relax reconnect-backoff log assertion to tolerate jitter in the
  delay value (still asserts attempt number).

Validation:
- ./build.ps1 clean (0/0)
- Shared.Tests: 2045 passed / 29 skipped
- Tray.Tests:   877 passed
- Manual: tray launched from net10.0-windows10.0.22621.0 build

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@RBrid RBrid force-pushed the user/rbrid/KeepAliveFixes2 branch from b42fb20 to 1f48af3 Compare June 2, 2026 02:07
@RBrid RBrid marked this pull request as ready for review June 2, 2026 02:20
No code changes; previous push 1f48af3 addressed the Dispose() race finding.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@RBrid RBrid marked this pull request as draft June 4, 2026 23:29
@clawsweeper clawsweeper Bot added rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 availability 🚨 Merging this PR could cause crashes, hangs, restart loops, stalls, or process outages. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. P2 Normal priority bug or improvement with limited blast radius. rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant