Skip to content

[Repo Assist] test(chat): add regression coverage for failed-tool + final-assistant sequence#692

Merged
shanselman merged 1 commit into
masterfrom
repo-assist/improve-tool-error-then-final-assistant-tests-29903afd9bf40cc1
Jun 7, 2026
Merged

[Repo Assist] test(chat): add regression coverage for failed-tool + final-assistant sequence#692
shanselman merged 1 commit into
masterfrom
repo-assist/improve-tool-error-then-final-assistant-tests-29903afd9bf40cc1

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot commented Jun 5, 2026

🤖 This PR was created by Repo Assist, an automated AI assistant.

Summary

Adds 5 new ChatTimelineReducer unit tests that pin the expected state for the sequence:

ChatToolStartEvent → ChatToolErrorEvent → ChatMessageEvent(final) → ChatTurnEndEvent

Motivation

This is regression coverage for issue #672 (Windows Companion Chat shows stale failed tool event while web UI shows completed assistant response).

The bug is in the rendering layerOpenClawChatTimeline.cs reorders entries within each turn so all ToolCall entries appear after all non-ToolCall entries. For the failure case, this puts the failed tool event at the visual bottom (where auto-scroll lands), while the final assistant response is above it and off-screen.

The reducer state is already correct — this PR locks that behavior with tests so any future fix can proceed with confidence.

Tests added (ChatTimelineReducerTests.cs)

Test Verifies
ToolError_ThenFinalAssistant_ProducesToolAndAssistantEntries Both ToolCall (Error) and Assistant entries exist
ToolError_ThenFinalAssistant_AssistantHasFinalText Assistant entry has the expected final text
ToolError_ThenFinalAssistant_TurnIsEnded Turn is ended cleanly, active IDs cleared
ToolError_ThenFinalAssistant_ToolEntryPrecedesAssistantInState ToolCall entry is at index 0, Assistant at index 1 (state insertion order)
ToolError_ThenFinalAssistant_ToolOutputContainsErrorText ToolCall entry carries the error text

Relationship to issue #672

The ToolEntryPrecedesAssistantInState test explicitly documents the state insertion order. This is the baseline for a future rendering fix: the rendering layer should keep the final assistant response as the last visible item after a failed tool.

Test Status

  • dotnet test OpenClaw.Tray.Tests — 941 total, 939 passed, 2 skipped (infrastructure-related), 0 failed
  • dotnet test OpenClaw.Shared.Tests — 8 pre-existing failures on master (unrelated to these changes), confirmed identical on both master and this branch
  • ⚠️ ./build.ps1 — not run (Windows PowerShell build script, not available in Linux CI environment; this is an infrastructure limitation)

Generated by 🌈 Repo Assist. Learn more.

Generated by 🌈 Repo Assist, see workflow run. Learn more.

To install this agentic workflow, run

gh aw add githubnext/agentics/workflows/repo-assist.md@97143ac59cb3a13ef2a77581f929f06719c7402a

…ant sequence (#672)

Add 5 new ChatTimelineReducer tests that pin the expected state for the
sequence: ToolStart → ToolError → final ChatMessageEvent → ChatTurnEnd.

This is regression coverage for issue #672 (Companion shows stale failed
tool event while web UI shows completed response). The bug is in the
rendering layer (OpenClawChatTimeline reorders ToolCall entries to appear
after non-ToolCall entries within a turn, placing the error at visual
bottom while the assistant reply scrolls off screen), but the reducer
state is correct — these tests document and lock that.

Tests added:
- ToolError_ThenFinalAssistant_ProducesToolAndAssistantEntries
- ToolError_ThenFinalAssistant_AssistantHasFinalText
- ToolError_ThenFinalAssistant_TurnIsEnded
- ToolError_ThenFinalAssistant_ToolEntryPrecedesAssistantInState
- ToolError_ThenFinalAssistant_ToolOutputContainsErrorText

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented Jun 5, 2026

Codex review: found issues before merge. Reviewed June 4, 2026, 9:52 PM ET / 01:52 UTC.

Summary
This PR adds five ChatTimelineReducer unit tests for the failed-tool followed by final-assistant sequence.

Reproducibility: yes. for the reducer baseline from source inspection: the ToolStart, ToolError, final ChatMessage, and TurnEnd path produces the state shape the PR tests. The actual Windows Companion visual bug remains represented by the linked issue rather than reproduced here in a live UI run.

Review metrics: 2 noteworthy metrics.

  • Changed files: 1 test file changed. The patch is test-only, so it cannot resolve the linked visible renderer bug by itself.
  • Added tests: 5 reducer tests added. The new coverage pins reducer state for the failed-tool/final-assistant sequence that the renderer later reorders.

Merge readiness
Overall: 🦐 gold shrimp
Proof: 🌊 off-meta tidepool
Patch quality: 🦐 gold shrimp
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P2] Run or collect the repository-required Windows validation: ./build.ps1, shared tests, and tray tests.
  • [P2] Decide whether to land this reducer baseline separately or include it with the renderer fix for the linked issue.

Risk before merge

Maintainer options:

  1. Decide the mitigation before merge
    Land this reducer baseline only if maintainers want a separate checkpoint, then fix the visible renderer ordering or scroll behavior in the linked issue with Windows Companion proof.
  2. Pause or close
    Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge

  • [P2] The remaining action is maintainer judgment plus validation, not a narrow automated code repair for this PR branch.

Security
Cleared: The diff only adds unit tests under tests/ and does not change production code, dependencies, CI, secrets handling, or build scripts.

Review findings

  • [P2] Complete required validation before merge — tests/OpenClaw.Tray.Tests/ChatTimelineReducerTests.cs:263
Review details

Best possible solution:

Land this reducer baseline only if maintainers want a separate checkpoint, then fix the visible renderer ordering or scroll behavior in the linked issue with Windows Companion proof.

Do we have a high-confidence way to reproduce the issue?

Yes for the reducer baseline from source inspection: the ToolStart, ToolError, final ChatMessage, and TurnEnd path produces the state shape the PR tests. The actual Windows Companion visual bug remains represented by the linked issue rather than reproduced here in a live UI run.

Is this the best way to solve the issue?

Unclear as a complete fix: the tests are a reasonable baseline, but the visible bug lives in OpenClawChatTimeline.cs. Maintainers should decide whether to land this checkpoint separately or fold it into the renderer fix.

Full review comments:

  • [P2] Complete required validation before merge — tests/OpenClaw.Tray.Tests/ChatTimelineReducerTests.cs:263
    The PR body says ./build.ps1 was not run and shared tests still had failures. AGENTS.md requires the full build, shared tests, and tray tests after changes, so this needs completed Windows validation or an explicit maintainer override before merge.
    Confidence: 0.87

Overall correctness: patch is incorrect
Overall confidence: 0.86

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 99efc50cbc22.

Label changes

Label changes:

  • add rating: 🦐 gold shrimp: Overall readiness is 🦐 gold shrimp; proof is 🌊 off-meta tidepool and patch quality is 🦐 gold shrimp.
  • add status: ⏳ waiting on author: ClawSweeper has contributor-facing work open and is waiting for author action. Not applicable: The PR is authored by an automation bot and only adds reducer unit tests, so the external-contributor real behavior proof gate does not apply.
  • remove rating: 🐚 platinum hermit: Current PR rating is rating: 🦐 gold shrimp, so this older rating label is no longer current.
  • remove status: 👀 ready for maintainer look: Current PR status label is status: ⏳ waiting on author.

Label justifications:

  • P3: This is low-risk test-only baseline coverage for an existing chat display issue, not the user-visible fix itself.
  • rating: 🦐 gold shrimp: Overall readiness is 🦐 gold shrimp; proof is 🌊 off-meta tidepool and patch quality is 🦐 gold shrimp.
  • status: ⏳ waiting on author: ClawSweeper has contributor-facing work open and is waiting for author action. Not applicable: The PR is authored by an automation bot and only adds reducer unit tests, so the external-contributor real behavior proof gate does not apply.
Evidence reviewed

Acceptance criteria:

  • [P1] ./build.ps1.
  • [P1] dotnet test ./tests/OpenClaw.Shared.Tests/OpenClaw.Shared.Tests.csproj --no-restore.
  • [P1] dotnet test ./tests/OpenClaw.Tray.Tests/OpenClaw.Tray.Tests.csproj --no-restore.

What I checked:

Likely related people:

  • Christine Yan: Current blame for the reducer, renderer ordering block, and existing reducer tests points to the same recent area commit. (role: recent area contributor; confidence: medium; commits: 85445c78066b; files: src/OpenClaw.Chat/ChatTimelineReducer.cs, src/OpenClaw.Tray.WinUI/Chat/OpenClawChatTimeline.cs, tests/OpenClaw.Tray.Tests/ChatTimelineReducerTests.cs)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. P3 Low-risk cleanup, docs, polish, ergonomics, or speculative feature. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels Jun 5, 2026
@shanselman shanselman marked this pull request as ready for review June 7, 2026 22:23
@shanselman shanselman merged commit 559f411 into master Jun 7, 2026
7 checks passed
@shanselman shanselman deleted the repo-assist/improve-tool-error-then-final-assistant-tests-29903afd9bf40cc1 branch June 7, 2026 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automation enhancement New feature or request P3 Low-risk cleanup, docs, polish, ergonomics, or speculative feature. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. repo-assist status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant