Skip to content

fix(web): preserve eval compare fields when hydrating session messages from eval history#402

Open
Dsazz wants to merge 5 commits intogoogle:mainfrom
Dsazz:fix/session-eval-combined-stability
Open

fix(web): preserve eval compare fields when hydrating session messages from eval history#402
Dsazz wants to merge 5 commits intogoogle:mainfrom
Dsazz:fix/session-eval-combined-stability

Conversation

@Dsazz
Copy link

@Dsazz Dsazz commented Mar 1, 2026

Summary

This PR fixes a regression where failed eval cases loaded from Eval history did not show the Actual vs Expected comparison in the chat panel, even though backend responses contained the required data.

It also adds focused hardening in eval error handling and test coverage to prevent silent regressions.

Problem

When opening a failed eval case from Eval history, the chat session loaded, but failed-message comparison details were missing:

  • Actual Response / Expected Response
  • Actual Tool Uses / Expected Tool Uses
  • Score / Threshold context

The API payload already included these fields.

Root Cause

EvalTabComponent annotates session events with eval metadata (e.g. evalStatus, failedMetric, actualFinalResponse, expectedFinalResponse, etc.), but in the session-hydration path the event-to-message mapping dropped those fields before rendering.

As a result, chat panel conditions (e.g. failed eval compare rendering) had incomplete message objects and skipped UI sections.

What Changed

1) Preserve eval metadata in message hydration path

Updated chat message mapping so eval annotation fields survive event -> message conversion consistently across session load paths.

2) Keep compare rendering valid for empty-string actual response

Ensured compare rendering checks for presence (null/undefined) rather than truthiness, so empty actual responses still render as valid compare content.

3) Harden EvalTab error handling

Improved EvalTab HTTP error behavior:

  • 404 on eval results is treated as expected “no history yet” ([])
  • non-404 errors are no longer silently flattened into empty results
  • removed brittle statusText === 'Not Found' checks in favor of status === 404

4) Add regression tests

Added/updated tests to cover:

  • stable eval history behavior
  • 404 vs non-404 handling paths
  • tab visibility behavior for missing eval sets
  • failed compare data availability expectations

Why This Approach

  • Keeps backend contract unchanged
  • Fixes the bug at the data propagation boundary (root cause), not via template-only workaround
  • Preserves intended UX for real “no history yet” cases while avoiding hidden failures for real backend errors

Validation

  • Reproduced with failed eval history runs and confirmed compare sections appear
  • Confirmed score/threshold and tool-use compare fields display correctly
  • Confirmed 404/no-history behavior remains user-friendly
  • Added targeted test coverage for new behavior

Related

Dsazz added 4 commits March 1, 2026 19:49
Ensure session URL handling and eval comparison message mapping stay stable so selected sessions and failed-eval response fields render reliably in the chat panel.

Made-with: Cursor
Precompute eval history render fields and align dynamic eval-tab lifecycle wiring to avoid refresh-time races and change-detection churn on first load.

Made-with: Cursor
Use messages as the single source of truth in ChatPanel by removing the displayMessages snapshot layer and related render-key plumbing to keep the implementation simpler and easier to reason about.

Made-with: Cursor
Handle 404 responses as expected empty eval history while preserving existing UI data on non-404 failures, and add targeted EvalTab tests for 404/non-404 behavior and tab visibility logic.

Made-with: Cursor
@google-cla
Copy link

google-cla bot commented Mar 1, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant