Skip to content

Conversation

@mmabrouk
Copy link
Member

@mmabrouk mmabrouk commented Feb 3, 2026

Summary

Stacked on #3622

Updates evaluation OpenAPI parsing to prefer the new x-agenta-flags.is_chat vendor extension, with fallback to legacy heuristics.

Changes

  • Parse x-agenta-flags.is_chat from OpenAPI /test or /run operations
  • Fall back to legacy heuristic (check for messages property or x-parameter: messages)
  • Thread is_chat through payload construction so message parsing only runs for chat apps
  • Add temporary logging to distinguish which detection path was used

Logging (temporary)

Chat detection from x-agenta-flags  is_chat=True  path=/test

or

Chat detection fallback to heuristic  is_chat=True  path=/test

These logs will be removed after validation.

Files Changed

  • api/oss/src/services/llm_apps_service.py
  • api/oss/src/core/evaluations/tasks/legacy.py
  • docs/design/chat-interface-rfc/status.md

- Parse x-agenta-flags.is_chat from OpenAPI operations when available
- Fall back to legacy heuristic based on messages fields
- Thread is_chat into evaluation payload building and add temporary logs
@vercel
Copy link

vercel bot commented Feb 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agenta-documentation Ready Ready Preview, Comment Feb 9, 2026 10:32pm

Request Review

The SDK (PR #3622) changed the OpenAPI vendor extension from a flat
'x-agenta-flags' key to a nested 'x-agenta: {flags: {...}}' structure.
Update _get_openapi_chat_flag to read from the new nested path.

Also removes unused imports (common, make_hash_id) caught by ruff.
@mmabrouk mmabrouk force-pushed the feat/chat-interface-eval-detection branch from 8ef59cf to a7478cd Compare February 9, 2026 19:31
@mmabrouk mmabrouk marked this pull request as ready for review February 9, 2026 19:37
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

Open in Devin Review

…llback, restore removed variables

- Fix TODO comment: says 'messages' column, not 'chats'
- Remove datapoint.get('chat') fallback — 'chat' was the old column name,
  the FE now uses 'messages'. No need for backward compat.
- Restore references/links variables + imports that were removed by ruff
  as unused — they belong to a commented-out make_hash_id call and are
  out of scope for this PR.
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Feb 9, 2026
…SaveTestsetModal

SaveTestsetModal.tsx hardcodes 'chat' as the column name when re-saving
evaluation results to a testset. Several FE readers (DebugSection,
CHAT_ARRAY_KEYS, legacy evaluation exports) also reference 'chat'.
Keep the fallback: prefer 'messages', fall back to 'chat'.
@mmabrouk
Copy link
Member Author

mmabrouk commented Feb 9, 2026

Testing Results

What was tested

  • Deployed feat/chat-interface-eval-detection on dev environment (port 8180)
  • Ran an evaluation on a chat app with a testset containing a messages column
  • Verified worker logs confirm x-agenta.flags path is used (not heuristic fallback):
    Chat detection from x-agenta.flags  is_chat=True  path=/test
    

What's confirmed working

  • _get_openapi_chat_flag() correctly reads the nested x-agenta: {flags: {is_chat: true}} from OpenAPI
  • is_chat=True is passed through batch_invokerun_with_retryinvoke_appmake_payload
  • make_payload gates payload["messages"] behind is_chat (no longer injected unconditionally)

Additional tests needed

  • Non-chat app evaluation: run eval on a completion (non-chat) app and confirm is_chat=False or Nonepayload["messages"] should NOT be injected
  • Heuristic fallback: test with an app that doesn't emit x-agenta.flags (e.g. older SDK) — should fall back to messages property/parameter heuristic and log "Chat detection fallback to heuristic"
  • chat column fallback: run eval with a testset that has a chat column (not messages) — verify datapoint.get("chat") fallback works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backend Evaluation size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant