diff --git a/packages/junior-evals/README.md b/packages/junior-evals/README.md index 18445bc4..a3a736ae 100644 --- a/packages/junior-evals/README.md +++ b/packages/junior-evals/README.md @@ -4,9 +4,9 @@ Evals are end-to-end Slack conversation evaluations. They are the integration-style test layer for agent-facing behavior when model interpretation is part of the contract. -- We define conversation cases inline in TypeScript using `slackEval()`. +- We define conversation cases inline in TypeScript using `describeEval()` and the shared `slackEvals` harness options. - We run the real runtime/harness against those fixtures. -- We score outcomes with an LLM judge via `vitest-evals`. +- We score outcomes with a `vitest-evals` judge that reuses the Slack harness prompt seam, backed by Junior's Pi client and the Vercel AI Gateway model `openai/gpt-5.4`. ## Layer Boundaries @@ -52,7 +52,7 @@ Not in scope: ## Execution Model -For each case (`slackEval()` call): +For each `it()` case inside a `describeEval()` suite: 1. Replay events through the harness via `runEvalScenario()`. 2. Create a fresh runtime instance for the case via the chat composition root; do not mutate the production singleton runtime. @@ -97,7 +97,7 @@ Evals require real Vercel Sandbox access. If sandbox bootstrap fails, the eval f ## Authoring Rules -- Add core cases under `evals/core/*.eval.ts` and plugin-specific cases under `evals//` using `slackEval()`. +- Add core cases under `evals/core/*.eval.ts` and plugin-specific cases under `evals//` using `describeEval()` with `slackEvals`. - Use event builders (`mention`, `threadMessage`, `threadStart`) from `evals/helpers.ts`. - Use `auto_complete_mcp_oauth` or `auto_complete_oauth` when the harness should instantly complete the fake provider callback after our app has genuinely initiated auth. - For multi-turn, pass the same `thread` override so events land in one thread. @@ -109,7 +109,7 @@ Evals require real Vercel Sandbox access. If sandbox bootstrap fails, the eval f - `allow` should list acceptable optional variations. - `fail` should list forbidden outputs or failure conditions. - Do not write judge criteria as one dense paragraph. -- Let the `describe()` block own the behavior area. The file path and `describe()` context already provide scope. +- Let the `describeEval()` block own the behavior area. The file path and `describeEval()` context already provide scope. - Each eval name should only state the specific scenario and outcome. - Prefer `when , ` over vague labels like `continuity: remembers prior turn context`. - Keep user prompts natural. They should read like plausible user requests, not scripted implementation instructions. @@ -159,13 +159,18 @@ Avoid: ## Minimal Case ```typescript -import { mention, rubric, slackEval } from "../helpers"; - -slackEval("when explicitly mentioned, post one direct reply", { - events: [mention("<@U_APP> summarize this")], - criteria: rubric({ - contract: "An explicit mention gets one direct reply.", - pass: ["The assistant posts exactly one reply to the mention."], - }), +import { describeEval } from "vitest-evals"; +import { mention, rubric, slackEvals } from "../helpers"; + +describeEval("Routing", slackEvals, (it) => { + it("when explicitly mentioned, post one direct reply", async ({ run }) => { + await run({ + events: [mention("<@U_APP> summarize this")], + criteria: rubric({ + contract: "An explicit mention gets one direct reply.", + pass: ["The assistant posts exactly one reply to the mention."], + }), + }); + }); }); ``` diff --git a/packages/junior-evals/evals/core/lifecycle-and-resilience.eval.ts b/packages/junior-evals/evals/core/lifecycle-and-resilience.eval.ts index 0e0cfd1a..f30a9b56 100644 --- a/packages/junior-evals/evals/core/lifecycle-and-resilience.eval.ts +++ b/packages/junior-evals/evals/core/lifecycle-and-resilience.eval.ts @@ -1,10 +1,11 @@ -import { describe } from "vitest"; -import { mention, rubric, slackEval, threadStart } from "../helpers"; +import { describeEval } from "vitest-evals"; +import { mention, rubric, slackEvals, threadStart } from "../helpers"; -describe("Lifecycle and Resilience", () => { - slackEval( - "when an assistant thread starts, set title and prompts without posting a reply", - { +describeEval("Lifecycle and Resilience", slackEvals, (it) => { + it("when an assistant thread starts, set title and prompts without posting a reply", async ({ + run, + }) => { + await run({ events: [threadStart()], criteria: rubric({ contract: @@ -15,12 +16,13 @@ describe("Lifecycle and Resilience", () => { "Suggested prompts are set exactly once.", ], }), - }, - ); + }); + }); - slackEval( - "when reply generation fails before any answer, post one clear error reply", - { + it("when reply generation fails before any answer, post one clear error reply", async ({ + run, + }) => { + await run({ overrides: { fail_reply_call: 1 }, events: [mention("What's the status of the deploy?")], criteria: rubric({ @@ -34,12 +36,13 @@ describe("Lifecycle and Resilience", () => { "Do not leak stack traces, exception text, or debugging narration in the reply.", ], }), - }, - ); + }); + }); - slackEval( - "when a short reply is interrupted by the provider, keep the partial answer in one marked post", - { + it("when a short reply is interrupted by the provider, keep the partial answer in one marked post", async ({ + run, + }) => { + await run({ overrides: { reply_results: [ { @@ -63,6 +66,6 @@ describe("Lifecycle and Resilience", () => { "Do not mention provider internals, execution failure details, or logged-for-debugging text.", ], }), - }, - ); + }); + }); }); diff --git a/packages/junior-evals/evals/core/media-and-attachments.eval.ts b/packages/junior-evals/evals/core/media-and-attachments.eval.ts index bb5e3aac..27cb1718 100644 --- a/packages/junior-evals/evals/core/media-and-attachments.eval.ts +++ b/packages/junior-evals/evals/core/media-and-attachments.eval.ts @@ -1,10 +1,11 @@ -import { describe } from "vitest"; -import { mention, rubric, slackEval } from "../helpers"; +import { describeEval } from "vitest-evals"; +import { mention, rubric, slackEvals } from "../helpers"; -describe("Media and Attachments", () => { - slackEval( - "when the user asks for an image, attach an image instead of replying with text alone", - { +describeEval("Media and Attachments", slackEvals, (it) => { + it("when the user asks for an image, attach an image instead of replying with text alone", async ({ + run, + }) => { + await run({ overrides: { mock_image_generation: true }, events: [mention("show me how you feel")], criteria: rubric({ @@ -17,6 +18,6 @@ describe("Media and Attachments", () => { "Do not include sandbox setup failure text.", ], }), - }, - ); + }); + }); }); diff --git a/packages/junior-evals/evals/core/oauth-workflows.eval.ts b/packages/junior-evals/evals/core/oauth-workflows.eval.ts index 96ce3583..a90c2c9b 100644 --- a/packages/junior-evals/evals/core/oauth-workflows.eval.ts +++ b/packages/junior-evals/evals/core/oauth-workflows.eval.ts @@ -1,16 +1,17 @@ -import { describe } from "vitest"; -import { rubric, slackEval, threadMessage } from "../helpers"; +import { describeEval } from "vitest-evals"; +import { rubric, slackEvals, threadMessage } from "../helpers"; -describe("OAuth Workflows", () => { +describeEval("OAuth Workflows", slackEvals, (it) => { const mcpAuthResumeThread = { id: "thread-auth-resume", channel_id: "C-auth-resume", thread_ts: "17000000.auth-resume", }; - slackEval( - "when MCP auth pauses a turn, resume in the same thread with prior context intact", - { + it("when MCP auth pauses a turn, resume in the same thread with prior context intact", async ({ + run, + }) => { + await run({ overrides: { auto_complete_mcp_oauth: ["eval-auth"], plugin_dirs: ["evals/fixtures/plugins"], @@ -29,7 +30,6 @@ describe("OAuth Workflows", () => { ), ], taskTimeout: 120_000, - timeout: 300_000, criteria: rubric({ contract: "After MCP authorization completes, the same thread gets a resumed answer that keeps prior context.", @@ -49,8 +49,8 @@ describe("OAuth Workflows", () => { "Do not post a generic failure message.", ], }), - }, - ); + }); + }); const oauthResumeThread = { id: "thread-oauth-resume", @@ -58,9 +58,10 @@ describe("OAuth Workflows", () => { thread_ts: "17000000.oauth-resume", }; - slackEval( - "when generic OAuth pauses a turn, resume in the same thread with prior context intact", - { + it("when generic OAuth pauses a turn, resume in the same thread with prior context intact", async ({ + run, + }) => { + await run({ overrides: { auto_complete_oauth: ["eval-oauth"], plugin_dirs: ["evals/fixtures/plugins"], @@ -79,7 +80,6 @@ describe("OAuth Workflows", () => { ), ], taskTimeout: 120_000, - timeout: 300_000, criteria: rubric({ contract: "After generic OAuth authorization completes, the same thread gets a resumed answer that keeps prior context.", @@ -97,8 +97,8 @@ describe("OAuth Workflows", () => { "Do not post a generic failure message.", ], }), - }, - ); + }); + }); const oauthReconnectThread = { id: "thread-oauth-reconnect", @@ -106,9 +106,10 @@ describe("OAuth Workflows", () => { thread_ts: "17000000.oauth-reconnect", }; - slackEval( - "when the user explicitly asks to reconnect, confirm reconnection without auto-resuming another task", - { + it("when the user explicitly asks to reconnect, confirm reconnection without auto-resuming another task", async ({ + run, + }) => { + await run({ overrides: { auto_complete_oauth: ["eval-oauth"], plugin_dirs: ["evals/fixtures/plugins"], @@ -120,7 +121,6 @@ describe("OAuth Workflows", () => { ), ], taskTimeout: 120_000, - timeout: 300_000, criteria: rubric({ contract: "An explicit reconnect request can drive a fresh authorization cycle to completion in the same thread.", @@ -137,6 +137,6 @@ describe("OAuth Workflows", () => { "Do not post a generic failure message.", ], }), - }, - ); + }); + }); }); diff --git a/packages/junior-evals/evals/core/output-contract.eval.ts b/packages/junior-evals/evals/core/output-contract.eval.ts index e3782e2d..6095f1a1 100644 --- a/packages/junior-evals/evals/core/output-contract.eval.ts +++ b/packages/junior-evals/evals/core/output-contract.eval.ts @@ -1,10 +1,11 @@ -import { describe } from "vitest"; -import { mention, rubric, slackEval } from "../helpers"; +import { describeEval } from "vitest-evals"; +import { mention, rubric, slackEvals } from "../helpers"; -describe("Output Contract", () => { - slackEval( - "when asked for a structured overview, use bolded section labels instead of markdown headings", - { +describeEval("Output Contract", slackEvals, (it) => { + it("when asked for a structured overview, avoid hash markdown headings", async ({ + run, + }) => { + await run({ events: [ mention( "Give me a short overview of how OAuth 2.0 authorization code flow works. Cover the authorization request, token exchange, and refresh. Keep it to a few short sections.", @@ -13,22 +14,26 @@ describe("Output Contract", () => { requireSandboxReady: false, criteria: rubric({ contract: - "Structured multi-section replies use Slack-friendly bolded section labels, not markdown heading syntax.", + "Structured multi-section replies do not use hash-prefixed markdown heading markers.", pass: [ "The assistant posts one reply that covers the authorization request, token exchange, and refresh.", - "Section labels appear as bolded short phrases on their own line, not as markdown headings.", + "No section label line starts with `#`, `##`, or `###`.", + ], + allow: [ + "Bolded title lines, bolded section labels, and numbered bold labels are acceptable.", ], fail: [ - "Do not use markdown heading syntax (lines beginning with `#`, `##`, or `###`) for section labels.", - "Do not paste a heading line like `# Authorization Request` at the start of a section.", + "Do not use lines beginning with `#`, `##`, or `###` for section labels.", + "Do not paste a hash-heading line like `# Authorization Request` at the start of a section.", ], }), - }, - ); + }); + }); - slackEval( - "when the reply contains multiple URLs, use plain URLs instead of markdown link syntax", - { + it("when the reply contains multiple URLs, use plain URLs instead of markdown link syntax", async ({ + run, + }) => { + await run({ events: [ mention( "Where can I find the official documentation for the Slack Web API, Slack Bolt JS, and Slack Block Kit? Just point me at the three canonical starting pages.", @@ -47,12 +52,13 @@ describe("Output Contract", () => { "Do not wrap URLs in Slack `` link syntax unless the user explicitly asked for that form.", ], }), - }, - ); + }); + }); - slackEval( - "when asked to compare two options, use bullets instead of a markdown table", - { + it("when asked to compare two options, use bullets instead of a markdown table", async ({ + run, + }) => { + await run({ events: [ mention( "Give me a short comparison of REST and GraphQL across these three dimensions: caching, over-fetching, and tooling maturity. Keep it tight.", @@ -71,6 +77,6 @@ describe("Output Contract", () => { "Do not include a row like `| REST | GraphQL |` or similar pipe-delimited structures.", ], }), - }, - ); + }); + }); }); diff --git a/packages/junior-evals/evals/core/passive-behavior.eval.ts b/packages/junior-evals/evals/core/passive-behavior.eval.ts index ab72bd28..2636aa8d 100644 --- a/packages/junior-evals/evals/core/passive-behavior.eval.ts +++ b/packages/junior-evals/evals/core/passive-behavior.eval.ts @@ -1,40 +1,44 @@ -import { describe } from "vitest"; -import { mention, rubric, slackEval, threadMessage } from "../helpers"; +import { describeEval } from "vitest-evals"; +import { mention, rubric, slackEvals, threadMessage } from "../helpers"; -describe("Passive Behavior", () => { +describeEval("Passive Behavior", slackEvals, (it) => { const sideConversationThread = { id: "thread-passive-side-conversation", channel_id: "C-passive-side-conversation", thread_ts: "17000000.passive-side-conversation", }; - slackEval("when a later question is human-to-human, stay out of the thread", { - overrides: { - reply_texts: [ - "The deploy changed the billing worker and the API auth flow.", - ], - }, - events: [ - mention( - "Summarize this deploy in one sentence. It changed the billing worker and the API auth flow.", - { + it("when a later question is human-to-human, stay out of the thread", async ({ + run, + }) => { + await run({ + overrides: { + reply_texts: [ + "The deploy changed the billing worker and the API auth flow.", + ], + }, + events: [ + mention( + "Summarize this deploy in one sentence. It changed the billing worker and the API auth flow.", + { + thread: sideConversationThread, + }, + ), + threadMessage("@sam can you take the billing worker rollback?", { thread: sideConversationThread, - }, - ), - threadMessage("@sam can you take the billing worker rollback?", { - thread: sideConversationThread, - }), - ], - criteria: rubric({ - contract: - "A later human-to-human question is ignored even when it is phrased like something Junior could answer.", - pass: [ - "The assistant posts exactly one reply: the initial helpful answer about the deploy.", - ], - fail: [ - "Do not answer the later question addressed to @sam about who should take the rollback.", + }), ], - }), + criteria: rubric({ + contract: + "A later human-to-human question is ignored even when it is phrased like something Junior could answer.", + pass: [ + "The assistant posts exactly one reply: the initial helpful answer about the deploy.", + ], + fail: [ + "Do not answer the later question addressed to @sam about who should take the rollback.", + ], + }), + }); }); const directedFollowUpThread = { @@ -43,9 +47,10 @@ describe("Passive Behavior", () => { thread_ts: "17000000.passive-directed-follow-up", }; - slackEval( - "when a follow-up is clearly directed at Junior's prior answer, reply without another @mention", - { + it("when a follow-up is clearly directed at Junior's prior answer, reply without another @mention", async ({ + run, + }) => { + await run({ overrides: { reply_texts: ["You need the budget by Friday."], }, @@ -65,8 +70,8 @@ describe("Passive Behavior", () => { "The second reply plainly restates that the budget is needed by Friday.", ], }), - }, - ); + }); + }); const casualPronounThread = { id: "thread-passive-casual-pronoun", @@ -74,9 +79,10 @@ describe("Passive Behavior", () => { thread_ts: "17000000.passive-casual-pronoun", }; - slackEval( - "when a casual pronoun question reads like coworker talk, stay out of the thread", - { + it("when a casual pronoun question reads like coworker talk, stay out of the thread", async ({ + run, + }) => { + await run({ overrides: { reply_texts: [ "The deploy changed the billing worker and the API auth flow.", @@ -101,8 +107,8 @@ describe("Passive Behavior", () => { "Do not reply to the later casual question 'Is that the right approach?'", ], }), - }, - ); + }); + }); const domainVocabThread = { id: "thread-passive-domain-vocab", @@ -110,9 +116,10 @@ describe("Passive Behavior", () => { thread_ts: "17000000.passive-domain-vocab", }; - slackEval( - "when a later question only shares topic vocabulary, do not treat it as directed at Junior", - { + it("when a later question only shares topic vocabulary, do not treat it as directed at Junior", async ({ + run, + }) => { + await run({ overrides: { reply_texts: [ "The billing worker handles invoice processing and payment retries.", @@ -136,8 +143,8 @@ describe("Passive Behavior", () => { "Do not reply to the later question about the billing worker timeline.", ], }), - }, - ); + }); + }); const canYouThread = { id: "thread-passive-can-you", @@ -145,9 +152,10 @@ describe("Passive Behavior", () => { thread_ts: "17000000.passive-can-you", }; - slackEval( - "when 'can you' is directed at a coworker, stay out of the thread", - { + it("when 'can you' is directed at a coworker, stay out of the thread", async ({ + run, + }) => { + await run({ overrides: { reply_texts: ["Here's the deployment status."], }, @@ -163,8 +171,8 @@ describe("Passive Behavior", () => { ], fail: ["Do not reply to the later 'Can you check on this?' message."], }), - }, - ); + }); + }); const genuineFollowUpThread = { id: "thread-passive-genuine-follow-up", @@ -172,9 +180,10 @@ describe("Passive Behavior", () => { thread_ts: "17000000.passive-genuine-follow-up", }; - slackEval( - "when the user explicitly asks Junior to elaborate, post a second reply", - { + it("when the user explicitly asks Junior to elaborate, post a second reply", async ({ + run, + }) => { + await run({ overrides: { reply_texts: ["The deploy changed three services."], }, @@ -197,8 +206,8 @@ describe("Passive Behavior", () => { "The second reply provides more detail about the deploy changes.", ], }), - }, - ); + }); + }); const terseFollowUpThread = { id: "thread-passive-terse-follow-up", @@ -206,9 +215,10 @@ describe("Passive Behavior", () => { thread_ts: "17000000.passive-terse-follow-up", }; - slackEval( - "when a terse clarification comes right after Junior's answer, treat it as directed back to Junior", - { + it("when a terse clarification comes right after Junior's answer, treat it as directed back to Junior", async ({ + run, + }) => { + await run({ overrides: { reply_texts: [ "The deploy changed billing, auth, and the API gateway.", @@ -231,8 +241,8 @@ describe("Passive Behavior", () => { "The second reply clarifies which services changed.", ], }), - }, - ); + }); + }); const humansTookFloorThread = { id: "thread-passive-humans-took-floor", @@ -240,9 +250,10 @@ describe("Passive Behavior", () => { thread_ts: "17000000.passive-humans-took-floor", }; - slackEval( - "when humans resume the thread, keep ignoring same-topic questions unless they turn back to Junior", - { + it("when humans resume the thread, keep ignoring same-topic questions unless they turn back to Junior", async ({ + run, + }) => { + await run({ overrides: { reply_texts: ["The deploy changed billing, auth, and the API gateway."], }, @@ -265,8 +276,8 @@ describe("Passive Behavior", () => { ], fail: ["Do not answer the later billing worker timeline question."], }), - }, - ); + }); + }); const optOutThread = { id: "thread-opt-out", @@ -274,9 +285,10 @@ describe("Passive Behavior", () => { thread_ts: "17000000.optout", }; - slackEval( - "when the user says to stop participating, stay quiet until re-mentioned", - { + it("when the user says to stop participating, stay quiet until re-mentioned", async ({ + run, + }) => { + await run({ overrides: { reply_texts: [ "I can help in this thread.", @@ -305,6 +317,6 @@ describe("Passive Behavior", () => { ], fail: ["Do not treat the stop message like an ordinary help request."], }), - }, - ); + }); + }); }); diff --git a/packages/junior-evals/evals/core/research-reply-shape.eval.ts b/packages/junior-evals/evals/core/research-reply-shape.eval.ts index 01e10364..86644362 100644 --- a/packages/junior-evals/evals/core/research-reply-shape.eval.ts +++ b/packages/junior-evals/evals/core/research-reply-shape.eval.ts @@ -1,10 +1,11 @@ -import { describe } from "vitest"; -import { mention, rubric, slackEval } from "../helpers"; +import { describeEval } from "vitest-evals"; +import { mention, rubric, slackEvals } from "../helpers"; -describe("Research Reply Shape", () => { - slackEval( - "when summarizing multiple sources, show initial progress and return a concise answer without process chatter", - { +describeEval("Research Reply Shape", slackEvals, (it) => { + it("when summarizing multiple sources, show initial progress and return a concise answer without process chatter", async ({ + run, + }) => { + await run({ events: [ mention( "Read these three sources and give me one brief, coherent summary of how modern Slack agent streaming works. Keep it short enough to fit in one normal Slack reply, and do not include code samples: https://docs.slack.dev/ai/developing-agents/ , https://docs.slack.dev/reference/methods/chat.startStream/ , https://docs.slack.dev/reference/methods/chat.stopStream/ .", @@ -15,7 +16,6 @@ describe("Research Reply Shape", () => { }, requireSandboxReady: false, taskTimeout: 150_000, - timeout: 210_000, criteria: rubric({ contract: "A multi-source research request returns a concise Slack-style answer without process chatter.", @@ -33,12 +33,13 @@ describe("Research Reply Shape", () => { "Do not send caveats about inaccessible or partial sources as a stray status-like note.", ], }), - }, - ); + }); + }); - slackEval( - "when long-form research is requested as a reusable reference, create a canvas and keep the thread reply brief", - { + it("when long-form research is requested as a reusable reference, create a canvas and keep the thread reply brief", async ({ + run, + }) => { + await run({ events: [ mention( "Read these three sources and put together a detailed timeline and implementation reference for modern Slack agent streaming that I can come back to later. Cover how the APIs evolved, the key methods, the current limits, and the migration gotchas: https://docs.slack.dev/ai/developing-agents/ , https://docs.slack.dev/reference/methods/chat.startStream/ , https://docs.slack.dev/reference/methods/chat.stopStream/ .", @@ -49,7 +50,6 @@ describe("Research Reply Shape", () => { }, requireSandboxReady: false, taskTimeout: 180_000, - timeout: 240_000, criteria: rubric({ contract: "A long-form research deliverable becomes a Slack canvas, with the thread reserved for a short summary and pointer.", @@ -68,6 +68,6 @@ describe("Research Reply Shape", () => { "Do not add process chatter such as 'let me check', 'fetching', or similar tool-progress narration.", ], }), - }, - ); + }); + }); }); diff --git a/packages/junior-evals/evals/core/routing-and-continuity.eval.ts b/packages/junior-evals/evals/core/routing-and-continuity.eval.ts index 1d2ffcf7..4434cbd4 100644 --- a/packages/junior-evals/evals/core/routing-and-continuity.eval.ts +++ b/packages/junior-evals/evals/core/routing-and-continuity.eval.ts @@ -1,10 +1,11 @@ -import { describe } from "vitest"; -import { mention, rubric, slackEval, threadMessage } from "../helpers"; +import { describeEval } from "vitest-evals"; +import { mention, rubric, slackEvals, threadMessage } from "../helpers"; -describe("Routing and Continuity", () => { - slackEval( - "when a thread message explicitly mentions Junior, post a direct reply", - { +describeEval("Routing and Continuity", slackEvals, (it) => { + it("when a thread message explicitly mentions Junior, post a direct reply", async ({ + run, + }) => { + await run({ events: [threadMessage("<@U_APP> what is 2+2?", { is_mention: true })], criteria: rubric({ contract: @@ -15,12 +16,13 @@ describe("Routing and Continuity", () => { ], fail: ["Do not return sandbox setup failure text."], }), - }, - ); + }); + }); - slackEval( - "when asked to post in channel, send a channel post instead of a thread reply", - { + it("when asked to post in channel, send a channel post instead of a thread reply", async ({ + run, + }) => { + await run({ events: [mention("@bot say hello to the channel!")], criteria: rubric({ contract: @@ -33,12 +35,13 @@ describe("Routing and Continuity", () => { "A lightweight acknowledgement reaction in reactions is acceptable.", ], }), - }, - ); + }); + }); - slackEval( - "when asked to post in another named channel, explain the limitation instead", - { + it("when asked to post in another named channel, explain the limitation instead", async ({ + run, + }) => { + await run({ events: [ mention( "@bot post this in #discuss-design-engineering instead: Heads up, design review starts in 10 minutes.", @@ -57,12 +60,13 @@ describe("Routing and Continuity", () => { "Do not claim the message was posted to #discuss-design-engineering.", ], }), - }, - ); + }); + }); - slackEval( - "when the request is reaction-only, add a reaction without reply clutter", - { + it("when the request is reaction-only, add a reaction without reply clutter", async ({ + run, + }) => { + await run({ events: [mention("react to this")], criteria: rubric({ contract: @@ -73,8 +77,8 @@ describe("Routing and Continuity", () => { "Do not add a short acknowledgement reply such as 'Done'.", ], }), - }, - ); + }); + }); const continuityThread = { id: "thread-continuity", @@ -82,9 +86,10 @@ describe("Routing and Continuity", () => { thread_ts: "17000000.continuity", }; - slackEval( - "when a follow-up asks about the prior turn, recall the earlier budget context", - { + it("when a follow-up asks about the prior turn, recall the earlier budget context", async ({ + run, + }) => { + await run({ events: [ mention("I need the budget by Friday.", { thread: continuityThread }), threadMessage("what did i just ask?", { @@ -101,6 +106,6 @@ describe("Routing and Continuity", () => { ], fail: ["Do not return sandbox setup failure text."], }), - }, - ); + }); + }); }); diff --git a/packages/junior-evals/evals/core/skill-infra.eval.ts b/packages/junior-evals/evals/core/skill-infra.eval.ts index 56c86692..dc7b5010 100644 --- a/packages/junior-evals/evals/core/skill-infra.eval.ts +++ b/packages/junior-evals/evals/core/skill-infra.eval.ts @@ -1,10 +1,11 @@ -import { describe } from "vitest"; -import { mention, rubric, slackEval, threadMessage } from "../helpers"; +import { describeEval } from "vitest-evals"; +import { mention, rubric, slackEvals, threadMessage } from "../helpers"; -describe("Skill Infrastructure", () => { - slackEval( - "when the candidate brief command runs, return one candidate brief reply", - { +describeEval("Skill Infrastructure", slackEvals, (it) => { + it("when the candidate brief command runs, return one candidate brief reply", async ({ + run, + }) => { + await run({ overrides: { skill_dirs: ["evals/fixtures/skills"] }, events: [mention("/candidate-brief David Cramer")], criteria: rubric({ @@ -16,8 +17,8 @@ describe("Skill Infrastructure", () => { ], fail: ["Do not include sandbox setup failure text."], }), - }, - ); + }); + }); const candidateBriefThread = { id: "thread-candidate-brief-repeat", @@ -25,9 +26,10 @@ describe("Skill Infrastructure", () => { thread_ts: "17000000.candidate-brief", }; - slackEval( - "when the candidate brief command runs twice in one thread, keep the replies ordered", - { + it("when the candidate brief command runs twice in one thread, keep the replies ordered", async ({ + run, + }) => { + await run({ overrides: { skill_dirs: ["evals/fixtures/skills"] }, events: [ mention("/candidate-brief Alice Example", { @@ -48,12 +50,13 @@ describe("Skill Infrastructure", () => { ], fail: ["Do not include sandbox setup failure text."], }), - }, - ); + }); + }); - slackEval( - "when the working-directory command runs, return one file-list reply", - { + it("when the working-directory command runs, return one file-list reply", async ({ + run, + }) => { + await run({ overrides: { skill_dirs: ["evals/fixtures/skills"] }, events: [mention("/list-working-directory")], criteria: rubric({ @@ -65,12 +68,13 @@ describe("Skill Infrastructure", () => { ], fail: ["Do not include sandbox setup failure text."], }), - }, - ); + }); + }); - slackEval( - "when asked to double-check a source-backed fact, use the source and answer completely", - { + it("when asked to double-check a source-backed fact, use the source and answer completely", async ({ + run, + }) => { + await run({ overrides: { skill_dirs: ["evals/fixtures/skills"] }, events: [ mention( @@ -93,12 +97,13 @@ describe("Skill Infrastructure", () => { "Do not claim that a closed issue is enough to prove the capability exists.", ], }), - }, - ); + }); + }); - slackEval( - "when an MCP-backed skill handles a lookup, return the provider-backed answer", - { + it("when an MCP-backed skill handles a lookup, return the provider-backed answer", async ({ + run, + }) => { + await run({ overrides: { plugin_dirs: ["evals/fixtures/plugins"], }, @@ -108,7 +113,6 @@ describe("Skill Infrastructure", () => { ), ], taskTimeout: 120_000, - timeout: 300_000, criteria: rubric({ contract: "An MCP-backed skill can complete a natural lookup by using the provider result instead of surfacing tool validation errors.", @@ -128,6 +132,6 @@ describe("Skill Infrastructure", () => { "Do not say the MCP runtime is broken or that the lookup cannot be attempted.", ], }), - }, - ); + }); + }); }); diff --git a/packages/junior-evals/evals/github/skill-workflows.eval.ts b/packages/junior-evals/evals/github/skill-workflows.eval.ts index 69a251df..d615c424 100644 --- a/packages/junior-evals/evals/github/skill-workflows.eval.ts +++ b/packages/junior-evals/evals/github/skill-workflows.eval.ts @@ -1,10 +1,11 @@ -import { describe } from "vitest"; -import { mention, rubric, slackEval, threadMessage } from "../helpers"; +import { describeEval } from "vitest-evals"; +import { mention, rubric, slackEvals, threadMessage } from "../helpers"; -describe("GitHub Skill Workflows", () => { - slackEval( - "when the GitHub credential smoke command runs, return one CREDENTIAL_OK reply", - { +describeEval("GitHub Skill Workflows", slackEvals, (it) => { + it("when the GitHub credential smoke command runs, return one CREDENTIAL_OK reply", async ({ + run, + }) => { + await run({ overrides: { skill_dirs: ["evals/fixtures/skills"], enable_test_credentials: true, @@ -22,12 +23,13 @@ describe("GitHub Skill Workflows", () => { ], fail: ["Do not include sandbox setup failure text."], }), - }, - ); + }); + }); - slackEval( - "when creating a GitHub issue, skip duplicate-search narration in the reply", - { + it("when creating a GitHub issue, skip duplicate-search narration in the reply", async ({ + run, + }) => { + await run({ overrides: { enable_test_credentials: true, plugin_packages: ["@sentry/junior-github"], @@ -52,8 +54,8 @@ describe("GitHub Skill Workflows", () => { "Do not report that no duplicates were found.", ], }), - }, - ); + }); + }); const reporterRequesterThread = { id: "thread-reporter-requester", @@ -61,9 +63,10 @@ describe("GitHub Skill Workflows", () => { thread_ts: "17000000.reporter-requester", }; - slackEval( - "when one user reports and another files an issue, keep attribution roles separate", - { + it("when one user reports and another files an issue, keep attribution roles separate", async ({ + run, + }) => { + await run({ overrides: { enable_test_credentials: true, plugin_packages: ["@sentry/junior-github"], @@ -120,12 +123,13 @@ describe("GitHub Skill Workflows", () => { "Do not omit reporter attribution when showing the filed issue content.", ], }), - }, - ); + }); + }); - slackEval( - "when a GitHub task mentions a Sentry product area, do not prompt for Sentry auth first", - { + it("when a GitHub task mentions a Sentry product area, do not prompt for Sentry auth first", async ({ + run, + }) => { + await run({ overrides: { enable_test_credentials: true, plugin_packages: ["@sentry/junior-github", "@sentry/junior-sentry"], @@ -151,12 +155,13 @@ describe("GitHub Skill Workflows", () => { "Do not ask to inspect live Sentry data before doing the GitHub task.", ], }), - }, - ); + }); + }); - slackEval( - "when asked an implementation question about this repo, answer from repository evidence", - { + it("when asked an implementation question about this repo, answer from repository evidence", async ({ + run, + }) => { + await run({ overrides: { enable_test_credentials: true, plugin_packages: ["@sentry/junior-github"], @@ -181,12 +186,13 @@ describe("GitHub Skill Workflows", () => { "Do not answer purely from generic GitHub or OAuth knowledge without repo evidence.", ], }), - }, - ); + }); + }); - slackEval( - "when asked about PR auth sequencing, mention push auth before PR auth", - { + it("when asked about PR auth sequencing, mention push auth before PR auth", async ({ + run, + }) => { + await run({ overrides: { enable_test_credentials: true, plugin_packages: ["@sentry/junior-github"], @@ -211,8 +217,8 @@ describe("GitHub Skill Workflows", () => { "Do not omit the explicit push-auth step.", ], }), - }, - ); + }); + }); const defaultRepoThread = { id: "thread-default-repo", @@ -225,9 +231,10 @@ describe("GitHub Skill Workflows", () => { thread_ts: "17000000.default-repo-issue", }; - slackEval( - "when creating an issue after repo setup, use the stored repo without inventing tool failures", - { + it("when creating an issue after repo setup, use the stored repo without inventing tool failures", async ({ + run, + }) => { + await run({ overrides: { enable_test_credentials: true, plugin_packages: ["@sentry/junior-github"], @@ -266,12 +273,13 @@ describe("GitHub Skill Workflows", () => { "Do not create or report an issue for a repository other than getsentry/junior.", ], }), - }, - ); + }); + }); - slackEval( - "when a default repo is set in one turn, reuse it in the next turn without asking again", - { + it("when a default repo is set in one turn, reuse it in the next turn without asking again", async ({ + run, + }) => { + await run({ overrides: { enable_test_credentials: true, plugin_packages: ["@sentry/junior-github"], @@ -306,6 +314,6 @@ describe("GitHub Skill Workflows", () => { "Do not say a live GitHub lookup is required before answering.", ], }), - }, - ); + }); + }); }); diff --git a/packages/junior-evals/evals/helpers.ts b/packages/junior-evals/evals/helpers.ts index e45cf9cf..b638568a 100644 --- a/packages/junior-evals/evals/helpers.ts +++ b/packages/junior-evals/evals/helpers.ts @@ -1,6 +1,17 @@ -import { configure, evaluate } from "vitest-evals/evaluate"; -import { gateway } from "@ai-sdk/gateway"; -import { z } from "zod"; +import { + namedJudge, + type DescribeEvalOptions, + type JudgeContext, +} from "vitest-evals"; +import { completeText, resolveGatewayModel } from "@/chat/pi/client"; +import { + toJsonValue, + type Harness, + type HarnessRun, + type JsonValue, + type NormalizedMessage, + type ToolCallRecord, +} from "vitest-evals/harness"; import { registerLogRecordSink, type EmittedLogRecord } from "@/chat/logging"; import { type EvalEvent, @@ -9,141 +20,6 @@ import { runEvalScenario, } from "./behavior-harness"; -configure({ model: gateway("openai/gpt-5.2") }); - -// ── Eval output schema ───────────────────────────────────── - -const slackMetadataSchema = z.object({ - thread_title_set: z - .boolean() - .describe("Whether the assistant set a title on the Slack thread"), - suggested_prompts_set: z - .boolean() - .describe( - "Whether the assistant set suggested prompts on the Slack thread", - ), - assistant_status_pending: z - .boolean() - .describe( - "Whether any assistant thread still has a non-empty status indicator after the turn completed (should always be false)", - ), -}); - -const attachedFileSchema = z.object({ - filename: z - .string() - .describe("Filename of an actual file attached to the assistant post"), - isImage: z.boolean().describe("Whether the attached file is an image"), - mimeType: z - .string() - .optional() - .describe("MIME type of the attached file when known"), - sizeBytes: z - .number() - .optional() - .describe("File size in bytes when the harness has the binary payload"), -}); - -const assistantPostSchema = z.object({ - channel: z - .string() - .optional() - .describe("Slack channel ID where this assistant thread post was sent"), - files: z - .array(attachedFileSchema) - .describe( - "Actual files attached to this assistant thread post, not text describing files", - ), - text: z.string().describe("Visible text the assistant posted in the thread"), - thread_ts: z - .string() - .optional() - .describe("Slack thread timestamp for this assistant thread post"), -}); - -const canvasSchema = z.object({ - title: z.string().describe("Title of a Slack canvas created during the turn"), - markdown: z - .string() - .describe( - "Initial markdown body written into the created Slack canvas during the turn", - ), -}); - -const evalOutputSchema = z.object({ - assistant_posts: z - .array(assistantPostSchema) - .describe( - "Visible assistant replies in the evaluated Slack thread, including attached files and auth-resume replies", - ), - observed_tool_invocations: z - .array( - z.object({ - tool: z.string().describe("Tool name the assistant attempted to call"), - bash_command: z - .string() - .optional() - .describe("Bash command when the invoked tool is bash"), - skill_name: z - .string() - .optional() - .describe("Skill name when the invoked tool is loadSkill"), - mcp_tool_name: z - .string() - .optional() - .describe("MCP tool name when the invoked tool is callMcpTool"), - mcp_arguments: z - .record(z.string(), z.unknown()) - .optional() - .describe( - "MCP provider arguments nested under callMcpTool.arguments", - ), - }), - ) - .describe("Sanitized tool invocations observed during the eval"), - canvases: z - .array(canvasSchema) - .describe("Slack canvases created during the turn"), - channel_posts: z - .array( - z.object({ - channel: z - .string() - .describe("Slack channel ID where a direct channel post was sent"), - text: z - .string() - .describe("Message text sent via Slack chat.postMessage"), - thread_ts: z - .string() - .optional() - .describe( - "Slack thread timestamp when the message was sent as a thread reply", - ), - }), - ) - .describe("Slack channel messages sent by the assistant"), - reactions: z - .array( - z.object({ - channel: z - .string() - .describe("Slack channel ID where the reaction was added"), - emoji: z - .string() - .describe("Emoji reaction name sent via Slack reactions.add"), - timestamp: z - .string() - .describe( - "Target message timestamp reacted to via Slack reactions.add", - ), - }), - ) - .describe("Slack reactions added by the assistant"), - slack_metadata: slackMetadataSchema.describe( - "Slack thread metadata set by the assistant", - ), -}); - function hasAssistantStatusPending(result: EvalResult): boolean { const lastByThread = new Map(); for (const call of result.slackAdapter.statusCalls) { @@ -155,22 +31,103 @@ function hasAssistantStatusPending(result: EvalResult): boolean { return false; } -function serializeEvalResult(result: EvalResult): string { - const output: z.input = { - assistant_posts: result.posts, - observed_tool_invocations: result.toolInvocations, - canvases: result.canvases, - channel_posts: result.channelPosts, - reactions: result.reactions, +function toJson(value: unknown): JsonValue { + return toJsonValue(value) ?? null; +} + +function toJsonRecord( + value: Record, +): Record { + const record: Record = {}; + for (const [key, entry] of Object.entries(value)) { + record[key] = toJson(entry); + } + return record; +} + +function buildEvalOutput(result: EvalResult): Record { + return { + assistant_posts: toJson(result.posts), + observed_tool_invocations: toJson(result.toolInvocations), + canvases: toJson(result.canvases), + channel_posts: toJson(result.channelPosts), + reactions: toJson(result.reactions), slack_metadata: { thread_title_set: result.slackAdapter.titleCalls.length > 0, suggested_prompts_set: result.slackAdapter.promptCalls.length > 0, assistant_status_pending: hasAssistantStatusPending(result), }, }; +} + +function serializeEvalOutput(output: Record): string { return JSON.stringify(output, null, 2); } +function toToolCallRecord( + invocation: EvalResult["toolInvocations"][number], +): ToolCallRecord { + const args: Record = {}; + if (invocation.bash_command) { + args.command = invocation.bash_command; + } + if (invocation.skill_name) { + args.skill_name = invocation.skill_name; + } + if (invocation.mcp_tool_name) { + args.tool_name = invocation.mcp_tool_name; + } + if (invocation.mcp_arguments) { + args.arguments = toJson(invocation.mcp_arguments); + } + + return { + name: invocation.tool, + ...(Object.keys(args).length > 0 ? { arguments: args } : {}), + }; +} + +function toHarnessRun(result: EvalResult): HarnessRun { + const output = buildEvalOutput(result); + const toolCalls = result.toolInvocations.map(toToolCallRecord); + const messages: NormalizedMessage[] = [ + ...result.posts.map( + (post): NormalizedMessage => ({ + role: "assistant", + content: post.text, + metadata: toJsonRecord({ + ...(post.channel ? { channel: post.channel } : {}), + ...(post.thread_ts ? { thread_ts: post.thread_ts } : {}), + files: post.files, + }), + }), + ), + ...(toolCalls.length > 0 + ? [ + { + role: "assistant" as const, + toolCalls, + }, + ] + : []), + ]; + + return { + output, + session: { + messages, + outputText: serializeEvalOutput(output), + metadata: toJsonRecord({ + slack_metadata: output.slack_metadata, + }), + }, + usage: { + toolCalls: toolCalls.length, + }, + errors: [], + }; +} + // ── Core eval wrapper ────────────────────────────────────── interface EvalRubric { @@ -180,14 +137,12 @@ interface EvalRubric { fail?: readonly string[]; } -interface SlackEvalOptions { +export interface SlackEvalInput { events: EvalEvent[]; overrides?: EvalOverrides; criteria: EvalRubric; requireGatewayReady?: boolean; taskTimeout?: number; - threshold?: number; - timeout?: number; requireSandboxReady?: boolean; } @@ -220,7 +175,11 @@ function formatRubric(criteria: EvalRubric): string { .join("\n\n"); } -function assertGatewayReady(name: string, result: EvalResult): void { +function getEvalLabel(input: SlackEvalInput): string { + return input.criteria.contract; +} + +function assertGatewayReady(input: SlackEvalInput, result: EvalResult): void { const failure = result.logRecords.find((record) => { if (record.eventName !== "ai_completion_failed") { return false; @@ -239,12 +198,12 @@ function assertGatewayReady(name: string, result: EvalResult): void { failure.body || "AI Gateway authentication failed"; throw new Error( - `Eval gateway bootstrap failed for "${name}". Received "${message}". ` + + `Eval gateway bootstrap failed for "${getEvalLabel(input)}". Received "${message}". ` + "Refresh AI Gateway auth first (for example via `vercel env pull`) and retry.", ); } -function assertSandboxReady(name: string, result: EvalResult): void { +function assertSandboxReady(input: SlackEvalInput, result: EvalResult): void { const failingPosts = result.posts.filter((post) => post.text.includes(SANDBOX_SETUP_FAILED_TEXT), ); @@ -254,12 +213,12 @@ function assertSandboxReady(name: string, result: EvalResult): void { const sample = failingPosts[0]?.text ?? SANDBOX_SETUP_FAILED_TEXT; throw new Error( - `Eval sandbox bootstrap failed for "${name}". Received "${sample}". ` + + `Eval sandbox bootstrap failed for "${getEvalLabel(input)}". Received "${sample}". ` + "Evals require a working Vercel Sandbox and do not permit local fallback.", ); } -function assertStatusCleared(name: string, result: EvalResult): void { +function assertStatusCleared(input: SlackEvalInput, result: EvalResult): void { const lastByThread = new Map(); for (const call of result.slackAdapter.statusCalls) { const key = `${call.channelId}:${call.threadTs}`; @@ -268,7 +227,7 @@ function assertStatusCleared(name: string, result: EvalResult): void { for (const [thread, text] of lastByThread) { if (text !== "") { throw new Error( - `Eval "${name}" left assistant status pending on thread ${thread}: "${text}". ` + + `Eval "${getEvalLabel(input)}" left assistant status pending on thread ${thread}: "${text}". ` + "Every turn must clear the assistant status indicator before completing.", ); } @@ -286,56 +245,170 @@ export function rubric(criteria: EvalRubric): EvalRubric { return criteria; } -/** Defines one end-to-end conversational eval case for the Slack harness. */ -export function slackEval(name: string, opts: SlackEvalOptions) { - evaluate(name, { - timeout: opts.timeout ?? 120_000, - threshold: opts.threshold ?? 0.75, - task: async () => { - const logRecords: EmittedLogRecord[] = []; - const unregisterLogSink = registerLogRecordSink((record) => { - logRecords.push(record); - }); - try { - const taskPromise = runEvalScenario( - { - events: opts.events, - overrides: opts.overrides, - }, - { logRecords }, - ); - const result = - typeof opts.taskTimeout === "number" && opts.taskTimeout > 0 - ? await Promise.race([ - taskPromise, - new Promise((_, reject) => - setTimeout( - () => - reject( - new Error( - `Eval harness timed out after ${opts.taskTimeout}ms before judge evaluation`, - ), +type JudgeAnswer = "A" | "B" | "C" | "D" | "E"; + +interface JudgeResultPayload { + answer: JudgeAnswer; + rationale: string; +} + +const CHOICE_SCORES: Record = { + A: 1, + B: 0.75, + C: 0.5, + D: 0.25, + E: 0, +}; + +const EVAL_SYSTEM = + 'You are assessing a submitted output based on a given criterion. Ignore differences in style, grammar, punctuation, or length. Focus only on whether the criterion is met. Return only raw JSON matching {"answer":"A","rationale":"..."}.'; +const EVAL_JUDGE_MODEL_ID = resolveGatewayModel("openai/gpt-5.4").id; + +function formatJudgePrompt(output: string, criteria: string): string { + return ` +${output} + + + +${criteria} + + +Does the submission meet the criteria? Select one option: +(A) The criteria is fully met with no issues +(B) The criteria is mostly met with minor gaps +(C) The criteria is partially met with notable gaps +(D) The criteria is barely met or only tangentially addressed +(E) The criteria is not met at all + +Return only a JSON object with: +- answer: one of "A", "B", "C", "D", "E" +- rationale: a concise explanation`; +} + +function isJudgeAnswer(value: unknown): value is JudgeAnswer { + return ( + typeof value === "string" && + Object.prototype.hasOwnProperty.call(CHOICE_SCORES, value) + ); +} + +function parseJudgeResult(text: string): JudgeResultPayload { + const parsed = JSON.parse(text) as unknown; + if ( + !parsed || + typeof parsed !== "object" || + !isJudgeAnswer((parsed as Record).answer) || + typeof (parsed as Record).rationale !== "string" + ) { + throw new Error(`Rubric judge returned invalid JSON: ${text}`); + } + return parsed as JudgeResultPayload; +} + +/** Replays Slack events through the real runtime and returns normalized artifacts. */ +export const slackHarness: Harness = { + name: "slack", + prompt: async (input, options) => { + const { text } = await completeText({ + modelId: EVAL_JUDGE_MODEL_ID, + system: options?.system, + messages: [ + { + role: "user", + content: input, + timestamp: Date.now(), + }, + ], + temperature: 0, + metadata: options?.metadata, + }); + return text; + }, + run: async (input) => { + const logRecords: EmittedLogRecord[] = []; + const unregisterLogSink = registerLogRecordSink((record) => { + logRecords.push(record); + }); + try { + const taskPromise = runEvalScenario( + { + events: input.events, + overrides: input.overrides, + }, + { logRecords }, + ); + const result = + typeof input.taskTimeout === "number" && input.taskTimeout > 0 + ? await Promise.race([ + taskPromise, + new Promise((_, reject) => + setTimeout( + () => + reject( + new Error( + `Eval harness timed out after ${input.taskTimeout}ms before judge evaluation`, ), - opts.taskTimeout, - ), + ), + input.taskTimeout, ), - ]) - : await taskPromise; - if (opts.requireGatewayReady ?? true) { - assertGatewayReady(name, result); - } - if (opts.requireSandboxReady ?? true) { - assertSandboxReady(name, result); - } - assertStatusCleared(name, result); - return serializeEvalResult(result); - } finally { - unregisterLogSink(); + ), + ]) + : await taskPromise; + if (input.requireGatewayReady ?? true) { + assertGatewayReady(input, result); } - }, - criteria: formatRubric(opts.criteria), - }); -} + if (input.requireSandboxReady ?? true) { + assertSandboxReady(input, result); + } + assertStatusCleared(input, result); + return toHarnessRun(result); + } finally { + unregisterLogSink(); + } + }, +}; + +/** Scores Slack eval output against the case rubric. */ +export const RubricJudge = namedJudge( + "RubricJudge", + async ({ + inputValue, + output, + harness, + }: JudgeContext< + SlackEvalInput, + Record, + typeof slackHarness + >) => { + const object = parseJudgeResult( + await harness.prompt( + formatJudgePrompt(output, formatRubric(inputValue.criteria)), + { + system: EVAL_SYSTEM, + metadata: { + judge: "RubricJudge", + }, + }, + ), + ); + const answer = object.answer as keyof typeof CHOICE_SCORES; + + return { + score: CHOICE_SCORES[answer], + metadata: { + answer, + rationale: object.rationale, + }, + }; + }, +); + +/** Shared vitest-evals suite options for Slack conversation evals. */ +export const slackEvals = { + harness: slackHarness, + judges: [RubricJudge], + judgeThreshold: 0.75, +} satisfies DescribeEvalOptions; // ── Event builders ───────────────────────────────────────── diff --git a/packages/junior-evals/evals/sentry/skill-workflows.eval.ts b/packages/junior-evals/evals/sentry/skill-workflows.eval.ts index e0983eb5..7d69236e 100644 --- a/packages/junior-evals/evals/sentry/skill-workflows.eval.ts +++ b/packages/junior-evals/evals/sentry/skill-workflows.eval.ts @@ -1,10 +1,11 @@ -import { describe } from "vitest"; -import { mention, rubric, slackEval } from "../helpers"; +import { describeEval } from "vitest-evals"; +import { mention, rubric, slackEvals } from "../helpers"; -describe("Sentry Skill Workflows", () => { - slackEval( - "when the Sentry credential smoke command runs, return one CREDENTIAL_OK reply", - { +describeEval("Sentry Skill Workflows", slackEvals, (it) => { + it("when the Sentry credential smoke command runs, return one CREDENTIAL_OK reply", async ({ + run, + }) => { + await run({ overrides: { skill_dirs: ["evals/fixtures/skills"], enable_test_credentials: true, @@ -22,12 +23,13 @@ describe("Sentry Skill Workflows", () => { ], fail: ["Do not include sandbox setup failure text."], }), - }, - ); + }); + }); - slackEval( - "when listing Sentry organizations, use the current org command surface", - { + it("when listing Sentry organizations, use the current org command surface", async ({ + run, + }) => { + await run({ overrides: { enable_test_credentials: true, plugin_packages: ["@sentry/junior-sentry"], @@ -49,6 +51,6 @@ describe("Sentry Skill Workflows", () => { "Do not ask the user to reconnect Sentry unless the command returns an auth failure.", ], }), - }, - ); + }); + }); }); diff --git a/packages/junior-evals/package.json b/packages/junior-evals/package.json index 7739c338..45ecd929 100644 --- a/packages/junior-evals/package.json +++ b/packages/junior-evals/package.json @@ -9,14 +9,12 @@ "evals": "JUNIOR_STATE_ADAPTER=memory pnpm exec vitest run -c vitest.evals.config.ts" }, "devDependencies": { - "@ai-sdk/gateway": "^3.0.99", "@sentry/junior": "workspace:*", "@sentry/junior-github": "workspace:*", "@sentry/junior-sentry": "workspace:*", "chat": "4.26.0", "typescript": "^5.9.3", "vitest": "^4.1.4", - "vitest-evals": "^0.7.0", - "zod": "^4.3.6" + "vitest-evals": "0.9.0-beta.1" } } diff --git a/packages/junior-evals/vitest.evals.config.ts b/packages/junior-evals/vitest.evals.config.ts index 82e17bac..b7b548ed 100644 --- a/packages/junior-evals/vitest.evals.config.ts +++ b/packages/junior-evals/vitest.evals.config.ts @@ -34,5 +34,6 @@ export default defineConfig({ include: ["evals/**/*.eval.ts"], setupFiles: [path.resolve(juniorPackageRoot, "tests/msw/setup.ts")], reporters: [new DefaultEvalReporter()], + testTimeout: 300_000, }, }); diff --git a/packages/junior/src/chat/prompt.ts b/packages/junior/src/chat/prompt.ts index 1c51fc10..05cda928 100644 --- a/packages/junior/src/chat/prompt.ts +++ b/packages/junior/src/chat/prompt.ts @@ -428,7 +428,7 @@ function buildOutputSection(): string { return [ openTag, "- Start with the answer or result, not internal process narration.", - "- Use Slack-flavored Markdown: **bold** section labels, `code`, [text](url) links, bullet lists, and fenced code blocks. No tables.", + "- Use Slack-flavored Markdown: **bold** section labels, `code`, [text](url) links, bullet lists, and fenced code blocks. No tables. When the answer primarily lists several URLs, show each URL bare instead of as a labeled link.", "- Keep replies brief and scannable; use bullets or short code blocks when helpful, and one compact thread reply when it fits.", "- When a research or document-style answer would benefit from continuation, multiple sections, or future reference value, create a Slack canvas and keep the thread reply to one or two short sentences plus the link; do not recap the canvas contents.", "- Unless a successful Slack side-effect tool intentionally satisfied the request by itself, end every turn with a final user-facing markdown response.", diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml index 6ec27fcf..55e7610f 100644 --- a/pnpm-lock.yaml +++ b/pnpm-lock.yaml @@ -197,9 +197,6 @@ importers: packages/junior-evals: devDependencies: - "@ai-sdk/gateway": - specifier: ^3.0.99 - version: 3.0.99(zod@4.3.6) "@sentry/junior": specifier: workspace:* version: file:packages/junior(@aws-sdk/credential-provider-web-identity@3.972.33)(@sentry/node@10.48.0) @@ -219,11 +216,8 @@ importers: specifier: ^4.1.4 version: 4.1.4(@edge-runtime/vm@3.2.0)(@opentelemetry/api@1.9.1)(@types/node@25.6.0)(msw@2.13.3(@types/node@25.6.0)(typescript@5.9.3))(vite@8.0.3(@types/node@25.6.0)(esbuild@0.27.4)(jiti@2.6.1)(terser@5.46.1)(tsx@4.21.0)(yaml@2.8.3)) vitest-evals: - specifier: ^0.7.0 - version: 0.7.0(ai@6.0.162(zod@4.3.6))(tinyrainbow@3.1.0)(vitest@4.1.4(@edge-runtime/vm@3.2.0)(@opentelemetry/api@1.9.1)(@types/node@25.6.0)(msw@2.13.3(@types/node@25.6.0)(typescript@5.9.3))(vite@8.0.3(@types/node@25.6.0)(esbuild@0.27.4)(jiti@2.6.1)(terser@5.46.1)(tsx@4.21.0)(yaml@2.8.3)))(zod@4.3.6) - zod: - specifier: ^4.3.6 - version: 4.3.6 + specifier: 0.9.0-beta.1 + version: 0.9.0-beta.1(ai@6.0.162(zod@4.3.6))(tinyrainbow@3.1.0)(vitest@4.1.4(@edge-runtime/vm@3.2.0)(@opentelemetry/api@1.9.1)(@types/node@25.6.0)(msw@2.13.3(@types/node@25.6.0)(typescript@5.9.3))(vite@8.0.3(@types/node@25.6.0)(esbuild@0.27.4)(jiti@2.6.1)(terser@5.46.1)(tsx@4.21.0)(yaml@2.8.3)))(zod@4.3.6) packages/junior-github: {} @@ -10639,15 +10633,15 @@ packages: vite: optional: true - vitest-evals@0.7.0: + vitest-evals@0.9.0-beta.1: resolution: { - integrity: sha512-ZHvgKeP+DgL11wpS/GXDQTt0zlFkJwpk1PLBybXZfJnuXRBSkNrO3nVticVtHO8soHoP3rK0TtBUxV5du29OuQ==, + integrity: sha512-Y3BfT0SStUgl7pwgWsN4/EVwl85jQQ8+sgLOh7BVd8PEvWaZeLbnhy6/bVjxDB1vis9DlwwgxJvHlQ3Ca5qdHQ==, } peerDependencies: ai: ">=4 <7" tinyrainbow: ">=2 <4" - vitest: ">=3 <5" + vitest: ">=4 <5" zod: ">=3 <5" peerDependenciesMeta: ai: @@ -18286,7 +18280,7 @@ snapshots: optionalDependencies: vite: 6.4.1(@types/node@25.6.0)(jiti@2.6.1)(lightningcss@1.32.0)(terser@5.46.1)(tsx@4.21.0)(yaml@2.8.3) - vitest-evals@0.7.0(ai@6.0.162(zod@4.3.6))(tinyrainbow@3.1.0)(vitest@4.1.4(@edge-runtime/vm@3.2.0)(@opentelemetry/api@1.9.1)(@types/node@25.6.0)(msw@2.13.3(@types/node@25.6.0)(typescript@5.9.3))(vite@8.0.3(@types/node@25.6.0)(esbuild@0.27.4)(jiti@2.6.1)(terser@5.46.1)(tsx@4.21.0)(yaml@2.8.3)))(zod@4.3.6): + vitest-evals@0.9.0-beta.1(ai@6.0.162(zod@4.3.6))(tinyrainbow@3.1.0)(vitest@4.1.4(@edge-runtime/vm@3.2.0)(@opentelemetry/api@1.9.1)(@types/node@25.6.0)(msw@2.13.3(@types/node@25.6.0)(typescript@5.9.3))(vite@8.0.3(@types/node@25.6.0)(esbuild@0.27.4)(jiti@2.6.1)(terser@5.46.1)(tsx@4.21.0)(yaml@2.8.3)))(zod@4.3.6): dependencies: tinyrainbow: 3.1.0 vitest: 4.1.4(@edge-runtime/vm@3.2.0)(@opentelemetry/api@1.9.1)(@types/node@25.6.0)(msw@2.13.3(@types/node@25.6.0)(typescript@5.9.3))(vite@8.0.3(@types/node@25.6.0)(esbuild@0.27.4)(jiti@2.6.1)(terser@5.46.1)(tsx@4.21.0)(yaml@2.8.3)) diff --git a/specs/testing/evals-spec.md b/specs/testing/evals-spec.md index a6770418..8f11c31b 100644 --- a/specs/testing/evals-spec.md +++ b/specs/testing/evals-spec.md @@ -3,10 +3,11 @@ ## Metadata - Created: 2026-03-03 -- Last Edited: 2026-04-21 +- Last Edited: 2026-05-03 ## Changelog +- 2026-05-03: Updated authoring rules for the vitest-evals harness-first API: suites use `describeEval()` with shared Slack harness options, cases call `run(...)` directly, and LLM judges reuse the harness prompt seam. - 2026-04-21: Described evals as the integration-style layer for agent-facing behavior and clarified the boundary against ordinary runtime/product integration tests. - 2026-03-03: Standardized metadata headers and reconciled spec references/structure. - 2026-03-04: Normalized section shape by introducing explicit `Non-Goals`. @@ -16,7 +17,7 @@ ## Intent -Evals validate end-to-end conversational behavior outcomes through the runtime harness and LLM-judged criteria. Treat them as the integration-style layer for agent-facing behavior: use them when the contract depends on natural-language interpretation, continuity, prompt behavior, or reply quality. +Evals validate end-to-end conversational behavior outcomes through the runtime harness and LLM-judged criteria. Treat them as the integration-style layer for agent-facing behavior: use them when the contract depends on natural-language interpretation, continuity, prompt behavior, or reply quality. The Slack eval judge uses the same harness prompt seam as the suite, backed by Junior's Pi client and Vercel AI Gateway. ## Scope @@ -33,7 +34,7 @@ In scope: ## Authoring Rules -1. Define cases via `slackEval()` and event builders. +1. Define suites via `describeEval()` with the shared Slack harness options, and define cases as plain `it()` tests that call `run(...)` with event builders. 2. Keep each case focused on one primary behavior outcome. 3. Express expectations through the structured rubric shape used by `rubric({ contract, pass, allow, fail })`. 4. Every new or edited eval must keep its rubric human-readable to maintainers. @@ -42,7 +43,7 @@ In scope: `allow` lists acceptable optional variations. `fail` lists failure conditions or forbidden output. 5. Do not write judge criteria as one dense paragraph. -6. Let the `describe()` block own the behavior area. The file path and `describe()` context already provide scope, so each individual eval name should only state the specific scenario and outcome. +6. Let the `describeEval()` block own the behavior area. The file path and `describeEval()` context already provide scope, so each individual eval name should only state the specific scenario and outcome. 7. Prefer `when , ` over vague labels like `continuity: remembers prior turn context`. 8. Avoid asserting tool-internal mechanics unless explicitly user-visible. 9. Keep user prompts natural and product-realistic. Do not script exact internal commands, tool names, or implementation steps into the prompt just to force a path.