Skip to content

ref(evals): Upgrade Slack evals to vitest-evals 0.9#283

Merged
dcramer merged 1 commit intomainfrom
ref/junior-evals-vitest-0-9
May 4, 2026
Merged

ref(evals): Upgrade Slack evals to vitest-evals 0.9#283
dcramer merged 1 commit intomainfrom
ref/junior-evals-vitest-0-9

Conversation

@dcramer
Copy link
Copy Markdown
Member

@dcramer dcramer commented May 3, 2026

Upgrade the Slack eval suite to vitest-evals@0.9.0-beta.1 and cut over to the harness-first API. Eval cases now use describeEval() with direct it(..., { run }) calls, and the old slackEval(...) wrapper is gone.

Harness Judge Path

The Slack harness now owns the judge prompt seam. RubricJudge reads JudgeContext.harness.prompt(...), which keeps judging on Junior's Pi client and Vercel AI Gateway path with openai/gpt-5.4.

Dependency Cleanup

@sentry/junior-evals no longer depends directly on @ai-sdk/gateway or zod. The suite relies on Junior's existing Pi/Gateway client and a small local parser for the judge response shape.

Eval Authoring

Eval docs and the testing spec now describe describeEval() as the canonical authoring style. The output-contract eval also narrows the heading rule to the actual contract: avoid hash-prefixed markdown headings.

Migrate Slack behavior evals to the harness-first describeEval API.

Remove the old slackEval wrapper.

Reuse the Slack harness prompt seam for judging through Junior's Pi client and Vercel AI Gateway.

Drop direct eval-package dependencies on AI SDK Gateway and Zod after the clean cutover.

Co-Authored-By: GPT-5 Codex <codex@openai.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
junior-docs Ready Ready Preview, Comment May 3, 2026 10:58pm

Request Review

@dcramer dcramer marked this pull request as ready for review May 4, 2026 01:22
@dcramer dcramer merged commit 58eaca4 into main May 4, 2026
13 of 14 checks passed
@dcramer dcramer deleted the ref/junior-evals-vitest-0-9 branch May 4, 2026 01:22
Comment on lines +383 to +385
const object = parseJudgeResult(
await harness.prompt(
formatJudgePrompt(output, formatRubric(inputValue.criteria)),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: RubricJudge passes the output object (Record<string, unknown>) directly to formatJudgePrompt(), which expects a string, causing incorrect type coercion.
Severity: CRITICAL

Suggested Fix

The HarnessRun context provides session.outputText, which is a string representation of the output. Use session.outputText instead of the output object when calling formatJudgePrompt to ensure the correct data is passed. Alternatively, serialize the output object to a string (e.g., using JSON.stringify) before the function call.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: packages/junior-evals/evals/helpers.ts#L383-L385

Potential issue: The `RubricJudge` function receives an `output` parameter from
`JudgeContext` which is typed as `Record<string, unknown>`. This object is then passed
directly to the `formatJudgePrompt` function, which expects its first argument to be a
string. Due to JavaScript's type coercion, the object is converted to the literal string
`"[object Object]"`. This results in every evaluation being judged against a
meaningless, corrupted prompt, leading to incorrect scores and silently failing evals.
The `HarnessRun` object contains both a record `output` and a string
`session.outputText`, suggesting the latter should have been used.

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant