ref(evals): Upgrade Slack evals to vitest-evals 0.9#283
Merged
Conversation
Migrate Slack behavior evals to the harness-first describeEval API. Remove the old slackEval wrapper. Reuse the Slack harness prompt seam for judging through Junior's Pi client and Vercel AI Gateway. Drop direct eval-package dependencies on AI SDK Gateway and Zod after the clean cutover. Co-Authored-By: GPT-5 Codex <codex@openai.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Comment on lines
+383
to
+385
| const object = parseJudgeResult( | ||
| await harness.prompt( | ||
| formatJudgePrompt(output, formatRubric(inputValue.criteria)), |
There was a problem hiding this comment.
Bug: RubricJudge passes the output object (Record<string, unknown>) directly to formatJudgePrompt(), which expects a string, causing incorrect type coercion.
Severity: CRITICAL
Suggested Fix
The HarnessRun context provides session.outputText, which is a string representation of the output. Use session.outputText instead of the output object when calling formatJudgePrompt to ensure the correct data is passed. Alternatively, serialize the output object to a string (e.g., using JSON.stringify) before the function call.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: packages/junior-evals/evals/helpers.ts#L383-L385
Potential issue: The `RubricJudge` function receives an `output` parameter from
`JudgeContext` which is typed as `Record<string, unknown>`. This object is then passed
directly to the `formatJudgePrompt` function, which expects its first argument to be a
string. Due to JavaScript's type coercion, the object is converted to the literal string
`"[object Object]"`. This results in every evaluation being judged against a
meaningless, corrupted prompt, leading to incorrect scores and silently failing evals.
The `HarnessRun` object contains both a record `output` and a string
`session.outputText`, suggesting the latter should have been used.
Did we get this right? 👍 / 👎 to inform future reviews.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Upgrade the Slack eval suite to
vitest-evals@0.9.0-beta.1and cut over to the harness-first API. Eval cases now usedescribeEval()with directit(..., { run })calls, and the oldslackEval(...)wrapper is gone.Harness Judge Path
The Slack harness now owns the judge prompt seam.
RubricJudgereadsJudgeContext.harness.prompt(...), which keeps judging on Junior's Pi client and Vercel AI Gateway path withopenai/gpt-5.4.Dependency Cleanup
@sentry/junior-evalsno longer depends directly on@ai-sdk/gatewayorzod. The suite relies on Junior's existing Pi/Gateway client and a small local parser for the judge response shape.Eval Authoring
Eval docs and the testing spec now describe
describeEval()as the canonical authoring style. The output-contract eval also narrows the heading rule to the actual contract: avoid hash-prefixed markdown headings.