fix(api): include invocation steps in evaluation metrics refresh#3754
fix(api): include invocation steps in evaluation metrics refresh#3754jp-agenta merged 1 commit intorelease/v0.85.6from
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Metrics refresh only collected trace IDs from annotation steps, so cost and token metrics from invocation traces were never computed. This regression was introduced in 9500279 (2026-01-30, v0.81.2).
6357dff to
4bd3be9
Compare
Investigation: removal of the
|
| path=metric.get("path") or "*", | ||
| ) | ||
| for metric in step_metrics_keys | ||
| ] + [ | ||
| MetricSpec( | ||
| type=MetricType.JSON, | ||
| path="attributes.ag", | ||
| ) | ||
| ] |
There was a problem hiding this comment.
🔴 JSON MetricSpec removed for all step types, but PR intent is to keep it for annotation steps
The PR description explicitly states change #3: "Only append the JSON metric spec (attributes.ag) for annotation steps." However, the code removes the JSON MetricSpec entirely for all step types, including annotation steps.
Root Cause and Impact
The old code unconditionally appended a MetricSpec(type=MetricType.JSON, path="attributes.ag") to every step's specs list. The stated fix was to stop appending it for invocation steps (to avoid unwanted JSON output metrics) while keeping it for annotation steps.
But the new code at lines 1059–1065 simply builds specs from step_metrics_keys without any conditional JSON spec:
specs = [
MetricSpec(
type=MetricType(metric.get("type")),
path=metric.get("path") or "*",
)
for metric in step_metrics_keys
]The old code had:
] + [
MetricSpec(
type=MetricType.JSON,
path="attributes.ag",
)
]This additional JSON spec is now gone for all steps. For annotation steps, this means the analytics query no longer requests JSON-type extraction of attributes.ag, which was previously used to produce JSON metric output (see api/oss/src/dbs/postgres/tracing/utils.py:983 and :1848 where MetricType.JSON drives specific CTE logic).
The step_types_by_key dict created at line 892 tracks each step's type, so the loop at line 1034 could conditionally add the JSON spec for annotation steps using step_types_by_key.get(step_key) == "annotation", but this check is missing.
Impact: Annotation steps lose their JSON metric output that was previously computed, causing a regression in evaluation metrics for annotation steps.
(Refers to lines 1059-1065)
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
This was explicitely removed. See reasoning @jp-agenta
Summary
Cost and token metrics are not computed in the evaluation table, even though they appear correctly in traces.
The bug
The
_refresh_metricsmethod inapi/oss/src/core/evaluations/service.pycollects trace IDs only from annotation steps. It then runs analytics on those traces to compute metrics. Because cost and token data lives on invocation traces (not annotation traces), those metrics are never included in the evaluation metrics output.The metrics endpoint returns only
attributes.ag.metrics.duration.cumulativeandattributes.ag.data.outputs.success(from evaluator annotation traces), but notattributes.ag.metrics.costs.cumulative.totalorattributes.ag.metrics.tokens.cumulative.total.When the regression happened
Commit
9500279ed2("quick review", by Juan Pablo Vega, 2026-01-30) changed the step filter in_refresh_metricsfrom all steps to annotation-only:First affected release: v0.81.2.
The fix
Three changes in
_refresh_metrics:invocationandannotationsteps (controlled by a newMETRICS_STEP_TYPESconstant).input) during metrics key initialization.attributes.ag) for annotation steps. This prevents invocation steps from generating unwanted JSON output metrics; they only contribute the default scalar metrics (cost, tokens, duration, errors).