fix(api): include invocation steps in evaluation metrics refresh by mmabrouk · Pull Request #3754 · Agenta-AI/agenta

mmabrouk · 2026-02-13T16:29:11Z

Summary

Cost and token metrics are not computed in the evaluation table, even though they appear correctly in traces.

The bug

The _refresh_metrics method in api/oss/src/core/evaluations/service.py collects trace IDs only from annotation steps. It then runs analytics on those traces to compute metrics. Because cost and token data lives on invocation traces (not annotation traces), those metrics are never included in the evaluation metrics output.

The metrics endpoint returns only attributes.ag.metrics.duration.cumulative and attributes.ag.data.outputs.success (from evaluator annotation traces), but not attributes.ag.metrics.costs.cumulative.total or attributes.ag.metrics.tokens.cumulative.total.

When the regression happened

Commit 9500279ed2 ("quick review", by Juan Pablo Vega, 2026-01-30) changed the step filter in _refresh_metrics from all steps to annotation-only:

# Before (working): collected all step types
steps_metrics_keys: Dict[str, List[Dict[str, str]]] = dict()

# After (broken): only annotation steps
steps_metrics_keys = {
    step.key: [] for step in run.data.steps if step.type == "annotation"
}

First affected release: v0.81.2.

The fix

Three changes in _refresh_metrics:

Collect trace IDs from both invocation and annotation steps (controlled by a new METRICS_STEP_TYPES constant).
Skip steps that are neither invocation nor annotation (e.g. input) during metrics key initialization.
Only append the JSON metric spec (attributes.ag) for annotation steps. This prevents invocation steps from generating unwanted JSON output metrics; they only contribute the default scalar metrics (cost, tokens, duration, errors).

vercel · 2026-02-13T16:29:16Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agenta-documentation	Ready	Preview, Comment	Feb 13, 2026 6:27pm

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Metrics refresh only collected trace IDs from annotation steps, so cost and token metrics from invocation traces were never computed. This regression was introduced in 9500279 (2026-01-30, v0.81.2).

mmabrouk · 2026-02-13T18:29:04Z

Investigation: removal of the `attributes.ag` JSON metric spec

What it was

The _refresh_metrics method appended a MetricSpec(type=MetricType.JSON, path="attributes.ag") to the list of metric specs for every annotation step. The JSON metric type in the analytics engine (build_json_blocks in api/oss/src/dbs/postgres/tracing/utils.py) only counts how many spans contain a valid JSON object at the given path. It does not descend into the object or compute field-level statistics.

The result was a single entry in the metrics response per annotation step:

"attributes.ag": { "type": "json", "count": 1 }

This tells you "1 span had a JSON object at attributes.ag". Nothing more.

When it was introduced

It was added in commit 9500279ed2 ("quick review", Juan Pablo Vega, 2026-01-30). This is the same commit that introduced the regression. Before that commit, the JSON spec did not exist.

Who consumes it

We traced the full pipeline on both sides.

Backend: the analytics engine produces the {"type": "json", "count": 1} value and returns it in the metrics bucket. The evaluation service stores it in the metrics table as part of the scenario metrics data. No other backend code reads or acts on this key.

Frontend: we investigated the evaluation table and overview views thoroughly.

Table columns are generated solely from run.data.mappings (atoms/table/columns.ts:319). No mapping ever points to bare attributes.ag. The key does not create or hide any column.
Cell values are resolved by looking up specific paths in the flattened metric data (atoms/metrics.ts:532). No cell renderer looks up attributes.ag.
Overview charts discover metrics by scanning the flattened run-level stats map for keys prefixed with each step key (OverviewView/utils/evaluatorMetrics.ts:58). The attributes.ag entry gets stored in the flat map but no chart, spider axis, or summary row references it.
The evaluator output fields (e.g. attributes.ag.data.outputs.success, attributes.ag.data.outputs.score) are handled by separate typed metric specs generated from the evaluator schema. These are independent of the JSON spec and are not affected by its removal.

In short, the attributes.ag JSON spec produced a value that flowed through the system end to end, was stored in the database, sent to the frontend, and then ignored everywhere.

Why we removed it

It serves no functional purpose. No code on either side reads or acts on it.
It was introduced without explanation in the same commit that caused the regression.
It adds noise to the metrics payload (one extra key per annotation step per scenario).
Keeping dead data paths makes the system harder to reason about.

devin-ai-integration

Devin Review found 1 new potential issue.

View 4 additional findings in Devin Review.

devin-ai-integration · 2026-02-13T18:30:56Z

api/oss/src/core/evaluations/service.py

                        path=metric.get("path") or "*",
                    )
                    for metric in step_metrics_keys
-                ] + [
-                    MetricSpec(
-                        type=MetricType.JSON,
-                        path="attributes.ag",
-                    )
                ]


🔴 JSON MetricSpec removed for all step types, but PR intent is to keep it for annotation steps

The PR description explicitly states change #3: "Only append the JSON metric spec (attributes.ag) for annotation steps." However, the code removes the JSON MetricSpec entirely for all step types, including annotation steps.

Root Cause and Impact

The old code unconditionally appended a MetricSpec(type=MetricType.JSON, path="attributes.ag") to every step's specs list. The stated fix was to stop appending it for invocation steps (to avoid unwanted JSON output metrics) while keeping it for annotation steps.

But the new code at lines 1059–1065 simply builds specs from step_metrics_keys without any conditional JSON spec:

specs = [ MetricSpec( type=MetricType(metric.get("type")), path=metric.get("path") or "*", ) for metric in step_metrics_keys ]

The old code had:

] + [ MetricSpec( type=MetricType.JSON, path="attributes.ag", ) ]

This additional JSON spec is now gone for all steps. For annotation steps, this means the analytics query no longer requests JSON-type extraction of attributes.ag, which was previously used to produce JSON metric output (see api/oss/src/dbs/postgres/tracing/utils.py:983 and :1848 where MetricType.JSON drives specific CTE logic).

The step_types_by_key dict created at line 892 tracks each step's type, so the loop at line 1034 could conditionally add the JSON spec for annotation steps using step_types_by_key.get(step_key) == "annotation", but this check is missing.

Impact: Annotation steps lose their JSON metric output that was previously computed, causing a regression in evaluation metrics for annotation steps.

(Refers to lines 1059-1065)

Was this helpful? React with 👍 or 👎 to provide feedback.

This was explicitely removed. See reasoning @jp-agenta

dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Feb 13, 2026

vercel bot deployed to Preview February 13, 2026 16:30 View deployment

dosubot bot added the bug Something isn't working label Feb 13, 2026

devin-ai-integration bot reviewed Feb 13, 2026

View reviewed changes

fix(api): include invocation steps in evaluation metrics refresh

4bd3be9

Metrics refresh only collected trace IDs from annotation steps, so cost and token metrics from invocation traces were never computed. This regression was introduced in 9500279 (2026-01-30, v0.81.2).

mmabrouk force-pushed the feat/metrics-not-computed branch from 6357dff to 4bd3be9 Compare February 13, 2026 18:26

vercel bot deployed to Preview February 13, 2026 18:27 View deployment

devin-ai-integration bot reviewed Feb 13, 2026

View reviewed changes

mmabrouk requested a review from jp-agenta February 13, 2026 19:08

jp-agenta approved these changes Feb 14, 2026

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 14, 2026

jp-agenta changed the base branch from main to release/v0.85.6 February 14, 2026 13:50

jp-agenta merged commit 1e2b2bb into release/v0.85.6 Feb 14, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(api): include invocation steps in evaluation metrics refresh#3754

fix(api): include invocation steps in evaluation metrics refresh#3754
jp-agenta merged 1 commit intorelease/v0.85.6from
feat/metrics-not-computed

mmabrouk commented Feb 13, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

vercel bot commented Feb 13, 2026 •

edited

Loading

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

mmabrouk commented Feb 13, 2026

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Feb 13, 2026

Uh oh!

mmabrouk Feb 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mmabrouk commented Feb 13, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The bug

When the regression happened

The fix

Uh oh!

vercel bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

mmabrouk commented Feb 13, 2026

Investigation: removal of the attributes.ag JSON metric spec

What it was

When it was introduced

Who consumes it

Why we removed it

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

mmabrouk Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mmabrouk commented Feb 13, 2026 •

edited by devin-ai-integration bot

Loading

vercel bot commented Feb 13, 2026 •

edited

Loading

Investigation: removal of the `attributes.ag` JSON metric spec

mmabrouk Feb 13, 2026 •

edited

Loading