Skip to content

fix(api): include invocation steps in evaluation metrics refresh#3754

Merged
jp-agenta merged 1 commit intorelease/v0.85.6from
feat/metrics-not-computed
Feb 14, 2026
Merged

fix(api): include invocation steps in evaluation metrics refresh#3754
jp-agenta merged 1 commit intorelease/v0.85.6from
feat/metrics-not-computed

Conversation

@mmabrouk
Copy link
Member

@mmabrouk mmabrouk commented Feb 13, 2026

Summary

Cost and token metrics are not computed in the evaluation table, even though they appear correctly in traces.

The bug

The _refresh_metrics method in api/oss/src/core/evaluations/service.py collects trace IDs only from annotation steps. It then runs analytics on those traces to compute metrics. Because cost and token data lives on invocation traces (not annotation traces), those metrics are never included in the evaluation metrics output.

The metrics endpoint returns only attributes.ag.metrics.duration.cumulative and attributes.ag.data.outputs.success (from evaluator annotation traces), but not attributes.ag.metrics.costs.cumulative.total or attributes.ag.metrics.tokens.cumulative.total.

When the regression happened

Commit 9500279ed2 ("quick review", by Juan Pablo Vega, 2026-01-30) changed the step filter in _refresh_metrics from all steps to annotation-only:

# Before (working): collected all step types
steps_metrics_keys: Dict[str, List[Dict[str, str]]] = dict()

# After (broken): only annotation steps
steps_metrics_keys = {
    step.key: [] for step in run.data.steps if step.type == "annotation"
}

First affected release: v0.81.2.

The fix

Three changes in _refresh_metrics:

  1. Collect trace IDs from both invocation and annotation steps (controlled by a new METRICS_STEP_TYPES constant).
  2. Skip steps that are neither invocation nor annotation (e.g. input) during metrics key initialization.
  3. Only append the JSON metric spec (attributes.ag) for annotation steps. This prevents invocation steps from generating unwanted JSON output metrics; they only contribute the default scalar metrics (cost, tokens, duration, errors).

Open with Devin

@vercel
Copy link

vercel bot commented Feb 13, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agenta-documentation Ready Ready Preview, Comment Feb 13, 2026 6:27pm

Request Review

@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Feb 13, 2026
@dosubot dosubot bot added the bug Something isn't working label Feb 13, 2026
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

Metrics refresh only collected trace IDs from annotation steps,
so cost and token metrics from invocation traces were never computed.
This regression was introduced in 9500279 (2026-01-30, v0.81.2).
@mmabrouk
Copy link
Member Author

Investigation: removal of the attributes.ag JSON metric spec

What it was

The _refresh_metrics method appended a MetricSpec(type=MetricType.JSON, path="attributes.ag") to the list of metric specs for every annotation step. The JSON metric type in the analytics engine (build_json_blocks in api/oss/src/dbs/postgres/tracing/utils.py) only counts how many spans contain a valid JSON object at the given path. It does not descend into the object or compute field-level statistics.

The result was a single entry in the metrics response per annotation step:

"attributes.ag": { "type": "json", "count": 1 }

This tells you "1 span had a JSON object at attributes.ag". Nothing more.

When it was introduced

It was added in commit 9500279ed2 ("quick review", Juan Pablo Vega, 2026-01-30). This is the same commit that introduced the regression. Before that commit, the JSON spec did not exist.

Who consumes it

We traced the full pipeline on both sides.

Backend: the analytics engine produces the {"type": "json", "count": 1} value and returns it in the metrics bucket. The evaluation service stores it in the metrics table as part of the scenario metrics data. No other backend code reads or acts on this key.

Frontend: we investigated the evaluation table and overview views thoroughly.

  • Table columns are generated solely from run.data.mappings (atoms/table/columns.ts:319). No mapping ever points to bare attributes.ag. The key does not create or hide any column.
  • Cell values are resolved by looking up specific paths in the flattened metric data (atoms/metrics.ts:532). No cell renderer looks up attributes.ag.
  • Overview charts discover metrics by scanning the flattened run-level stats map for keys prefixed with each step key (OverviewView/utils/evaluatorMetrics.ts:58). The attributes.ag entry gets stored in the flat map but no chart, spider axis, or summary row references it.
  • The evaluator output fields (e.g. attributes.ag.data.outputs.success, attributes.ag.data.outputs.score) are handled by separate typed metric specs generated from the evaluator schema. These are independent of the JSON spec and are not affected by its removal.

In short, the attributes.ag JSON spec produced a value that flowed through the system end to end, was stored in the database, sent to the frontend, and then ignored everywhere.

Why we removed it

  1. It serves no functional purpose. No code on either side reads or acts on it.
  2. It was introduced without explanation in the same commit that caused the regression.
  3. It adds noise to the metrics payload (one extra key per annotation step per scenario).
  4. Keeping dead data paths makes the system harder to reason about.

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 4 additional findings in Devin Review.

Open in Devin Review

Comment on lines 1062 to 1065
path=metric.get("path") or "*",
)
for metric in step_metrics_keys
] + [
MetricSpec(
type=MetricType.JSON,
path="attributes.ag",
)
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 JSON MetricSpec removed for all step types, but PR intent is to keep it for annotation steps

The PR description explicitly states change #3: "Only append the JSON metric spec (attributes.ag) for annotation steps." However, the code removes the JSON MetricSpec entirely for all step types, including annotation steps.

Root Cause and Impact

The old code unconditionally appended a MetricSpec(type=MetricType.JSON, path="attributes.ag") to every step's specs list. The stated fix was to stop appending it for invocation steps (to avoid unwanted JSON output metrics) while keeping it for annotation steps.

But the new code at lines 1059–1065 simply builds specs from step_metrics_keys without any conditional JSON spec:

specs = [
    MetricSpec(
        type=MetricType(metric.get("type")),
        path=metric.get("path") or "*",
    )
    for metric in step_metrics_keys
]

The old code had:

] + [
    MetricSpec(
        type=MetricType.JSON,
        path="attributes.ag",
    )
]

This additional JSON spec is now gone for all steps. For annotation steps, this means the analytics query no longer requests JSON-type extraction of attributes.ag, which was previously used to produce JSON metric output (see api/oss/src/dbs/postgres/tracing/utils.py:983 and :1848 where MetricType.JSON drives specific CTE logic).

The step_types_by_key dict created at line 892 tracks each step's type, so the loop at line 1034 could conditionally add the JSON spec for annotation steps using step_types_by_key.get(step_key) == "annotation", but this check is missing.

Impact: Annotation steps lose their JSON metric output that was previously computed, causing a regression in evaluation metrics for annotation steps.

(Refers to lines 1059-1065)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Member Author

@mmabrouk mmabrouk Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was explicitely removed. See reasoning @jp-agenta

@mmabrouk mmabrouk requested a review from jp-agenta February 13, 2026 19:08
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 14, 2026
@jp-agenta jp-agenta changed the base branch from main to release/v0.85.6 February 14, 2026 13:50
@jp-agenta jp-agenta merged commit 1e2b2bb into release/v0.85.6 Feb 14, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working lgtm This PR has been approved by a maintainer size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants