feat(metrics): add legacy_streaming_tps toggle for LoadGen-parity QPS/TPS by viraatc · Pull Request #372 · mlcommons/endpoints

viraatc · 2026-06-23T09:22:08Z

What

Adds a legacy_streaming_tps runtime flag (default True → existing behavior is unchanged). When set to false (--legacy-streaming-tps false), QPS/TPS are computed to match MLPerf LoadGen's Server-scenario "completed" throughput:

QPS = (completed − failed − 1) / T
TPS = output_tokens / T

where T (loadgen_window_duration_ns) spans from the performance-tracking start to the completion of the last-issued request that completed successfully — the analog of LoadGen's final_query_all_samples_done_time. On a clean run (every request completes) this is exactly the last-issued request.

Today the metric uses tracked_duration_ns (start → last request to finish) with n_samples_completed and no fencepost, so relative to LoadGen it (a) ends the window later (at the last completion rather than the last-issued completion) and (b) counts failed-but-completed samples in the numerator.

Reference: mlcommons/inference loadgen/results.cc — Server completed: (sample_count-1)/final_query_all_samples_done_time and token_count/final_query_all_samples_done_time (tokens are not decremented).

How

Aggregator always emits a new loadgen_window_duration_ns counter alongside tracked_duration_ns (computed in MetricsTable). The snapshot therefore stays config-agnostic and fully reinterpretable; the flag is compute-only in Report and is never plumbed into the aggregator subprocess.
MetricsTable tracks, among successfully-completed tracked samples, the one with the max issued_ns, and records its completion as the window end. SampleRow.failed (set on the ERROR event, which precedes COMPLETE) excludes failures from the window.
Report gains qps_legacy/qps_loadgen (and tps_*); qps()/tps() dispatch on the flag. Legacy bodies are unchanged.
Edge handling — LoadGen QPS/TPS render N/A when: no tracked request completed successfully; completed − failed < 2; or more than one performance-tracking block exists (the numerator counters are global, so a single-block window would be inconsistent).

Tests

tests/unit/metrics/test_report_builder.py — legacy unchanged, loadgen numerator/denominator, failure subtraction, N<2 → N/A, window-absent → N/A.
tests/unit/async_utils/services/metrics_aggregator/test_metrics_table.py — window anchoring under out-of-order completion, failed last-issued, never-completing last-issued, no-successful-completions → 0, multi-block → 0.
168 unit tests pass; mypy/ruff clean.

Notes

Default is True, so this is opt-in and changes nothing unless enabled.
The selected view (legacy_streaming_tps) is recorded in config.yaml and the Report JSON; the snapshot carries both durations, so a saved run is reinterpretable either way.
Branch is based on an older main (14ca0541); the change is additive but a rebase onto current main before merge may be wanted.

🤖 Generated with Claude Code

github-actions · 2026-06-23T09:22:22Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces support for MLPerf LoadGen Server 'completed' window throughput metrics, including a configuration toggle (legacy_streaming_tps) to switch between legacy and LoadGen-aligned throughput calculations. However, several configuration templates were updated to use agentic_inference and agentic_inference_inline which are not yet supported by the schema in schema.py, leading to validation failures. Additionally, a potential state inconsistency was identified in metrics_table.py when handling out-of-bounds tracking block indices.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-23T09:24:58Z

    prompt: text_input
  accuracy_config: null  # Accuracy evaluation settings
-  multi_turn: null  # Multi-turn conversation configuration
+  agentic_inference: null  # Agentic inference conversation configuration


The Dataset model in src/inference_endpoint/config/schema.py defines multi_turn instead of agentic_inference, and forbids extra fields (extra="forbid"). Using agentic_inference here will cause a Pydantic validation error when loading this template. Please revert this to multi_turn.

multi_turn: null # Multi-turn conversation configuration

gemini-code-assist · 2026-06-23T09:24:59Z

    system: system_prompt
  accuracy_config:  # Accuracy evaluation settings
-    eval_method: pass_at_1  # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, vbench
+    eval_method: pass_at_1  # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, agentic_inference_inline, vbench


The ScorerMethod enum in src/inference_endpoint/config/schema.py does not contain agentic_inference_inline. Specifying it in the comment options or trying to use it will fail validation. Please revert this to the supported options.

eval_method: pass_at_1 # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, vbench

gemini-code-assist · 2026-06-23T09:24:59Z

    extractor: boxed_math_extractor  # Answer extractor (abcd_extractor, boxed_math_extractor, identity_extractor, python_code_extractor)
    num_repeats: 1  # Repeat dataset N times for evaluation
-  multi_turn: null  # Multi-turn conversation configuration
+  agentic_inference: null  # Agentic inference conversation configuration


The Dataset model in src/inference_endpoint/config/schema.py defines multi_turn instead of agentic_inference, and forbids extra fields. Using agentic_inference here will cause a Pydantic validation error when loading this template. Please revert this to multi_turn.

multi_turn: null # Multi-turn conversation configuration

gemini-code-assist · 2026-06-23T09:24:59Z

    prompt: text_input
  accuracy_config: null  # Accuracy evaluation settings
-  multi_turn: null  # Multi-turn conversation configuration
+  agentic_inference: null  # Agentic inference conversation configuration


The Dataset model in src/inference_endpoint/config/schema.py defines multi_turn instead of agentic_inference, and forbids extra fields. Using agentic_inference here will cause a Pydantic validation error when loading this template. Please revert this to multi_turn.

multi_turn: null # Multi-turn conversation configuration

gemini-code-assist · 2026-06-23T09:24:59Z

    system: system_prompt
  accuracy_config:  # Accuracy evaluation settings
-    eval_method: pass_at_1  # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, vbench
+    eval_method: pass_at_1  # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, agentic_inference_inline, vbench


The ScorerMethod enum in src/inference_endpoint/config/schema.py does not contain agentic_inference_inline. Specifying it in the comment options or trying to use it will fail validation. Please revert this to the supported options.

eval_method: pass_at_1 # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, vbench

gemini-code-assist · 2026-06-23T09:24:59Z

    system: system_prompt
  accuracy_config:  # Accuracy evaluation settings
-    eval_method: pass_at_1  # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, vbench
+    eval_method: pass_at_1  # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, agentic_inference_inline, vbench


The ScorerMethod enum in src/inference_endpoint/config/schema.py does not contain agentic_inference_inline. Specifying it in the comment options or trying to use it will fail validation. Please revert this to the supported options.

eval_method: pass_at_1 # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, vbench

gemini-code-assist · 2026-06-23T09:24:59Z

    extractor: boxed_math_extractor  # Answer extractor (abcd_extractor, boxed_math_extractor, identity_extractor, python_code_extractor)
    num_repeats: 1  # Repeat dataset N times for evaluation
-  multi_turn: null  # Multi-turn conversation configuration
+  agentic_inference: null  # Agentic inference conversation configuration


The Dataset model in src/inference_endpoint/config/schema.py defines multi_turn instead of agentic_inference, and forbids extra fields. Using agentic_inference here will cause a Pydantic validation error when loading this template. Please revert this to multi_turn.

multi_turn: null # Multi-turn conversation configuration

gemini-code-assist · 2026-06-23T09:24:59Z

+        if (
+            not row.failed
+            and row.issued_ns is not None
+            and row.issued_ns > self._loadgen_max_completed_issued_ns
+        ):
+            self._loadgen_max_completed_issued_ns = row.issued_ns
+            self._loadgen_window_end_ns = complete_ns
+            idx = row.tracked_block_idx
+            if 0 <= idx < len(self.tracked_blocks):
+                self._loadgen_window_block_start_ns = self.tracked_blocks[idx].start_ns


If idx is out of bounds, self._loadgen_window_block_start_ns is not updated, but self._loadgen_max_completed_issued_ns and self._loadgen_window_end_ns are. This can lead to an inconsistent or stale loadgen window calculation. We should only update the loadgen window state if idx is valid.

Suggested change

if (

not row.failed

and row.issued_ns is not None

and row.issued_ns > self._loadgen_max_completed_issued_ns

):

self._loadgen_max_completed_issued_ns = row.issued_ns

self._loadgen_window_end_ns = complete_ns

idx = row.tracked_block_idx

if 0 <= idx < len(self.tracked_blocks):

self._loadgen_window_block_start_ns = self.tracked_blocks[idx].start_ns

idx = row.tracked_block_idx

if (

not row.failed

and row.issued_ns is not None

and row.issued_ns > self._loadgen_max_completed_issued_ns

and 0 <= idx < len(self.tracked_blocks)

):

self._loadgen_max_completed_issued_ns = row.issued_ns

self._loadgen_window_end_ns = complete_ns

self._loadgen_window_block_start_ns = self.tracked_blocks[idx].start_ns

gemini-code-assist · 2026-06-23T09:24:59Z

    n_samples_to_issue: null  # Sample count override
  load_pattern:
-    type: concurrency  # Load pattern type | options: max_throughput, poisson, concurrency, multi_turn, burst, step
+    type: concurrency  # Load pattern type | options: max_throughput, poisson, concurrency, agentic_inference, burst, step


The LoadPatternType enum in src/inference_endpoint/config/schema.py defines multi_turn instead of agentic_inference. Please revert this option in the comment to avoid confusion.

type: concurrency # Load pattern type | options: max_throughput, poisson, concurrency, multi_turn, burst, step

gemini-code-assist · 2026-06-23T09:24:59Z

    n_samples_to_issue: null  # Sample count override
  load_pattern:
-    type: poisson  # Load pattern type | options: max_throughput, poisson, concurrency, multi_turn, burst, step
+    type: poisson  # Load pattern type | options: max_throughput, poisson, concurrency, agentic_inference, burst, step


The LoadPatternType enum in src/inference_endpoint/config/schema.py defines multi_turn instead of agentic_inference. Please revert this option in the comment to avoid confusion.

type: poisson # Load pattern type | options: max_throughput, poisson, concurrency, multi_turn, burst, step

…/TPS Add a `legacy_streaming_tps` runtime flag (default True, so existing behavior is unchanged). When disabled, QPS/TPS match MLPerf LoadGen's Server-scenario "completed" throughput: QPS = (completed - failed - 1) / T TPS = output_tokens / T where T is the window from the last-ISSUED request's tracking-block start to that request's own completion (LoadGen `final_query_all_samples_done_time`), rather than the current window that ends at the last request to FINISH. The aggregator always emits a new `loadgen_window_duration_ns` counter, so the flag is compute-only in Report and a saved snapshot stays fully reinterpretable either way. LoadGen QPS/TPS render N/A when the last-issued request failed or never completed (faithfulness-or-nothing). Ref: mlcommons/inference loadgen/results.cc (Server scenario, completed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

viraatc · 2026-06-23T10:09:13Z

Review Council — Multi-AI Code Review

Reviewed by: Codex (gpt-5.5, xhigh) + Claude | Depth: standard | Commit: 40dbf898

Found 4 findings (1 already fixed in this push). No critical/high blockers — the core numerator/denominator logic faithfully matches loadgen/results.cc (Server completed). Posting as a synthesized summary; items 2–3 are design-level decisions for the author rather than line bugs.

#	File	Severity	Category	Reviewer(s)	Finding
1	`config/templates/*.yaml`	medium	~~bug~~ fixed	Codex	Resolved in `40dbf898`. Templates had been regenerated by a system-python `pre-commit` run against a newer installed package (`agentic_inference`, `drain_timeout 300`) and omitted the new field — inconsistent with this branch's `multi_turn` enum. Regenerated from the branch's editable env: now `multi_turn` + `legacy_streaming_tps: true`, consistent.
2	`metrics_table.py` (`total_loadgen_window_ns` / block-start)	medium	bug	Claude	LoadGen window starts at `START_PERFORMANCE_TRACKING`, not first issue. For `poisson`/online, the first sample fires at `start + first_delay`, so the window includes one pre-issue inter-arrival gap → LoadGen QPS biased low by ~`1/(N-1)` (negligible at large N, material for short runs). `max_throughput`/`concurrency` unaffected. LoadGen schedules its first Server query at delta 0. For exact parity, anchor the window start at the first issued sample's `issued_ns`.
3	`metrics_table.py` (`_update_tracked_block`) + `schema.py` help	low	design	Codex + Opus	Failed/hung last-issued request: the window anchors to the last successfully-completed issued request (graceful) rather than rendering N/A. Deliberate (avoids fragile, non-reproducible N/A on a single tail failure; identical to LoadGen on clean runs). Note: the flag help says "last-issued" — should read "last successfully-completed issued request" to match the code.
4	`tests/`	low	testing	Claude	No aggregator-level `ISSUED→ERROR→COMPLETE` integration test exercising the `(completed-failed)` numerator and failed-last-issued exclusion end-to-end (unit tests cover `mark_sample_failed` and the window directly, but not the live event stream).

Disposition: #1 fixed. #2 (Poisson start bias) and #3 (failed-last-issued philosophy + help wording) are author calls — see PR thread. #4 is a nice-to-have follow-up.

github-actions Bot requested review from arekay-nv and nvzhihanj June 23, 2026 09:22

gemini-code-assist Bot reviewed Jun 23, 2026

View reviewed changes

viraatc force-pushed the feat/viraatc-loadgen-tps-parity branch from b2db0a7 to 0751e69 Compare June 23, 2026 09:43

viraatc force-pushed the feat/viraatc-loadgen-tps-parity branch from 0751e69 to 40dbf89 Compare June 23, 2026 10:08

Conversation

viraatc commented Jun 23, 2026

What

How

Tests

Notes

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

viraatc commented Jun 23, 2026

Review Council — Multi-AI Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant