Skip to content

feat(metrics): add legacy_streaming_tps toggle for LoadGen-parity QPS/TPS#372

Draft
viraatc wants to merge 1 commit into
mainfrom
feat/viraatc-loadgen-tps-parity
Draft

feat(metrics): add legacy_streaming_tps toggle for LoadGen-parity QPS/TPS#372
viraatc wants to merge 1 commit into
mainfrom
feat/viraatc-loadgen-tps-parity

Conversation

@viraatc

@viraatc viraatc commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

What

Adds a legacy_streaming_tps runtime flag (default True → existing behavior is unchanged). When set to false (--legacy-streaming-tps false), QPS/TPS are computed to match MLPerf LoadGen's Server-scenario "completed" throughput:

  • QPS = (completed − failed − 1) / T
  • TPS = output_tokens / T

where T (loadgen_window_duration_ns) spans from the performance-tracking start to the completion of the last-issued request that completed successfully — the analog of LoadGen's final_query_all_samples_done_time. On a clean run (every request completes) this is exactly the last-issued request.

Today the metric uses tracked_duration_ns (start → last request to finish) with n_samples_completed and no fencepost, so relative to LoadGen it (a) ends the window later (at the last completion rather than the last-issued completion) and (b) counts failed-but-completed samples in the numerator.

Reference: mlcommons/inference loadgen/results.cc — Server completed: (sample_count-1)/final_query_all_samples_done_time and token_count/final_query_all_samples_done_time (tokens are not decremented).

How

  • Aggregator always emits a new loadgen_window_duration_ns counter alongside tracked_duration_ns (computed in MetricsTable). The snapshot therefore stays config-agnostic and fully reinterpretable; the flag is compute-only in Report and is never plumbed into the aggregator subprocess.
  • MetricsTable tracks, among successfully-completed tracked samples, the one with the max issued_ns, and records its completion as the window end. SampleRow.failed (set on the ERROR event, which precedes COMPLETE) excludes failures from the window.
  • Report gains qps_legacy/qps_loadgen (and tps_*); qps()/tps() dispatch on the flag. Legacy bodies are unchanged.
  • Edge handling — LoadGen QPS/TPS render N/A when: no tracked request completed successfully; completed − failed < 2; or more than one performance-tracking block exists (the numerator counters are global, so a single-block window would be inconsistent).

Tests

  • tests/unit/metrics/test_report_builder.py — legacy unchanged, loadgen numerator/denominator, failure subtraction, N<2 → N/A, window-absent → N/A.
  • tests/unit/async_utils/services/metrics_aggregator/test_metrics_table.py — window anchoring under out-of-order completion, failed last-issued, never-completing last-issued, no-successful-completions → 0, multi-block → 0.
  • 168 unit tests pass; mypy/ruff clean.

Notes

  • Default is True, so this is opt-in and changes nothing unless enabled.
  • The selected view (legacy_streaming_tps) is recorded in config.yaml and the Report JSON; the snapshot carries both durations, so a saved run is reinterpretable either way.
  • Branch is based on an older main (14ca0541); the change is additive but a rebase onto current main before merge may be wanted.

🤖 Generated with Claude Code

@github-actions github-actions Bot requested review from arekay-nv and nvzhihanj June 23, 2026 09:22
@github-actions

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for MLPerf LoadGen Server 'completed' window throughput metrics, including a configuration toggle (legacy_streaming_tps) to switch between legacy and LoadGen-aligned throughput calculations. However, several configuration templates were updated to use agentic_inference and agentic_inference_inline which are not yet supported by the schema in schema.py, leading to validation failures. Additionally, a potential state inconsistency was identified in metrics_table.py when handling out-of-bounds tracking block indices.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

prompt: text_input
accuracy_config: null # Accuracy evaluation settings
multi_turn: null # Multi-turn conversation configuration
agentic_inference: null # Agentic inference conversation configuration

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Dataset model in src/inference_endpoint/config/schema.py defines multi_turn instead of agentic_inference, and forbids extra fields (extra="forbid"). Using agentic_inference here will cause a Pydantic validation error when loading this template. Please revert this to multi_turn.

  multi_turn: null  # Multi-turn conversation configuration

system: system_prompt
accuracy_config: # Accuracy evaluation settings
eval_method: pass_at_1 # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, vbench
eval_method: pass_at_1 # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, agentic_inference_inline, vbench

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The ScorerMethod enum in src/inference_endpoint/config/schema.py does not contain agentic_inference_inline. Specifying it in the comment options or trying to use it will fail validation. Please revert this to the supported options.

    eval_method: pass_at_1  # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, vbench

extractor: boxed_math_extractor # Answer extractor (abcd_extractor, boxed_math_extractor, identity_extractor, python_code_extractor)
num_repeats: 1 # Repeat dataset N times for evaluation
multi_turn: null # Multi-turn conversation configuration
agentic_inference: null # Agentic inference conversation configuration

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Dataset model in src/inference_endpoint/config/schema.py defines multi_turn instead of agentic_inference, and forbids extra fields. Using agentic_inference here will cause a Pydantic validation error when loading this template. Please revert this to multi_turn.

  multi_turn: null  # Multi-turn conversation configuration

prompt: text_input
accuracy_config: null # Accuracy evaluation settings
multi_turn: null # Multi-turn conversation configuration
agentic_inference: null # Agentic inference conversation configuration

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Dataset model in src/inference_endpoint/config/schema.py defines multi_turn instead of agentic_inference, and forbids extra fields. Using agentic_inference here will cause a Pydantic validation error when loading this template. Please revert this to multi_turn.

  multi_turn: null  # Multi-turn conversation configuration

system: system_prompt
accuracy_config: # Accuracy evaluation settings
eval_method: pass_at_1 # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, vbench
eval_method: pass_at_1 # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, agentic_inference_inline, vbench

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The ScorerMethod enum in src/inference_endpoint/config/schema.py does not contain agentic_inference_inline. Specifying it in the comment options or trying to use it will fail validation. Please revert this to the supported options.

    eval_method: pass_at_1  # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, vbench

system: system_prompt
accuracy_config: # Accuracy evaluation settings
eval_method: pass_at_1 # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, vbench
eval_method: pass_at_1 # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, agentic_inference_inline, vbench

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The ScorerMethod enum in src/inference_endpoint/config/schema.py does not contain agentic_inference_inline. Specifying it in the comment options or trying to use it will fail validation. Please revert this to the supported options.

    eval_method: pass_at_1  # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, vbench

extractor: boxed_math_extractor # Answer extractor (abcd_extractor, boxed_math_extractor, identity_extractor, python_code_extractor)
num_repeats: 1 # Repeat dataset N times for evaluation
multi_turn: null # Multi-turn conversation configuration
agentic_inference: null # Agentic inference conversation configuration

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Dataset model in src/inference_endpoint/config/schema.py defines multi_turn instead of agentic_inference, and forbids extra fields. Using agentic_inference here will cause a Pydantic validation error when loading this template. Please revert this to multi_turn.

  multi_turn: null  # Multi-turn conversation configuration

Comment on lines +651 to +660
if (
not row.failed
and row.issued_ns is not None
and row.issued_ns > self._loadgen_max_completed_issued_ns
):
self._loadgen_max_completed_issued_ns = row.issued_ns
self._loadgen_window_end_ns = complete_ns
idx = row.tracked_block_idx
if 0 <= idx < len(self.tracked_blocks):
self._loadgen_window_block_start_ns = self.tracked_blocks[idx].start_ns

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If idx is out of bounds, self._loadgen_window_block_start_ns is not updated, but self._loadgen_max_completed_issued_ns and self._loadgen_window_end_ns are. This can lead to an inconsistent or stale loadgen window calculation. We should only update the loadgen window state if idx is valid.

Suggested change
if (
not row.failed
and row.issued_ns is not None
and row.issued_ns > self._loadgen_max_completed_issued_ns
):
self._loadgen_max_completed_issued_ns = row.issued_ns
self._loadgen_window_end_ns = complete_ns
idx = row.tracked_block_idx
if 0 <= idx < len(self.tracked_blocks):
self._loadgen_window_block_start_ns = self.tracked_blocks[idx].start_ns
idx = row.tracked_block_idx
if (
not row.failed
and row.issued_ns is not None
and row.issued_ns > self._loadgen_max_completed_issued_ns
and 0 <= idx < len(self.tracked_blocks)
):
self._loadgen_max_completed_issued_ns = row.issued_ns
self._loadgen_window_end_ns = complete_ns
self._loadgen_window_block_start_ns = self.tracked_blocks[idx].start_ns

n_samples_to_issue: null # Sample count override
load_pattern:
type: concurrency # Load pattern type | options: max_throughput, poisson, concurrency, multi_turn, burst, step
type: concurrency # Load pattern type | options: max_throughput, poisson, concurrency, agentic_inference, burst, step

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The LoadPatternType enum in src/inference_endpoint/config/schema.py defines multi_turn instead of agentic_inference. Please revert this option in the comment to avoid confusion.

    type: concurrency  # Load pattern type | options: max_throughput, poisson, concurrency, multi_turn, burst, step

n_samples_to_issue: null # Sample count override
load_pattern:
type: poisson # Load pattern type | options: max_throughput, poisson, concurrency, multi_turn, burst, step
type: poisson # Load pattern type | options: max_throughput, poisson, concurrency, agentic_inference, burst, step

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The LoadPatternType enum in src/inference_endpoint/config/schema.py defines multi_turn instead of agentic_inference. Please revert this option in the comment to avoid confusion.

    type: poisson  # Load pattern type | options: max_throughput, poisson, concurrency, multi_turn, burst, step

@viraatc viraatc force-pushed the feat/viraatc-loadgen-tps-parity branch from b2db0a7 to 0751e69 Compare June 23, 2026 09:43
…/TPS

Add a `legacy_streaming_tps` runtime flag (default True, so existing behavior
is unchanged). When disabled, QPS/TPS match MLPerf LoadGen's Server-scenario
"completed" throughput:

  QPS = (completed - failed - 1) / T
  TPS = output_tokens / T

where T is the window from the last-ISSUED request's tracking-block start to
that request's own completion (LoadGen `final_query_all_samples_done_time`),
rather than the current window that ends at the last request to FINISH.

The aggregator always emits a new `loadgen_window_duration_ns` counter, so the
flag is compute-only in Report and a saved snapshot stays fully
reinterpretable either way. LoadGen QPS/TPS render N/A when the last-issued
request failed or never completed (faithfulness-or-nothing).

Ref: mlcommons/inference loadgen/results.cc (Server scenario, completed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@viraatc viraatc force-pushed the feat/viraatc-loadgen-tps-parity branch from 0751e69 to 40dbf89 Compare June 23, 2026 10:08
@viraatc

viraatc commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

Review Council — Multi-AI Code Review

Reviewed by: Codex (gpt-5.5, xhigh) + Claude | Depth: standard | Commit: 40dbf898

Found 4 findings (1 already fixed in this push). No critical/high blockers — the core numerator/denominator logic faithfully matches loadgen/results.cc (Server completed). Posting as a synthesized summary; items 2–3 are design-level decisions for the author rather than line bugs.

# File Severity Category Reviewer(s) Finding
1 config/templates/*.yaml medium bug fixed Codex Resolved in 40dbf898. Templates had been regenerated by a system-python pre-commit run against a newer installed package (agentic_inference, drain_timeout 300) and omitted the new field — inconsistent with this branch's multi_turn enum. Regenerated from the branch's editable env: now multi_turn + legacy_streaming_tps: true, consistent.
2 metrics_table.py (total_loadgen_window_ns / block-start) medium bug Claude LoadGen window starts at START_PERFORMANCE_TRACKING, not first issue. For poisson/online, the first sample fires at start + first_delay, so the window includes one pre-issue inter-arrival gap → LoadGen QPS biased low by ~1/(N-1) (negligible at large N, material for short runs). max_throughput/concurrency unaffected. LoadGen schedules its first Server query at delta 0. For exact parity, anchor the window start at the first issued sample's issued_ns.
3 metrics_table.py (_update_tracked_block) + schema.py help low design Codex + Opus Failed/hung last-issued request: the window anchors to the last successfully-completed issued request (graceful) rather than rendering N/A. Deliberate (avoids fragile, non-reproducible N/A on a single tail failure; identical to LoadGen on clean runs). Note: the flag help says "last-issued" — should read "last successfully-completed issued request" to match the code.
4 tests/ low testing Claude No aggregator-level ISSUED→ERROR→COMPLETE integration test exercising the (completed-failed) numerator and failed-last-issued exclusion end-to-end (unit tests cover mark_sample_failed and the window directly, but not the live event stream).

Disposition: #1 fixed. #2 (Poisson start bias) and #3 (failed-last-issued philosophy + help wording) are author calls — see PR thread. #4 is a nice-to-have follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant