feat(metrics): add legacy_streaming_tps toggle for LoadGen-parity QPS/TPS#372
feat(metrics): add legacy_streaming_tps toggle for LoadGen-parity QPS/TPS#372viraatc wants to merge 1 commit into
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces support for MLPerf LoadGen Server 'completed' window throughput metrics, including a configuration toggle (legacy_streaming_tps) to switch between legacy and LoadGen-aligned throughput calculations. However, several configuration templates were updated to use agentic_inference and agentic_inference_inline which are not yet supported by the schema in schema.py, leading to validation failures. Additionally, a potential state inconsistency was identified in metrics_table.py when handling out-of-bounds tracking block indices.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| prompt: text_input | ||
| accuracy_config: null # Accuracy evaluation settings | ||
| multi_turn: null # Multi-turn conversation configuration | ||
| agentic_inference: null # Agentic inference conversation configuration |
There was a problem hiding this comment.
The Dataset model in src/inference_endpoint/config/schema.py defines multi_turn instead of agentic_inference, and forbids extra fields (extra="forbid"). Using agentic_inference here will cause a Pydantic validation error when loading this template. Please revert this to multi_turn.
multi_turn: null # Multi-turn conversation configuration| system: system_prompt | ||
| accuracy_config: # Accuracy evaluation settings | ||
| eval_method: pass_at_1 # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, vbench | ||
| eval_method: pass_at_1 # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, agentic_inference_inline, vbench |
There was a problem hiding this comment.
The ScorerMethod enum in src/inference_endpoint/config/schema.py does not contain agentic_inference_inline. Specifying it in the comment options or trying to use it will fail validation. Please revert this to the supported options.
eval_method: pass_at_1 # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, vbench| extractor: boxed_math_extractor # Answer extractor (abcd_extractor, boxed_math_extractor, identity_extractor, python_code_extractor) | ||
| num_repeats: 1 # Repeat dataset N times for evaluation | ||
| multi_turn: null # Multi-turn conversation configuration | ||
| agentic_inference: null # Agentic inference conversation configuration |
There was a problem hiding this comment.
The Dataset model in src/inference_endpoint/config/schema.py defines multi_turn instead of agentic_inference, and forbids extra fields. Using agentic_inference here will cause a Pydantic validation error when loading this template. Please revert this to multi_turn.
multi_turn: null # Multi-turn conversation configuration| prompt: text_input | ||
| accuracy_config: null # Accuracy evaluation settings | ||
| multi_turn: null # Multi-turn conversation configuration | ||
| agentic_inference: null # Agentic inference conversation configuration |
There was a problem hiding this comment.
The Dataset model in src/inference_endpoint/config/schema.py defines multi_turn instead of agentic_inference, and forbids extra fields. Using agentic_inference here will cause a Pydantic validation error when loading this template. Please revert this to multi_turn.
multi_turn: null # Multi-turn conversation configuration| system: system_prompt | ||
| accuracy_config: # Accuracy evaluation settings | ||
| eval_method: pass_at_1 # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, vbench | ||
| eval_method: pass_at_1 # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, agentic_inference_inline, vbench |
There was a problem hiding this comment.
The ScorerMethod enum in src/inference_endpoint/config/schema.py does not contain agentic_inference_inline. Specifying it in the comment options or trying to use it will fail validation. Please revert this to the supported options.
eval_method: pass_at_1 # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, vbench| system: system_prompt | ||
| accuracy_config: # Accuracy evaluation settings | ||
| eval_method: pass_at_1 # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, vbench | ||
| eval_method: pass_at_1 # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, agentic_inference_inline, vbench |
There was a problem hiding this comment.
The ScorerMethod enum in src/inference_endpoint/config/schema.py does not contain agentic_inference_inline. Specifying it in the comment options or trying to use it will fail validation. Please revert this to the supported options.
eval_method: pass_at_1 # Scorer method | options: pass_at_1, string_match, rouge, code_bench_scorer, shopify_category_f1, vbench| extractor: boxed_math_extractor # Answer extractor (abcd_extractor, boxed_math_extractor, identity_extractor, python_code_extractor) | ||
| num_repeats: 1 # Repeat dataset N times for evaluation | ||
| multi_turn: null # Multi-turn conversation configuration | ||
| agentic_inference: null # Agentic inference conversation configuration |
There was a problem hiding this comment.
The Dataset model in src/inference_endpoint/config/schema.py defines multi_turn instead of agentic_inference, and forbids extra fields. Using agentic_inference here will cause a Pydantic validation error when loading this template. Please revert this to multi_turn.
multi_turn: null # Multi-turn conversation configuration| if ( | ||
| not row.failed | ||
| and row.issued_ns is not None | ||
| and row.issued_ns > self._loadgen_max_completed_issued_ns | ||
| ): | ||
| self._loadgen_max_completed_issued_ns = row.issued_ns | ||
| self._loadgen_window_end_ns = complete_ns | ||
| idx = row.tracked_block_idx | ||
| if 0 <= idx < len(self.tracked_blocks): | ||
| self._loadgen_window_block_start_ns = self.tracked_blocks[idx].start_ns |
There was a problem hiding this comment.
If idx is out of bounds, self._loadgen_window_block_start_ns is not updated, but self._loadgen_max_completed_issued_ns and self._loadgen_window_end_ns are. This can lead to an inconsistent or stale loadgen window calculation. We should only update the loadgen window state if idx is valid.
| if ( | |
| not row.failed | |
| and row.issued_ns is not None | |
| and row.issued_ns > self._loadgen_max_completed_issued_ns | |
| ): | |
| self._loadgen_max_completed_issued_ns = row.issued_ns | |
| self._loadgen_window_end_ns = complete_ns | |
| idx = row.tracked_block_idx | |
| if 0 <= idx < len(self.tracked_blocks): | |
| self._loadgen_window_block_start_ns = self.tracked_blocks[idx].start_ns | |
| idx = row.tracked_block_idx | |
| if ( | |
| not row.failed | |
| and row.issued_ns is not None | |
| and row.issued_ns > self._loadgen_max_completed_issued_ns | |
| and 0 <= idx < len(self.tracked_blocks) | |
| ): | |
| self._loadgen_max_completed_issued_ns = row.issued_ns | |
| self._loadgen_window_end_ns = complete_ns | |
| self._loadgen_window_block_start_ns = self.tracked_blocks[idx].start_ns |
| n_samples_to_issue: null # Sample count override | ||
| load_pattern: | ||
| type: concurrency # Load pattern type | options: max_throughput, poisson, concurrency, multi_turn, burst, step | ||
| type: concurrency # Load pattern type | options: max_throughput, poisson, concurrency, agentic_inference, burst, step |
There was a problem hiding this comment.
| n_samples_to_issue: null # Sample count override | ||
| load_pattern: | ||
| type: poisson # Load pattern type | options: max_throughput, poisson, concurrency, multi_turn, burst, step | ||
| type: poisson # Load pattern type | options: max_throughput, poisson, concurrency, agentic_inference, burst, step |
There was a problem hiding this comment.
b2db0a7 to
0751e69
Compare
…/TPS Add a `legacy_streaming_tps` runtime flag (default True, so existing behavior is unchanged). When disabled, QPS/TPS match MLPerf LoadGen's Server-scenario "completed" throughput: QPS = (completed - failed - 1) / T TPS = output_tokens / T where T is the window from the last-ISSUED request's tracking-block start to that request's own completion (LoadGen `final_query_all_samples_done_time`), rather than the current window that ends at the last request to FINISH. The aggregator always emits a new `loadgen_window_duration_ns` counter, so the flag is compute-only in Report and a saved snapshot stays fully reinterpretable either way. LoadGen QPS/TPS render N/A when the last-issued request failed or never completed (faithfulness-or-nothing). Ref: mlcommons/inference loadgen/results.cc (Server scenario, completed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
0751e69 to
40dbf89
Compare
Review Council — Multi-AI Code ReviewReviewed by: Codex (gpt-5.5, xhigh) + Claude | Depth: standard | Commit: Found 4 findings (1 already fixed in this push). No critical/high blockers — the core numerator/denominator logic faithfully matches
Disposition: #1 fixed. #2 (Poisson start bias) and #3 (failed-last-issued philosophy + help wording) are author calls — see PR thread. #4 is a nice-to-have follow-up. |
What
Adds a
legacy_streaming_tpsruntime flag (defaultTrue→ existing behavior is unchanged). When set tofalse(--legacy-streaming-tps false), QPS/TPS are computed to match MLPerf LoadGen's Server-scenario "completed" throughput:QPS = (completed − failed − 1) / TTPS = output_tokens / Twhere
T(loadgen_window_duration_ns) spans from the performance-tracking start to the completion of the last-issued request that completed successfully — the analog of LoadGen'sfinal_query_all_samples_done_time. On a clean run (every request completes) this is exactly the last-issued request.Today the metric uses
tracked_duration_ns(start → last request to finish) withn_samples_completedand no fencepost, so relative to LoadGen it (a) ends the window later (at the last completion rather than the last-issued completion) and (b) counts failed-but-completed samples in the numerator.Reference:
mlcommons/inferenceloadgen/results.cc— Server completed:(sample_count-1)/final_query_all_samples_done_timeandtoken_count/final_query_all_samples_done_time(tokens are not decremented).How
loadgen_window_duration_nscounter alongsidetracked_duration_ns(computed inMetricsTable). The snapshot therefore stays config-agnostic and fully reinterpretable; the flag is compute-only inReportand is never plumbed into the aggregator subprocess.MetricsTabletracks, among successfully-completed tracked samples, the one with the maxissued_ns, and records its completion as the window end.SampleRow.failed(set on theERRORevent, which precedesCOMPLETE) excludes failures from the window.Reportgainsqps_legacy/qps_loadgen(andtps_*);qps()/tps()dispatch on the flag. Legacy bodies are unchanged.completed − failed < 2; or more than one performance-tracking block exists (the numerator counters are global, so a single-block window would be inconsistent).Tests
tests/unit/metrics/test_report_builder.py— legacy unchanged, loadgen numerator/denominator, failure subtraction,N<2→ N/A, window-absent → N/A.tests/unit/async_utils/services/metrics_aggregator/test_metrics_table.py— window anchoring under out-of-order completion, failed last-issued, never-completing last-issued, no-successful-completions → 0, multi-block → 0.mypy/ruffclean.Notes
True, so this is opt-in and changes nothing unless enabled.legacy_streaming_tps) is recorded inconfig.yamland the Report JSON; the snapshot carries both durations, so a saved run is reinterpretable either way.main(14ca0541); the change is additive but a rebase onto currentmainbefore merge may be wanted.🤖 Generated with Claude Code