Skip to content

feat: per-dataset max_new_tokens override#356

Merged
nvzhihanj merged 1 commit into
mlcommons:release/v0.5from
roborluo:dev-bofengl-per-dataset-max-new-tokens
Jun 16, 2026
Merged

feat: per-dataset max_new_tokens override#356
nvzhihanj merged 1 commit into
mlcommons:release/v0.5from
roborluo:dev-bofengl-per-dataset-max-new-tokens

Conversation

@roborluo

Copy link
Copy Markdown

What does this PR do?

When running a combined performance + accuracy benchmark in a single --mode both invocation, the two phases want opposite generation caps, but today the harness only exposes one global model_params.max_new_tokens:

  • Performance phase needs a small cap. max_new_tokens is sent to the server as the per-request max_tokens, and a disaggregated decode scheduler reserves/plans decode-KV for that declared upper bound — even though generation actually stops at EOS far sooner. A large cap (e.g. 32768) over-reserves decode KV (~3.2× vs 10240), starves admittable decode slots at high concurrency, and triggers KV-transfer-timeout storms on the context→gen path. A realistic small cap avoids this.
  • Accuracy phase needs a large cap, otherwise long reasoning outputs get truncated and scores are artificially deflated. This matches the MLPerf Inference gpt-oss-120b reference, where the performance and accuracy workloads use different token settings — see language/gpt-oss-120b → Model and Dataset download. Without a per-dataset override, you cannot satisfy both in one --mode both run.

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

N/A

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

@roborluo roborluo requested a review from a team as a code owner June 13, 2026 05:10
@github-actions

github-actions Bot commented Jun 13, 2026

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a per-dataset max_new_tokens override capability to allow performance and accuracy datasets to use different token limits, falling back to the global model_params when unset. The feedback suggests encapsulating the override logic into a helper method get_model_params on the Dataset class to eliminate code duplication across the accuracy and performance dataset loading paths, and adding corresponding unit tests for this helper.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/inference_endpoint/config/schema.py
Comment on lines +287 to +294
# Per-dataset max_new_tokens override (falls back to global model_params).
acc_model_params = (
config.model_params
if acc_cfg.max_new_tokens is None
else config.model_params.model_copy(
update={"max_new_tokens": acc_cfg.max_new_tokens}
)
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use the new get_model_params helper method on the Dataset configuration model to simplify the override logic and eliminate duplication.

Suggested change
# Per-dataset max_new_tokens override (falls back to global model_params).
acc_model_params = (
config.model_params
if acc_cfg.max_new_tokens is None
else config.model_params.model_copy(
update={"max_new_tokens": acc_cfg.max_new_tokens}
)
)
acc_model_params = acc_cfg.get_model_params(config.model_params)

Comment on lines +307 to +314
# Per-dataset max_new_tokens override (falls back to global model_params).
perf_model_params = (
config.model_params
if perf_cfg.max_new_tokens is None
else config.model_params.model_copy(
update={"max_new_tokens": perf_cfg.max_new_tokens}
)
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use the new get_model_params helper method on the Dataset configuration model to simplify the override logic and eliminate duplication.

Suggested change
# Per-dataset max_new_tokens override (falls back to global model_params).
perf_model_params = (
config.model_params
if perf_cfg.max_new_tokens is None
else config.model_params.model_copy(
update={"max_new_tokens": perf_cfg.max_new_tokens}
)
)
perf_model_params = perf_cfg.get_model_params(config.model_params)

Comment thread tests/unit/config/test_schema.py
@arekay-nv

Copy link
Copy Markdown
Collaborator

@roborluo Can you look at #344 which addresses the same issue. We can consolidate the two here and merge this one. I think the other one has also the templates correctly populated which you are failing in CI.

roborluo added a commit to roborluo/endpoints that referenced this pull request Jun 16, 2026
…get_model_params

Address review feedback on PR mlcommons#356:
- Add Dataset.get_model_params(model_params) helper that applies the
  per-dataset max_new_tokens override (falls back to the global
  model_params when unset), removing the duplicated override logic from
  both call sites in benchmark/execute.py.
- Add unit tests for the helper (fallback + override + frozen-source
  preservation).
- Regenerate *_template_full.yaml (the new Dataset.max_new_tokens field
  now appears as `max_new_tokens: null` in the dataset block; the
  model_params comment drops because the field name now collides across
  ModelParams/Dataset and the comment generator skips ambiguous names).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an optional per-dataset `max_new_tokens` that overrides the global
`model_params.max_new_tokens` (sent as the per-request max_tokens). Lets a
performance dataset use a small cap (avoiding server-side KV
over-reservation/overload at high concurrency) while accuracy datasets use a
larger cap (avoiding truncation of long reasoning output). Falls back to the
global value when unset.

- schema: add Dataset.max_new_tokens (gt=0) and a Dataset.get_model_params()
  helper that applies the override, keeping the logic in one place.
- benchmark/execute: both the accuracy and performance load paths use the
  helper instead of duplicating the override.
- tests: per-dataset field validation + get_model_params() fallback/override.
- templates: regenerate *_template_full.yaml for the new field.
- chore: bump aiohttp 3.14.0 -> 3.14.1 to clear pip-audit CVEs
  (CVE-2026-54273..54280).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@roborluo roborluo force-pushed the dev-bofengl-per-dataset-max-new-tokens branch from 83e8461 to 458b8fa Compare June 16, 2026 18:17
@nvzhihanj nvzhihanj merged commit bbc8697 into mlcommons:release/v0.5 Jun 16, 2026
7 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 16, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants