Skip to content

Add DSV4 FP4 GB300 dynamo-sglang MTP disagg benchmarks#1297

Open
ch-wan wants to merge 17 commits intomainfrom
sglang-disagg-gb300-mtp-0507
Open

Add DSV4 FP4 GB300 dynamo-sglang MTP disagg benchmarks#1297
ch-wan wants to merge 17 commits intomainfrom
sglang-disagg-gb300-mtp-0507

Conversation

@ch-wan
Copy link
Copy Markdown
Collaborator

@ch-wan ch-wan commented May 7, 2026

Summary

  • Adds 6 disagg MTP recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/mtp/ (low-latency 1p1d-tp4 / 1p6d-dep4-tp4 + mid-curve dep4-dep8/dep16 with 1p, 2p, 4p prefill)
  • Wires them into dsv4-fp4-gb300-dynamo-sglang-mtp in .github/configs/nvidia-master.yaml, each entry carrying spec-decoding: "mtp" and the corresponding topology
  • Recipes adapted from elvischenv/srt-slurm@dsv4-gb300-disagg-8k1k-mtp, repointed at the public lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev container and the deepseek-v4-pro model alias

Test plan

  • /sweep on this PR — verify the matrix dispatches the 6 new MTP entries
  • Confirm the dsv4-fp4-gb300-dynamo-sglang-mtp rows appear in the sweep matrix listing
  • Eval-only entry (max-conc) produces lm-eval scores

🤖 Generated with Claude Code

@ch-wan ch-wan requested a review from a team May 7, 2026 18:02
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

@ch-wan ch-wan changed the title Add DeepSeek V4 Pro FP4 GB300 disaggregated SGLang MTP benchmarks Add DSV4 FP4 GB300 dynamo-sglang MTP disagg benchmarks May 7, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Comment thread perf-changelog.yaml Outdated
model:
path: "deepseek-v4-pro"
container: "lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev"
precision: "mxfp4"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 All 6 new MTP recipes set model.precision: "mxfp4", but every existing sibling dsv4 SGLang recipe in benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/ uses precision: "fp4" — even though they share the same moe-runner-backend: flashinfer_mxfp4 — and the matrix entry dsv4-fp4-gb300-dynamo-sglang-mtp itself has precision: fp4. Nit: align all 6 MTP recipes to precision: "fp4" to match the established convention; this is metadata-only (InferenceX aggregation keys off the matrix-level precision, not the recipe yaml), so runtime impact is minimal.

Extended reasoning...

What the inconsistency is

Each of the 6 new files at benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/mtp/*.yaml has:

model:
  path: "deepseek-v4-pro"
  container: "lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev"
  precision: "mxfp4"

Whereas all 6 pre-existing sibling recipes at benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-*.yaml use precision: "fp4" (line 37 of each), despite carrying the same moe-runner-backend: "flashinfer_mxfp4" setting in their sglang_config. The matrix entry added in .github/configs/nvidia-master.yaml for these MTP recipes also uses precision: fp4, and AGENTS.md lists only fp4 and fp8 as recognized precisions in the project.

Step-by-step proof of the divergence

  1. Open benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/mtp/disagg-low-latency-1p1d-tp4-tp4.yaml line 15: precision: "mxfp4".
  2. Open benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-1p1d-dep4-dep8-3-c256.yaml (or any of the 6 sibling recipes added in Update DeepSeek V4 Pro FP4 GB300 disaggregated SGLang benchmarks #1295) around line 37: precision: "fp4".
  3. Both files set moe-runner-backend: "flashinfer_mxfp4" in their sglang_config.decode blocks.
  4. Open .github/configs/nvidia-master.yaml at the new dsv4-fp4-gb300-dynamo-sglang-mtp: block: precision: fp4.

So within the same PR, the matrix says fp4 and the recipe yamls say mxfp4, while the equivalent non-MTP sibling recipes that share the same MoE backend say fp4 at the recipe level too. That is a copy-paste inconsistency with the established convention.

Addressing the refutation: what the runtime impact actually is

The refutation correctly notes that InferenceX's own aggregation pipelines (utils/summarize.py, utils/collect_eval_results.py, utils/matrix_logic/generate_sweep_configs.py, launch_gb300-cw.sh) key off the matrix-level precision field from nvidia-master.yaml, not the recipe yaml's model.precision. Since the matrix entry is correctly fp4, in-repo aggregation/labeling is unaffected — the original framing of "confusing labels in eval/result aggregation pipelines" overstates the impact. The recipe-level field is consumed externally by srt-slurm/srtctl, and the upstream source (elvischenv/srt-slurm@dsv4-gb300-disagg-8k1k-mtp) presumably accepts mxfp4. So this is not a runtime breakage.

Why it's still worth fixing

It is purely a cross-recipe metadata uniformity nit: every sibling dsv4 SGLang recipe in the same directory tree, even ones using the identical flashinfer_mxfp4 MoE backend, declares precision: "fp4" at the recipe level. The mxfp4 label here will trip up future grep-based audits and contradicts the project-wide enum in AGENTS.md. The fix is to replace precision: "mxfp4" with precision: "fp4" on line 15 of all 6 new MTP recipes — no other change required.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@ch-wan ch-wan force-pushed the sglang-disagg-gb300-mtp-0507 branch from ea35b7b to ce53cf1 Compare May 8, 2026 06:06
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@ch-wan
Copy link
Copy Markdown
Collaborator Author

ch-wan commented May 8, 2026

/sweep

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@ch-wan Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25540780423
Command: ``
Pinned ref: dba5e0d
Approval: not required (trusted collaborator).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@ch-wan
Copy link
Copy Markdown
Collaborator Author

ch-wan commented May 8, 2026

/sweep

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@ch-wan Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25541720592
Command: ``
Pinned ref: 5e30f2c
Approval: not required (trusted collaborator).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@ch-wan
Copy link
Copy Markdown
Collaborator Author

ch-wan commented May 8, 2026

/sweep

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@ch-wan Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25542023314
Command: ``
Pinned ref: ff0df99
Approval: not required (trusted collaborator).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@ch-wan ch-wan closed this May 8, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

… custom_tokenizer

NVIDIA/srt-slurm#144 adds __call__ / __getattr__ to
SGLangDeepseekV4Tokenizer so sa-bench's calculate_metrics
(benchmark_serving.py:657 — `tokenizer(text).input_ids`) can count
generated tokens for DSv4-Pro multi-node MTP runs without throwing
``TypeError: 'SGLangDeepseekV4Tokenizer' object is not callable``.

Until that PR merges, pin gb300-cw's sglang launcher to
``ch-wan/srt-slurm @ c901ad38`` (the same fix), and restore
``custom_tokenizer: sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer``
in the 6 MTP recipes. ``use_chat_template: true`` is required by
AGENTS.md for MTP correctness (EAGLE acceptance regresses on raw
random tokens).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

ch-wan and others added 2 commits May 8, 2026 14:40
…ain)

Pinned to the multi-arch image produced by sgl-project/sglang Build and
Push Development Docker Images run #25574279419 (head_sha 2cf1a4ab,
HEAD of sglang main). Replaces the older staging image
lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev (May 7).

The nightly-dev-cu13 image carries the full sglang main as of 2026-05-08
21:06 UTC, including upstream fixes since the May-7 staging snapshot.
Multi-arch manifest covers amd64 + arm64, so it works on the gb300
(Grace) compute nodes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit accidentally bumped the non-MTP base entry's
image too. The base 8k1k recipes still pin
``container: lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev``,
and the launcher requires the matrix's ``image:`` to match the
recipe's ``container:`` (it templates ``\"\${IMAGE}\": \${SQUASH_FILE}``
into srtslurm.yaml). Mismatching them would break the base sweep.

Only the dsv4-fp4-gb300-dynamo-sglang-mtp entry needs the
nightly-dev-cu13 bump (paired with the MTP recipe ``container:``
field).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

The 6 MTP recipes were imported with dynamo hash 9d3c913d from the
upstream srt-slurm fork, but the working non-MTP base recipes already
on this branch use 81d0555ee23519cea80a42b4fe824e30368b7300 — paired
with the sglang nightly cu13 main builds.

The 9d3c913d wheel is incompatible with sglang main 2cf1a4ab: the
decode scheduler subprocess (rank 0) is SIGQUIT'd during sgl.Engine()
init at dynamo.sglang.init_llm:77, surfacing as "Rank 0 scheduler died
during initialization (exit code: -3)" in CI run 25580956722.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

2 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

The 4 multi-node decode MTP recipes had a comment saying
SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 was "intentionally NOT set", but
sglang main 2cf1a4ab defaults this on. CAR_V2 is single-node only,
and on multi-node decode it silently fails to construct its backing
``self.obj``, then segfaults during cuda graph capture:

  AttributeError: 'CustomAllReduceV2' object has no attribute 'obj'
    at custom_all_reduce_v2.py:97 in capture()

The scheduler is SIGQUIT'd, surfacing as
"Rank 0 scheduler died during initialization (exit code: -3)" in
dynamo's wrapper. Explicitly setting the env to "0" matches the
intent of the pre-existing comment.

Affects: dep4-dep8, dep4-dep16, 2p1d-dep4-dep8, 4p1d-dep4-dep8.
Single-node decode recipes (1p1d-tp4-tp4, 1p6d-dep4-tp4) keep the
default since CAR_V2 works in single-node.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Apply the same explicit ``SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "0"``
to the existing 8k1k base decode recipes that had only the
``intentionally NOT set`` comment. The MTP fix in 6d28994 proved the
comment-only pattern is brittle: sglang main 2cf1a4ab defaults the
env on, and CAR_V2 segfaults during cuda graph capture on multi-node
decode. Make the disable explicit so a future image bump on the base
sweep can't regress the same way.

Affects 6 recipes: 1p1d-tp4-tp4-2-c1, 1p1d-dep4-dep16-5-c1024,
4p1d-dep4-dep16-8-c1024, 8p1d-dep4-dep16-12-c4096,
10p1d-dep4-dep16-14-c8192, 12p1d-dep4-dep12-15-c21504.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

sglang main 2cf1a4ab moved ``SGLANG_ENABLE_THINKING`` →
``SGLANG_DEFAULT_THINKING`` and ``SGLANG_REASONING_EFFORT`` →
``SGLANG_DSV4_REASONING_EFFORT``. The deprecation helper
``_print_deprecated_env`` (environ.py:642) only emits a warning — it
does NOT propagate the value to the new name. So the old env vars
were silently ignored: server defaulted to non-thinking mode with
empty reasoning effort, dropping GSM8K accuracy from ~95% to ~40%
(eval_results_all from run 25583345967: em_strict=0.4291 for
1p6d-dep4-tp4 conc=64, 0.4056 for 4p1d-dep4-dep8 conc=1024).

Set both names in prefill_environment and decode_environment of all
six MTP recipes:
  * old names — read by the sa-bench client tokenizer
    (sa_bench_tokenizers.sglang_deepseek_v4) for prompt-rendering
    parity with the server.
  * new names — read by the sglang server in 2cf1a4ab+.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

…egression fix)

GSM8K accuracy on the latest sweep dropped from the expected ~95% to
~40% (em_strict=0.4291 for 1p6d-dep4-tp4 conc=64; 0.4056 for
4p1d-dep4-dep8 conc=1024 — run 25583345967 eval_results_all).

Inspecting samples_gsm8k_*.jsonl revealed every response was prefixed
with junk like "Weapon:" / "Weaponized" / "We黑白颠倒", and the
reasoning often answered a different question than what was asked —
classic symptom of a malformed chat-template prompt.

Root cause in sglang main 2cf1a4ab
(entrypoints/openai/serving_chat.py:296):

    def _resolve_chat_encoding_spec(self) -> Optional[str]:
        if self.tool_call_parser == "deepseekv4":
            return "dsv4"
        if self.tool_call_parser == "deepseekv32":
            return "dsv32"

The dsv4 chat-encoding spec — which routes DSV4 prompts through
``encoding_dsv4.encode_messages`` with thinking-mode and
reasoning-effort handling — only activates when
``--tool-call-parser deepseekv4`` is set. Without it the server falls
back to the vanilla HF chat template (``apply_chat_template``), which
doesn't know about DSV4's special tokens, ``<think>`` blocks, or the
``thinking_mode`` argument. The MTP recipes never set this flag, so
ServerArgs reports ``tool_call_parser=None`` and the model receives a
malformed prompt.

Add ``tool-call-parser: deepseekv4`` to both prefill and decode
``sglang_config`` blocks in all 6 MTP recipes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

Restore the 6 base recipes to their state on origin/main; the
explicit ``SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: \"0\"`` was added
defensively in 9c4c244, but the base sweep is happy on its current
staging-dev image and shouldn't be touched in this PR.

Reverts files:
  disagg-gb300-1p1d-tp4-tp4-2-c1.yaml
  disagg-gb300-1p1d-dep4-dep16-5-c1024.yaml
  disagg-gb300-4p1d-dep4-dep16-8-c1024.yaml
  disagg-gb300-8p1d-dep4-dep16-12-c4096.yaml
  disagg-gb300-10p1d-dep4-dep16-14-c8192.yaml
  disagg-gb300-12p1d-dep4-dep12-15-c21504.yaml

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

- Drop ``SGLANG_ENABLE_THINKING`` / ``SGLANG_REASONING_EFFORT``
  (deprecated since sglang main 2cf1a4ab); keep only the new names
  ``SGLANG_DEFAULT_THINKING`` / ``SGLANG_DSV4_REASONING_EFFORT``.
- Bump the srt-slurm fork pin to 51847632 so the sa-bench client
  tokenizer reads the new env names (with old names as fallback).
- Trim multi-line block comments down to one-line tail comments
  for the CAR_V2 disable and ``tool-call-parser: deepseekv4`` flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

3 participants