launcher: add Nemotron-3-Super-120B-A12B-BF16 MTP vLLM specdec bench config#1714
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughA new YAML benchmark configuration file is added for the NVIDIA Nemotron-3-Super-120B-A12B BF16 model. It defines two pipeline tasks ( ChangesMTP vLLM Benchmark Configuration
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
…config Adds SPEED-bench MTP speculative-decoding YAML for NVIDIA-Nemotron-3-Super-120B-A12B-BF16 via vLLM, covering the qualitative and throughput_32k splits with tp_size=4. Part of OMNIML-5095 / OMNIML-5098. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
b536d3e to
a85498f
Compare
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1714 +/- ##
==========================================
+ Coverage 76.92% 76.95% +0.03%
==========================================
Files 511 511
Lines 56360 56360
==========================================
+ Hits 43356 43373 +17
+ Misses 13004 12987 -17
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Adds a SPEED-bench MTP speculative-decoding YAML for
NVIDIA-Nemotron-3-Super-120B-A12B-BF16via vLLM.Covers two splits:
qualitative— 32 concurrent, 4096 output tokensthroughput_32k— 8 concurrent, 80 requests, 4096 output tokensBoth tasks run
tp_size=4on a single 4×H100/A100 node.Part of OMNIML-5095 / OMNIML-5098.
Test plan
uv run slurm.py --yaml modules/Model-Optimizer/tools/launcher/examples/Nemotron-h/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/specdec_bench_mtp_vllm.yaml --dryrun --yes -vSummary by CodeRabbit
Release Notes