Fix launcher Slurm mounts in installed MCP mode#1811
Conversation
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthrough
ChangesConditional ModelOpt Mounts in Slurm Executor
MCP Bridge Reinstall Flag
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
/claude review |
There was a problem hiding this comment.
Claude review passed — no blocking issues found.
Traced the fix end-to-end:
launch.pysetsMODELOPT_SRC_PATH=Nonein installed mode and a real path in source mode; this flows throughrun_jobsintobuild_slurm_executor, where the newif modelopt_src_path:guard correctly suppresses themodelopt/modelopt_recipescontainer overlay mounts that don't exist in the remote installed package — the root cause of the pre-job container failure.- Positional arg order in both
run_jobscallers matches the updatedbuild_slurm_executor/build_docker_executorsignatures (newmodelopt_src_pathslot beforeexperiment_title). build_docker_executorbehavior is unchanged (still defaults to cwd), correct for local/dev Docker._launcher_argvadds--reinstall-package modelopt-launcheronly on the managed-source branch; validuv runoption placement, installed launches untouched. The per-launch reinstall cost is the documented intentional tradeoff to avoid running a stale cached package.
Test coverage exercises installed-skip, source-add, and the argv change. Risk is low — scope is confined to tools/ launcher/MCP infra with backward-compatible optional args.
LGTM
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1811 +/- ##
==========================================
- Coverage 56.84% 56.53% -0.32%
==========================================
Files 510 510
Lines 56615 56615
==========================================
- Hits 32183 32006 -177
- Misses 24432 24609 +177
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Summary
modelopt-launcherfrom the selected checkout so source refs do not reuse a stale cached packageRoot cause
PR #1799 correctly stopped packaging
modules/Model-Optimizer/*whenmodelopt-launcherruns from an installed package. However,build_slurm_executor()still unconditionally added container mounts forcode/modules/Model-Optimizer/modeloptandmodelopt_recipes. In installed MCP mode those paths do not exist in the remote package, so the container runtime fails before the job script starts.A second issue appeared during validation: the MCP managed-source path can materialize the right git checkout but still execute a cached
modelopt-launcherpackage with the same version. Addinguv run --reinstall-package modelopt-launcherensures the selected source ref is what actually runs.Validation
uv run pytest tests/test_bridge.py -qfromtools/mcp: 51 passeduv run pytest tests/test_slurm_executor.py tests/test_core.py -qfromtools/launcher: 24 passedpre-commit run --files tools/launcher/core.py tools/launcher/tests/test_slurm_executor.py tools/mcp/modelopt_mcp/bridge.py tools/mcp/tests/test_bridge.py: passednvidia-smiran successfully and the smoke script completed.Summary by CodeRabbit
Release Notes
Refactor
Tests