Fix ModelOpt MCP Slurm launcher submit#1799
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a GPU smoke test script and example YAML for Slurm/CUDA validation. Expands ChangesLauncher and MCP Bridge
Sequence Diagram(s)sequenceDiagram
participant Client
participant submit_job_impl
participant subprocess
participant _launcher_reported_error
participant ID_Parsers
Client->>submit_job_impl: submit job
submit_job_impl->>subprocess: run launcher process
subprocess-->>submit_job_impl: returncode, stdout, stderr
submit_job_impl->>_launcher_reported_error: check stderr for fatal error
_launcher_reported_error-->>submit_job_impl: error_detected (bool)
alt returncode != 0 or error_detected
submit_job_impl-->>Client: failure (launch_py_failed)
else
submit_job_impl->>ID_Parsers: parse experiment_id
ID_Parsers-->>submit_job_impl: experiment_id or None
submit_job_impl->>ID_Parsers: parse slurm_job_id
ID_Parsers-->>submit_job_impl: slurm_job_id or None
alt both ids found
submit_job_impl-->>Client: success
else id not found
submit_job_impl-->>Client: failure (launch_result_unparsed)
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error)
✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tools/mcp/modelopt_mcp/bridge.py (1)
712-733: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winDo not return Slurm success without an
experiment_id.Line 712 only fails when both IDs are absent, so stdout containing only
Submitted batch job 123returnsok=Truewithexperiment_id=None. The MCPsubmit_jobcontract tells clients to polljob_status(experiment_id), so this becomes an unmonitorable success response. Treat missingexperiment_idas a distinct partial/unparsed failure, or add a client-facing status path keyed byslurm_job_id.🐛 Possible shape-preserving fix
- if not experiment_id and not slurm_job_id: + if not experiment_id: return { "ok": False, "executor": "slurm", "reason": "launch_result_unparsed", "exit_code": 0, "stdout_tail": stdout_tail, "stderr_tail": stderr_tail, + "slurm_job_id": slurm_job_id, "diagnostic": ( - "launch.py exited 0 but did not report an experiment_id " - "or Slurm job id. Treating this as failed so callers do " - "not assume work was submitted." + "launch.py exited 0 but did not report an experiment_id. " + "Treating this as failed so callers do not receive a " + "Slurm success response that cannot be monitored via " + "job_status." ), "argv": argv, }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/mcp/modelopt_mcp/bridge.py` around lines 712 - 733, The condition at line 712 that checks for missing IDs only fails when both experiment_id and slurm_job_id are absent, but this allows a response with ok=True and experiment_id=None to be returned when only slurm_job_id is present. Since the MCP contract requires experiment_id for job status polling, modify the condition to treat a missing experiment_id as a distinct failure case. Change the condition from checking "not experiment_id and not slurm_job_id" to also fail when experiment_id is missing (regardless of whether slurm_job_id exists), and update the failure response diagnostic message to clearly indicate that experiment_id was not found or parsed from the launch output.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 905-910: The roots list in the launcher experiment discovery logic
is missing the launcher's own experiments directory that is documented in the
docstring. Add back the root directory for the launcher's experiments folder to
the roots list. This should be included alongside the existing roots for
NEMORUN_HOME/experiments, ./experiments, and ./local_experiments to ensure jobs
created under the launcher's working directory can be properly resolved.
- Around line 905-917: The experiment_id parameter is used unsafely in path
operations without validation, allowing potential path traversal attacks (via
sequences like ..) and glob pattern injection. Before using experiment_id in the
path joins at line 912 (root / experiment_id) and the glob operation at line 915
(root.glob(f"*/{experiment_id}")), add validation to ensure experiment_id is a
single safe path token containing only alphanumeric characters, hyphens, and
underscores. Reject any values containing path separators, dots for traversal,
or glob metacharacters. Perform this validation at the entry point where
experiment_id is received from the MCP API interface (in job_status and job_logs
functions) rather than within this utility function.
---
Outside diff comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 712-733: The condition at line 712 that checks for missing IDs
only fails when both experiment_id and slurm_job_id are absent, but this allows
a response with ok=True and experiment_id=None to be returned when only
slurm_job_id is present. Since the MCP contract requires experiment_id for job
status polling, modify the condition to treat a missing experiment_id as a
distinct failure case. Change the condition from checking "not experiment_id and
not slurm_job_id" to also fail when experiment_id is missing (regardless of
whether slurm_job_id exists), and update the failure response diagnostic message
to clearly indicate that experiment_id was not found or parsed from the launch
output.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8e8aa525-4735-4ed1-b0bc-8d5aebf3abf8
📒 Files selected for processing (7)
tools/launcher/common/smoke/nvidia_smi.shtools/launcher/core.pytools/launcher/examples/smoke/nvidia_smi.yamltools/launcher/launch.pytools/launcher/tests/test_core.pytools/mcp/modelopt_mcp/bridge.pytools/mcp/tests/test_bridge.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1799 +/- ##
==========================================
- Coverage 77.29% 76.45% -0.84%
==========================================
Files 511 511
Lines 56513 58554 +2041
==========================================
+ Hits 43681 44768 +1087
- Misses 12832 13786 +954
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
98b4113 to
350fdab
Compare
|
/claude review |
| if m: | ||
| slurm_job_id = m.group(1) | ||
|
|
||
| if not experiment_id and not slurm_job_id: |
There was a problem hiding this comment.
[IMPORTANT Compatibility] This only fails when both ids are missing, so a Slurm submit that prints Submitted batch job 123 / Job id: 123 but whose experiment-id line drifts out of all four regexes returns ok=True with experiment_id=None.
(1) What: The submit_job MCP contract tells callers to poll job_status(experiment_id), and job_status_impl / _resolve_experiment_dir are keyed entirely on experiment_id. A success response with experiment_id=None is unmonitorable — the caller believes work was submitted but can never check its status or fetch logs.
(2) Why it matters: This is exactly the "format may shift across versions" scenario the surrounding comment acknowledges. On any future nemo_run output drift that breaks the experiment-id patterns while leaving the Slurm Job id: line intact, every submit silently becomes a dead-end success.
(3) Fix: Treat a missing experiment_id as the unparsed/partial failure (fail when not experiment_id, surfacing slurm_job_id in the payload for manual scancel), or add a status path keyed by slurm_job_id. This corroborates CodeRabbit's same finding on this line.
There was a problem hiding this comment.
Claude review — scope: all 7 changed files reviewed (tools/launcher/core.py, launch.py, examples, common, tests; tools/mcp/modelopt_mcp/bridge.py; tools/mcp/tests/test_bridge.py). This PR is entirely launcher/MCP tooling — no modelopt/ algorithm, mode-registration, config-schema, or export code is touched, so the Algorithm/ModeState/Export categories do not apply.
Findings: CRITICAL: 0 | IMPORTANT: 1 | SUGGESTION: 0
Most impactful finding:
- [IMPORTANT Compatibility] bridge.py:712 — submit_job_impl returns ok=True with experiment_id=None whenever the four experiment-id regexes all miss but a Slurm "Job id:"/"Submitted batch job" line is present. Since job_status/job_logs are keyed entirely on experiment_id, that response is an unmonitorable success — the caller can never poll status or fetch logs. Recommend failing on missing experiment_id (surfacing slurm_job_id for manual cleanup) or adding a slurm_job_id-keyed status path. (CodeRabbit independently flagged the same line.)
Notes on things that checked out:
- core.py patching of SandboxTask0–SandboxTask4: each is its own @DataClass, so per-class dataclass_fields/annotations/init.annotations patching is correct — no shared-mutable aliasing. Test coverage extended appropriately.
- launch.py packaging rewrite (relative globs to on-disk-conditional absolute paths): the always-added examples/common paths flow through the exact same _add_package_path mechanism the author end-to-end smoke job exercised (common/smoke/nvidia_smi.sh packaged + ran), giving good evidence the PatternPackager semantics hold.
- Fatal-error-on-exit-0 detection (_launcher_reported_error) and the dry-run mirror are sound and well-tested.
- The path-traversal/glob-injection concern on experiment_id is already raised by CodeRabbit; not duplicating it here.
Overall risk: LOW. Tooling-only change, well-tested, with one should-fix contract gap on the submit-to-poll path.
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
350fdab to
bfcf6fb
Compare
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tools/mcp/modelopt_mcp/bridge.py (2)
360-368: 🔒 Security & Privacy | 🔴 Critical | ⚡ Quick winTerminate SSH option parsing before the user-supplied target.
argvlist form avoids shell injection, but it does not stopsshfrom treating acluster_hostthat starts with-as another option; options like-oProxyCommand=...can execute a local command. Add--before the target and reject whitespace/leading-option targets at the MCP boundary. As per coding guidelines, "Validate external input once at the interface boundary."🛡️ Proposed hardening
if identity: argv += ["-i", identity] target = f"{cluster_user}@{cluster_host}" if cluster_user else cluster_host - argv += [target, "whoami && hostname"] + if target.startswith("-") or any(ch.isspace() for ch in target): + return { + "ok": False, + "executor": "slurm", + "cluster_host": cluster_host, + "cluster_user": cluster_user, + "reason": "invalid_ssh_target", + "diagnostic": "cluster_host/cluster_user must form a single non-option SSH target.", + } + argv += ["--", target, "whoami && hostname"]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/mcp/modelopt_mcp/bridge.py` around lines 360 - 368, The SSH command construction does not properly terminate option parsing before the user-supplied target, which allows targets starting with `-` to be interpreted as SSH options (like `-oProxyCommand=...`) enabling command injection. Add `--` to the argv list immediately before the target variable to signal the end of SSH options, and add input validation to reject cluster_host and cluster_user values that contain whitespace or start with `-` at the point where these values are initially received from external input, before they are used in the subprocess.run call.Source: Coding guidelines
698-757: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winValidate the parsed
experiment_idbefore returning submit success.The new
Experiment.from_id(...)/Experiment Status for ...patterns accept arbitrary non-space strings, butjob_status_implandjob_logs_implonly accept_SAFE_EXPERIMENT_ID_RE. Without validating here, submit can returnok=Truewith an ID that callers cannot poll. As per coding guidelines, "Validate external input once at the interface boundary."🧩 Proposed validation
else: m = re.search(r"Job id:\s*(\d+)", stdout_tail, re.IGNORECASE) if m: slurm_job_id = m.group(1) + if experiment_id: + invalid = _validate_experiment_id(experiment_id) + if invalid: + return { + **invalid, + "executor": "slurm", + "exit_code": 0, + "stdout_tail": stdout_tail, + "stderr_tail": stderr_tail, + "argv": argv, + } + if not experiment_id: return { "ok": False,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/mcp/modelopt_mcp/bridge.py` around lines 698 - 757, The code parses experiment_id from various regex patterns but does not validate that the parsed ID conforms to the safety requirements before returning success. Add validation to check if the parsed experiment_id matches the _SAFE_EXPERIMENT_ID_RE pattern before accepting it as valid. If the experiment_id exists but does not match the safe pattern, treat it as a failed parse by returning the same failure response structure that is currently returned when experiment_id is empty or not found.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/launcher/core.py`:
- Around line 399-409: Remove all inline `# nosec` directives from the _git_info
function, including the `# nosec B404` comment on the subprocess import
statement and the `# nosec B603 B607` comments on both subprocess.run calls.
Either refactor the implementation to legitimately pass Bandit security checks
without these suppressions, or if the current subprocess approach is necessary,
route the security exception through the approved security-exception process in
PR metadata instead of using inline code comments that violate repo policy.
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Line 38: Remove all `# nosec B404` inline suppression comments from the
subprocess import statement and all other locations where they appear (lines 38,
239, 293, 368, 614, 642, 855, 1280, 1459, 1485 as noted in the comment).
According to repository policy, inline Bandit bypasses are not permitted; either
remove the suppressions entirely so Bandit passes without them, or work with the
code owner to formally document and route a security exception through the
proper approval process before merge.
- Around line 936-951: The function `_resolve_experiment_dir()` now searches
multiple root directories including the launcher package's experiments directory
(via launcher_dir), but the error diagnostic message shown to users when an
experiment is not found has not been updated to match. Locate the failure
diagnostic in the error handling code that follows the root search loop (around
lines 973-984) and update the error message to include all the roots being
searched: NEMORUN_HOME/experiments, current working directory experiments,
current working directory local_experiments, and the launcher package's
experiments directory. This ensures users see all the locations that were
checked rather than only the older roots.
---
Outside diff comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 360-368: The SSH command construction does not properly terminate
option parsing before the user-supplied target, which allows targets starting
with `-` to be interpreted as SSH options (like `-oProxyCommand=...`) enabling
command injection. Add `--` to the argv list immediately before the target
variable to signal the end of SSH options, and add input validation to reject
cluster_host and cluster_user values that contain whitespace or start with `-`
at the point where these values are initially received from external input,
before they are used in the subprocess.run call.
- Around line 698-757: The code parses experiment_id from various regex patterns
but does not validate that the parsed ID conforms to the safety requirements
before returning success. Add validation to check if the parsed experiment_id
matches the _SAFE_EXPERIMENT_ID_RE pattern before accepting it as valid. If the
experiment_id exists but does not match the safe pattern, treat it as a failed
parse by returning the same failure response structure that is currently
returned when experiment_id is empty or not found.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 00f70104-4e10-4552-b650-a8cc0c1948f1
📒 Files selected for processing (7)
tools/launcher/common/smoke/nvidia_smi.shtools/launcher/core.pytools/launcher/examples/smoke/nvidia_smi.yamltools/launcher/launch.pytools/launcher/tests/test_core.pytools/mcp/modelopt_mcp/bridge.pytools/mcp/tests/test_bridge.py
✅ Files skipped from review due to trivial changes (1)
- tools/launcher/examples/smoke/nvidia_smi.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
- tools/launcher/common/smoke/nvidia_smi.sh
- tools/launcher/launch.py
- tools/launcher/tests/test_core.py
| import os | ||
| import re | ||
| import subprocess # nosec B404 | ||
| import subprocess # nosec B404 - fixed-argv CLI probes are required; shell=True is not used. |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Remove the Bandit # nosec suppressions or route a formal exception.
These changed call sites still suppress Bandit with # nosec; repo policy disallows inline bypasses even for fixed-argv subprocess probes. Remove the suppressions and make Bandit pass, or document the required security exception with setup-codeowner approval before merge. As per coding guidelines, "# nosec comments are not allowed as a bypass for security checks."
Also applies to: 239-239, 293-293, 368-368, 614-614, 642-642, 855-855, 1280-1280, 1459-1459, 1485-1485
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/mcp/modelopt_mcp/bridge.py` at line 38, Remove all `# nosec B404`
inline suppression comments from the subprocess import statement and all other
locations where they appear (lines 38, 239, 293, 368, 614, 642, 855, 1280, 1459,
1485 as noted in the comment). According to repository policy, inline Bandit
bypasses are not permitted; either remove the suppressions entirely so Bandit
passes without them, or work with the code owner to formally document and route
a security exception through the proper approval process before merge.
Source: Coding guidelines
|
/ok to test |
Signed-off-by: Chenhan D. Yu <chenhany@nvidia.com>
Signed-off-by: Chenhan D. Yu <chenhany@nvidia.com>
|
CI is green on |
Signed-off-by: Chenhan D. Yu <chenhany@nvidia.com>
Signed-off-by: Chenhan D. Yu <chenhany@nvidia.com>
|
Added managed source checkout support in |
|
/ok to test aabd9cf |
|
## Summary - make Slurm ModelOpt source overlay mounts conditional on source-checkout mode - force managed-source MCP launches to reinstall `modelopt-launcher` from the selected checkout so source refs do not reuse a stale cached package - add regression coverage for installed-mode Slurm mounts and managed-source launcher argv construction ## Root cause PR #1799 correctly stopped packaging `modules/Model-Optimizer/*` when `modelopt-launcher` runs from an installed package. However, `build_slurm_executor()` still unconditionally added container mounts for `code/modules/Model-Optimizer/modelopt` and `modelopt_recipes`. In installed MCP mode those paths do not exist in the remote package, so the container runtime fails before the job script starts. A second issue appeared during validation: the MCP managed-source path can materialize the right git checkout but still execute a cached `modelopt-launcher` package with the same version. Adding `uv run --reinstall-package modelopt-launcher` ensures the selected source ref is what actually runs. ## Validation - `uv run pytest tests/test_bridge.py -q` from `tools/mcp`: 51 passed - `uv run pytest tests/test_slurm_executor.py tests/test_core.py -q` from `tools/launcher`: 24 passed - `pre-commit run --files tools/launcher/core.py tools/launcher/tests/test_slurm_executor.py tools/mcp/modelopt_mcp/bridge.py tools/mcp/tests/test_bridge.py`: passed - Live Slurm GPU smoke validated with patched launcher path; `nvidia-smi` ran successfully and the smoke script completed. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **Refactor** * Restructured container mount assembly for Slurm job execution to conditionally mount ModelOpt directories based on optional source path parameter. * Enhanced launcher command-line generation with package management improvements. * Replaced unconditional mount paths with conditional behavior for more flexible resource utilization. * **Tests** * Expanded container mount scenario test coverage for installed and source execution modes. * Tightened test assertions for comprehensive mount behavior verification. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Summary
slurm_factorycorrectly for task slotsmodelopt-mcpsubmit parsing/status resolution and add regression coverage for launcher false-positive success casesnvidia-smismoke YAML/script and fix launcher packaging so source-backed Slurm jobs package required files recursivelyValidation
uv run pytest tests/test_core.py -quv run pytest tests/test_bridge.py -qcw_dfwnvidia-smiran successfully in-container)Summary by CodeRabbit