Fix ModelOpt MCP Slurm launcher submit by ChenhanYu · Pull Request #1799 · NVIDIA/Model-Optimizer

ChenhanYu · 2026-06-23T00:35:49Z

Summary

fix launcher Slurm task annotation patching so nemo-run CLI resolves slurm_factory correctly for task slots
harden modelopt-mcp submit parsing/status resolution and add regression coverage for launcher false-positive success cases
add a minimal nvidia-smi smoke YAML/script and fix launcher packaging so source-backed Slurm jobs package required files recursively

Validation

uv run pytest tests/test_core.py -q
uv run pytest tests/test_bridge.py -q
dry-run and live-submit validated through the patched local MCP server on cw_dfw
interactive smoke job succeeded end-to-end (nvidia-smi ran successfully in-container)

Summary by CodeRabbit

New Features
- Added an NVIDIA SMI GPU smoke test (script and minimal Slurm YAML example) for launcher integration.
Bug Fixes
- Improved detection of fatal launcher errors, including when the launcher exits with code 0.
- Strengthened Slurm experiment/job identifier parsing and added early rejection of unsafe experiment IDs, with clearer “unparsed”/failure behavior.
- Updated sandbox task Slurm config type handling and improved launcher packaging so examples/common are included consistently.
Tests
- Expanded unit and filesystem-based coverage for parsing/validation, dry-run fatal stderr handling, and nested experiment directory layouts.

coderabbitai · 2026-06-23T00:36:03Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a GPU smoke test script and example YAML for Slurm/CUDA validation. Expands set_slurm_config_type() to patch all SandboxTask0–SandboxTask4 slots. Makes launcher packaging paths filesystem-conditional via new helper functions. Refactors _git_info() to parse git metadata directly from the filesystem instead of subprocess. Hardens MCP bridge Slurm submission parsing with fatal-error detection, multi-pattern experiment/job-id extraction, launch_result_unparsed fallback, and nested experiment directory resolution.

Changes

Launcher and MCP Bridge

Layer / File(s)	Summary
GPU smoke test script and example YAML `tools/launcher/common/smoke/nvidia_smi.sh`, `tools/launcher/examples/smoke/nvidia_smi.yaml`	Adds bash smoke script that runs `hostname` and `nvidia-smi` with strict execution and start/end markers, and YAML configuring a 1-node/1-GPU Slurm job running that script inside a CUDA 12.4 container.
SandboxTask slurm_config annotation patching for all task slots `tools/launcher/core.py`, `tools/launcher/tests/test_core.py`	`set_slurm_config_type()` now iterates over `SandboxTask` and `SandboxTask0`–`SandboxTask4`, patching `__dataclass_fields__`, `__annotations__`, and `__init__` for each. Tests are extended to assert patching for `SandboxTask2`–`SandboxTask4`.
Launcher packaging paths and git-info refactoring `tools/launcher/launch.py`, `tools/launcher/core.py`	Imports `glob`. `_include_pattern` and `_relative_path` are initialized empty and populated by new `_add_package_path()` and `_add_package_glob()` helpers that only append paths present on disk, replacing hard-coded glob patterns. Base `examples` and `common` are always added; Megatron-LM and Model-Optimizer paths added only when modelopt source is detected. `_git_info()` refactored to parse `.git/HEAD`, resolve refs, and read commit hashes directly from git directory without subprocess, falling back to `"unknown"` on error. `--clean` subprocess call reformatted with multi-line argument list.
MCP bridge: error detection and experiment-ID validation helpers `tools/mcp/modelopt_mcp/bridge.py`	Adds `_SAFE_EXPERIMENT_ID_RE` and `_validate_experiment_id()` for safe ID validation. Introduces `_LAUNCHER_ERROR_RE` and `_launcher_reported_error()` to detect fatal launcher output independent of exit code. Adds `_find_launcher_package_dir()` for fallback experiment directory lookup. Updates subprocess import `# nosec` annotation.
MCP bridge: Slurm submission failure detection and output parsing `tools/mcp/modelopt_mcp/bridge.py`	Extends `submit_job_impl` and `_submit_job_dry_run` to fail on detected fatal launcher errors. Reworks experiment-id extraction to try `Experiment.from_id()` first, then multiple regex patterns; adds case-insensitive `"Job id: ..."` job-id variant; returns `launch_result_unparsed` when neither id is found, preserving any partial `slurm_job_id` and output tails.
MCP bridge: experiment directory resolution and job-operation validation `tools/mcp/modelopt_mcp/bridge.py`	`_resolve_experiment_dir` searches direct and nested paths via `glob(f"*/{experiment_id}")` across multiple candidate roots from `_experiment_search_roots()`. `_experiment_not_found_diagnostic()` centralizes search description. `job_status_impl` and `job_logs_impl` validate experiment IDs early, returning structured failures for unsafe IDs. Additional `# nosec` annotation refinements.
MCP bridge tests for parsing and validation paths `tools/mcp/tests/test_bridge.py`	Adds tests covering zero-exit-no-ids failure, Nemo "Experiment Status" id extraction, fatal stderr as failure even with exit 0, dry-run fatal-text validation, nested `experiments/<title>/<id>` resolution, launcher-directory fallback, and unsafe experiment-id rejection.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant submit_job_impl
  participant subprocess
  participant _launcher_reported_error
  participant ID_Parsers
  
  Client->>submit_job_impl: submit job
  submit_job_impl->>subprocess: run launcher process
  subprocess-->>submit_job_impl: returncode, stdout, stderr
  submit_job_impl->>_launcher_reported_error: check stderr for fatal error
  _launcher_reported_error-->>submit_job_impl: error_detected (bool)
  alt returncode != 0 or error_detected
    submit_job_impl-->>Client: failure (launch_py_failed)
  else
    submit_job_impl->>ID_Parsers: parse experiment_id
    ID_Parsers-->>submit_job_impl: experiment_id or None
    submit_job_impl->>ID_Parsers: parse slurm_job_id
    ID_Parsers-->>submit_job_impl: slurm_job_id or None
    alt both ids found
      submit_job_impl-->>Client: success
    else id not found
      submit_job_impl-->>Client: failure (launch_result_unparsed)
    end
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

NVIDIA/Model-Optimizer#1766: Both PRs modify MCP bridge experiment directory resolution and launcher output parsing to support more flexible path layouts and error handling.

Suggested labels

cherry-pick-done, cherry-pick-0.45.0

Suggested reviewers

kevalmorabia97

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

Ignore

❌ Failed checks (1 error)

Check name	Status	Explanation	Resolution
Security Anti-Patterns	❌ Error	PR violates SECURITY.md policy: 13 `# nosec` comments were added to bypass Bandit checks in bridge.py and launch.py without approval from `@NVIDIA/modelopt-setup-codeowners` or security justification...	Remove all `# nosec` comments and request explicit review/approval from `@NVIDIA/modelopt-setup-codeowners` with security justification in PR description before merging.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly addresses the main objective: fixing ModelOpt MCP Slurm launcher submit behavior, which encompasses the three core areas documented in PR objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 83.72% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenhany/fix-modelopt-mcp-slurm-submit

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tools/mcp/modelopt_mcp/bridge.py (1)

712-733: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Do not return Slurm success without an experiment_id.

Line 712 only fails when both IDs are absent, so stdout containing only Submitted batch job 123 returns ok=True with experiment_id=None. The MCP submit_job contract tells clients to poll job_status(experiment_id), so this becomes an unmonitorable success response. Treat missing experiment_id as a distinct partial/unparsed failure, or add a client-facing status path keyed by slurm_job_id.

🐛 Possible shape-preserving fix

-    if not experiment_id and not slurm_job_id:
+    if not experiment_id:
         return {
             "ok": False,
             "executor": "slurm",
             "reason": "launch_result_unparsed",
             "exit_code": 0,
             "stdout_tail": stdout_tail,
             "stderr_tail": stderr_tail,
+            "slurm_job_id": slurm_job_id,
             "diagnostic": (
-                "launch.py exited 0 but did not report an experiment_id "
-                "or Slurm job id. Treating this as failed so callers do "
-                "not assume work was submitted."
+                "launch.py exited 0 but did not report an experiment_id. "
+                "Treating this as failed so callers do not receive a "
+                "Slurm success response that cannot be monitored via "
+                "job_status."
             ),
             "argv": argv,
         }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` around lines 712 - 733, The condition at
line 712 that checks for missing IDs only fails when both experiment_id and
slurm_job_id are absent, but this allows a response with ok=True and
experiment_id=None to be returned when only slurm_job_id is present. Since the
MCP contract requires experiment_id for job status polling, modify the condition
to treat a missing experiment_id as a distinct failure case. Change the
condition from checking "not experiment_id and not slurm_job_id" to also fail
when experiment_id is missing (regardless of whether slurm_job_id exists), and
update the failure response diagnostic message to clearly indicate that
experiment_id was not found or parsed from the launch output.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 905-910: The roots list in the launcher experiment discovery logic
is missing the launcher's own experiments directory that is documented in the
docstring. Add back the root directory for the launcher's experiments folder to
the roots list. This should be included alongside the existing roots for
NEMORUN_HOME/experiments, ./experiments, and ./local_experiments to ensure jobs
created under the launcher's working directory can be properly resolved.
- Around line 905-917: The experiment_id parameter is used unsafely in path
operations without validation, allowing potential path traversal attacks (via
sequences like ..) and glob pattern injection. Before using experiment_id in the
path joins at line 912 (root / experiment_id) and the glob operation at line 915
(root.glob(f"*/{experiment_id}")), add validation to ensure experiment_id is a
single safe path token containing only alphanumeric characters, hyphens, and
underscores. Reject any values containing path separators, dots for traversal,
or glob metacharacters. Perform this validation at the entry point where
experiment_id is received from the MCP API interface (in job_status and job_logs
functions) rather than within this utility function.

---

Outside diff comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 712-733: The condition at line 712 that checks for missing IDs
only fails when both experiment_id and slurm_job_id are absent, but this allows
a response with ok=True and experiment_id=None to be returned when only
slurm_job_id is present. Since the MCP contract requires experiment_id for job
status polling, modify the condition to treat a missing experiment_id as a
distinct failure case. Change the condition from checking "not experiment_id and
not slurm_job_id" to also fail when experiment_id is missing (regardless of
whether slurm_job_id exists), and update the failure response diagnostic message
to clearly indicate that experiment_id was not found or parsed from the launch
output.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8e8aa525-4735-4ed1-b0bc-8d5aebf3abf8

📥 Commits

Reviewing files that changed from the base of the PR and between 090b1c5 and 69043a6.

📒 Files selected for processing (7)

tools/launcher/common/smoke/nvidia_smi.sh
tools/launcher/core.py
tools/launcher/examples/smoke/nvidia_smi.yaml
tools/launcher/launch.py
tools/launcher/tests/test_core.py
tools/mcp/modelopt_mcp/bridge.py
tools/mcp/tests/test_bridge.py

codecov · 2026-06-23T00:55:16Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.45%. Comparing base (090b1c5) to head (aabd9cf).
⚠️ Report is 8 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1799      +/-   ##
==========================================
- Coverage   77.29%   76.45%   -0.84%     
==========================================
  Files         511      511              
  Lines       56513    58554    +2041     
==========================================
+ Hits        43681    44768    +1087     
- Misses      12832    13786     +954

Flag	Coverage Δ
regression	`14.73% <ø> (+0.06%)`	⬆️
unit	`54.66% <ø> (+0.13%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ChenhanYu · 2026-06-23T13:53:22Z

/claude review

claude · 2026-06-23T14:05:30Z

+        if m:
+            slurm_job_id = m.group(1)
+
+    if not experiment_id and not slurm_job_id:


[IMPORTANT Compatibility] This only fails when both ids are missing, so a Slurm submit that prints Submitted batch job 123 / Job id: 123 but whose experiment-id line drifts out of all four regexes returns ok=True with experiment_id=None.

(1) What: The submit_job MCP contract tells callers to poll job_status(experiment_id), and job_status_impl / _resolve_experiment_dir are keyed entirely on experiment_id. A success response with experiment_id=None is unmonitorable — the caller believes work was submitted but can never check its status or fetch logs.

(2) Why it matters: This is exactly the "format may shift across versions" scenario the surrounding comment acknowledges. On any future nemo_run output drift that breaks the experiment-id patterns while leaving the Slurm Job id: line intact, every submit silently becomes a dead-end success.

(3) Fix: Treat a missing experiment_id as the unparsed/partial failure (fail when not experiment_id, surfacing slurm_job_id in the payload for manual scancel), or add a status path keyed by slurm_job_id. This corroborates CodeRabbit's same finding on this line.

claude

Claude review — scope: all 7 changed files reviewed (tools/launcher/core.py, launch.py, examples, common, tests; tools/mcp/modelopt_mcp/bridge.py; tools/mcp/tests/test_bridge.py). This PR is entirely launcher/MCP tooling — no modelopt/ algorithm, mode-registration, config-schema, or export code is touched, so the Algorithm/ModeState/Export categories do not apply.

Findings: CRITICAL: 0 | IMPORTANT: 1 | SUGGESTION: 0

Most impactful finding:

[IMPORTANT Compatibility] bridge.py:712 — submit_job_impl returns ok=True with experiment_id=None whenever the four experiment-id regexes all miss but a Slurm "Job id:"/"Submitted batch job" line is present. Since job_status/job_logs are keyed entirely on experiment_id, that response is an unmonitorable success — the caller can never poll status or fetch logs. Recommend failing on missing experiment_id (surfacing slurm_job_id for manual cleanup) or adding a slurm_job_id-keyed status path. (CodeRabbit independently flagged the same line.)

Notes on things that checked out:

core.py patching of SandboxTask0–SandboxTask4: each is its own @DataClass, so per-class dataclass_fields/annotations/init.annotations patching is correct — no shared-mutable aliasing. Test coverage extended appropriately.
launch.py packaging rewrite (relative globs to on-disk-conditional absolute paths): the always-added examples/common paths flow through the exact same _add_package_path mechanism the author end-to-end smoke job exercised (common/smoke/nvidia_smi.sh packaged + ran), giving good evidence the PatternPackager semantics hold.
Fatal-error-on-exit-0 detection (_launcher_reported_error) and the dry-run mirror are sound and well-tested.
The path-traversal/glob-injection concern on experiment_id is already raised by CodeRabbit; not duplicating it here.

Overall risk: LOW. Tooling-only change, well-tested, with one should-fix contract gap on the submit-to-poll path.

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tools/mcp/modelopt_mcp/bridge.py (2)

360-368: 🔒 Security & Privacy | 🔴 Critical | ⚡ Quick win

Terminate SSH option parsing before the user-supplied target.

argv list form avoids shell injection, but it does not stop ssh from treating a cluster_host that starts with - as another option; options like -oProxyCommand=... can execute a local command. Add -- before the target and reject whitespace/leading-option targets at the MCP boundary. As per coding guidelines, "Validate external input once at the interface boundary."

🛡️ Proposed hardening

     if identity:
         argv += ["-i", identity]
     target = f"{cluster_user}@{cluster_host}" if cluster_user else cluster_host
-    argv += [target, "whoami && hostname"]
+    if target.startswith("-") or any(ch.isspace() for ch in target):
+        return {
+            "ok": False,
+            "executor": "slurm",
+            "cluster_host": cluster_host,
+            "cluster_user": cluster_user,
+            "reason": "invalid_ssh_target",
+            "diagnostic": "cluster_host/cluster_user must form a single non-option SSH target.",
+        }
+    argv += ["--", target, "whoami && hostname"]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` around lines 360 - 368, The SSH command
construction does not properly terminate option parsing before the user-supplied
target, which allows targets starting with `-` to be interpreted as SSH options
(like `-oProxyCommand=...`) enabling command injection. Add `--` to the argv
list immediately before the target variable to signal the end of SSH options,
and add input validation to reject cluster_host and cluster_user values that
contain whitespace or start with `-` at the point where these values are
initially received from external input, before they are used in the
subprocess.run call.

Source: Coding guidelines

698-757: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Validate the parsed experiment_id before returning submit success.

The new Experiment.from_id(...) / Experiment Status for ... patterns accept arbitrary non-space strings, but job_status_impl and job_logs_impl only accept _SAFE_EXPERIMENT_ID_RE. Without validating here, submit can return ok=True with an ID that callers cannot poll. As per coding guidelines, "Validate external input once at the interface boundary."

🧩 Proposed validation

     else:
         m = re.search(r"Job id:\s*(\d+)", stdout_tail, re.IGNORECASE)
         if m:
             slurm_job_id = m.group(1)
 
+    if experiment_id:
+        invalid = _validate_experiment_id(experiment_id)
+        if invalid:
+            return {
+                **invalid,
+                "executor": "slurm",
+                "exit_code": 0,
+                "stdout_tail": stdout_tail,
+                "stderr_tail": stderr_tail,
+                "argv": argv,
+            }
+
     if not experiment_id:
         return {
             "ok": False,

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` around lines 698 - 757, The code parses
experiment_id from various regex patterns but does not validate that the parsed
ID conforms to the safety requirements before returning success. Add validation
to check if the parsed experiment_id matches the _SAFE_EXPERIMENT_ID_RE pattern
before accepting it as valid. If the experiment_id exists but does not match the
safe pattern, treat it as a failed parse by returning the same failure response
structure that is currently returned when experiment_id is empty or not found.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/launcher/core.py`:
- Around line 399-409: Remove all inline `# nosec` directives from the _git_info
function, including the `# nosec B404` comment on the subprocess import
statement and the `# nosec B603 B607` comments on both subprocess.run calls.
Either refactor the implementation to legitimately pass Bandit security checks
without these suppressions, or if the current subprocess approach is necessary,
route the security exception through the approved security-exception process in
PR metadata instead of using inline code comments that violate repo policy.

In `@tools/mcp/modelopt_mcp/bridge.py`:
- Line 38: Remove all `# nosec B404` inline suppression comments from the
subprocess import statement and all other locations where they appear (lines 38,
239, 293, 368, 614, 642, 855, 1280, 1459, 1485 as noted in the comment).
According to repository policy, inline Bandit bypasses are not permitted; either
remove the suppressions entirely so Bandit passes without them, or work with the
code owner to formally document and route a security exception through the
proper approval process before merge.
- Around line 936-951: The function `_resolve_experiment_dir()` now searches
multiple root directories including the launcher package's experiments directory
(via launcher_dir), but the error diagnostic message shown to users when an
experiment is not found has not been updated to match. Locate the failure
diagnostic in the error handling code that follows the root search loop (around
lines 973-984) and update the error message to include all the roots being
searched: NEMORUN_HOME/experiments, current working directory experiments,
current working directory local_experiments, and the launcher package's
experiments directory. This ensures users see all the locations that were
checked rather than only the older roots.

---

Outside diff comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 360-368: The SSH command construction does not properly terminate
option parsing before the user-supplied target, which allows targets starting
with `-` to be interpreted as SSH options (like `-oProxyCommand=...`) enabling
command injection. Add `--` to the argv list immediately before the target
variable to signal the end of SSH options, and add input validation to reject
cluster_host and cluster_user values that contain whitespace or start with `-`
at the point where these values are initially received from external input,
before they are used in the subprocess.run call.
- Around line 698-757: The code parses experiment_id from various regex patterns
but does not validate that the parsed ID conforms to the safety requirements
before returning success. Add validation to check if the parsed experiment_id
matches the _SAFE_EXPERIMENT_ID_RE pattern before accepting it as valid. If the
experiment_id exists but does not match the safe pattern, treat it as a failed
parse by returning the same failure response structure that is currently
returned when experiment_id is empty or not found.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 00f70104-4e10-4552-b650-a8cc0c1948f1

📥 Commits

Reviewing files that changed from the base of the PR and between 350fdab and bfcf6fb.

📒 Files selected for processing (7)

tools/launcher/common/smoke/nvidia_smi.sh
tools/launcher/core.py
tools/launcher/examples/smoke/nvidia_smi.yaml
tools/launcher/launch.py
tools/launcher/tests/test_core.py
tools/mcp/modelopt_mcp/bridge.py
tools/mcp/tests/test_bridge.py

✅ Files skipped from review due to trivial changes (1)

tools/launcher/examples/smoke/nvidia_smi.yaml

🚧 Files skipped from review as they are similar to previous changes (3)

tools/launcher/common/smoke/nvidia_smi.sh
tools/launcher/launch.py
tools/launcher/tests/test_core.py

coderabbitai · 2026-06-23T14:37:40Z

 import os
 import re
-import subprocess  # nosec B404
+import subprocess  # nosec B404 - fixed-argv CLI probes are required; shell=True is not used.


🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Remove the Bandit # nosec suppressions or route a formal exception.

These changed call sites still suppress Bandit with # nosec; repo policy disallows inline bypasses even for fixed-argv subprocess probes. Remove the suppressions and make Bandit pass, or document the required security exception with setup-codeowner approval before merge. As per coding guidelines, "# nosec comments are not allowed as a bypass for security checks."

Also applies to: 239-239, 293-293, 368-368, 614-614, 642-642, 855-855, 1280-1280, 1459-1459, 1485-1485

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/mcp/modelopt_mcp/bridge.py` at line 38, Remove all `# nosec B404` inline suppression comments from the subprocess import statement and all other locations where they appear (lines 38, 239, 293, 368, 614, 642, 855, 1280, 1459, 1485 as noted in the comment). According to repository policy, inline Bandit bypasses are not permitted; either remove the suppressions entirely so Bandit passes without them, or work with the code owner to formally document and route a security exception through the proper approval process before merge.

Source: Coding guidelines

ChenhanYu · 2026-06-23T15:27:46Z

/ok to test

Signed-off-by: Chenhan D. Yu <chenhany@nvidia.com>

copy-pr-bot · 2026-06-23T16:30:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Chenhan D. Yu <chenhany@nvidia.com>

ChenhanYu · 2026-06-23T17:04:06Z

CI is green on b848c69 and auto-merge is enabled with squash. Direct merge is currently blocked only by required code-owner review from NVIDIA/modelopt-devs; once approved, GitHub should merge the PR automatically.

Signed-off-by: Chenhan D. Yu <chenhany@nvidia.com>

ChenhanYu · 2026-06-23T19:09:22Z

Added managed source checkout support in 7c57d35/aabd9cf: submit_job now accepts source_ref/source_repo, defaults to Model-Optimizer main, materializes a cached recursive-submodule checkout, and runs uv run --project <source_root>/tools/launcher modelopt-launcher .... CI is green on aabd9cf (24 success, 2 skipped); auto-merge remains enabled and is waiting on code-owner review.

ChenhanYu · 2026-06-23T19:13:54Z

/ok to test aabd9cf

github-actions · 2026-06-23T19:36:35Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-23 19:36 UTC

## Summary - make Slurm ModelOpt source overlay mounts conditional on source-checkout mode - force managed-source MCP launches to reinstall `modelopt-launcher` from the selected checkout so source refs do not reuse a stale cached package - add regression coverage for installed-mode Slurm mounts and managed-source launcher argv construction ## Root cause PR #1799 correctly stopped packaging `modules/Model-Optimizer/*` when `modelopt-launcher` runs from an installed package. However, `build_slurm_executor()` still unconditionally added container mounts for `code/modules/Model-Optimizer/modelopt` and `modelopt_recipes`. In installed MCP mode those paths do not exist in the remote package, so the container runtime fails before the job script starts. A second issue appeared during validation: the MCP managed-source path can materialize the right git checkout but still execute a cached `modelopt-launcher` package with the same version. Adding `uv run --reinstall-package modelopt-launcher` ensures the selected source ref is what actually runs. ## Validation - `uv run pytest tests/test_bridge.py -q` from `tools/mcp`: 51 passed - `uv run pytest tests/test_slurm_executor.py tests/test_core.py -q` from `tools/launcher`: 24 passed - `pre-commit run --files tools/launcher/core.py tools/launcher/tests/test_slurm_executor.py tools/mcp/modelopt_mcp/bridge.py tools/mcp/tests/test_bridge.py`: passed - Live Slurm GPU smoke validated with patched launcher path; `nvidia-smi` ran successfully and the smoke script completed.  ## Summary by CodeRabbit ## Release Notes * **Refactor** * Restructured container mount assembly for Slurm job execution to conditionally mount ModelOpt directories based on optional source path parameter. * Enhanced launcher command-line generation with package management improvements. * Replaced unconditional mount paths with conditional behavior for more flexible resource utilization. * **Tests** * Expanded container mount scenario test coverage for installed and source execution modes. * Tightened test assertions for comprehensive mount behavior verification.  Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

ChenhanYu requested a review from a team as a code owner June 23, 2026 00:35

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread tools/mcp/modelopt_mcp/bridge.py

ChenhanYu force-pushed the chenhany/fix-modelopt-mcp-slurm-submit branch 2 times, most recently from 98b4113 to 350fdab Compare June 23, 2026 01:19

ChenhanYu requested a review from kevalmorabia97 June 23, 2026 13:53

claude Bot reviewed Jun 23, 2026

View reviewed changes

Fix ModelOpt MCP Slurm launcher submit

bfcf6fb

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

ChenhanYu force-pushed the chenhany/fix-modelopt-mcp-slurm-submit branch from 350fdab to bfcf6fb Compare June 23, 2026 14:25

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

Address launcher MCP review feedback

98b30ca

Signed-off-by: Chenhan D. Yu <chenhany@nvidia.com>

Fix launcher git info from nested paths

b848c69

Signed-off-by: Chenhan D. Yu <chenhany@nvidia.com>

ChenhanYu enabled auto-merge (squash) June 23, 2026 17:03

ChenhanYu added 2 commits June 23, 2026 11:45

Add managed source checkouts to modelopt MCP

7c57d35

Signed-off-by: Chenhan D. Yu <chenhany@nvidia.com>

Format managed checkout helper

aabd9cf

Signed-off-by: Chenhan D. Yu <chenhany@nvidia.com>

kevalmorabia97 approved these changes Jun 23, 2026

View reviewed changes

ChenhanYu merged commit 37dbbda into main Jun 23, 2026
43 checks passed

ChenhanYu deleted the chenhany/fix-modelopt-mcp-slurm-submit branch June 23, 2026 19:36

ChenhanYu mentioned this pull request Jun 24, 2026

Fix launcher Slurm mounts in installed MCP mode #1811

Merged

Uh oh!

Conversation

ChenhanYu commented Jun 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks failed

❌ Failed checks (1 error)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ChenhanYu commented Jun 23, 2026

Uh oh!

claude Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ChenhanYu commented Jun 23, 2026

Uh oh!

copy-pr-bot Bot commented Jun 23, 2026

Uh oh!

ChenhanYu commented Jun 23, 2026

Uh oh!

ChenhanYu commented Jun 23, 2026

Uh oh!

ChenhanYu commented Jun 23, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ChenhanYu commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

codecov Bot commented Jun 23, 2026 •

edited

Loading