Skip to content

Fix ModelOpt MCP Slurm launcher submit#1799

Merged
ChenhanYu merged 5 commits into
mainfrom
chenhany/fix-modelopt-mcp-slurm-submit
Jun 23, 2026
Merged

Fix ModelOpt MCP Slurm launcher submit#1799
ChenhanYu merged 5 commits into
mainfrom
chenhany/fix-modelopt-mcp-slurm-submit

Conversation

@ChenhanYu

@ChenhanYu ChenhanYu commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • fix launcher Slurm task annotation patching so nemo-run CLI resolves slurm_factory correctly for task slots
  • harden modelopt-mcp submit parsing/status resolution and add regression coverage for launcher false-positive success cases
  • add a minimal nvidia-smi smoke YAML/script and fix launcher packaging so source-backed Slurm jobs package required files recursively

Validation

  • uv run pytest tests/test_core.py -q
  • uv run pytest tests/test_bridge.py -q
  • dry-run and live-submit validated through the patched local MCP server on cw_dfw
  • interactive smoke job succeeded end-to-end (nvidia-smi ran successfully in-container)

Summary by CodeRabbit

  • New Features
    • Added an NVIDIA SMI GPU smoke test (script and minimal Slurm YAML example) for launcher integration.
  • Bug Fixes
    • Improved detection of fatal launcher errors, including when the launcher exits with code 0.
    • Strengthened Slurm experiment/job identifier parsing and added early rejection of unsafe experiment IDs, with clearer “unparsed”/failure behavior.
    • Updated sandbox task Slurm config type handling and improved launcher packaging so examples/common are included consistently.
  • Tests
    • Expanded unit and filesystem-based coverage for parsing/validation, dry-run fatal stderr handling, and nested experiment directory layouts.

@ChenhanYu ChenhanYu requested a review from a team as a code owner June 23, 2026 00:35
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a GPU smoke test script and example YAML for Slurm/CUDA validation. Expands set_slurm_config_type() to patch all SandboxTask0SandboxTask4 slots. Makes launcher packaging paths filesystem-conditional via new helper functions. Refactors _git_info() to parse git metadata directly from the filesystem instead of subprocess. Hardens MCP bridge Slurm submission parsing with fatal-error detection, multi-pattern experiment/job-id extraction, launch_result_unparsed fallback, and nested experiment directory resolution.

Changes

Launcher and MCP Bridge

Layer / File(s) Summary
GPU smoke test script and example YAML
tools/launcher/common/smoke/nvidia_smi.sh, tools/launcher/examples/smoke/nvidia_smi.yaml
Adds bash smoke script that runs hostname and nvidia-smi with strict execution and start/end markers, and YAML configuring a 1-node/1-GPU Slurm job running that script inside a CUDA 12.4 container.
SandboxTask slurm_config annotation patching for all task slots
tools/launcher/core.py, tools/launcher/tests/test_core.py
set_slurm_config_type() now iterates over SandboxTask and SandboxTask0SandboxTask4, patching __dataclass_fields__, __annotations__, and __init__ for each. Tests are extended to assert patching for SandboxTask2SandboxTask4.
Launcher packaging paths and git-info refactoring
tools/launcher/launch.py, tools/launcher/core.py
Imports glob. _include_pattern and _relative_path are initialized empty and populated by new _add_package_path() and _add_package_glob() helpers that only append paths present on disk, replacing hard-coded glob patterns. Base examples and common are always added; Megatron-LM and Model-Optimizer paths added only when modelopt source is detected. _git_info() refactored to parse .git/HEAD, resolve refs, and read commit hashes directly from git directory without subprocess, falling back to "unknown" on error. --clean subprocess call reformatted with multi-line argument list.
MCP bridge: error detection and experiment-ID validation helpers
tools/mcp/modelopt_mcp/bridge.py
Adds _SAFE_EXPERIMENT_ID_RE and _validate_experiment_id() for safe ID validation. Introduces _LAUNCHER_ERROR_RE and _launcher_reported_error() to detect fatal launcher output independent of exit code. Adds _find_launcher_package_dir() for fallback experiment directory lookup. Updates subprocess import # nosec annotation.
MCP bridge: Slurm submission failure detection and output parsing
tools/mcp/modelopt_mcp/bridge.py
Extends submit_job_impl and _submit_job_dry_run to fail on detected fatal launcher errors. Reworks experiment-id extraction to try Experiment.from_id() first, then multiple regex patterns; adds case-insensitive "Job id: ..." job-id variant; returns launch_result_unparsed when neither id is found, preserving any partial slurm_job_id and output tails.
MCP bridge: experiment directory resolution and job-operation validation
tools/mcp/modelopt_mcp/bridge.py
_resolve_experiment_dir searches direct and nested paths via glob(f"*/{experiment_id}") across multiple candidate roots from _experiment_search_roots(). _experiment_not_found_diagnostic() centralizes search description. job_status_impl and job_logs_impl validate experiment IDs early, returning structured failures for unsafe IDs. Additional # nosec annotation refinements.
MCP bridge tests for parsing and validation paths
tools/mcp/tests/test_bridge.py
Adds tests covering zero-exit-no-ids failure, Nemo "Experiment Status" id extraction, fatal stderr as failure even with exit 0, dry-run fatal-text validation, nested experiments/<title>/<id> resolution, launcher-directory fallback, and unsafe experiment-id rejection.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant submit_job_impl
  participant subprocess
  participant _launcher_reported_error
  participant ID_Parsers
  
  Client->>submit_job_impl: submit job
  submit_job_impl->>subprocess: run launcher process
  subprocess-->>submit_job_impl: returncode, stdout, stderr
  submit_job_impl->>_launcher_reported_error: check stderr for fatal error
  _launcher_reported_error-->>submit_job_impl: error_detected (bool)
  alt returncode != 0 or error_detected
    submit_job_impl-->>Client: failure (launch_py_failed)
  else
    submit_job_impl->>ID_Parsers: parse experiment_id
    ID_Parsers-->>submit_job_impl: experiment_id or None
    submit_job_impl->>ID_Parsers: parse slurm_job_id
    ID_Parsers-->>submit_job_impl: slurm_job_id or None
    alt both ids found
      submit_job_impl-->>Client: success
    else id not found
      submit_job_impl-->>Client: failure (launch_result_unparsed)
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/Model-Optimizer#1766: Both PRs modify MCP bridge experiment directory resolution and launcher output parsing to support more flexible path layouts and error handling.

Suggested labels

cherry-pick-done, cherry-pick-0.45.0

Suggested reviewers

  • kevalmorabia97

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error PR violates SECURITY.md policy: 13 # nosec comments were added to bypass Bandit checks in bridge.py and launch.py without approval from @NVIDIA/modelopt-setup-codeowners or security justification... Remove all # nosec comments and request explicit review/approval from @NVIDIA/modelopt-setup-codeowners with security justification in PR description before merging.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main objective: fixing ModelOpt MCP Slurm launcher submit behavior, which encompasses the three core areas documented in PR objectives.
Docstring Coverage ✅ Passed Docstring coverage is 83.72% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenhany/fix-modelopt-mcp-slurm-submit

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tools/mcp/modelopt_mcp/bridge.py (1)

712-733: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Do not return Slurm success without an experiment_id.

Line 712 only fails when both IDs are absent, so stdout containing only Submitted batch job 123 returns ok=True with experiment_id=None. The MCP submit_job contract tells clients to poll job_status(experiment_id), so this becomes an unmonitorable success response. Treat missing experiment_id as a distinct partial/unparsed failure, or add a client-facing status path keyed by slurm_job_id.

🐛 Possible shape-preserving fix
-    if not experiment_id and not slurm_job_id:
+    if not experiment_id:
         return {
             "ok": False,
             "executor": "slurm",
             "reason": "launch_result_unparsed",
             "exit_code": 0,
             "stdout_tail": stdout_tail,
             "stderr_tail": stderr_tail,
+            "slurm_job_id": slurm_job_id,
             "diagnostic": (
-                "launch.py exited 0 but did not report an experiment_id "
-                "or Slurm job id. Treating this as failed so callers do "
-                "not assume work was submitted."
+                "launch.py exited 0 but did not report an experiment_id. "
+                "Treating this as failed so callers do not receive a "
+                "Slurm success response that cannot be monitored via "
+                "job_status."
             ),
             "argv": argv,
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` around lines 712 - 733, The condition at
line 712 that checks for missing IDs only fails when both experiment_id and
slurm_job_id are absent, but this allows a response with ok=True and
experiment_id=None to be returned when only slurm_job_id is present. Since the
MCP contract requires experiment_id for job status polling, modify the condition
to treat a missing experiment_id as a distinct failure case. Change the
condition from checking "not experiment_id and not slurm_job_id" to also fail
when experiment_id is missing (regardless of whether slurm_job_id exists), and
update the failure response diagnostic message to clearly indicate that
experiment_id was not found or parsed from the launch output.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 905-910: The roots list in the launcher experiment discovery logic
is missing the launcher's own experiments directory that is documented in the
docstring. Add back the root directory for the launcher's experiments folder to
the roots list. This should be included alongside the existing roots for
NEMORUN_HOME/experiments, ./experiments, and ./local_experiments to ensure jobs
created under the launcher's working directory can be properly resolved.
- Around line 905-917: The experiment_id parameter is used unsafely in path
operations without validation, allowing potential path traversal attacks (via
sequences like ..) and glob pattern injection. Before using experiment_id in the
path joins at line 912 (root / experiment_id) and the glob operation at line 915
(root.glob(f"*/{experiment_id}")), add validation to ensure experiment_id is a
single safe path token containing only alphanumeric characters, hyphens, and
underscores. Reject any values containing path separators, dots for traversal,
or glob metacharacters. Perform this validation at the entry point where
experiment_id is received from the MCP API interface (in job_status and job_logs
functions) rather than within this utility function.

---

Outside diff comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 712-733: The condition at line 712 that checks for missing IDs
only fails when both experiment_id and slurm_job_id are absent, but this allows
a response with ok=True and experiment_id=None to be returned when only
slurm_job_id is present. Since the MCP contract requires experiment_id for job
status polling, modify the condition to treat a missing experiment_id as a
distinct failure case. Change the condition from checking "not experiment_id and
not slurm_job_id" to also fail when experiment_id is missing (regardless of
whether slurm_job_id exists), and update the failure response diagnostic message
to clearly indicate that experiment_id was not found or parsed from the launch
output.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8e8aa525-4735-4ed1-b0bc-8d5aebf3abf8

📥 Commits

Reviewing files that changed from the base of the PR and between 090b1c5 and 69043a6.

📒 Files selected for processing (7)
  • tools/launcher/common/smoke/nvidia_smi.sh
  • tools/launcher/core.py
  • tools/launcher/examples/smoke/nvidia_smi.yaml
  • tools/launcher/launch.py
  • tools/launcher/tests/test_core.py
  • tools/mcp/modelopt_mcp/bridge.py
  • tools/mcp/tests/test_bridge.py

Comment thread tools/mcp/modelopt_mcp/bridge.py
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.45%. Comparing base (090b1c5) to head (aabd9cf).
⚠️ Report is 8 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1799      +/-   ##
==========================================
- Coverage   77.29%   76.45%   -0.84%     
==========================================
  Files         511      511              
  Lines       56513    58554    +2041     
==========================================
+ Hits        43681    44768    +1087     
- Misses      12832    13786     +954     
Flag Coverage Δ
regression 14.73% <ø> (+0.06%) ⬆️
unit 54.66% <ø> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ChenhanYu ChenhanYu force-pushed the chenhany/fix-modelopt-mcp-slurm-submit branch 2 times, most recently from 98b4113 to 350fdab Compare June 23, 2026 01:19
@ChenhanYu ChenhanYu requested a review from kevalmorabia97 June 23, 2026 13:53
@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

/claude review

Comment thread tools/mcp/modelopt_mcp/bridge.py Outdated
if m:
slurm_job_id = m.group(1)

if not experiment_id and not slurm_job_id:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT Compatibility] This only fails when both ids are missing, so a Slurm submit that prints Submitted batch job 123 / Job id: 123 but whose experiment-id line drifts out of all four regexes returns ok=True with experiment_id=None.

(1) What: The submit_job MCP contract tells callers to poll job_status(experiment_id), and job_status_impl / _resolve_experiment_dir are keyed entirely on experiment_id. A success response with experiment_id=None is unmonitorable — the caller believes work was submitted but can never check its status or fetch logs.

(2) Why it matters: This is exactly the "format may shift across versions" scenario the surrounding comment acknowledges. On any future nemo_run output drift that breaks the experiment-id patterns while leaving the Slurm Job id: line intact, every submit silently becomes a dead-end success.

(3) Fix: Treat a missing experiment_id as the unparsed/partial failure (fail when not experiment_id, surfacing slurm_job_id in the payload for manual scancel), or add a status path keyed by slurm_job_id. This corroborates CodeRabbit's same finding on this line.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review — scope: all 7 changed files reviewed (tools/launcher/core.py, launch.py, examples, common, tests; tools/mcp/modelopt_mcp/bridge.py; tools/mcp/tests/test_bridge.py). This PR is entirely launcher/MCP tooling — no modelopt/ algorithm, mode-registration, config-schema, or export code is touched, so the Algorithm/ModeState/Export categories do not apply.

Findings: CRITICAL: 0 | IMPORTANT: 1 | SUGGESTION: 0

Most impactful finding:

  • [IMPORTANT Compatibility] bridge.py:712 — submit_job_impl returns ok=True with experiment_id=None whenever the four experiment-id regexes all miss but a Slurm "Job id:"/"Submitted batch job" line is present. Since job_status/job_logs are keyed entirely on experiment_id, that response is an unmonitorable success — the caller can never poll status or fetch logs. Recommend failing on missing experiment_id (surfacing slurm_job_id for manual cleanup) or adding a slurm_job_id-keyed status path. (CodeRabbit independently flagged the same line.)

Notes on things that checked out:

  • core.py patching of SandboxTask0–SandboxTask4: each is its own @DataClass, so per-class dataclass_fields/annotations/init.annotations patching is correct — no shared-mutable aliasing. Test coverage extended appropriately.
  • launch.py packaging rewrite (relative globs to on-disk-conditional absolute paths): the always-added examples/common paths flow through the exact same _add_package_path mechanism the author end-to-end smoke job exercised (common/smoke/nvidia_smi.sh packaged + ran), giving good evidence the PatternPackager semantics hold.
  • Fatal-error-on-exit-0 detection (_launcher_reported_error) and the dry-run mirror are sound and well-tested.
  • The path-traversal/glob-injection concern on experiment_id is already raised by CodeRabbit; not duplicating it here.

Overall risk: LOW. Tooling-only change, well-tested, with one should-fix contract gap on the submit-to-poll path.

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
@ChenhanYu ChenhanYu force-pushed the chenhany/fix-modelopt-mcp-slurm-submit branch from 350fdab to bfcf6fb Compare June 23, 2026 14:25

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tools/mcp/modelopt_mcp/bridge.py (2)

360-368: 🔒 Security & Privacy | 🔴 Critical | ⚡ Quick win

Terminate SSH option parsing before the user-supplied target.

argv list form avoids shell injection, but it does not stop ssh from treating a cluster_host that starts with - as another option; options like -oProxyCommand=... can execute a local command. Add -- before the target and reject whitespace/leading-option targets at the MCP boundary. As per coding guidelines, "Validate external input once at the interface boundary."

🛡️ Proposed hardening
     if identity:
         argv += ["-i", identity]
     target = f"{cluster_user}@{cluster_host}" if cluster_user else cluster_host
-    argv += [target, "whoami && hostname"]
+    if target.startswith("-") or any(ch.isspace() for ch in target):
+        return {
+            "ok": False,
+            "executor": "slurm",
+            "cluster_host": cluster_host,
+            "cluster_user": cluster_user,
+            "reason": "invalid_ssh_target",
+            "diagnostic": "cluster_host/cluster_user must form a single non-option SSH target.",
+        }
+    argv += ["--", target, "whoami && hostname"]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` around lines 360 - 368, The SSH command
construction does not properly terminate option parsing before the user-supplied
target, which allows targets starting with `-` to be interpreted as SSH options
(like `-oProxyCommand=...`) enabling command injection. Add `--` to the argv
list immediately before the target variable to signal the end of SSH options,
and add input validation to reject cluster_host and cluster_user values that
contain whitespace or start with `-` at the point where these values are
initially received from external input, before they are used in the
subprocess.run call.

Source: Coding guidelines


698-757: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Validate the parsed experiment_id before returning submit success.

The new Experiment.from_id(...) / Experiment Status for ... patterns accept arbitrary non-space strings, but job_status_impl and job_logs_impl only accept _SAFE_EXPERIMENT_ID_RE. Without validating here, submit can return ok=True with an ID that callers cannot poll. As per coding guidelines, "Validate external input once at the interface boundary."

🧩 Proposed validation
     else:
         m = re.search(r"Job id:\s*(\d+)", stdout_tail, re.IGNORECASE)
         if m:
             slurm_job_id = m.group(1)
 
+    if experiment_id:
+        invalid = _validate_experiment_id(experiment_id)
+        if invalid:
+            return {
+                **invalid,
+                "executor": "slurm",
+                "exit_code": 0,
+                "stdout_tail": stdout_tail,
+                "stderr_tail": stderr_tail,
+                "argv": argv,
+            }
+
     if not experiment_id:
         return {
             "ok": False,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` around lines 698 - 757, The code parses
experiment_id from various regex patterns but does not validate that the parsed
ID conforms to the safety requirements before returning success. Add validation
to check if the parsed experiment_id matches the _SAFE_EXPERIMENT_ID_RE pattern
before accepting it as valid. If the experiment_id exists but does not match the
safe pattern, treat it as a failed parse by returning the same failure response
structure that is currently returned when experiment_id is empty or not found.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/launcher/core.py`:
- Around line 399-409: Remove all inline `# nosec` directives from the _git_info
function, including the `# nosec B404` comment on the subprocess import
statement and the `# nosec B603 B607` comments on both subprocess.run calls.
Either refactor the implementation to legitimately pass Bandit security checks
without these suppressions, or if the current subprocess approach is necessary,
route the security exception through the approved security-exception process in
PR metadata instead of using inline code comments that violate repo policy.

In `@tools/mcp/modelopt_mcp/bridge.py`:
- Line 38: Remove all `# nosec B404` inline suppression comments from the
subprocess import statement and all other locations where they appear (lines 38,
239, 293, 368, 614, 642, 855, 1280, 1459, 1485 as noted in the comment).
According to repository policy, inline Bandit bypasses are not permitted; either
remove the suppressions entirely so Bandit passes without them, or work with the
code owner to formally document and route a security exception through the
proper approval process before merge.
- Around line 936-951: The function `_resolve_experiment_dir()` now searches
multiple root directories including the launcher package's experiments directory
(via launcher_dir), but the error diagnostic message shown to users when an
experiment is not found has not been updated to match. Locate the failure
diagnostic in the error handling code that follows the root search loop (around
lines 973-984) and update the error message to include all the roots being
searched: NEMORUN_HOME/experiments, current working directory experiments,
current working directory local_experiments, and the launcher package's
experiments directory. This ensures users see all the locations that were
checked rather than only the older roots.

---

Outside diff comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 360-368: The SSH command construction does not properly terminate
option parsing before the user-supplied target, which allows targets starting
with `-` to be interpreted as SSH options (like `-oProxyCommand=...`) enabling
command injection. Add `--` to the argv list immediately before the target
variable to signal the end of SSH options, and add input validation to reject
cluster_host and cluster_user values that contain whitespace or start with `-`
at the point where these values are initially received from external input,
before they are used in the subprocess.run call.
- Around line 698-757: The code parses experiment_id from various regex patterns
but does not validate that the parsed ID conforms to the safety requirements
before returning success. Add validation to check if the parsed experiment_id
matches the _SAFE_EXPERIMENT_ID_RE pattern before accepting it as valid. If the
experiment_id exists but does not match the safe pattern, treat it as a failed
parse by returning the same failure response structure that is currently
returned when experiment_id is empty or not found.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 00f70104-4e10-4552-b650-a8cc0c1948f1

📥 Commits

Reviewing files that changed from the base of the PR and between 350fdab and bfcf6fb.

📒 Files selected for processing (7)
  • tools/launcher/common/smoke/nvidia_smi.sh
  • tools/launcher/core.py
  • tools/launcher/examples/smoke/nvidia_smi.yaml
  • tools/launcher/launch.py
  • tools/launcher/tests/test_core.py
  • tools/mcp/modelopt_mcp/bridge.py
  • tools/mcp/tests/test_bridge.py
✅ Files skipped from review due to trivial changes (1)
  • tools/launcher/examples/smoke/nvidia_smi.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
  • tools/launcher/common/smoke/nvidia_smi.sh
  • tools/launcher/launch.py
  • tools/launcher/tests/test_core.py

Comment thread tools/launcher/core.py Outdated
import os
import re
import subprocess # nosec B404
import subprocess # nosec B404 - fixed-argv CLI probes are required; shell=True is not used.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Remove the Bandit # nosec suppressions or route a formal exception.

These changed call sites still suppress Bandit with # nosec; repo policy disallows inline bypasses even for fixed-argv subprocess probes. Remove the suppressions and make Bandit pass, or document the required security exception with setup-codeowner approval before merge. As per coding guidelines, "# nosec comments are not allowed as a bypass for security checks."

Also applies to: 239-239, 293-293, 368-368, 614-614, 642-642, 855-855, 1280-1280, 1459-1459, 1485-1485

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` at line 38, Remove all `# nosec B404`
inline suppression comments from the subprocess import statement and all other
locations where they appear (lines 38, 239, 293, 368, 614, 642, 855, 1280, 1459,
1485 as noted in the comment). According to repository policy, inline Bandit
bypasses are not permitted; either remove the suppressions entirely so Bandit
passes without them, or work with the code owner to formally document and route
a security exception through the proper approval process before merge.

Source: Coding guidelines

Comment thread tools/mcp/modelopt_mcp/bridge.py Outdated
@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

/ok to test

Signed-off-by: Chenhan D. Yu <chenhany@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 23, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Chenhan D. Yu <chenhany@nvidia.com>
@ChenhanYu ChenhanYu enabled auto-merge (squash) June 23, 2026 17:03
@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

CI is green on b848c69 and auto-merge is enabled with squash. Direct merge is currently blocked only by required code-owner review from NVIDIA/modelopt-devs; once approved, GitHub should merge the PR automatically.

Signed-off-by: Chenhan D. Yu <chenhany@nvidia.com>
Signed-off-by: Chenhan D. Yu <chenhany@nvidia.com>
@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

Added managed source checkout support in 7c57d35/aabd9cf: submit_job now accepts source_ref/source_repo, defaults to Model-Optimizer main, materializes a cached recursive-submodule checkout, and runs uv run --project <source_root>/tools/launcher modelopt-launcher .... CI is green on aabd9cf (24 success, 2 skipped); auto-merge remains enabled and is waiting on code-owner review.

@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

/ok to test aabd9cf

@ChenhanYu ChenhanYu merged commit 37dbbda into main Jun 23, 2026
43 checks passed
@ChenhanYu ChenhanYu deleted the chenhany/fix-modelopt-mcp-slurm-submit branch June 23, 2026 19:36
@github-actions

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-23 19:36 UTC

ChenhanYu added a commit that referenced this pull request Jun 24, 2026
## Summary
- make Slurm ModelOpt source overlay mounts conditional on
source-checkout mode
- force managed-source MCP launches to reinstall `modelopt-launcher`
from the selected checkout so source refs do not reuse a stale cached
package
- add regression coverage for installed-mode Slurm mounts and
managed-source launcher argv construction

## Root cause
PR #1799 correctly stopped packaging `modules/Model-Optimizer/*` when
`modelopt-launcher` runs from an installed package. However,
`build_slurm_executor()` still unconditionally added container mounts
for `code/modules/Model-Optimizer/modelopt` and `modelopt_recipes`. In
installed MCP mode those paths do not exist in the remote package, so
the container runtime fails before the job script starts.

A second issue appeared during validation: the MCP managed-source path
can materialize the right git checkout but still execute a cached
`modelopt-launcher` package with the same version. Adding `uv run
--reinstall-package modelopt-launcher` ensures the selected source ref
is what actually runs.

## Validation
- `uv run pytest tests/test_bridge.py -q` from `tools/mcp`: 51 passed
- `uv run pytest tests/test_slurm_executor.py tests/test_core.py -q`
from `tools/launcher`: 24 passed
- `pre-commit run --files tools/launcher/core.py
tools/launcher/tests/test_slurm_executor.py
tools/mcp/modelopt_mcp/bridge.py tools/mcp/tests/test_bridge.py`: passed
- Live Slurm GPU smoke validated with patched launcher path;
`nvidia-smi` ran successfully and the smoke script completed.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **Refactor**
* Restructured container mount assembly for Slurm job execution to
conditionally mount ModelOpt directories based on optional source path
parameter.
* Enhanced launcher command-line generation with package management
improvements.
* Replaced unconditional mount paths with conditional behavior for more
flexible resource utilization.

* **Tests**
* Expanded container mount scenario test coverage for installed and
source execution modes.
* Tightened test assertions for comprehensive mount behavior
verification.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants