Guard standalone termination against zero job id by shreyaskommuri · Pull Request #931 · NVIDIA/cloudai

shreyaskommuri · 2026-06-16T05:09:32Z

Summary

Guard standalone job termination so job.id == 0 does not shell out to kill -9 0.
Log a warning and skip termination when the standalone job ID is the dry-run sentinel rather than a launched process.
Add regression coverage for integer and string zero job IDs.

Context

Fixes #929.

While using CloudAI as the execution layer for CloudAI Autotune, a wrapper smoke test hit the common standalone sleep dry-run path. CloudAI scheduled dependency termination for job 0, which produced kill -9 0. On POSIX shells that targets the current process group, so the parent wrapper was killed and exited with code 137 before it could report a normal result.

In standalone dry-run mode, StandaloneRunner._submit_test() uses job_id = 0 because no subprocess is launched. That sentinel should not be passed to the shell as a real kill target.

Test Plan

python -m pytest tests/systems/standalone/test_system.py -q
- 5 passed
python -m pytest -q
- 1574 passed, 4 skipped, 463 deselected
uv run pre-commit run --all-files
- all hooks passed

I also attempted the live standalone sleep dry-run on a fresh origin/main branch, but that path is currently blocked earlier by the already-filed abstract-method issue #920 / PR #921. This PR keeps the scope to the zero-job-id termination fix from #929.

coderabbitai · 2026-06-16T05:09:40Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 55539c7c-994d-4d95-9bd3-9f7b16cbfedd

📥 Commits

Reviewing files that changed from the base of the PR and between 7130435 and 030ac94.

📒 Files selected for processing (2)

src/cloudai/systems/standalone/standalone_system.py
tests/systems/standalone/test_system.py

📝 Walkthrough

Walkthrough

In StandaloneSystem.kill, job.id is now parsed as an integer after trimming, and termination is skipped with a warning log if parsing fails or if the PID is <= 0 (including zero, zero-padded variants, and negative values). Valid numeric PIDs are executed with kill -9 using the normalized integer value. Comprehensive test coverage validates rejection of invalid inputs and correct normalization of leading-zero PIDs. The copyright year is updated to 2026.

Changes

PID Validation in Kill Command

Layer / File(s)	Summary
PID parsing and validation in `StandaloneSystem.kill` `src/cloudai/systems/standalone/standalone_system.py`	`StandaloneSystem.kill` now parses `job.id.strip()` to an integer before executing `kill -9`. If parsing fails or the resulting PID is `<= 0`, the method logs a warning and returns early; otherwise it executes `kill -9` with the normalized numeric PID. This prevents accidental termination of process group zero and handles invalid PID inputs safely. Copyright header updated to 2024–2026.
Invalid PID rejection and normalization tests `tests/systems/standalone/test_system.py`	`test_kill_job_skips_invalid_pid` parametrizes multiple invalid job ID inputs (including `"0"`, `"00"`, `"+0"`, `"-0"`, negative numbers, and non-numeric strings) and verifies `CommandShell.execute` is never called. `test_kill_job_normalizes_numeric_pid` confirms that valid numeric PIDs with leading zeros are normalized and executed correctly.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

Standalone sleep dry-run can kill parent wrapper when job id is 0 #929: This PR directly addresses the issue by implementing comprehensive PID validation in StandaloneSystem.kill() to prevent kill -9 0 from terminating the parent process group, extending the fix to handle all invalid PID inputs and normalizing numeric strings.

Poem

🐇 A PID once danced without care,
Zero-padded, slipping through air.
Now integer parsing stands guard,
Validates each ID, never hard—
Safe termination, mission starred! ⭐

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly describes the main change: adding a guard to prevent standalone job termination when job ID is zero.
Description check	✅ Passed	The description is clearly related to the changeset, providing context about the bug, the fix, and test coverage for the zero job ID termination issue.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

shreyaskommuri · 2026-06-16T05:12:29Z

@coderabbitai review

coderabbitai · 2026-06-16T05:12:35Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/systems/standalone/standalone_system.py`:
- Around line 81-90: The job ID validation using string-equality check `== "0"`
only blocks the exact string and allows variants like "00", "+0", and "-0" to
pass through, which then parse as integer 0 and signal the entire process group
instead of a specific process. Replace the string comparison with explicit
integer parsing and validation: convert job.id to an integer, validate it is a
positive number (greater than 0), and only proceed with the kill command if that
validation passes. This ensures that all zero variants and invalid PID values
are rejected before constructing the cmd string and calling cmd_shell.execute.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f5b3c7a8-35f8-4530-bdda-63d1d3531de9

📥 Commits

Reviewing files that changed from the base of the PR and between 1b0e8cc and 7130435.

📒 Files selected for processing (2)

src/cloudai/systems/standalone/standalone_system.py
tests/systems/standalone/test_system.py

Signed-off-by: shreyaskommuri <shreyaskommuri@gmail.com>

shreyaskommuri · 2026-06-16T06:29:53Z

Addressed CodeRabbit feedback by parsing standalone job IDs as integers before constructing the kill command. The guard now skips non-numeric IDs and all non-positive PID variants such as 00, +0, -0, and -1, and uses the normalized positive PID for valid kills. Added regression coverage for those cases plus numeric normalization. Re-ran python -m pytest tests/systems/standalone/test_system.py -q, full python -m pytest -q, and uv run pre-commit run --all-files successfully.

shreyaskommuri force-pushed the issue/standalone-kill-zero branch from 68ca9fb to 7130435 Compare June 16, 2026 05:12

shreyaskommuri marked this pull request as ready for review June 16, 2026 05:12

shreyaskommuri requested review from jeffnvidia, podkidyshev and srivatsankrishnan as code owners June 16, 2026 05:12

coderabbitai Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread src/cloudai/systems/standalone/standalone_system.py Outdated

Guard standalone termination against zero job id

030ac94

Signed-off-by: shreyaskommuri <shreyaskommuri@gmail.com>

shreyaskommuri force-pushed the issue/standalone-kill-zero branch from 7130435 to 030ac94 Compare June 16, 2026 06:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guard standalone termination against zero job id#931

Guard standalone termination against zero job id#931
shreyaskommuri wants to merge 1 commit into
NVIDIA:mainfrom
shreyaskommuri:issue/standalone-kill-zero

shreyaskommuri commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

Uh oh!

shreyaskommuri commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

shreyaskommuri commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shreyaskommuri commented Jun 16, 2026

Summary

Context

Test Plan

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

Uh oh!

shreyaskommuri commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shreyaskommuri commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading