Skip to content

Guard standalone termination against zero job id#931

Open
shreyaskommuri wants to merge 1 commit into
NVIDIA:mainfrom
shreyaskommuri:issue/standalone-kill-zero
Open

Guard standalone termination against zero job id#931
shreyaskommuri wants to merge 1 commit into
NVIDIA:mainfrom
shreyaskommuri:issue/standalone-kill-zero

Conversation

@shreyaskommuri

Copy link
Copy Markdown
Contributor

Summary

  • Guard standalone job termination so job.id == 0 does not shell out to kill -9 0.
  • Log a warning and skip termination when the standalone job ID is the dry-run sentinel rather than a launched process.
  • Add regression coverage for integer and string zero job IDs.

Context

Fixes #929.

While using CloudAI as the execution layer for CloudAI Autotune, a wrapper smoke test hit the common standalone sleep dry-run path. CloudAI scheduled dependency termination for job 0, which produced kill -9 0. On POSIX shells that targets the current process group, so the parent wrapper was killed and exited with code 137 before it could report a normal result.

In standalone dry-run mode, StandaloneRunner._submit_test() uses job_id = 0 because no subprocess is launched. That sentinel should not be passed to the shell as a real kill target.

Test Plan

  • python -m pytest tests/systems/standalone/test_system.py -q
    • 5 passed
  • python -m pytest -q
    • 1574 passed, 4 skipped, 463 deselected
  • uv run pre-commit run --all-files
    • all hooks passed

I also attempted the live standalone sleep dry-run on a fresh origin/main branch, but that path is currently blocked earlier by the already-filed abstract-method issue #920 / PR #921. This PR keeps the scope to the zero-job-id termination fix from #929.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 55539c7c-994d-4d95-9bd3-9f7b16cbfedd

📥 Commits

Reviewing files that changed from the base of the PR and between 7130435 and 030ac94.

📒 Files selected for processing (2)
  • src/cloudai/systems/standalone/standalone_system.py
  • tests/systems/standalone/test_system.py

📝 Walkthrough

Walkthrough

In StandaloneSystem.kill, job.id is now parsed as an integer after trimming, and termination is skipped with a warning log if parsing fails or if the PID is <= 0 (including zero, zero-padded variants, and negative values). Valid numeric PIDs are executed with kill -9 using the normalized integer value. Comprehensive test coverage validates rejection of invalid inputs and correct normalization of leading-zero PIDs. The copyright year is updated to 2026.

Changes

PID Validation in Kill Command

Layer / File(s) Summary
PID parsing and validation in StandaloneSystem.kill
src/cloudai/systems/standalone/standalone_system.py
StandaloneSystem.kill now parses job.id.strip() to an integer before executing kill -9. If parsing fails or the resulting PID is <= 0, the method logs a warning and returns early; otherwise it executes kill -9 with the normalized numeric PID. This prevents accidental termination of process group zero and handles invalid PID inputs safely. Copyright header updated to 2024–2026.
Invalid PID rejection and normalization tests
tests/systems/standalone/test_system.py
test_kill_job_skips_invalid_pid parametrizes multiple invalid job ID inputs (including "0", "00", "+0", "-0", negative numbers, and non-numeric strings) and verifies CommandShell.execute is never called. test_kill_job_normalizes_numeric_pid confirms that valid numeric PIDs with leading zeros are normalized and executed correctly.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

Poem

🐇 A PID once danced without care,
Zero-padded, slipping through air.
Now integer parsing stands guard,
Validates each ID, never hard—
Safe termination, mission starred! ⭐

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly describes the main change: adding a guard to prevent standalone job termination when job ID is zero.
Description check ✅ Passed The description is clearly related to the changeset, providing context about the bug, the fix, and test coverage for the zero job ID termination issue.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@shreyaskommuri shreyaskommuri force-pushed the issue/standalone-kill-zero branch from 68ca9fb to 7130435 Compare June 16, 2026 05:12
@shreyaskommuri shreyaskommuri marked this pull request as ready for review June 16, 2026 05:12
@shreyaskommuri

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/systems/standalone/standalone_system.py`:
- Around line 81-90: The job ID validation using string-equality check `== "0"`
only blocks the exact string and allows variants like "00", "+0", and "-0" to
pass through, which then parse as integer 0 and signal the entire process group
instead of a specific process. Replace the string comparison with explicit
integer parsing and validation: convert job.id to an integer, validate it is a
positive number (greater than 0), and only proceed with the kill command if that
validation passes. This ensures that all zero variants and invalid PID values
are rejected before constructing the cmd string and calling cmd_shell.execute.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f5b3c7a8-35f8-4530-bdda-63d1d3531de9

📥 Commits

Reviewing files that changed from the base of the PR and between 1b0e8cc and 7130435.

📒 Files selected for processing (2)
  • src/cloudai/systems/standalone/standalone_system.py
  • tests/systems/standalone/test_system.py

Comment thread src/cloudai/systems/standalone/standalone_system.py Outdated
Signed-off-by: shreyaskommuri <shreyaskommuri@gmail.com>
@shreyaskommuri shreyaskommuri force-pushed the issue/standalone-kill-zero branch from 7130435 to 030ac94 Compare June 16, 2026 06:29
@shreyaskommuri

Copy link
Copy Markdown
Contributor Author

Addressed CodeRabbit feedback by parsing standalone job IDs as integers before constructing the kill command. The guard now skips non-numeric IDs and all non-positive PID variants such as 00, +0, -0, and -1, and uses the normalized positive PID for valid kills. Added regression coverage for those cases plus numeric normalization. Re-ran python -m pytest tests/systems/standalone/test_system.py -q, full python -m pytest -q, and uv run pre-commit run --all-files successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Standalone sleep dry-run can kill parent wrapper when job id is 0

1 participant