Skip to content

feat(OnlineRepair): Add RequestOnlineRepair CLI template; Block instance release when aggregate health includes PreventDeletion#1184

Open
sunilkumar-nvidia wants to merge 11 commits into
NVIDIA:mainfrom
sunilkumar-nvidia:onlinerepair
Open

feat(OnlineRepair): Add RequestOnlineRepair CLI template; Block instance release when aggregate health includes PreventDeletion#1184
sunilkumar-nvidia wants to merge 11 commits into
NVIDIA:mainfrom
sunilkumar-nvidia:onlinerepair

Conversation

@sunilkumar-nvidia
Copy link
Copy Markdown
Contributor

@sunilkumar-nvidia sunilkumar-nvidia commented Apr 28, 2026

Description

This work adds online repair support for instances from the core: operators can mark a machine for handoff to online repair, block normal instance release while that state is present, then clear it when repair is done.

Typical flow

  • Use admin-cli to insert a health report from the request-online-repair template (merge source request-online-repair).
  • Run manual online repair on the instance (outside these changes).
  • When the machine is healthy again, remove that health report entry (same source) so release and health state can proceed.

Code changes

  • Introduce HealthAlertClassification::prevent_deletion(). ReleaseInstance is rejected when aggregate host health includes that classification, with InstanceReleaseBlockedByPreventDeletion. Admin force-delete is not subject to this check.
  • Add the request-online-repair template in admin-cli (classifications include prevent allocations, suppressed external alerting, and prevent deletion), with unit tests.
  • Add an API integration test that merges a health report carrying PreventDeletion, expects release to fail, then expects release to succeed after the entry is removed.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Summary by CodeRabbit

  • New Features

    • Added "request-online-repair" health report template option to the admin CLI.
    • Added new "PreventDeletion" alert classification that prevents instance release operations when active.
  • Tests

    • Added integration test verifying instance release is blocked when prevent_deletion health alerts are present.

@sunilkumar-nvidia sunilkumar-nvidia requested a review from a team as a code owner April 28, 2026 02:22
@sunilkumar-nvidia sunilkumar-nvidia self-assigned this Apr 28, 2026
@ajf
Copy link
Copy Markdown
Collaborator

ajf commented Apr 28, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@ajf
Copy link
Copy Markdown
Collaborator

ajf commented Apr 28, 2026

I'm testing coderabbit out on this PR @sunilkumar-nvidia :)

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

Walkthrough

A new "request-online-repair" health report template and "PreventDeletion" alert classification are introduced, with release-handler logic that blocks instance release operations when the prevent-deletion classification appears in aggregate health.

Changes

Cohort / File(s) Summary
Health Report Template Addition
crates/admin-cli/src/machine/health_report/args.rs, crates/admin-cli/src/machine/health_report/cmd.rs, crates/admin-cli/src/machine/tests.rs
New RequestOnlineRepair template variant added to CLI enum, implemented in get_health_report with dedicated alert source/ID/target and prevent_deletion classification, verified via parsing test.
Alert Classification
crates/health-report/src/lib.rs
New prevent_deletion() constructor added to HealthAlertClassification producing the "PreventDeletion" classification string value, with unit test.
Instance Release Blocking
crates/api-model/src/lib.rs, crates/api/src/handlers/instance.rs
New ConfigValidationError::InstanceReleaseBlockedByPreventDeletion variant introduced; release handler enhanced with guard checking aggregate health for prevent_deletion classification, rejecting release if found.
Release Blocking Integration Test
crates/api/src/tests/instance.rs
New SQLx-backed integration test verifying instance release is rejected when health override contains prevent_deletion, and succeeds after override removal.

Sequence Diagram

sequenceDiagram
    participant Client
    participant APIHandler as Release Handler
    participant DB as Database/Health
    
    Client->>APIHandler: release_instance()
    APIHandler->>APIHandler: Check if marked for deletion
    alt Already deleted
        APIHandler->>Client: Return success
    else Not deleted
        APIHandler->>DB: Load managed-host snapshot<br/>with aggregate_health
        DB->>APIHandler: Return health status
        APIHandler->>APIHandler: Check for prevent_deletion<br/>in classifications
        alt prevent_deletion present
            APIHandler->>Client: Return ConfigValidationError<br/>(InstanceReleaseBlockedByPreventDeletion)
        else prevent_deletion absent
            APIHandler->>APIHandler: Proceed with release
            APIHandler->>Client: Return success
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A request for repair, a gentle care,
Prevention of deletion, beyond compare,
No hasty release when health must mend,
Online we heal, our systems defend! 🌿✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the two main changes: adding a RequestOnlineRepair CLI template and blocking instance release when PreventDeletion alert is present.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/api/src/tests/instance.rs`:
- Around line 4895-4923: The calls to
common::api_fixtures::send_health_report_override and
common::api_fixtures::remove_health_report_override reference a non-existent
module path (causing E0425); fix by either importing/using the correct fixtures
module that actually provides these helpers (e.g., change
common::api_fixtures::... to the external fixtures crate path that exports
send_health_report_override/remove_health_report_override or add pub functions
with those names to crate::tests::common), and update the call sites to use the
correct module (or add a use statement) so send_health_report_override and
remove_health_report_override resolve.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 940bea96-3d86-444a-a05e-55d4b33a0b57

📥 Commits

Reviewing files that changed from the base of the PR and between 696f99d and e4215bb.

📒 Files selected for processing (7)
  • crates/admin-cli/src/machine/health_report/args.rs
  • crates/admin-cli/src/machine/health_report/cmd.rs
  • crates/admin-cli/src/machine/tests.rs
  • crates/api-model/src/lib.rs
  • crates/api/src/handlers/instance.rs
  • crates/api/src/tests/instance.rs
  • crates/health-report/src/lib.rs

Comment thread crates/api/src/tests/instance.rs Outdated
@github-actions
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@williampnvidia williampnvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with feedback.

Comment thread crates/health-report/src/lib.rs
Comment thread crates/api/src/handlers/instance.rs Outdated
];
}
// Same shape as TenantReportedIssue; distinct merge source and probe id for online repair.
// Includes `PreventDeletion` so `ReleaseInstance` is blocked while this merge is present (not admin force-delete).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is really stopping instance release maybe "prevent deletion" isn't the best. If "deletion" is needed, maybe qualify it with "instance". I.e. PreventInstanceDeletion

I'd prefer something like "PreventInstanceRelease"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To align with the existing  PreventAllocation  classification—which prevents a machine from being allocated to any tenant as an instance—I’m proposing  PreventDeletion . This would similarly prevent an instance or machine from being removed or deleted.
Let me know if I’m off track here.

Copy link
Copy Markdown
Contributor

@wminckler wminckler May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment is only about the name PreventDeletion. In core, we don't "delete" instances, we "release" them. To add to the confusion, you can "delete" a machine, which is unrelated to this change


if snapshot
.aggregate_health
.has_classification(&HealthAlertClassification::prevent_deletion())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When tenant marked OnlineRepair=Failed in Instance label, we should allow Instance to be deleted according to design, are we still considering that path?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we should allow instance to be deleted when marked with Label OnLineRepair=Failed. However, this happens in two steps:

  1. Call API - ExitOnlineRepair - This API removes the HealthOverride(RequestOnlineRepair) applied on the instance.
  2. Call API - ReleaseInstance - To release/delete the instance from the tenant for repair (offline).

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants