feat(OnlineRepair): Add RequestOnlineRepair CLI template; Block instance release when aggregate health includes PreventDeletion#1184
Conversation
Introducing new HealthOverride Classification - PreventDeletion
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
I'm testing coderabbit out on this PR @sunilkumar-nvidia :) |
WalkthroughA new "request-online-repair" health report template and "PreventDeletion" alert classification are introduced, with release-handler logic that blocks instance release operations when the prevent-deletion classification appears in aggregate health. Changes
Sequence DiagramsequenceDiagram
participant Client
participant APIHandler as Release Handler
participant DB as Database/Health
Client->>APIHandler: release_instance()
APIHandler->>APIHandler: Check if marked for deletion
alt Already deleted
APIHandler->>Client: Return success
else Not deleted
APIHandler->>DB: Load managed-host snapshot<br/>with aggregate_health
DB->>APIHandler: Return health status
APIHandler->>APIHandler: Check for prevent_deletion<br/>in classifications
alt prevent_deletion present
APIHandler->>Client: Return ConfigValidationError<br/>(InstanceReleaseBlockedByPreventDeletion)
else prevent_deletion absent
APIHandler->>APIHandler: Proceed with release
APIHandler->>Client: Return success
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/api/src/tests/instance.rs`:
- Around line 4895-4923: The calls to
common::api_fixtures::send_health_report_override and
common::api_fixtures::remove_health_report_override reference a non-existent
module path (causing E0425); fix by either importing/using the correct fixtures
module that actually provides these helpers (e.g., change
common::api_fixtures::... to the external fixtures crate path that exports
send_health_report_override/remove_health_report_override or add pub functions
with those names to crate::tests::common), and update the call sites to use the
correct module (or add a use statement) so send_health_report_override and
remove_health_report_override resolve.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 940bea96-3d86-444a-a05e-55d4b33a0b57
📒 Files selected for processing (7)
crates/admin-cli/src/machine/health_report/args.rscrates/admin-cli/src/machine/health_report/cmd.rscrates/admin-cli/src/machine/tests.rscrates/api-model/src/lib.rscrates/api/src/handlers/instance.rscrates/api/src/tests/instance.rscrates/health-report/src/lib.rs
|
🌿 Preview your docs: https://nvidia-preview-pull-request-1184.docs.buildwithfern.com/infra-controller |
williampnvidia
left a comment
There was a problem hiding this comment.
Approved with feedback.
| ]; | ||
| } | ||
| // Same shape as TenantReportedIssue; distinct merge source and probe id for online repair. | ||
| // Includes `PreventDeletion` so `ReleaseInstance` is blocked while this merge is present (not admin force-delete). |
There was a problem hiding this comment.
If this is really stopping instance release maybe "prevent deletion" isn't the best. If "deletion" is needed, maybe qualify it with "instance". I.e. PreventInstanceDeletion
I'd prefer something like "PreventInstanceRelease"
There was a problem hiding this comment.
To align with the existing PreventAllocation classification—which prevents a machine from being allocated to any tenant as an instance—I’m proposing PreventDeletion . This would similarly prevent an instance or machine from being removed or deleted.
Let me know if I’m off track here.
There was a problem hiding this comment.
My comment is only about the name PreventDeletion. In core, we don't "delete" instances, we "release" them. To add to the confusion, you can "delete" a machine, which is unrelated to this change
|
|
||
| if snapshot | ||
| .aggregate_health | ||
| .has_classification(&HealthAlertClassification::prevent_deletion()) |
There was a problem hiding this comment.
When tenant marked OnlineRepair=Failed in Instance label, we should allow Instance to be deleted according to design, are we still considering that path?
There was a problem hiding this comment.
yes, we should allow instance to be deleted when marked with Label OnLineRepair=Failed. However, this happens in two steps:
- Call API - ExitOnlineRepair - This API removes the HealthOverride(RequestOnlineRepair) applied on the instance.
- Call API - ReleaseInstance - To release/delete the instance from the tenant for repair (offline).
…ntDeletion Signed-off-by: Sunil Kumar <sunilkumar@nvidia.com>
92b9535 to
269bae5
Compare
Description
This work adds online repair support for instances from the core: operators can mark a machine for handoff to online repair, block normal instance release while that state is present, then clear it when repair is done.
Typical flow
Code changes
Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes
Summary by CodeRabbit
New Features
Tests