Skip to content

fix: No-op Instance update when requested NVLink Interfaces matches with current#438

Merged
hwadekar-nv merged 11 commits into
mainfrom
fix/instance-upodate-same-partition
May 7, 2026
Merged

fix: No-op Instance update when requested NVLink Interfaces matches with current#438
hwadekar-nv merged 11 commits into
mainfrom
fix/instance-upodate-same-partition

Conversation

@hwadekar-nv
Copy link
Copy Markdown
Contributor

@hwadekar-nv hwadekar-nv commented Apr 28, 2026

Description

  • Validate Instance NVLink interface update against existing bindings
  • No operation if instance NVLink interface update is identical to existing
  • Treat it the same way as existing NVLink interfaces, but ensure it’s not in the Deleting state.
  • If existing interfaces are pending for more than 90 seconds, we consider them to be different.

Type of Change

  • Feature - New feature or functionality (feat:)
  • Fix - Bug fixes (fix:)
  • Chore - Modification or removal of existing functionality (chore:)
  • Refactor - Refactoring of existing functionality (refactor:)
  • Docs - Changes in documentation or OpenAPI schema (docs:)
  • CI - Changes in GitHub workflows. Requires additional scrutiny (ci:)
  • Version - Issuing a new release version (version:)

Services Affected

  • API - API models or endpoints updated
  • Workflow - Workflow service updated
  • DB - DB DAOs or migrations updated
  • Site Manager - Site Manager updated
  • Cert Manager - Cert Manager updated
  • Site Agent - Site Agent updated
  • RLA - RLA service updated
  • Powershelf Manager - Powershelf Manager updated
  • NVSwitch Manager - NVSwitch Manager updated

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@hwadekar-nv hwadekar-nv self-assigned this Apr 28, 2026
@hwadekar-nv hwadekar-nv requested a review from a team as a code owner April 28, 2026 22:17
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds an order-independent multiset comparison for NVLink interfaces in UpdateInstanceHandler.Handle. If incoming NVLinkInterfaces exactly match an existing status bucket (Pending, Ready, or Deleting) by multiset of (NVLinkLogicalPartitionID, DeviceInstance), the handler treats the NVLink update as a no-op and skips partition fetches, DB writes, deletion marking, and re-appending unchanged NVLink interfaces.

Changes

Instance NVLink fast-path

Layer / File(s) Summary
Data Shape / Read
api/pkg/api/handler/instance.go
Fetch existing NVLink interfaces and group/count them as a multiset keyed by (NVLinkLogicalPartitionID, DeviceInstance), partitioned by status (Pending, Ready, Deleting).
Decision Logic
api/pkg/api/handler/instance.go
Add multiset-match detection: if incoming NVLinkInterfaces length equals exactly one status-bucket and every incoming tuple can be matched by decrementing counts, set nvLinkInterfacesUnchanged=true.
Core Behavior Change
api/pkg/api/handler/instance.go
When nvLinkInterfacesUnchanged is true, skip fetching NVLink logical partitions, skip creating new Pending NVLink DB records, skip marking existing interfaces Deleting, and do not append existing NVLink interfaces into the response. Preserve previous DB/update flow when fast-path does not match.
Tests / Assertions
api/pkg/api/handler/instance_test.go
Add test case "NVLink interfaces unchanged (no-op, multiset order)". Introduce nvLinkGpuConfigsVerifyCountOnly flag to allow count-only verification of Site Controller NVLink GPU configs when order may differ. Modify test-site-controller assertion to conditionally perform per-index tuple checks or only verify length.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Handler
    participant DB
    participant SiteController

    Client->>Handler: UpdateInstance(req with NVLinkInterfaces)
    alt NVLink multiset matches an existing status-bucket
        Handler->>Handler: Build multiset keyed by (PartitionID, DeviceInstance)
        Handler->>Handler: Match incoming tuples against multiset (decrement counts)
        Handler-->>Client: Respond (no NVLink DB fetch/update or SiteController NVLink update)
    else NVLink differs
        Handler->>DB: Fetch NVLinkLogicalPartitions
        Handler->>DB: Create new NVLink interface records (Pending)
        Handler->>DB: Mark unmatched existing NVLink interfaces Deleting
        Handler->>SiteController: Send updated GPU config / NVLink info
        SiteController-->>Handler: Ack
        Handler-->>Client: Respond with updated NVLink interfaces
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description accurately reflects the changeset, describing the NVLink interface no-op optimization and its validation logic against existing bindings.
Title check ✅ Passed The title accurately summarizes the main change: implementing a no-op optimization when requested NVLink interfaces match the current instance state, preventing unnecessary database updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/instance-upodate-same-partition

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-04-28 22:18:44 UTC | Commit: e13071a

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 28, 2026

Test Results

9 281 tests  +5   9 281 ✅ +5   7m 17s ⏱️ -6s
  154 suites ±0       0 💤 ±0 
   14 files   ±0       0 ❌ ±0 

Results for commit 1cc4217. ± Comparison against base commit 2645be3.

This pull request removes 1 and adds 6 tests. Note that renamed tests count towards both.
github.com/NVIDIA/infra-controller-rest/api/pkg/api/handler ‑ TestUpdateInstanceHandler_Handle/test_Instance_update_API_endpoint_success_with_NVLink_interface_update_to_same_partition_(no-op_update_allowed)
github.com/NVIDIA/infra-controller-rest/api/pkg/api/handler ‑ TestUpdateInstanceHandler_Handle/test_Instance_update_API_endpoint_success_when_Instance_has_four_NVLink_interfaces_and_update_requests_two_(nvllp1_devices_0_and_1)
github.com/NVIDIA/infra-controller-rest/api/pkg/api/handler ‑ TestUpdateInstanceHandler_Handle/test_Instance_update_API_endpoint_success_when_Instance_has_four_NVLink_interfaces_and_update_requests_two_(nvllp2_devices_2_and_3)
github.com/NVIDIA/infra-controller-rest/api/pkg/api/handler ‑ TestUpdateInstanceHandler_Handle/test_Instance_update_API_endpoint_success_when_NVLink_interfaces_pending_and_updated_within_90s_—_multiset_no-op_(no_DB_re-issue)
github.com/NVIDIA/infra-controller-rest/api/pkg/api/handler ‑ TestUpdateInstanceHandler_Handle/test_Instance_update_API_endpoint_success_when_NVLink_interfaces_unchanged_(no-op,_multiset_order)
github.com/NVIDIA/infra-controller-rest/api/pkg/api/handler ‑ TestUpdateInstanceHandler_Handle/test_Instance_update_API_endpoint_success_when_NVLink_pending_rows_older_than_90s_grace_—_re-issued_(stale_Updated_timestamps)
github.com/NVIDIA/infra-controller-rest/api/pkg/api/handler ‑ TestUpdateInstanceHandler_Handle/test_Instance_update_re-issues_when_four_NVLink_rows_exist_as_Deleting_—_same_multiset_request_(deviceInstance_0–3)_creates_four_new_Pending

♻️ This comment has been updated with latest results.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
api/pkg/api/handler/instance.go (1)

2660-2675: ⚠️ Potential issue | 🟠 Major

Handle explicit empty nvLinkInterfaces updates as no-ops too.

Line 2660 gates this new check on len(apiRequest.NVLinkInterfaces) > 0, so an explicit [] skips the rejection path. If the instance already has no active NVLink interfaces, that request still matches the current binding set and will incorrectly proceed as a no-op update.

Suggested fix
-	if len(apiRequest.NVLinkInterfaces) > 0 {
+	if apiRequest.NVLinkInterfaces != nil {
 		nvlIfcDAO := cdbm.NewNVLinkInterfaceDAO(uih.dbSession)
 		nvlIfcs, _, err := nvlIfcDAO.GetAll(ctx, nil, cdbm.NVLinkInterfaceFilterInput{InstanceIDs: []uuid.UUID{instance.ID}}, cdbp.PageInput{}, nil)
 		if err != nil {
 			logger.Error().Err(err).Msg("error retrieving NVLink Interfaces from DB for Instance")
 			return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, "Failed to retrieve NVLink Interfaces for Instance", nil)
 		}
 
 		// Discard if VPC has default NVLink Logical Partition specified and NVLink Interfaces are exists
 		if len(nvlIfcs) > 0 && vpc.NVLinkLogicalPartitionID != nil {
 			return cutil.NewAPIErrorResponse(c, http.StatusBadRequest, "Cannot update NVLink Interfaces if VPC has default NVLink Logical Partition and NVLink Interfaces already exist for the Instance", nil)
 		}
 
 		if nvLinkInterfaceUpdateRequestMatchesExisting(apiRequest.NVLinkInterfaces, nvlIfcs) {
 			return cutil.NewAPIErrorResponse(c, http.StatusBadRequest, "Cannot update NVLink Interfaces because the requested NVLink Interfaces are identical to the Instance's current NVLink Interfaces", nil)
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/instance.go` around lines 2660 - 2675, Remove the outer
len(apiRequest.NVLinkInterfaces) > 0 guard so you always fetch current NVLink
interfaces (nvlIfcDAO.GetAll -> nvlIfcs) and compare them with the request using
nvLinkInterfaceUpdateRequestMatchesExisting; also adjust the VPC
default-partition check to only block when the request is actually trying to set
interfaces (i.e. keep the vpc.NVLinkLogicalPartitionID check conditional on
len(apiRequest.NVLinkInterfaces) > 0 and len(nvlIfcs) > 0) so an explicit empty
slice ([] ) will be treated as a no-op when it matches the existing nvlIfcs and
correctly return the identical-update error.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/pkg/api/handler/instance_test.go`:
- Around line 5757-5784: The test currently returns early on non-200 responses
so the Site Controller workflow-call assertion (controlled by
verifySiteControllerRequest) is skipped; move or duplicate the Site Controller
verification out of the early-return path so it always runs after the handler
invocation and explicitly assert no workflow calls for this case using the Site
Controller mock (scp) API (e.g., assert scp shows zero workflow requests / no
workflows invoked) when verifySiteControllerRequest is false in the failing test
case; update the harness logic around verifySiteControllerRequest or add an
explicit check in this specific test case so the "no Site Controller workflow
call" constraint is enforced even on 400 responses.

---

Outside diff comments:
In `@api/pkg/api/handler/instance.go`:
- Around line 2660-2675: Remove the outer len(apiRequest.NVLinkInterfaces) > 0
guard so you always fetch current NVLink interfaces (nvlIfcDAO.GetAll ->
nvlIfcs) and compare them with the request using
nvLinkInterfaceUpdateRequestMatchesExisting; also adjust the VPC
default-partition check to only block when the request is actually trying to set
interfaces (i.e. keep the vpc.NVLinkLogicalPartitionID check conditional on
len(apiRequest.NVLinkInterfaces) > 0 and len(nvlIfcs) > 0) so an explicit empty
slice ([] ) will be treated as a no-op when it matches the existing nvlIfcs and
correctly return the identical-update error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4726590c-f1d0-4e39-b80d-bb8d4503a7f2

📥 Commits

Reviewing files that changed from the base of the PR and between 3616400 and e13071a.

📒 Files selected for processing (2)
  • api/pkg/api/handler/instance.go
  • api/pkg/api/handler/instance_test.go

Comment thread api/pkg/api/handler/instance_test.go Outdated
Comment on lines +5757 to +5784
name: "test Instance update API endpoint fails when NVLink interfaces are unchanged (same multiset as current)",
fields: fields{
dbSession: dbSession,
tc: tc,
scp: scp,
cfg: cfg,
},
args: args{
reqData: &model.APIInstanceUpdateRequest{
IpxeScript: os2.IpxeScript,
// Order differs from DB creation order; must still be rejected as identical multiset.
NVLinkInterfaces: []model.APINVLinkInterfaceCreateOrUpdateRequest{
{NVLinkLogicalPartitionID: nvllp2.ID.String(), DeviceInstance: 3},
{NVLinkLogicalPartitionID: nvllp1.ID.String(), DeviceInstance: 0},
{NVLinkLogicalPartitionID: nvllp2.ID.String(), DeviceInstance: 2},
{NVLinkLogicalPartitionID: nvllp1.ID.String(), DeviceInstance: 1},
},
},
reqInstance: inst13.ID.String(),
reqOrg: tnOrg1,
reqUser: tnu1,
respCode: http.StatusBadRequest,
respMessage: cdb.GetStrPtr("Cannot update NVLink Interfaces because the requested NVLink Interfaces are identical to the Instance's current NVLink Interfaces"),
},
wantErr: false,
verifySiteControllerRequest: false,
verifyChildSpanner: false,
},
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing assertion for “no Site Controller workflow call” in the new rejection scenario.

This case checks 400 + error text, but it does not verify the side-effect constraint. Because non-200 responses return early at Line 5950, workflow-call checks are skipped entirely.

Proposed test-harness enhancement
 type args struct {
   ...
+  expectNoSiteWorkflowCall bool
 }

 {
   name: "test Instance update API endpoint fails when NVLink interfaces are unchanged (same multiset as current)",
   ...
-  verifySiteControllerRequest: false,
+  expectNoSiteWorkflowCall: true,
+  verifySiteControllerRequest: false,
   verifyChildSpanner: false,
 },

 for _, tt := range tests {
   t.Run(tt.name, func(t *testing.T) {
+    var preCalls int
+    var siteMock *tmocks.Client
+    if tt.args.expectNoSiteWorkflowCall {
+      reqID, err := uuid.Parse(tt.args.reqInstance)
+      require.NoError(t, err)
+      ins, err := insDAO.GetByID(context.Background(), nil, reqID, nil)
+      require.NoError(t, err)
+      c, err := tt.fields.scp.GetClientByID(ins.SiteID)
+      require.NoError(t, err)
+      siteMock = c.(*tmocks.Client)
+      preCalls = len(siteMock.Calls)
+    }

     ...
     require.Equal(t, tt.args.respCode, rec.Code)
+    if tt.args.expectNoSiteWorkflowCall {
+      assert.Equal(t, preCalls, len(siteMock.Calls), "site controller workflow should not be invoked")
+    }
     if tt.args.respCode != http.StatusOK {
       return
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/instance_test.go` around lines 5757 - 5784, The test
currently returns early on non-200 responses so the Site Controller
workflow-call assertion (controlled by verifySiteControllerRequest) is skipped;
move or duplicate the Site Controller verification out of the early-return path
so it always runs after the handler invocation and explicitly assert no workflow
calls for this case using the Site Controller mock (scp) API (e.g., assert scp
shows zero workflow requests / no workflows invoked) when
verifySiteControllerRequest is false in the failing test case; update the
harness logic around verifySiteControllerRequest or add an explicit check in
this specific test case so the "no Site Controller workflow call" constraint is
enforced even on 400 responses.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hwadekar-nv Acknowledged. I’ll ignore this thread and won’t pursue the suggested harness change. If you want me to mark my earlier comment as resolved, just say the word.


🧠 Learnings used
Learnt from: thossain-nv
Repo: NVIDIA/ncx-infra-controller-rest PR: 257
File: api/pkg/api/handler/vpcpeering.go:203-262
Timestamp: 2026-03-18T01:37:21.156Z
Learning: In Go handler code under api/pkg/api/handler, RBAC should not short-circuit on the first matching role (e.g., if isProviderAdmin { ... } else { ... }). For dual-role service accounts (e.g., Latinum), an org may have both ProviderAdmin and TenantAdmin. Evaluate both authorization paths independently, accumulate errors per path, and only reject if all applicable paths fail. Implement a helper that checks each role path and collects errors, then decide based on whether any path grants access. Add tests covering mixed-role scenarios and ensure CI gates enforce this behavior across all relevant handlers in this directory.

Learnt from: pbreton
Repo: NVIDIA/ncx-infra-controller-rest PR: 293
File: api/pkg/api/handler/operatingsystem_ownership_test.go:119-139
Timestamp: 2026-03-30T23:24:33.337Z
Learning: In Go handler unit tests that mock Temporal SDK `ExecuteWorkflow` with a variadic signature (`args ...interface{}`), ensure the `testify/mock` expectation uses the exact same number of `mock.Anything` (or other matchers) as the total number of arguments passed at the call site. Testify flattens variadic arguments via `_m.Called(_ca...)`, so a mismatch in matcher count will break the test. If your handler uses a site-scoped Temporal client mock (e.g., `siteMock` from `scp.IDClientMap`) that is invoked with 4 arguments (e.g., `ctx, workflowOptions, workflowName, workflowInput`), keep the corresponding `On("ExecuteWorkflow", ...)` expectation at 4 matchers; do not “convert” it to the main client’s arity (e.g., a 5-arg mock) unless that specific mock is actually called with 5 arguments in the tested code path.

Learnt from: bcavnvidia
Repo: NVIDIA/ncx-infra-controller-rest PR: 350
File: api/pkg/api/handler/vpc.go:303-313
Timestamp: 2026-04-06T21:09:55.214Z
Learning: In this codebase, the database interface is intentionally designed to persist the exact value it is given (including values derived from protobuf types, e.g., enum `.String()`). Repo-owned DB constants (e.g., `cdbm.VpcRoutingProfileTypeExternal`) are published for external callers/consumers, not as an indirection layer that the handler should use. Therefore, when reviewing this repository’s handlers, do not flag as a coupling issue the pattern of persisting a protobuf-derived string value directly via calls like `cdb.GetStrPtr(routingProfileTypeValue.String())`, as this is expected behavior rather than an accidental dependency on those string representations.

Learnt from: bcavnvidia
Repo: NVIDIA/ncx-infra-controller-rest PR: 350
File: api/pkg/api/handler/vpc.go:303-313
Timestamp: 2026-04-06T21:22:33.123Z
Learning: Do not flag as an unacceptable coupling the persistence of protobuf enum values using the enum’s `.String()` output (e.g., storing `cdb.GetStrPtr(myProtoEnum.String())`) in inventory/round-trip paths. This codebase stores these `.String()` values because the gRPC back-end inventory process returns the same protobuf-derived string format, so DB round-trips must preserve the exact representation. If and when these inventory values change to plain strings with no backing proto enum, reassess the storage approach, but until then treat proto-enum `.String()` persistence as intentional and compatible.

Learnt from: bcavnvidia
Repo: NVIDIA/ncx-infra-controller-rest PR: 367
File: api/pkg/api/handler/machineinstancetype.go:593-606
Timestamp: 2026-04-10T16:39:25.404Z
Learning: In this codebase, Machine IDs are NOT UUIDs, while the deprecated Machine/InstanceType association IDs ARE UUIDs. When reviewing code that parses/validates incoming machine-related identifiers, treat `uuid.Parse` as a reliable discriminator between the deprecated UUID-based association IDs and the non-UUID Machine IDs. Do not raise/flag potential ambiguity between these two ID namespaces when a `uuid.Parse` check is used to separate them.

Learnt from: parmani-nv
Repo: NVIDIA/ncx-infra-controller-rest PR: 372
File: api/pkg/api/handler/fabric.go:137-143
Timestamp: 2026-04-15T23:49:00.903Z
Learning: In this repository, `cdb.ErrDoesNotExist` (from `db/pkg/db`) is a DB-layer sentinel error that is never wrapped. Therefore, in Go code that calls DAO/DB methods (e.g., `TenantSiteDAO.GetByTenantIDAndSiteID`), direct equality checks like `err == cdb.ErrDoesNotExist` / `err != cdb.ErrDoesNotExist` are intentional and safe. Do not flag these comparisons as robustness issues and do not suggest replacing them with `errors.Is()` for this specific sentinel.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 28, 2026

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
nico-nsm 64 2 20 33 9 0
nico-psm 56 4 29 13 2 8
nico-rest-api 57 4 30 13 2 8
nico-rest-cert-manager 54 4 28 13 1 8
nico-rest-db 55 4 28 13 2 8
nico-rest-site-agent 54 4 28 13 1 8
nico-rest-site-manager 54 4 28 13 1 8
nico-rest-workflow 56 4 29 13 2 8
nico-rla 55 4 28 13 2 8
TOTAL 505 34 248 137 22 64

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@hwadekar-nv hwadekar-nv force-pushed the fix/instance-upodate-same-partition branch from 08bf661 to b2c7503 Compare April 29, 2026 20:44
@hwadekar-nv hwadekar-nv changed the title fix: Reject Instance update when requested NVLink Interfaces match current fix: No-op Instance update when requested NVLink Interfaces match current Apr 29, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
api/pkg/api/handler/instance.go (1)

3177-3186: 💤 Low value

Multiset consumption does not verify that all existing active entries are accounted for.

After consuming from existingNvlIfcMap for each request interface, there is no subsequent check to ensure the map is fully depleted. If the request contains a subset of the existing keys (but the total count matches), the logic would incorrectly report unchanged.

Example: Existing has {A:0: 2}, request has {A:0, B:0} with count 2. Both consume successfully from a map that initially had {A:0: 2, B:0: 0} (if B:0 doesn't exist, it fails). However, if existing has {A:0: 1, A:1: 1} and request has {A:0, A:0}, the second A:0 would fail to consume. The logic handles this case. But if existing has {A:0: 2} and request has {A:0, A:0}, it works correctly.

Upon further analysis, the existing logic does handle this correctly since consumption failures set nvLinkInterfacesUnchanged = false. However, adding a post-loop verification that remaining counts are zero would strengthen the invariant and improve maintainability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/instance.go` around lines 3177 - 3186, After consuming
counts from existingNvlIfcMap while iterating request interfaces, add a
post-loop verification that every remaining value in existingNvlIfcMap is zero;
if any value > 0 set nvLinkInterfacesUnchanged = false. Update the logic around
existingNvlIfcMap / nvLinkInterfacesUnchanged (the block that builds key :=
fmt.Sprintf("%s:%d", nvllPartitionID.String(), apiNvlIfc.DeviceInstance) and
decrements existingNvlIfcMap[key]) to include this final sweep so we only report
unchanged when all multinset counts are fully consumed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/pkg/api/handler/instance.go`:
- Line 3156: Fix the typographical error in the comment that currently reads "If
the number of pending/ready/deeleting NVLink interfaces is unchanged, do
nothing" by changing "deeleting" to "deleting" so the comment reads "If the
number of pending/ready/deleting NVLink interfaces is unchanged, do nothing";
update the comment text exactly where that string appears to correct the typo.
- Around line 3136-3154: existingNvlIfcMap is populated from existingNvlIfcs
without filtering out NVLinkInterfaceStatusDeleting entries, so Deleting
interfaces can incorrectly count toward unchanged checks; update the loop that
builds existingNvlIfcMap (iterating existingNvlIfcs) to skip entries whose
Status == cdbm.NVLinkInterfaceStatusDeleting before incrementing the map and
updating counters (keep updating existingDeletingNvlIfcsCount for bookkeeping),
ensuring only non-deleting interfaces are added to existingNvlIfcMap used for
the unchanged check.
- Around line 3159-3165: The count comparison is wrong because existingNvlIfcMap
aggregates all statuses; change the logic so each branch filters
existingNvlIfcMap to the corresponding-status entries before doing the multiset
match: when len(apiRequest.NVLinkInterfaces) == existingPendingNvlIfcsCount, set
nvLinkInterfacesUnchanged only after building a map of Pending interfaces (from
existingNvlIfcMap) and performing the multiset compare against that filtered
map; do the same for the Ready branch using existingReadyNvlIfcsCount and Ready
entries; remove or disallow treating equality with existingDeletingNvlIfcsCount
as unchanged (do not accept Deleting entries) so the multiset compare never uses
Deleting-status entries.

---

Nitpick comments:
In `@api/pkg/api/handler/instance.go`:
- Around line 3177-3186: After consuming counts from existingNvlIfcMap while
iterating request interfaces, add a post-loop verification that every remaining
value in existingNvlIfcMap is zero; if any value > 0 set
nvLinkInterfacesUnchanged = false. Update the logic around existingNvlIfcMap /
nvLinkInterfacesUnchanged (the block that builds key := fmt.Sprintf("%s:%d",
nvllPartitionID.String(), apiNvlIfc.DeviceInstance) and decrements
existingNvlIfcMap[key]) to include this final sweep so we only report unchanged
when all multinset counts are fully consumed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 24e5782f-112f-43d1-a981-1bc4430fe8cf

📥 Commits

Reviewing files that changed from the base of the PR and between b2c7503 and 00ec52a.

📒 Files selected for processing (1)
  • api/pkg/api/handler/instance.go

Comment thread api/pkg/api/handler/instance.go Outdated
Comment on lines +3136 to +3154
// Count the number of NVLink interfaces by NVLink Logical Partition ID and Device Instance
existingNvlIfcMap := make(map[string]int)
existingReadyNvlIfcsCount := 0
existingPendingNvlIfcsCount := 0
existingDeletingNvlIfcsCount := 0
for i := range existingNvlIfcs {
key := fmt.Sprintf("%s:%d", existingNvlIfcs[i].NVLinkLogicalPartitionID.String(), existingNvlIfcs[i].DeviceInstance)
existingNvlIfcMap[key]++
switch existingNvlIfcs[i].Status {
case cdbm.NVLinkInterfaceStatusReady:
existingReadyNvlIfcsCount++
case cdbm.NVLinkInterfaceStatusPending:
existingPendingNvlIfcsCount++
case cdbm.NVLinkInterfaceStatusDeleting:
existingDeletingNvlIfcsCount++
default:
// Provisioning, Error, or other statuses are not counted in these buckets.
}
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Multiset map includes all statuses but only non-Deleting interfaces should participate in the unchanged check.

Per the PR objective, interfaces in Deleting state should not be considered part of the current active binding. Currently, existingNvlIfcMap includes all interfaces regardless of status, which could lead to a request matching against a Deleting interface and incorrectly being treated as a no-op.

Proposed fix: exclude Deleting interfaces from the multiset
 		for i := range existingNvlIfcs {
+			// Skip Deleting interfaces - they should not be considered part of active bindings
+			if existingNvlIfcs[i].Status == cdbm.NVLinkInterfaceStatusDeleting {
+				existingDeletingNvlIfcsCount++
+				continue
+			}
 			key := fmt.Sprintf("%s:%d", existingNvlIfcs[i].NVLinkLogicalPartitionID.String(), existingNvlIfcs[i].DeviceInstance)
 			existingNvlIfcMap[key]++
 			switch existingNvlIfcs[i].Status {
 			case cdbm.NVLinkInterfaceStatusReady:
 				existingReadyNvlIfcsCount++
 			case cdbm.NVLinkInterfaceStatusPending:
 				existingPendingNvlIfcsCount++
-			case cdbm.NVLinkInterfaceStatusDeleting:
-				existingDeletingNvlIfcsCount++
 			default:
 				// Provisioning, Error, or other statuses are not counted in these buckets.
 			}
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/instance.go` around lines 3136 - 3154, existingNvlIfcMap
is populated from existingNvlIfcs without filtering out
NVLinkInterfaceStatusDeleting entries, so Deleting interfaces can incorrectly
count toward unchanged checks; update the loop that builds existingNvlIfcMap
(iterating existingNvlIfcs) to skip entries whose Status ==
cdbm.NVLinkInterfaceStatusDeleting before incrementing the map and updating
counters (keep updating existingDeletingNvlIfcsCount for bookkeeping), ensuring
only non-deleting interfaces are added to existingNvlIfcMap used for the
unchanged check.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since existing Interfaces might be in the state one of those including Deleting and if incoming interfaces as in Deleting state then incoming operation would be no-op until existing interfaces completely gone.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hwadekar-nv, thank you for the clarification. I understand the design intent now: if the incoming request exactly matches the currently-Deleting interfaces (same count and same (NVLinkLogicalPartitionID, DeviceInstance) multiset), the operation is intentionally treated as a no-op — the system avoids creating new Pending records on top of an in-progress deletion, and lets the cleanup complete naturally.

Including Deleting-state interfaces in existingNvlIfcMap is therefore deliberate, since it is needed for the multiset match to succeed in that path. That is a valid design choice and I withdraw my earlier concern.

One minor consideration worth confirming: if existing interfaces span mixed statuses (e.g., 1 Ready + 1 Deleting), a count match against existingReadyNvlIfcsCount == 1 could trigger nvLinkInterfacesUnchanged = true, after which the multiset check might consume the Deleting key rather than the Ready key for a given (partition, device) pair — potentially masking a real change. This is only a risk when the same (partition, device) key appears in both Deleting and non-Deleting interfaces simultaneously. If your data model prevents that (e.g., a given GPU slot cannot have two concurrent records in different states), this is a non-issue. Otherwise a small guard per status bucket in the multiset pass might be worth a follow-up.


✏️ Learnings added
Learnt from: hwadekar-nv
Repo: NVIDIA/infra-controller-rest PR: 438
File: api/pkg/api/handler/instance.go:3136-3154
Timestamp: 2026-04-29T23:54:10.850Z
Learning: In api/pkg/api/handler/instance.go (UpdateInstanceHandler.Handle), the NVLink unchanged fast-path intentionally includes NVLink interfaces in the Deleting state inside existingNvlIfcMap when building the multiset. If the incoming request's NVLink interfaces match the count and (NVLinkLogicalPartitionID, DeviceInstance) multiset of currently-Deleting interfaces, the operation is treated as a no-op — the handler avoids creating new Pending records while an in-progress deletion completes. Do not flag the inclusion of Deleting-status interfaces in existingNvlIfcMap as a bug.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: thossain-nv
Repo: NVIDIA/ncx-infra-controller-rest PR: 257
File: api/pkg/api/handler/vpcpeering.go:203-262
Timestamp: 2026-03-18T01:37:21.156Z
Learning: In Go handler code under api/pkg/api/handler, RBAC should not short-circuit on the first matching role (e.g., if isProviderAdmin { ... } else { ... }). For dual-role service accounts (e.g., Latinum), an org may have both ProviderAdmin and TenantAdmin. Evaluate both authorization paths independently, accumulate errors per path, and only reject if all applicable paths fail. Implement a helper that checks each role path and collects errors, then decide based on whether any path grants access. Add tests covering mixed-role scenarios and ensure CI gates enforce this behavior across all relevant handlers in this directory.

Learnt from: bcavnvidia
Repo: NVIDIA/ncx-infra-controller-rest PR: 350
File: api/pkg/api/handler/vpc.go:303-313
Timestamp: 2026-04-06T21:09:55.214Z
Learning: In this codebase, the database interface is intentionally designed to persist the exact value it is given (including values derived from protobuf types, e.g., enum `.String()`). Repo-owned DB constants (e.g., `cdbm.VpcRoutingProfileTypeExternal`) are published for external callers/consumers, not as an indirection layer that the handler should use. Therefore, when reviewing this repository’s handlers, do not flag as a coupling issue the pattern of persisting a protobuf-derived string value directly via calls like `cdb.GetStrPtr(routingProfileTypeValue.String())`, as this is expected behavior rather than an accidental dependency on those string representations.

Learnt from: bcavnvidia
Repo: NVIDIA/ncx-infra-controller-rest PR: 350
File: api/pkg/api/handler/vpc.go:303-313
Timestamp: 2026-04-06T21:22:33.123Z
Learning: Do not flag as an unacceptable coupling the persistence of protobuf enum values using the enum’s `.String()` output (e.g., storing `cdb.GetStrPtr(myProtoEnum.String())`) in inventory/round-trip paths. This codebase stores these `.String()` values because the gRPC back-end inventory process returns the same protobuf-derived string format, so DB round-trips must preserve the exact representation. If and when these inventory values change to plain strings with no backing proto enum, reassess the storage approach, but until then treat proto-enum `.String()` persistence as intentional and compatible.

Learnt from: bcavnvidia
Repo: NVIDIA/ncx-infra-controller-rest PR: 367
File: api/pkg/api/handler/machineinstancetype.go:593-606
Timestamp: 2026-04-10T16:39:25.404Z
Learning: In this codebase, Machine IDs are NOT UUIDs, while the deprecated Machine/InstanceType association IDs ARE UUIDs. When reviewing code that parses/validates incoming machine-related identifiers, treat `uuid.Parse` as a reliable discriminator between the deprecated UUID-based association IDs and the non-UUID Machine IDs. Do not raise/flag potential ambiguity between these two ID namespaces when a `uuid.Parse` check is used to separate them.

Learnt from: parmani-nv
Repo: NVIDIA/ncx-infra-controller-rest PR: 372
File: api/pkg/api/handler/fabric.go:137-143
Timestamp: 2026-04-15T23:49:00.903Z
Learning: In this repository, `cdb.ErrDoesNotExist` (from `db/pkg/db`) is a DB-layer sentinel error that is never wrapped. Therefore, in Go code that calls DAO/DB methods (e.g., `TenantSiteDAO.GetByTenantIDAndSiteID`), direct equality checks like `err == cdb.ErrDoesNotExist` / `err != cdb.ErrDoesNotExist` are intentional and safe. Do not flag these comparisons as robustness issues and do not suggest replacing them with `errors.Is()` for this specific sentinel.

Comment thread api/pkg/api/handler/instance.go Outdated
Comment thread api/pkg/api/handler/instance.go Outdated
Comment on lines +3159 to +3165
if len(apiRequest.NVLinkInterfaces) == existingPendingNvlIfcsCount {
nvLinkInterfacesUnchanged = true
} else if len(apiRequest.NVLinkInterfaces) == existingReadyNvlIfcsCount {
nvLinkInterfacesUnchanged = true
} else if len(apiRequest.NVLinkInterfaces) == existingDeletingNvlIfcsCount {
nvLinkInterfacesUnchanged = true
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Logic inconsistency: count check is per-status but multiset matching operates on all statuses combined.

The current implementation checks if the request count equals one of the three status-specific counts (Pending, Ready, or Deleting), but existingNvlIfcMap aggregates interfaces across all statuses. This permits false positives where a request matches a count bucket but consumes entries from different status categories during multiset matching.

Additionally, allowing the request to match existingDeletingNvlIfcsCount contradicts the PR objective: "ensuring those interfaces are not in the Deleting state."

Consider constructing separate maps per status or filtering existingNvlIfcMap to only include interfaces whose status corresponds to the matched count bucket.

Suggested approach
-		// If the number of pending/ready/deeleting NVLink interfaces is unchanged, do nothing
-		// because existing NVLink interfaces includes pending/ready/deleting NVLink interfaces on the same logical partition and device instance
-		// hence we need to verify if incoming request contains the same number of pending/ready/deleting NVLink interfaces
-		if len(apiRequest.NVLinkInterfaces) == existingPendingNvlIfcsCount {
-			nvLinkInterfacesUnchanged = true
-		} else if len(apiRequest.NVLinkInterfaces) == existingReadyNvlIfcsCount {
-			nvLinkInterfacesUnchanged = true
-		} else if len(apiRequest.NVLinkInterfaces) == existingDeletingNvlIfcsCount {
-			nvLinkInterfacesUnchanged = true
-		}
+		// Only consider Ready and Pending interfaces as "active" bindings (exclude Deleting)
+		activeNvlIfcsCount := existingReadyNvlIfcsCount + existingPendingNvlIfcsCount
+		if len(apiRequest.NVLinkInterfaces) == activeNvlIfcsCount {
+			nvLinkInterfacesUnchanged = true
+		}

Then filter existingNvlIfcMap to only include non-Deleting interfaces for consistent multiset matching.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if len(apiRequest.NVLinkInterfaces) == existingPendingNvlIfcsCount {
nvLinkInterfacesUnchanged = true
} else if len(apiRequest.NVLinkInterfaces) == existingReadyNvlIfcsCount {
nvLinkInterfacesUnchanged = true
} else if len(apiRequest.NVLinkInterfaces) == existingDeletingNvlIfcsCount {
nvLinkInterfacesUnchanged = true
}
// Only consider Ready and Pending interfaces as "active" bindings (exclude Deleting)
activeNvlIfcsCount := existingReadyNvlIfcsCount + existingPendingNvlIfcsCount
if len(apiRequest.NVLinkInterfaces) == activeNvlIfcsCount {
nvLinkInterfacesUnchanged = true
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/instance.go` around lines 3159 - 3165, The count
comparison is wrong because existingNvlIfcMap aggregates all statuses; change
the logic so each branch filters existingNvlIfcMap to the corresponding-status
entries before doing the multiset match: when len(apiRequest.NVLinkInterfaces)
== existingPendingNvlIfcsCount, set nvLinkInterfacesUnchanged only after
building a map of Pending interfaces (from existingNvlIfcMap) and performing the
multiset compare against that filtered map; do the same for the Ready branch
using existingReadyNvlIfcsCount and Ready entries; remove or disallow treating
equality with existingDeletingNvlIfcsCount as unchanged (do not accept Deleting
entries) so the multiset compare never uses Deleting-status entries.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since existing Interfaces might be in the state one of those including Deleting and if incoming interfaces as in Deleting state then incoming operation would be no-op until existing interfaces completely gone. We can't say always len(apiRequest.NVLinkInterfaces) == activeNvlIfcsCount

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hwadekar-nv, understood — thank you for the clarification. The intent is clear: if a deletion is already in progress (interfaces in Deleting state) and the incoming request specifies the same (NVLinkLogicalPartitionID, DeviceInstance) multiset, the operation should be treated as a no-op to avoid stacking a new Pending creation on top of an in-flight deletion. Matching against existingDeletingNvlIfcsCount is therefore correct and intentional.

I will retract that part of the review concern. I note there is a subtle edge case where the combined existingNvlIfcMap (built across all statuses) could allow a request to satisfy the count check of one status bucket (e.g., Ready) but actually match keys belonging to a different status (e.g., Deleting). However, since the end result is still a safe no-op — no duplicate DB records are created — this is acceptable behavior in your design.


🧠 Learnings used
Learnt from: hwadekar-nv
Repo: NVIDIA/infra-controller-rest PR: 438
File: api/pkg/api/handler/instance.go:3136-3154
Timestamp: 2026-04-29T23:54:10.850Z
Learning: In api/pkg/api/handler/instance.go (UpdateInstanceHandler.Handle), the NVLink unchanged fast-path intentionally includes NVLink interfaces in the Deleting state inside existingNvlIfcMap when building the multiset. If the incoming request's NVLink interfaces match the count and (NVLinkLogicalPartitionID, DeviceInstance) multiset of currently-Deleting interfaces, the operation is treated as a no-op — the handler avoids creating new Pending records while an in-progress deletion completes. Do not flag the inclusion of Deleting-status interfaces in existingNvlIfcMap as a bug.

Learnt from: thossain-nv
Repo: NVIDIA/ncx-infra-controller-rest PR: 257
File: api/pkg/api/handler/vpcpeering.go:203-262
Timestamp: 2026-03-18T01:37:21.156Z
Learning: In Go handler code under api/pkg/api/handler, RBAC should not short-circuit on the first matching role (e.g., if isProviderAdmin { ... } else { ... }). For dual-role service accounts (e.g., Latinum), an org may have both ProviderAdmin and TenantAdmin. Evaluate both authorization paths independently, accumulate errors per path, and only reject if all applicable paths fail. Implement a helper that checks each role path and collects errors, then decide based on whether any path grants access. Add tests covering mixed-role scenarios and ensure CI gates enforce this behavior across all relevant handlers in this directory.

Learnt from: bcavnvidia
Repo: NVIDIA/ncx-infra-controller-rest PR: 350
File: api/pkg/api/handler/vpc.go:303-313
Timestamp: 2026-04-06T21:09:55.214Z
Learning: In this codebase, the database interface is intentionally designed to persist the exact value it is given (including values derived from protobuf types, e.g., enum `.String()`). Repo-owned DB constants (e.g., `cdbm.VpcRoutingProfileTypeExternal`) are published for external callers/consumers, not as an indirection layer that the handler should use. Therefore, when reviewing this repository’s handlers, do not flag as a coupling issue the pattern of persisting a protobuf-derived string value directly via calls like `cdb.GetStrPtr(routingProfileTypeValue.String())`, as this is expected behavior rather than an accidental dependency on those string representations.

Learnt from: bcavnvidia
Repo: NVIDIA/ncx-infra-controller-rest PR: 350
File: api/pkg/api/handler/vpc.go:303-313
Timestamp: 2026-04-06T21:22:33.123Z
Learning: Do not flag as an unacceptable coupling the persistence of protobuf enum values using the enum’s `.String()` output (e.g., storing `cdb.GetStrPtr(myProtoEnum.String())`) in inventory/round-trip paths. This codebase stores these `.String()` values because the gRPC back-end inventory process returns the same protobuf-derived string format, so DB round-trips must preserve the exact representation. If and when these inventory values change to plain strings with no backing proto enum, reassess the storage approach, but until then treat proto-enum `.String()` persistence as intentional and compatible.

Learnt from: bcavnvidia
Repo: NVIDIA/ncx-infra-controller-rest PR: 367
File: api/pkg/api/handler/machineinstancetype.go:593-606
Timestamp: 2026-04-10T16:39:25.404Z
Learning: In this codebase, Machine IDs are NOT UUIDs, while the deprecated Machine/InstanceType association IDs ARE UUIDs. When reviewing code that parses/validates incoming machine-related identifiers, treat `uuid.Parse` as a reliable discriminator between the deprecated UUID-based association IDs and the non-UUID Machine IDs. Do not raise/flag potential ambiguity between these two ID namespaces when a `uuid.Parse` check is used to separate them.

Learnt from: parmani-nv
Repo: NVIDIA/ncx-infra-controller-rest PR: 372
File: api/pkg/api/handler/fabric.go:137-143
Timestamp: 2026-04-15T23:49:00.903Z
Learning: In this repository, `cdb.ErrDoesNotExist` (from `db/pkg/db`) is a DB-layer sentinel error that is never wrapped. Therefore, in Go code that calls DAO/DB methods (e.g., `TenantSiteDAO.GetByTenantIDAndSiteID`), direct equality checks like `err == cdb.ErrDoesNotExist` / `err != cdb.ErrDoesNotExist` are intentional and safe. Do not flag these comparisons as robustness issues and do not suggest replacing them with `errors.Is()` for this specific sentinel.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
api/pkg/api/handler/instance.go (1)

3136-3185: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Block the NVLink no-op fast-path when any existing interface is in an unhandled status.

At Line 3159, an explicit clear request ([]) becomes a no-op whenever the counted buckets are zero, even if existingNvlIfcs still contains Provisioning or Error rows. The same issue can happen for mixed-state instances, because those rows are inserted into existingNvlIfcMap but never participate in the bucket check. That can silently ignore a real NVLink update.

Suggested fix
 		// Count the number of NVLink interfaces by NVLink Logical Partition ID and Device Instance
 		existingNvlIfcMap := make(map[string]int)
 		existingReadyNvlIfcsCount := 0
 		existingPendingNvlIfcsCount := 0
 		existingDeletingNvlIfcsCount := 0
+		hasUnhandledNvlIfcStatus := false
 		for i := range existingNvlIfcs {
 			key := fmt.Sprintf("%s:%d", existingNvlIfcs[i].NVLinkLogicalPartitionID.String(), existingNvlIfcs[i].DeviceInstance)
-			existingNvlIfcMap[key]++
 			switch existingNvlIfcs[i].Status {
 			case cdbm.NVLinkInterfaceStatusReady:
 				existingReadyNvlIfcsCount++
+				existingNvlIfcMap[key]++
 			case cdbm.NVLinkInterfaceStatusPending:
 				existingPendingNvlIfcsCount++
+				existingNvlIfcMap[key]++
 			case cdbm.NVLinkInterfaceStatusDeleting:
 				existingDeletingNvlIfcsCount++
+				existingNvlIfcMap[key]++
 			default:
-				// Provisioning, Error, or other statuses are not counted in these buckets.
+				hasUnhandledNvlIfcStatus = true
 			}
 		}
 
 		// If the number of pending/ready/deleting NVLink interfaces is unchanged, do nothing
 		// because existing NVLink interfaces includes pending/ready/deleting NVLink interfaces on the same logical partition and device instance
 		// hence we need to verify if incoming request contains the same number of pending/ready/deleting NVLink interfaces
-		if len(apiRequest.NVLinkInterfaces) == existingPendingNvlIfcsCount {
-			nvLinkInterfacesUnchanged = true
-		} else if len(apiRequest.NVLinkInterfaces) == existingReadyNvlIfcsCount {
-			nvLinkInterfacesUnchanged = true
-		} else if len(apiRequest.NVLinkInterfaces) == existingDeletingNvlIfcsCount {
-			nvLinkInterfacesUnchanged = true
+		if !hasUnhandledNvlIfcStatus {
+			if len(apiRequest.NVLinkInterfaces) == existingPendingNvlIfcsCount {
+				nvLinkInterfacesUnchanged = true
+			} else if len(apiRequest.NVLinkInterfaces) == existingReadyNvlIfcsCount {
+				nvLinkInterfacesUnchanged = true
+			} else if len(apiRequest.NVLinkInterfaces) == existingDeletingNvlIfcsCount {
+				nvLinkInterfacesUnchanged = true
+			}
 		}

Based on learnings: keep Deleting-state interfaces eligible for this no-op path; the issue is that statuses outside Pending/Ready/Deleting can currently trigger it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/pkg/api/handler/instance.go` around lines 3136 - 3185, The no-op
fast-path incorrectly fires when existingNvlIfcs contains rows in unhandled
statuses (e.g., Provisioning or Error) because only Pending/Ready/Deleting were
bucketed; add a guard that detects any unhandled status during the
existingNvlIfcs iteration (e.g., set hasUnhandledNvlIfcStatus = true in the
default switch case) and then only allow nvLinkInterfacesUnchanged to remain
true if hasUnhandledNvlIfcStatus is false (and the multiset counts still match).
Update references around existingNvlIfcs, existingNvlIfcMap,
nvLinkInterfacesUnchanged and the status-counting switch to implement this check
so unhandled statuses block the no-op path while keeping Deleting eligible.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@api/pkg/api/handler/instance.go`:
- Around line 3136-3185: The no-op fast-path incorrectly fires when
existingNvlIfcs contains rows in unhandled statuses (e.g., Provisioning or
Error) because only Pending/Ready/Deleting were bucketed; add a guard that
detects any unhandled status during the existingNvlIfcs iteration (e.g., set
hasUnhandledNvlIfcStatus = true in the default switch case) and then only allow
nvLinkInterfacesUnchanged to remain true if hasUnhandledNvlIfcStatus is false
(and the multiset counts still match). Update references around existingNvlIfcs,
existingNvlIfcMap, nvLinkInterfacesUnchanged and the status-counting switch to
implement this check so unhandled statuses block the no-op path while keeping
Deleting eligible.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9963253c-fdd4-4d90-a5d9-aa1560d81622

📥 Commits

Reviewing files that changed from the base of the PR and between 00ec52a and cdbeef9.

📒 Files selected for processing (1)
  • api/pkg/api/handler/instance.go

@hwadekar-nv hwadekar-nv force-pushed the fix/instance-upodate-same-partition branch from cdbeef9 to 7bc9293 Compare May 5, 2026 18:21
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
api/pkg/api/handler/instance.go (1)

3105-3167: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The no-op fast-path never actually skips the NVLink writes.

Even when nvLinkInterfacesUnchanged becomes true, Lines 3105-3167 still create new Pending rows and mark the current rows Deleting whenever apiRequest.NVLinkInterfaces != nil. That defeats the PR fix for identical requests. Branch this block on !nvLinkInterfacesUnchanged, and in the no-op branch preserve existingNvlIfcs as the returned/current state.

Suggested control-flow fix
-		if apiRequest.NVLinkInterfaces != nil {
+		if apiRequest.NVLinkInterfaces != nil {
+			if nvLinkInterfacesUnchanged {
+				newNvlIfcs = existingNvlIfcs
+			} else {
 			nvllpIDs := goset.NewSet[uuid.UUID]()
 			for _, apiNvlIfc := range apiRequest.NVLinkInterfaces {
 				// NVLink Logical Partition
 				nvllPartitionID, err := uuid.Parse(apiNvlIfc.NVLinkLogicalPartitionID)
 				if err != nil {
@@
 				if err != nil {
 					logger.Error().Err(err).Msg("failed to update NVLink Interface records in DB")
 					return cutil.NewAPIError(http.StatusInternalServerError, "Failed to update NVLink Interfaces for Instance, DB error", nil)
 				}
 			}
+			}
 
 		} else {
 			newNvlIfcs = existingNvlIfcs
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/pkg/api/handler/instance.go` around lines 3105 - 3167, The NVLink write
block guarded only by apiRequest.NVLinkInterfaces still runs even when
nvLinkInterfacesUnchanged is true; change the control flow to only execute the
creation+deletion logic when !nvLinkInterfacesUnchanged (i.e., wrap the entire
block that parses NVLinkLogicalPartitionIDs, calls nvlIfcDAO.Create and
UpdateMultiple) and in the else/no-op branch leave existingNvlIfcs unchanged and
set newNvlIfcs (or the returned NVLink list) to the existingNvlIfcs so the
handler returns the current state without creating Pending rows or marking
Deleting.
🧹 Nitpick comments (1)
api/pkg/api/handler/instance_test.go (1)

5664-5691: ⚡ Quick win

Add an explicit no-op DB assertion for NVLink interface state.

This new case targets “unchanged NVLink interfaces => no-op”, but it does not assert the key postcondition that existing NVLink interfaces are not moved to Deleting.

Proposed test-harness addition
 	type args struct {
 		...
 		nvLinkGpuConfigsVerifyCountOnly bool
+		expectNoNVLinkDeleting          bool
 	}
 		{
 			name: "test Instance update API endpoint success when NVLink interfaces unchanged (no-op, multiset order)",
 			...
 			args: args{
 				...
 				nvLinkGpuConfigsVerifyCountOnly: true,
+				expectNoNVLinkDeleting:          true,
 			},
+			if tt.args.expectNoNVLinkDeleting {
+				nvlIfcDAO := cdbm.NewNVLinkInterfaceDAO(tt.fields.dbSession)
+				deleting, _, _ := nvlIfcDAO.GetAll(
+					ec.Request().Context(),
+					nil,
+					cdbm.NVLinkInterfaceFilterInput{
+						InstanceIDs: []uuid.UUID{reqIns.ID},
+						Statuses:    []string{cdbm.NVLinkInterfaceStatusDeleting},
+					},
+					cdbp.PageInput{},
+					nil,
+				)
+				assert.Len(t, deleting, 0)
+			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/pkg/api/handler/instance_test.go` around lines 5664 - 5691, This test
case ("test Instance update API endpoint success when NVLink interfaces
unchanged (no-op, multiset order)") is missing an explicit DB assertion that
existing NVLink interfaces were not marked Deleting; after calling the handler,
query the NVLink interface rows for the instance (use inst13.ID /
NVLinkInterfaces tied to nvllp1/nvllp2 via DeviceInstance) via the test
dbSession and assert each NVLink interface's State (or Status) field is still
the live state (not "Deleting"); add this assertion to the test’s
verify/tear-down block so the handler no-op path verifies DB rows were not
transitioned to Deleting.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@api/pkg/api/handler/instance_test.go`:
- Around line 6155-6160: When nvLinkGpuConfigsVerifyCountOnly is true, the test
currently only checks length; instead verify the multiset of
(NVLinkLogicalPartitionID, DeviceInstance) pairs regardless of order by
comparing counts of each pair between siteReq.Config.Nvlink.GpuConfigs and
tt.args.reqData.NVLinkInterfaces. Locate the branch guarded by
nvLinkGpuConfigsVerifyCountOnly and replace the length-only assertion with logic
that iterates siteReq.Config.Nvlink.GpuConfigs and
tt.args.reqData.NVLinkInterfaces, constructs a map/counter keyed by the tuple
(LogicalPartitionId.Value, DeviceInstance) and asserts the two maps are equal so
order differences do not cause false positives.

In `@api/pkg/api/handler/instance.go`:
- Around line 3027-3040: The closure is shadowing outer variables causing them
to be inaccessible later: remove the inner declaration of newNvlIfcs and stop
using := for existingNvlIfcs and nvLinkInterfacesUnchanged inside the
transaction closure so they reassign the outer variables instead; specifically,
delete the inner "var newNvlIfcs" declaration, change the GetAll assignment from
":=" to "=" for existingNvlIfcs (handling the err as currently done), and move
the declaration of nvLinkInterfacesUnchanged to the outer scope (e.g., before
the transaction block) so the values set by the NVLinkInterfaceDAO/GetAll flow
are visible after the transaction. Ensure NVLinkInterfaceDAO
(NewNVLinkInterfaceDAO), GetAll, and any error handling continue to work with
the reassigned variables.

---

Outside diff comments:
In `@api/pkg/api/handler/instance.go`:
- Around line 3105-3167: The NVLink write block guarded only by
apiRequest.NVLinkInterfaces still runs even when nvLinkInterfacesUnchanged is
true; change the control flow to only execute the creation+deletion logic when
!nvLinkInterfacesUnchanged (i.e., wrap the entire block that parses
NVLinkLogicalPartitionIDs, calls nvlIfcDAO.Create and UpdateMultiple) and in the
else/no-op branch leave existingNvlIfcs unchanged and set newNvlIfcs (or the
returned NVLink list) to the existingNvlIfcs so the handler returns the current
state without creating Pending rows or marking Deleting.

---

Nitpick comments:
In `@api/pkg/api/handler/instance_test.go`:
- Around line 5664-5691: This test case ("test Instance update API endpoint
success when NVLink interfaces unchanged (no-op, multiset order)") is missing an
explicit DB assertion that existing NVLink interfaces were not marked Deleting;
after calling the handler, query the NVLink interface rows for the instance (use
inst13.ID / NVLinkInterfaces tied to nvllp1/nvllp2 via DeviceInstance) via the
test dbSession and assert each NVLink interface's State (or Status) field is
still the live state (not "Deleting"); add this assertion to the test’s
verify/tear-down block so the handler no-op path verifies DB rows were not
transitioned to Deleting.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d5dcb542-360e-4125-974b-9d481aecdfe5

📥 Commits

Reviewing files that changed from the base of the PR and between cdbeef9 and 7bc9293.

📒 Files selected for processing (2)
  • api/pkg/api/handler/instance.go
  • api/pkg/api/handler/instance_test.go

Comment thread api/pkg/api/handler/instance_test.go
Comment thread api/pkg/api/handler/instance.go Outdated
Comment on lines +3027 to +3040
}

// Check if any DPU Extension Service Deployments are being requested to be created or removed
existingDesdMap := map[string]*cdbm.DpuExtensionServiceDeployment{}
for _, desd := range existingDesds {
existingDesdMap[fmt.Sprintf("%s:%s", desd.DpuExtensionServiceID.String(), desd.Version)] = &desd
}

updatedDesdMap := map[string]*cdbm.DpuExtensionServiceDeployment{}

// If DPU extension services were omitted from the request, preserve existing
// services in the config we send to Site Controller.
if apiRequest.DpuExtensionServiceDeployments == nil {
updateDesds = append(updateDesds, existingDesds...)
} else {
for _, adesdr := range apiRequest.DpuExtensionServiceDeployments {
desdID, err := uuid.Parse(adesdr.DpuExtensionServiceID)
if err != nil {
return cutil.NewAPIError(http.StatusBadRequest, fmt.Sprintf("Invalid DPU Extension Service ID: %s specified in request", adesdr.DpuExtensionServiceID), nil)
}

desvID := fmt.Sprintf("%s:%s", desdID.String(), adesdr.Version)
// Create new NVLink Interface records in the DB if specified in request
var newNvlIfcs []cdbm.NVLinkInterface
nvlIfcDAO := cdbm.NewNVLinkInterfaceDAO(uih.dbSession)

existingDesd, exists := existingDesdMap[desvID]
if exists {
updateDesds = append(updateDesds, *existingDesd)
updatedDesdMap[desvID] = existingDesd
} else {
newDesd, serr := desdDAO.Create(ctx, tx, cdbm.DpuExtensionServiceDeploymentCreateInput{
SiteID: site.ID,
TenantID: tenant.ID,
InstanceID: instance.ID,
DpuExtensionServiceID: desdID,
Version: adesdr.Version,
Status: cdbm.DpuExtensionServiceDeploymentStatusPending,
CreatedBy: dbUser.ID,
})
if serr != nil {
logger.Error().Err(serr).Msg("error creating Instance DpuExtensionServiceDeployment record in DB")
return cutil.NewAPIError(http.StatusInternalServerError, "Failed to create DPU Extension Service Deployment for Instance, DB error", nil)
}
des, _ := desIDMap[desdID]
newDesd.DpuExtensionService = des
updateDesds = append(updateDesds, *newDesd)
updatedDesdMap[desvID] = newDesd
}
// OrderAscending is our best-effort to make sure we send Carbide the interfaces in the order it originally received them. so the config doesn't get rejected
existingNvlIfcs, _, err := nvlIfcDAO.GetAll(ctx, tx, cdbm.NVLinkInterfaceFilterInput{InstanceIDs: []uuid.UUID{instanceID}}, cdbp.PageInput{OrderBy: &cdbp.OrderBy{Field: cdbm.NVLinkInterfaceOrderByCreated, Order: cdbp.OrderAscending}}, nil)
if err != nil {
logger.Error().Err(err).Msg("failed to retrieve NVLink Interfaces details for Instance")
return cutil.NewAPIErrorResponse(c, http.StatusInternalServerError, "Failed to retrieve NVLink interfaces for Instance, DB error", nil)
}

nvLinkInterfacesUnchanged := false
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
from pathlib import Path

path = Path("api/pkg/api/handler/instance.go")
lines = path.read_text().splitlines()

for start, end in [(3018, 3042), (3398, 3407)]:
    print(f"\n--- {path}:{start}-{end} ---")
    for i in range(start, end + 1):
        print(f"{i}: {lines[i-1]}")

print("\n--- occurrences of NVLink response-state symbols ---")
for i, line in enumerate(lines, 1):
    if any(tok in line for tok in ("newNvlIfcs", "existingNvlIfcs", "nvLinkInterfacesUnchanged")):
        print(f"{i}: {line}")
PY

Repository: NVIDIA/infra-controller-rest

Length of output: 4075


🏁 Script executed:

#!/bin/bash
python - <<'PY'
from pathlib import Path

path = Path("api/pkg/api/handler/instance.go")
lines = path.read_text().splitlines()

# Show broader context around transaction boundaries
print("--- Transaction entry (around line 2700) ---")
for i in range(2698, 2730):
    print(f"{i}: {lines[i-1]}")

print("\n--- Transaction exit (around line 3410) ---")
for i in range(3400, 3415):
    print(f"{i}: {lines[i-1]}")
PY

Repository: NVIDIA/infra-controller-rest

Length of output: 2392


🏁 Script executed:

#!/bin/bash
python - <<'PY'
from pathlib import Path

path = Path("api/pkg/api/handler/instance.go")
lines = path.read_text().splitlines()

# Look for the closure-ending pattern: likely a }) at the end of the WithTx
# Search from line 3150 onwards for the closing brace of the transaction
print("--- Looking for transaction closure end (line 3100-3300) ---")
for i in range(3100, 3300):
    line = lines[i-1]
    # Show lines with unmatched closing braces that might end the closure
    if i >= 3200 and line.strip() in ["}", "})"] :
        print(f"{i}: {line}")
    elif i >= 3200 and i <= 3230:
        print(f"{i}: {line}")

# Also check around line 3400 to see structure
print("\n--- Around line 3400-3410 (response assembly) ---")
for i in range(3395, 3415):
    print(f"{i}: {lines[i-1]}")
PY

Repository: NVIDIA/infra-controller-rest

Length of output: 2528


🏁 Script executed:

#!/bin/bash
python - <<'PY'
from pathlib import Path
import re

path = Path("api/pkg/api/handler/instance.go")
lines = path.read_text().splitlines()

# Find where the WithTx closure ends by searching for "}) " pattern
# The closure likely ends with "}) " (closing brace-paren-space)
print("--- Searching for WithTx closure end ---")
for i in range(3300, 3420):
    line = lines[i-1]
    # Look for the specific pattern of closure end
    if re.match(r'^\s*\})\s*$', line):
        print(f"FOUND CLOSURE END at {i}: {line}")
        # Print surrounding context
        for j in range(max(3300, i-5), min(3420, i+10)):
            print(f"{j}: {lines[j-1]}")
        break
    elif i <= 3320 or i >= 3380:
        print(f"{i}: {lines[i-1]}")
PY

Repository: NVIDIA/infra-controller-rest

Length of output: 796


🏁 Script executed:

#!/bin/bash
python - <<'PY'
from pathlib import Path

path = Path("api/pkg/api/handler/instance.go")
lines = path.read_text().splitlines()

# Find where the WithTx closure ends by looking for }) at start of line
print("--- Searching for WithTx closure end (line 3300+) ---")
for i in range(3300, 3420):
    line = lines[i-1]
    if line.strip() == '})' or line.rstrip().endswith('})'):
        print(f"FOUND CLOSURE END at {i}: {repr(line)}")
        # Print surrounding context
        for j in range(max(3300, i-3), min(3420, i+5)):
            prefix = ">>>" if j == i else "   "
            print(f"{prefix} {j}: {lines[j-1]}")
        break
    # Also print some context every 20 lines
    if i % 20 == 0:
        print(f"{i}: {lines[i-1]}")
PY

Repository: NVIDIA/infra-controller-rest

Length of output: 691


Fix critical scope error: inner variable declarations shadow and become inaccessible outside the transaction closure.

Line 3030 redeclares newNvlIfcs, and line 3034 shadows existingNvlIfcs with the := operator inside the closure. Additionally, line 3040 declares nvLinkInterfacesUnchanged locally within the closure. These shadowed variables are then referenced outside the closure at line 3405–3407, where they are either empty or out of scope entirely. This is a compile-time error.

Remove the var newNvlIfcs declaration at line 3030. Reassign to the outer variables directly: change line 3034 to use = instead of := (with appropriate error handling), and move nvLinkInterfacesUnchanged declaration outside the closure (before line 2721) or refactor to avoid out-of-scope references.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/pkg/api/handler/instance.go` around lines 3027 - 3040, The closure is
shadowing outer variables causing them to be inaccessible later: remove the
inner declaration of newNvlIfcs and stop using := for existingNvlIfcs and
nvLinkInterfacesUnchanged inside the transaction closure so they reassign the
outer variables instead; specifically, delete the inner "var newNvlIfcs"
declaration, change the GetAll assignment from ":=" to "=" for existingNvlIfcs
(handling the err as currently done), and move the declaration of
nvLinkInterfacesUnchanged to the outer scope (e.g., before the transaction
block) so the values set by the NVLinkInterfaceDAO/GetAll flow are visible after
the transaction. Ensure NVLinkInterfaceDAO (NewNVLinkInterfaceDAO), GetAll, and
any error handling continue to work with the reassigned variables.

@hwadekar-nv hwadekar-nv force-pushed the fix/instance-upodate-same-partition branch 2 times, most recently from bae89d8 to 728c5b1 Compare May 5, 2026 22:09
@hwadekar-nv hwadekar-nv changed the title fix: No-op Instance update when requested NVLink Interfaces match current fix: No-op Instance update when requested NVLink Interfaces matches with current May 5, 2026
@hwadekar-nv hwadekar-nv force-pushed the fix/instance-upodate-same-partition branch 7 times, most recently from 36b6c99 to 94f45cd Compare May 6, 2026 23:43
Copy link
Copy Markdown
Contributor

@thossain-nv thossain-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's reduce the grace period to 90s and fix the final suggestions before merging.

Comment thread api/pkg/api/handler/instance.go Outdated
// This interface is already ready, we don't need to re-issue the NVLink interface
existingReadyNvlIfcsCount++
} else {
if mostRecentNvlIfc.Updated.Before(time.Now().Add(-NVLinkInterfaceStatusSyncGraceWindow)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be mostRecentNvlIfc.Updated.After(time.Now().Add(-NVLinkInterfaceStatusSyncGraceWindow)) { as we want the no-op to happen if the Interfaces were updated during grace period not before.

@hwadekar-nv hwadekar-nv force-pushed the fix/instance-upodate-same-partition branch from 57a9a2c to eecca29 Compare May 7, 2026 01:55
@hwadekar-nv hwadekar-nv force-pushed the fix/instance-upodate-same-partition branch from eecca29 to 1cc4217 Compare May 7, 2026 16:05
@hwadekar-nv hwadekar-nv merged commit 32a4eb7 into main May 7, 2026
104 checks passed
@hwadekar-nv hwadekar-nv deleted the fix/instance-upodate-same-partition branch May 7, 2026 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants