Skip to content

fix: BMC MAC checks when site-explorer is discovering switches and power shelves#1017

Open
chet wants to merge 1 commit intoNVIDIA:mainfrom
chet:rack-component-duplicate-defense
Open

fix: BMC MAC checks when site-explorer is discovering switches and power shelves#1017
chet wants to merge 1 commit intoNVIDIA:mainfrom
chet:rack-component-duplicate-defense

Conversation

@chet
Copy link
Copy Markdown
Contributor

@chet chet commented Apr 17, 2026

Description

We ran into an issue where we had duplicate switches in an environment, because on the first site-explorer run, it failed to pull the chassis serial number from the switch, which fell back to "NA", and gave us an encoded SwitchId as such. The subsequent run succeeded, which gave us a different SwitchId, so then we had two.

Machines actually do a BMC MAC check. If the BMC MAC already has an associated machine, don't make another one. Switches and power shelves did not have this check.

In general machine discovery:

  • Does an existing BMC MAC check first.
  • Fails if it can't pull the serial (errors with MissingHardwareInfo::Serial).
  • Then looks for a duplicate MachineId.

So, this adds it, and tweaks some things with power shelf bmc_mac_address to match the pattern we have for switches.

Tests added!

Signed-off-by: Chet Nichols III chetn@nvidia.com

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

…wer shelves

We ran into an issue where we had duplicate switches in an environment, because on the first `site-explorer` run, it failed to pull the chassis serial number from the switch, which fell back to "NA", and gave us an encoded `SwitchId` as such. The subsequent run succeeded, which gave us a different `SwitchId`, so then we had two.

Machines actually do a BMC MAC check. If the BMC MAC already has an associated machine, don't make another one. Switches and power shelves did not have this check.

In general machine discovery:
- Does an existing BMC MAC check first.
- Fails if it can't pull the serial (errors with `MissingHardwareInfo::Serial`).
- Then looks for a duplicate `MachineId`.

So, this adds it, and tweaks some things with power shelf `bmc_mac_address` to match the pattern we have for switches.

Tests added.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
@chet chet requested a review from a team as a code owner April 17, 2026 22:04
@github-actions
Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-04-17 22:07:13 UTC | Commit: 8248be3


let query = sqlx::query_as::<_, PowerShelfId>(
"INSERT INTO power_shelves (id, name, config, controller_state, controller_state_version, description, labels, version, rack_id) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9) RETURNING id",
"INSERT INTO power_shelves (id, name, config, controller_state, controller_state_version, bmc_mac_address, description, labels, version, rack_id) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10) RETURNING id",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can do that temporarily. But longer term #925 seems like the better approach.

==> The BMC/object link would always be in table machine_interfaces, and you can check from there

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Matthias247 Yeah -- totally agree. I've been fussing with the associations in general, and I know @spydaNVIDIA has too.

/// If a chassis serial number cant be populated, it returns NA.
/// There's probably a better way to do this, but right now I'm
/// fixing some bugs; I'm making it a constant, at least!
const SWITCH_SERIAL_NA: &str = "NA";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's reported by the actual BMC?

For hosts what we are doing is:

  • create the managehost object
  • mark it as unhealthy with SerialMismatch alert

I'd feel like we should be consistent over what we are doing for all objects.
That could mean consistently rejecting objects with mismatching serials.

Copy link
Copy Markdown
Contributor Author

@chet chet Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Matthias247 Yeah, lol -- the BMC reported "NA", and then in a subsequent read, it reported the actual serial number. And yeah, I could switch this to just reject if the serial mismatches.

The other problem though is I do want to guard against duplicate devices -- the first exploration, it reported "NA", the second exploration reported an actual serial number, so then we ended up with two switches.

So we keep the BMC MAC check, but in cases of NA, we let it in, and just mark it unhealthy with a serial mismatch? Or do we just skip it until the serial matches in a subsequent exploration?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Matthias247 Okay so really, the overall approach here isn't "check for NA" and do things. Instead:

If there's already an associated entity with the BMC MAC, don't make another one, else make a new entity (switch, power shelf, etc).

And THEN, like you said, if the chassis serial number doesn't match the expected serial for that entity, immediately mark it unhealthy?

...but would we prefer to not even let it get created to begin with, and log that we noticed it, and the SN doesn't match? I guess the current flow lets us get a managed entity from it, and then we can see that it's there and why it's unhealthy. ...whereas logs would be more difficult to track down in a sense.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And THEN, like you said, if the chassis serial number doesn't match the expected serial for that entity, immediately mark it unhealthy?

That's what we do for machines. But I'm not sure if its the best approach.

...but would we prefer to not even let it get created to begin with, and log that we noticed it, and the SN doesn't match?

That might actually be saner. Maybe we should put it behind a config flag? But if you wanted to get that host ingested, you might as well just modify the serial number. So I'm a bit torn on whether that additional flag is needed.

@ajf
Copy link
Copy Markdown
Collaborator

ajf commented Apr 29, 2026

conflicts @chet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants