fix: BMC MAC checks when site-explorer is discovering switches and power shelves#1017
fix: BMC MAC checks when site-explorer is discovering switches and power shelves#1017chet wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
…wer shelves We ran into an issue where we had duplicate switches in an environment, because on the first `site-explorer` run, it failed to pull the chassis serial number from the switch, which fell back to "NA", and gave us an encoded `SwitchId` as such. The subsequent run succeeded, which gave us a different `SwitchId`, so then we had two. Machines actually do a BMC MAC check. If the BMC MAC already has an associated machine, don't make another one. Switches and power shelves did not have this check. In general machine discovery: - Does an existing BMC MAC check first. - Fails if it can't pull the serial (errors with `MissingHardwareInfo::Serial`). - Then looks for a duplicate `MachineId`. So, this adds it, and tweaks some things with power shelf `bmc_mac_address` to match the pattern we have for switches. Tests added. Signed-off-by: Chet Nichols III <chetn@nvidia.com>
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-04-17 22:07:13 UTC | Commit: 8248be3 |
|
|
||
| let query = sqlx::query_as::<_, PowerShelfId>( | ||
| "INSERT INTO power_shelves (id, name, config, controller_state, controller_state_version, description, labels, version, rack_id) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9) RETURNING id", | ||
| "INSERT INTO power_shelves (id, name, config, controller_state, controller_state_version, bmc_mac_address, description, labels, version, rack_id) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10) RETURNING id", |
There was a problem hiding this comment.
we can do that temporarily. But longer term #925 seems like the better approach.
==> The BMC/object link would always be in table machine_interfaces, and you can check from there
There was a problem hiding this comment.
@Matthias247 Yeah -- totally agree. I've been fussing with the associations in general, and I know @spydaNVIDIA has too.
| /// If a chassis serial number cant be populated, it returns NA. | ||
| /// There's probably a better way to do this, but right now I'm | ||
| /// fixing some bugs; I'm making it a constant, at least! | ||
| const SWITCH_SERIAL_NA: &str = "NA"; |
There was a problem hiding this comment.
that's reported by the actual BMC?
For hosts what we are doing is:
- create the managehost object
- mark it as unhealthy with SerialMismatch alert
I'd feel like we should be consistent over what we are doing for all objects.
That could mean consistently rejecting objects with mismatching serials.
There was a problem hiding this comment.
@Matthias247 Yeah, lol -- the BMC reported "NA", and then in a subsequent read, it reported the actual serial number. And yeah, I could switch this to just reject if the serial mismatches.
The other problem though is I do want to guard against duplicate devices -- the first exploration, it reported "NA", the second exploration reported an actual serial number, so then we ended up with two switches.
So we keep the BMC MAC check, but in cases of NA, we let it in, and just mark it unhealthy with a serial mismatch? Or do we just skip it until the serial matches in a subsequent exploration?
There was a problem hiding this comment.
@Matthias247 Okay so really, the overall approach here isn't "check for NA" and do things. Instead:
If there's already an associated entity with the BMC MAC, don't make another one, else make a new entity (switch, power shelf, etc).
And THEN, like you said, if the chassis serial number doesn't match the expected serial for that entity, immediately mark it unhealthy?
...but would we prefer to not even let it get created to begin with, and log that we noticed it, and the SN doesn't match? I guess the current flow lets us get a managed entity from it, and then we can see that it's there and why it's unhealthy. ...whereas logs would be more difficult to track down in a sense.
There was a problem hiding this comment.
And THEN, like you said, if the chassis serial number doesn't match the expected serial for that entity, immediately mark it unhealthy?
That's what we do for machines. But I'm not sure if its the best approach.
...but would we prefer to not even let it get created to begin with, and log that we noticed it, and the SN doesn't match?
That might actually be saner. Maybe we should put it behind a config flag? But if you wanted to get that host ingested, you might as well just modify the serial number. So I'm a bit torn on whether that additional flag is needed.
|
conflicts @chet |
Description
We ran into an issue where we had duplicate switches in an environment, because on the first
site-explorerrun, it failed to pull the chassis serial number from the switch, which fell back to "NA", and gave us an encodedSwitchIdas such. The subsequent run succeeded, which gave us a differentSwitchId, so then we had two.Machines actually do a BMC MAC check. If the BMC MAC already has an associated machine, don't make another one. Switches and power shelves did not have this check.
In general machine discovery:
MissingHardwareInfo::Serial).MachineId.So, this adds it, and tweaks some things with power shelf
bmc_mac_addressto match the pattern we have for switches.Tests added!
Signed-off-by: Chet Nichols III chetn@nvidia.com
Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes