Maintenance Pipeline Proposal#20
Conversation
Signed-off-by: Nagadeesh Nagaraja <nagadeesh.nagaraja@sap.com>
There was a problem hiding this comment.
I like the overall architecture and the given examples. Drift recovery sounds well planned. Some comments:
Server taints integration
IEP-0016 should take an explicit dependency on ironcore-dev/metal-operator#878 (Server taints). The MaintenancePipelineRun controller should directly apply a taint to all servers in serverRefs at run start and remove it on completion or failure — no ServerReadinessRule involved, since maintenance is not optional and not gate-controller-driven.
This also resolves the ServerMaintenance churn rough edge called out in the Alternatives section: instead of each child resource independently cycling maintenance windows per stage, the pipeline holds the taint for the entire run duration.
The taint effect should be operator-configurable since the right answer depends on whether workloads are already running:
spec:
maintenanceTaintEffect: NoClaim # or: EvictAdd a Dependencies section listing ironcore-dev/metal-operator#878.
Aggregate stage phase is underspecified
The proposal says the aggregate phase for Server-scoped stages "reflects the slowest server" but does not define mixed-state behavior. Suggest stating explicitly:
Pending— no servers have startedInProgress— at least one server is progressingFailed— at least one server failedCompleted— all servers completed
Editorial
- Typo:
"it also handls the 1-BMC-to-N-servers relationship"→handles - Typo: username in frontmatter is
@nagdeesh, PR author is@nagadeesh-nagaraja - The drift recovery timeline references
bios-fw-v240at T+6 which does not exist in the pipeline spec example — either add the intermediate hop to the spec or fix the timeline
|
@xkonni I would keep the Tainting the server out of scope for this proposal. it is not dependent on this use-case or controller. the concept of taint replacing the servermaintenance is again a different topic. |
ironcore-dev/metal-operator#814