Skip to content

Maintenance Pipeline Proposal#20

Open
nagadeesh-nagaraja wants to merge 2 commits into
mainfrom
maintenancePipeline
Open

Maintenance Pipeline Proposal#20
nagadeesh-nagaraja wants to merge 2 commits into
mainfrom
maintenancePipeline

Conversation

@nagadeesh-nagaraja

Copy link
Copy Markdown

Signed-off-by: Nagadeesh Nagaraja <nagadeesh.nagaraja@sap.com>

@xkonni xkonni left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the overall architecture and the given examples. Drift recovery sounds well planned. Some comments:

Server taints integration

IEP-0016 should take an explicit dependency on ironcore-dev/metal-operator#878 (Server taints). The MaintenancePipelineRun controller should directly apply a taint to all servers in serverRefs at run start and remove it on completion or failure — no ServerReadinessRule involved, since maintenance is not optional and not gate-controller-driven.

This also resolves the ServerMaintenance churn rough edge called out in the Alternatives section: instead of each child resource independently cycling maintenance windows per stage, the pipeline holds the taint for the entire run duration.

The taint effect should be operator-configurable since the right answer depends on whether workloads are already running:

spec:
  maintenanceTaintEffect: NoClaim  # or: Evict

Add a Dependencies section listing ironcore-dev/metal-operator#878.

Aggregate stage phase is underspecified

The proposal says the aggregate phase for Server-scoped stages "reflects the slowest server" but does not define mixed-state behavior. Suggest stating explicitly:

  • Pending — no servers have started
  • InProgress — at least one server is progressing
  • Failed — at least one server failed
  • Completed — all servers completed

Editorial

  • Typo: "it also handls the 1-BMC-to-N-servers relationship"handles
  • Typo: username in frontmatter is @nagdeesh, PR author is @nagadeesh-nagaraja
  • The drift recovery timeline references bios-fw-v240 at T+6 which does not exist in the pipeline spec example — either add the intermediate hop to the spec or fix the timeline

@nagadeesh-nagaraja

Copy link
Copy Markdown
Author

@xkonni I would keep the Tainting the server out of scope for this proposal. it is not dependent on this use-case or controller.

the concept of taint replacing the servermaintenance is again a different topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants