[wip] NO-MERGE: boxcutter: retry SSA field manager migration on 409 conflict by stbenjam · Pull Request #583 · openshift/cluster-capi-operator

stbenjam · 2026-06-08T01:26:09Z

The csaupgrade.UpgradeManagedFieldsPatch function generates a JSON patch that includes the object's resourceVersion as an optimistic lock. When the kube-apiserver's CRD controller updates a CRD's status conditions immediately after creation, the resourceVersion is bumped before the field manager migration patch arrives, causing a 409 Conflict.

Without retry logic, this conflict causes the InstallerController to enter an infinite error loop: each reconciliation re-reads from cache (which may still be stale), regenerates the same patch with the old resourceVersion, and hits the same 409.

This was observed as a deterministic bootstrap failure in payload 5.0.0-0.nightly-2026-06-06-100407 where ALL TechPreview jobs across all cloud providers failed to bootstrap. The InstallerController was stuck retrying SSA migration on CAPI CRDs and ValidatingAdmissionPolicies, preventing worker MachineSets from ever being created.

Fix: add a retry loop (max 5 attempts) that re-reads the object from cache after a conflict to pick up the latest resourceVersion before retrying the patch.

openshift-merge-bot · 2026-06-08T01:26:12Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

coderabbitai · 2026-06-08T01:26:17Z

Important

Review skipped

Review was skipped due to path filters

⛔ Files ignored due to path filters (1)

vendor/pkg.package-operator.run/boxcutter/machinery/objects.go is excluded by !**/vendor/**, !vendor/**

CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including **/dist/** will override the default block on the dist directory, by removing the pattern from both the lists.

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 928dd8eb-7aa0-44ba-a59d-b7c7b9dec964

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-06-08T01:26:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign radekmanak for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…nager migration patch The JSON patch generated by csaupgrade.UpgradeManagedFieldsPatch includes a resourceVersion optimistic lock that causes a 409 conflict when the kube-apiserver's CRD controller updates the object's status between our read and patch. Since the only available reader is an informer cache that may also be stale, retrying with a cache re-read does not resolve the deadlock. Strip the resourceVersion operation from the patch entirely. This is safe because the managedFields update is idempotent, concurrent status updates do not conflict with our field ownership claim, and boxcutter either just created this object or is adopting it.

…exponential backoff on 409 conflicts Fix v3: The kube-apiserver CRD controller concurrently updates status after CRD creation, bumping resourceVersion. The csaupgrade JSON patch includes the (now-stale) resourceVersion as an optimistic lock, causing a deterministic 409 conflict. Previous fixes failed because: - v1: retried from cache with no delay (cache still stale) - v2: stripped the RV from the patch (etcd own CAS still rejects) This fix adds proper retry with exponential backoff (100ms-5s, max 10 attempts). Between retries, the object is re-read from the informer cache — the backoff delay gives the cache time to sync the latest resourceVersion. The patch is recomputed fresh on each attempt.

mdbooth · 2026-06-08T08:22:25Z

@stbenjam The issue is that every write operation on a CRD sent to the bootstrap KAS suffers a 20s timeout because the bootstrap KAS can't reach webhooks. It will be fixed by openshift/installer#10344.

In the absence of a 20s timeout boxcutter normally wins this race. The addtional retry complexity is not required, as the controller already provides exponential backoff in the (normally) very unlikely event that it loses the race. The 20s timeout is a bug which we already have a fix for, so it's not worth adding code for.

The reason the race exists is that there is no k8s atomic create primitive which populates SSA metadata. The library wants to create an object which it intends to manage with SSA. It wants to return an error if the object unexpectedly already exists and does not match any given adoption criteria.

If the clanker still wants to play with it, boxcutter can be found here: https://github.com/package-operator/boxcutter.

mdbooth · 2026-06-09T08:22:33Z

/close

openshift-ci · 2026-06-09T08:22:50Z

@mdbooth: Closed this PR.

Details

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 8, 2026

openshift-ci Bot requested review from damdo and mdbooth June 8, 2026 01:26

stbenjam changed the title ~~[wip] UPSTREAM: <carry>: boxcutter: retry SSA field manager migration on 409 conflict~~ [wip] NO-MERGE: boxcutter: retry SSA field manager migration on 409 conflict Jun 8, 2026

redhat-chai-bot force-pushed the 06080114-bot-changes branch from 81a5a13 to d01ccd6 Compare June 8, 2026 02:53

openshift-ci Bot closed this Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] NO-MERGE: boxcutter: retry SSA field manager migration on 409 conflict#583

[wip] NO-MERGE: boxcutter: retry SSA field manager migration on 409 conflict#583
stbenjam wants to merge 2 commits into
openshift:mainfrom
redhat-chai-bot:06080114-bot-changes

stbenjam commented Jun 8, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 8, 2026

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Review skipped

Uh oh!

openshift-ci Bot commented Jun 8, 2026

Uh oh!

mdbooth commented Jun 8, 2026 •

edited

Loading

Uh oh!

mdbooth commented Jun 9, 2026

Uh oh!

openshift-ci Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stbenjam commented Jun 8, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 8, 2026

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

openshift-ci Bot commented Jun 8, 2026

Uh oh!

mdbooth commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdbooth commented Jun 9, 2026

Uh oh!

openshift-ci Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

mdbooth commented Jun 8, 2026 •

edited

Loading