Skip to content

[wip] NO-MERGE: boxcutter: retry SSA field manager migration on 409 conflict#583

Closed
stbenjam wants to merge 2 commits into
openshift:mainfrom
redhat-chai-bot:06080114-bot-changes
Closed

[wip] NO-MERGE: boxcutter: retry SSA field manager migration on 409 conflict#583
stbenjam wants to merge 2 commits into
openshift:mainfrom
redhat-chai-bot:06080114-bot-changes

Conversation

@stbenjam

@stbenjam stbenjam commented Jun 8, 2026

Copy link
Copy Markdown
Member

The csaupgrade.UpgradeManagedFieldsPatch function generates a JSON patch that includes the object's resourceVersion as an optimistic lock. When the kube-apiserver's CRD controller updates a CRD's status conditions immediately after creation, the resourceVersion is bumped before the field manager migration patch arrives, causing a 409 Conflict.

Without retry logic, this conflict causes the InstallerController to enter an infinite error loop: each reconciliation re-reads from cache (which may still be stale), regenerates the same patch with the old resourceVersion, and hits the same 409.

This was observed as a deterministic bootstrap failure in payload 5.0.0-0.nightly-2026-06-06-100407 where ALL TechPreview jobs across all cloud providers failed to bootstrap. The InstallerController was stuck retrying SSA migration on CAPI CRDs and ValidatingAdmissionPolicies, preventing worker MachineSets from ever being created.

Fix: add a retry loop (max 5 attempts) that re-reads the object from cache after a conflict to pick up the latest resourceVersion before retrying the patch.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 8, 2026
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Important

Review skipped

Review was skipped due to path filters

⛔ Files ignored due to path filters (1)
  • vendor/pkg.package-operator.run/boxcutter/machinery/objects.go is excluded by !**/vendor/**, !vendor/**

CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including **/dist/** will override the default block on the dist directory, by removing the pattern from both the lists.

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 928dd8eb-7aa0-44ba-a59d-b7c7b9dec964

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from damdo and mdbooth June 8, 2026 01:26
@openshift-ci

openshift-ci Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign radekmanak for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@stbenjam stbenjam changed the title [wip] UPSTREAM: <carry>: boxcutter: retry SSA field manager migration on 409 conflict [wip] NO-MERGE: boxcutter: retry SSA field manager migration on 409 conflict Jun 8, 2026
…nager migration patch

The JSON patch generated by csaupgrade.UpgradeManagedFieldsPatch includes a
resourceVersion optimistic lock that causes a 409 conflict when the
kube-apiserver's CRD controller updates the object's status between our read
and patch. Since the only available reader is an informer cache that may also
be stale, retrying with a cache re-read does not resolve the deadlock.

Strip the resourceVersion operation from the patch entirely. This is safe
because the managedFields update is idempotent, concurrent status updates
do not conflict with our field ownership claim, and boxcutter either just
created this object or is adopting it.
…exponential backoff on 409 conflicts

Fix v3: The kube-apiserver CRD controller concurrently updates status
after CRD creation, bumping resourceVersion. The csaupgrade JSON patch
includes the (now-stale) resourceVersion as an optimistic lock, causing
a deterministic 409 conflict. Previous fixes failed because:
- v1: retried from cache with no delay (cache still stale)
- v2: stripped the RV from the patch (etcd own CAS still rejects)

This fix adds proper retry with exponential backoff (100ms-5s, max 10
attempts). Between retries, the object is re-read from the informer
cache — the backoff delay gives the cache time to sync the latest
resourceVersion. The patch is recomputed fresh on each attempt.
@mdbooth

mdbooth commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

@stbenjam The issue is that every write operation on a CRD sent to the bootstrap KAS suffers a 20s timeout because the bootstrap KAS can't reach webhooks. It will be fixed by openshift/installer#10344.

In the absence of a 20s timeout boxcutter normally wins this race. The addtional retry complexity is not required, as the controller already provides exponential backoff in the (normally) very unlikely event that it loses the race. The 20s timeout is a bug which we already have a fix for, so it's not worth adding code for.

The reason the race exists is that there is no k8s atomic create primitive which populates SSA metadata. The library wants to create an object which it intends to manage with SSA. It wants to return an error if the object unexpectedly already exists and does not match any given adoption criteria.

If the clanker still wants to play with it, boxcutter can be found here: https://github.com/package-operator/boxcutter.

@mdbooth

mdbooth commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

/close

@openshift-ci openshift-ci Bot closed this Jun 9, 2026
@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@mdbooth: Closed this PR.

Details

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants