[wip] NO-MERGE: boxcutter: retry SSA field manager migration on 409 conflict#583
[wip] NO-MERGE: boxcutter: retry SSA field manager migration on 409 conflict#583stbenjam wants to merge 2 commits into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Important Review skippedReview was skipped due to path filters ⛔ Files ignored due to path filters (1)
CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including ⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
…nager migration patch The JSON patch generated by csaupgrade.UpgradeManagedFieldsPatch includes a resourceVersion optimistic lock that causes a 409 conflict when the kube-apiserver's CRD controller updates the object's status between our read and patch. Since the only available reader is an informer cache that may also be stale, retrying with a cache re-read does not resolve the deadlock. Strip the resourceVersion operation from the patch entirely. This is safe because the managedFields update is idempotent, concurrent status updates do not conflict with our field ownership claim, and boxcutter either just created this object or is adopting it.
81a5a13 to
d01ccd6
Compare
…exponential backoff on 409 conflicts Fix v3: The kube-apiserver CRD controller concurrently updates status after CRD creation, bumping resourceVersion. The csaupgrade JSON patch includes the (now-stale) resourceVersion as an optimistic lock, causing a deterministic 409 conflict. Previous fixes failed because: - v1: retried from cache with no delay (cache still stale) - v2: stripped the RV from the patch (etcd own CAS still rejects) This fix adds proper retry with exponential backoff (100ms-5s, max 10 attempts). Between retries, the object is re-read from the informer cache — the backoff delay gives the cache time to sync the latest resourceVersion. The patch is recomputed fresh on each attempt.
|
@stbenjam The issue is that every write operation on a CRD sent to the bootstrap KAS suffers a 20s timeout because the bootstrap KAS can't reach webhooks. It will be fixed by openshift/installer#10344. In the absence of a 20s timeout boxcutter normally wins this race. The addtional retry complexity is not required, as the controller already provides exponential backoff in the (normally) very unlikely event that it loses the race. The 20s timeout is a bug which we already have a fix for, so it's not worth adding code for. The reason the race exists is that there is no k8s atomic create primitive which populates SSA metadata. The library wants to create an object which it intends to manage with SSA. It wants to return an error if the object unexpectedly already exists and does not match any given adoption criteria. If the clanker still wants to play with it, boxcutter can be found here: https://github.com/package-operator/boxcutter. |
|
/close |
|
@mdbooth: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The csaupgrade.UpgradeManagedFieldsPatch function generates a JSON patch that includes the object's resourceVersion as an optimistic lock. When the kube-apiserver's CRD controller updates a CRD's status conditions immediately after creation, the resourceVersion is bumped before the field manager migration patch arrives, causing a 409 Conflict.
Without retry logic, this conflict causes the InstallerController to enter an infinite error loop: each reconciliation re-reads from cache (which may still be stale), regenerates the same patch with the old resourceVersion, and hits the same 409.
This was observed as a deterministic bootstrap failure in payload 5.0.0-0.nightly-2026-06-06-100407 where ALL TechPreview jobs across all cloud providers failed to bootstrap. The InstallerController was stuck retrying SSA migration on CAPI CRDs and ValidatingAdmissionPolicies, preventing worker MachineSets from ever being created.
Fix: add a retry loop (max 5 attempts) that re-reads the object from cache after a conflict to pick up the latest resourceVersion before retrying the patch.