OCPCLOUD-2664: Aggregate controller statuses in clusteroperator controller#574
OCPCLOUD-2664: Aggregate controller statuses in clusteroperator controller#574mdbooth wants to merge 4 commits into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@mdbooth: This pull request references OCPCLOUD-2664 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughIntroduce a typed Reason enum with stringer, add code generation in the build, refactor operatorstatus to aggregate/merge/write ClusterOperator conditions and optionally update operator versions, export controller ResultGenerator instances, and update controllers and tests to the new status flow. ChangesOperator Status System Refactoring
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 3 warnings)
✅ Passed checks (11 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/controllers/revision/revision_controller.go (1)
48-57:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winFix build:
ResultGeneratorcannot be aconst(it’s a function call)
ResultGeneratoris declared inside aconstblock asoperatorstatus.ControllerResultGenerator(controllerName), which is not a compile-time constant, so the build will fail. Switch it to avar(same issue appears inpkg/controllers/installer/installer_controller.gotoo).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/controllers/revision/revision_controller.go` around lines 48 - 57, The declaration of ResultGenerator inside the const block is invalid because operatorstatus.ControllerResultGenerator(controllerName) is not a compile-time constant; change ResultGenerator from a const to a var (move it out of the const block and declare: var ResultGenerator = operatorstatus.ControllerResultGenerator(controllerName)) so the function call is allowed at init time; apply the same fix for the identical declaration in pkg/controllers/installer/installer_controller.go referencing the Installer controller name.
🧹 Nitpick comments (1)
pkg/controllers/revision/revision_controller_test.go (1)
353-407: ⚡ Quick winAlign the new Ginkgo specs with the repo’s test style.
These new
It(...)titles should use theshould ...form, and the phase comments are better expressed withBy(...)so the spec output stays structured.As per coding guidelines, "Use descriptive test names in 'should...' format in Ginkgo tests" and "Use
By()for test steps to document test phases in Ginkgo tests".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/controllers/revision/revision_controller_test.go` around lines 353 - 407, Rename the two Ginkgo It specs to use "should ..." phrasing (e.g., change It("sets operator version when current revision matches latest", ...) to It("should set operator version when current revision matches latest", ...) and change the other to "should correct stale operator version when current revision matches latest"). Replace inline phase comments with By(...) calls to document steps: convert comments like "// BeforeEach created rev1. Set CurrentRevision to match." and "// Seed an incorrect operator version", "// Trigger a reconcile", "// Wait for the version to appear"/"// Wait for the version to be corrected" into By("...") calls placed immediately before the code that performs each phase (use the existing identifiers clusterAPI, co, coKey, operatorstatus.OperatorVersionKey, kWithCtx, cl, metav1.SetMetaDataAnnotation to find the blocks).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/capi-operator/main.go`:
- Around line 124-130: The code reads infra.Status.PlatformStatus.Type via
util.IsCAPIEnabledForPlatform(...) without guarding that PlatformStatus may be
nil, which can panic at startup; before computing supportedPlatform call the
existing validation/guard that ensures infra.Status.PlatformStatus is non-nil
(or explicitly check infra.Status.PlatformStatus != nil) and only then pass
PlatformStatus.Type to util.IsCAPIEnabledForPlatform, otherwise treat the
platform as unsupported; update the sequence around the
ClusterOperatorController construction (the IsUnsupportedPlatform value) so
supportedPlatform is derived after the nil-check/validation.
In `@pkg/controllers/clusteroperator/clusteroperator_controller.go`:
- Around line 196-197: The user-facing Available message currently says "Cluster
CAPI Operator"; update it to use the user-facing term "Cluster API" instead of
"CAPI". Locate the availableMessage string construction (availableMessage :=
fmt.Sprintf("Cluster CAPI Operator is available at %s", r.ReleaseVersion)) and
change the literal to "Cluster API Operator is available at %s" so the appended
condition (condition(configv1.OperatorAvailable, configv1.ConditionTrue,
operatorstatus.ReasonAsExpected, availableMessage)) reports the correct
user-facing wording.
In `@pkg/operatorstatus/controller_status.go`:
- Around line 340-344: The code appends a new operator version with
status.WithVersions which can duplicate an existing entry; before calling
WithVersions, filter the current status' Versions to remove any entry whose Name
== OperatorVersionKey, then call WithVersions only once with the filtered slice
plus the new OperandVersion configured with OperatorVersionKey and
r.operatorVersion; reference the existing symbols status, r.operatorVersion,
WithVersions, and OperatorVersionKey when implementing this replacement so the
old operator entry is removed before adding the updated one.
---
Outside diff comments:
In `@pkg/controllers/revision/revision_controller.go`:
- Around line 48-57: The declaration of ResultGenerator inside the const block
is invalid because operatorstatus.ControllerResultGenerator(controllerName) is
not a compile-time constant; change ResultGenerator from a const to a var (move
it out of the const block and declare: var ResultGenerator =
operatorstatus.ControllerResultGenerator(controllerName)) so the function call
is allowed at init time; apply the same fix for the identical declaration in
pkg/controllers/installer/installer_controller.go referencing the Installer
controller name.
---
Nitpick comments:
In `@pkg/controllers/revision/revision_controller_test.go`:
- Around line 353-407: Rename the two Ginkgo It specs to use "should ..."
phrasing (e.g., change It("sets operator version when current revision matches
latest", ...) to It("should set operator version when current revision matches
latest", ...) and change the other to "should correct stale operator version
when current revision matches latest"). Replace inline phase comments with
By(...) calls to document steps: convert comments like "// BeforeEach created
rev1. Set CurrentRevision to match." and "// Seed an incorrect operator
version", "// Trigger a reconcile", "// Wait for the version to appear"/"// Wait
for the version to be corrected" into By("...") calls placed immediately before
the code that performs each phase (use the existing identifiers clusterAPI, co,
coKey, operatorstatus.OperatorVersionKey, kWithCtx, cl,
metav1.SetMetaDataAnnotation to find the blocks).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 0ee08c70-9976-436d-a70c-d7173fe61749
⛔ Files ignored due to path filters (2)
vendor/golang.org/x/tools/cmd/stringer/stringer.gois excluded by!**/vendor/**,!vendor/**vendor/modules.txtis excluded by!**/vendor/**,!vendor/**
📒 Files selected for processing (17)
Makefilecmd/capi-operator/main.gogo.modpkg/controllers/clusteroperator/clusteroperator_controller.gopkg/controllers/clusteroperator/clusteroperator_controller_test.gopkg/controllers/clusteroperator/suite_test.gopkg/controllers/common_consts.gopkg/controllers/installer/installer_controller.gopkg/controllers/revision/revision_controller.gopkg/controllers/revision/revision_controller_test.gopkg/controllers/secretsync/secret_sync_controller.gopkg/operatorstatus/controller_status.gopkg/operatorstatus/controller_status_test.gopkg/operatorstatus/operator_status.gopkg/operatorstatus/reason_string.gopkg/operatorstatus/watch_predicates.gopkg/test/conditions.go
💤 Files with no reviewable changes (1)
- pkg/controllers/common_consts.go
Add WithUpdateOperatorVersion() to ReconcileResult, allowing controllers to optionally update the operator version in the ClusterOperator status when writing their conditions. Also switches controller_status tests from fake client to envtest for accurate SSA field ownership testing.
Convert Reason from string constants to an ordered iota type, enabling severity-based comparison for condition aggregation. Replaces ReasonSyncFailed with the standardised ReasonEphemeralError.
ClusterOperator version is now written by the revision controller instead of the clusteroperator controller.
aebb6d0 to
c238311
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/controllers/clusteroperator/clusteroperator_controller.go`:
- Around line 221-224: The controller currently only uses
operatorstatus.ClusterOperatorConditionsChanged() so Reconcile won't run for
status.Versions-only updates (it owns operatorstatus.OperatorVersionKey),
causing version drift; update the controller builder (the
For(&configv1.ClusterOperator{})/Named(controllerName) chain) to include an
additional predicate that also enqueues on status.Versions changes or when the
OperatorVersionKey value differs (e.g., add a predicate.Funcs with an UpdateFunc
that compares old.Status.Versions vs new.Status.Versions and/or extracts
operatorstatus.OperatorVersionKey to trigger a requeue), so Reconcile runs for
version-only updates as well.
- Around line 70-75: The current logic only calls writeStatus when
operatorstatus.MergeConditions reports changes, which misses updates to
co.Status.Versions; add a small helper (e.g., getOperandVersionByName or
readOperandVersion) to lookup the current operand entry in co.Status.Versions by
name and compare its Version to the desired operand version, and then update the
reconciliation block so that if either MergeConditions(updated)==true OR the
operand version comparison shows a delta you call writeStatus(ctx, co,
conditions) to persist the new Versions entry; reference MergeConditions,
writeStatus, and co.Status.Versions when locating where to add the helper and
the extra comparison.
- Around line 183-190: Filter subcontrollerStatuses to only those whose
progressing condition is true before selecting the max reason: build a slice
(e.g., progressingStatuses) containing entries where entry.progressing.status ==
configv1.ConditionTrue, then use slices.MaxFunc(progressingStatuses, ...) to
compute priorityProgressing; only append the OperatorProgressing condition using
priorityProgressing when progressingStatuses is non-empty (keep existing
isProgressing check) so non-progressing controllers' reasons are not considered;
update references to priorityProgressing accordingly and ensure no
nil/empty-slice access occurs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 4661e31d-64e4-47e6-9ddf-49d03b14f257
⛔ Files ignored due to path filters (2)
vendor/golang.org/x/tools/cmd/stringer/stringer.gois excluded by!**/vendor/**,!vendor/**vendor/modules.txtis excluded by!**/vendor/**,!vendor/**
📒 Files selected for processing (17)
Makefilecmd/capi-operator/main.gogo.modpkg/controllers/clusteroperator/clusteroperator_controller.gopkg/controllers/clusteroperator/clusteroperator_controller_test.gopkg/controllers/clusteroperator/suite_test.gopkg/controllers/common_consts.gopkg/controllers/installer/installer_controller.gopkg/controllers/revision/revision_controller.gopkg/controllers/revision/revision_controller_test.gopkg/controllers/secretsync/secret_sync_controller.gopkg/operatorstatus/controller_status.gopkg/operatorstatus/controller_status_test.gopkg/operatorstatus/operator_status.gopkg/operatorstatus/reason_string.gopkg/operatorstatus/watch_predicates.gopkg/test/conditions.go
💤 Files with no reviewable changes (1)
- pkg/controllers/common_consts.go
✅ Files skipped from review due to trivial changes (2)
- pkg/operatorstatus/reason_string.go
- go.mod
🚧 Files skipped from review as they are similar to previous changes (12)
- pkg/operatorstatus/watch_predicates.go
- cmd/capi-operator/main.go
- pkg/test/conditions.go
- pkg/controllers/secretsync/secret_sync_controller.go
- pkg/controllers/revision/revision_controller_test.go
- pkg/controllers/clusteroperator/suite_test.go
- pkg/controllers/revision/revision_controller.go
- pkg/controllers/installer/installer_controller.go
- pkg/operatorstatus/operator_status.go
- Makefile
- pkg/operatorstatus/controller_status_test.go
- pkg/operatorstatus/controller_status.go
c238311 to
a8ab8ad
Compare
|
This unit test job took 38 minutes: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-capi-operator/574/pull-ci-openshift-cluster-capi-operator-main-unit/2061871187276861440 Usual based on job history is 10-20 minutes.
Update: Re-run took 14 minutes. Looks like a one-off. No investigation required. |
|
/pipeline auto |
|
Pipeline controller notification The |
|
/pipeline run |
a8ab8ad to
098f11e
Compare
|
Pipeline controller died: force pushed a no-op commit to retrigger all jobs. |
|
Scheduling tests matching the |
2 similar comments
|
Scheduling tests matching the |
|
Scheduling tests matching the |
|
Flake /test e2e-openstack-capi-techpreview |
| priorityProgressing := slices.MaxFunc(subcontrollerStatuses, func(a, b subcontrollerStatus) int { | ||
| return cmp.Compare(a.progressing.reason, b.progressing.reason) | ||
| }) | ||
|
|
||
| if isProgressing { | ||
| // Report the highest priority progressing reason | ||
| newConditions = append(newConditions, condition(configv1.OperatorProgressing, configv1.ConditionTrue, priorityProgressing.progressing.reason, fmt.Sprintf("%s: %s", priorityProgressing.controller, priorityProgressing.progressing.message))) |
There was a problem hiding this comment.
| priorityProgressing := slices.MaxFunc(subcontrollerStatuses, func(a, b subcontrollerStatus) int { | |
| return cmp.Compare(a.progressing.reason, b.progressing.reason) | |
| }) | |
| if isProgressing { | |
| // Report the highest priority progressing reason | |
| newConditions = append(newConditions, condition(configv1.OperatorProgressing, configv1.ConditionTrue, priorityProgressing.progressing.reason, fmt.Sprintf("%s: %s", priorityProgressing.controller, priorityProgressing.progressing.message))) | |
| if isProgressing { | |
| priorityProgressing := slices.MaxFunc(subcontrollerStatuses, func(a, b subcontrollerStatus) int { | |
| return cmp.Compare(a.progressing.reason, b.progressing.reason) | |
| }) | |
| // Report the highest priority progressing reason | |
| newConditions = append(newConditions, condition(configv1.OperatorProgressing, configv1.ConditionTrue, priorityProgressing.progressing.reason, fmt.Sprintf("%s: %s", priorityProgressing.controller, priorityProgressing.progressing.message))) |
There was a problem hiding this comment.
I had this in an earlier version, but it turns out it's not correct. The reason is that we want to report the highest priority Progressing reason. The top priority is actually a non-retryable error, and that's always accompanied by Progressing=False.
However, now that I look at it again I'm clearly missing some test cases because the progressing=False still has a hardcoded reason 🤦 I need to fix this.
I'll also address the nit.
|
/lgtm |
|
|
Rewrite ClusterOperatorController to aggregate per-controller sub-conditions (Available/Progressing) into top-level ClusterOperator conditions.
098f11e to
7189785
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/controllers/clusteroperator/clusteroperator_controller_test.go (1)
59-69: ⚡ Quick winSeed and assert
Status.Versionsin the supported-platform table.These cases never populate the revision-owned
operatorversion, so they don't cover the key regression risk here: this controller patchesstatusdirectly and must not clobber an existing release version while updating conditions. Please seedStatus.Versionsin the setup and assert it survives each reconciliation.Suggested test adjustment
capiClusterOperator = &configv1.ClusterOperator{ ObjectMeta: metav1.ObjectMeta{ Name: controllers.ClusterOperatorName, }, } Expect(cl.Create(ctx, capiClusterOperator)).To(Succeed()) DeferCleanup(func(ctx context.Context) { testutils.CleanupResources(Default, ctx, testEnv.Config, cl, "", &configv1.ClusterOperator{}) }) + capiClusterOperator.Status.Versions = []configv1.OperandVersion{{ + Name: operatorstatus.OperatorVersionKey, + Version: "previous-release-version", + }} Expect(cl.Status().Update(ctx, capiClusterOperator)).To(Succeed()) }) @@ Eventually(co). WithContext(ctx). WithTimeout(defaultEventuallyTimeout). Should(SatisfyAll( HaveField("Status.Conditions", SatisfyAll( expectedAvailable, expectedProgressing, test.HaveCondition(configv1.OperatorDegraded).WithStatus(configv1.ConditionFalse), test.HaveCondition(configv1.OperatorUpgradeable).WithStatus(configv1.ConditionTrue), )), + HaveField("Status.Versions", ContainElement(SatisfyAll( + HaveField("Name", Equal(operatorstatus.OperatorVersionKey)), + HaveField("Version", Equal("previous-release-version")), + ))), ))Based on learnings "Add appropriate unit tests for all code changes in Gomega".
Also applies to: 80-92
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/controllers/clusteroperator/clusteroperator_controller_test.go` around lines 59 - 69, The test creates capiClusterOperator but never seeds Status.Versions, so reconciliation may clobber an existing operator release; update the setup for capiClusterOperator (the variable used in these It/By blocks and controllers.ClusterOperatorName) to populate Status.Versions with an entry for the "operator" (name matching your controller's expected name and a fake Version string), then after each reconciliation assert that capiClusterOperator.Status.Versions still contains that operator entry (i.e., its Version value was not overwritten) — apply the same seeding and post-reconcile assertion to the other test block referenced (the similar case at lines 80-92).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/controllers/clusteroperator/clusteroperator_controller_test.go`:
- Around line 59-69: The test creates capiClusterOperator but never seeds
Status.Versions, so reconciliation may clobber an existing operator release;
update the setup for capiClusterOperator (the variable used in these It/By
blocks and controllers.ClusterOperatorName) to populate Status.Versions with an
entry for the "operator" (name matching your controller's expected name and a
fake Version string), then after each reconciliation assert that
capiClusterOperator.Status.Versions still contains that operator entry (i.e.,
its Version value was not overwritten) — apply the same seeding and
post-reconcile assertion to the other test block referenced (the similar case at
lines 80-92).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: b7458e65-ffd0-4ca4-bc61-0ca8bd286a40
📒 Files selected for processing (9)
cmd/capi-operator/main.gopkg/controllers/clusteroperator/clusteroperator_controller.gopkg/controllers/clusteroperator/clusteroperator_controller_test.gopkg/controllers/clusteroperator/suite_test.gopkg/controllers/installer/installer_controller.gopkg/controllers/revision/revision_controller.gopkg/operatorstatus/controller_status.gopkg/operatorstatus/controller_status_test.gopkg/operatorstatus/watch_predicates.go
🚧 Files skipped from review as they are similar to previous changes (5)
- pkg/controllers/clusteroperator/suite_test.go
- pkg/controllers/installer/installer_controller.go
- pkg/controllers/revision/revision_controller.go
- pkg/operatorstatus/controller_status.go
- pkg/operatorstatus/controller_status_test.go
|
/lgtm |
|
/test build |
|
/pipeline required |
|
Scheduling tests matching the |
|
/testwith openshift/cluster-capi-operator/main/e2e-aws-ovn-techpreview-upgrade #576 |
|
@mdbooth: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
The revision and installer controllers current write 'SubConditions' to the ClusterOperator prefixed
RevisionControllerandInstallerControllerrespectively, but these conditions are not reflected in the primary conditions published on thecluster-apiClusterOperator object.Meanwhile the following controllers use
ClusterOperatorStatusClient, which writes directly to the ClusterOperator's primary conditions:This means that the published primary conditions are currently set non-deterministically by whichever of the above 4 controllers reconciled most recently.
This change if the first of a series of PRs to fix this. It rewrites the clusteroperator controller to aggregate the conditions of subcontrollers. However, the situation will not be fixed until all 4 of the above mentioned controllers have also been updated to use
operatorstatus.ReconcileResultfor condition writing and integrated into the aggregating controller.Additionally, this change also ensures that we don't publish an updated operator version until the CAPI rollout is complete. We do this by moving the responsibility for writing operator version to the revision controller.
I recommend reviewing this change by commit.
Summary by CodeRabbit
Chores
Refactor
Tests
Bug Fixes
Bug Fixes