Skip to content

no-jira: checking operator ready status comprehensive#80136

Open
bmeng wants to merge 1 commit into
openshift:mainfrom
bmeng:health-check
Open

no-jira: checking operator ready status comprehensive#80136
bmeng wants to merge 1 commit into
openshift:mainfrom
bmeng:health-check

Conversation

@bmeng
Copy link
Copy Markdown
Contributor

@bmeng bmeng commented Jun 5, 2026

Check operators ready status with more checkpoints

Summary by CodeRabbit

This PR enhances the ROSA cluster operator readiness check step in the OpenShift CI infrastructure to perform comprehensive multi-stage validation of operator health status.

What's changing:

The rosa-cluster-wait-ready-operators step, which is used to verify ROSA clusters are ready for testing, now implements a more thorough readiness verification workflow:

Previous behavior: The step only checked if cluster operators finished progressing (Progressing=false) and immediately branched to error handling if a timeout occurred.

New behavior: The step now validates operator health through sequential checks:

  1. Wait for all operators to complete progressing (Progressing=false, 60m timeout)
  2. Verify all operators are Available=true (10m timeout)
  3. Verify no operators are Degraded=true (10m timeout)
  4. Confirm the ClusterVersion object is Available

Key improvements:

  • Uses a check_failed flag to track failures across all stages rather than handling only the progressing timeout
  • Each check has its own dedicated log file for easier debugging (co_status.log, co_available.log, co_degraded.log)
  • Records the elapsed time for operators to finish progressing
  • Provides more specific error messages indicating which readiness condition failed
  • Enhanced Slack notifications (when configured) include the specific failure reason, allowing faster incident response
  • Falls through gracefully—if any stage fails, the script halts and (if Slack hooks are configured) posts a detailed message and sleeps 3 hours for debugging

The step documentation in the ref.yaml file has also been updated to clearly describe these multi-stage validation requirements for ROSA clusters.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@bmeng: This pull request explicitly references no jira issue.

Details

In response to this:

Check operators ready status with more checkpoints

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 5, 2026

Walkthrough

The PR enhances the ROSA cluster readiness check script with a multi-stage verification pipeline. It introduces a check_failed flag to consolidate error handling across operator progression, availability, degradation, and ClusterVersion availability checks. On failure, the script logs cluster state and posts an updated Slack notification with the specific failure reason before sleeping and exiting.

Changes

Cluster Readiness Verification

Layer / File(s) Summary
Multi-stage readiness check implementation
ci-operator/step-registry/rosa/cluster/wait-ready/operators/rosa-cluster-wait-ready-operators-commands.sh
Script introduces check_failed flag initialization and replaces the single "progressing timeout" branch with a unified pipeline that sequentially validates operator progression, checks Available=true and Degraded=false conditions with timeout detection, verifies ClusterVersion availability via jsonpath, and on any failure prints an error, dumps cluster state to logs, posts an updated Slack message containing the computed failure reason, sleeps 3 hours, and exits 1.
Step documentation
ci-operator/step-registry/rosa/cluster/wait-ready/operators/rosa-cluster-wait-ready-operators-ref.yaml
Updated step description to explicitly document the multi-stage verification including operator progression completion, operator availability and degradation checks, and ClusterVersion resource availability validation.

🎯 2 (Simple) | ⏱️ ~10 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error Script exposes webhook URL via curl command at line 103, where slack_hook_url is passed as a command argument and may contain auth tokens accessible in logs/history. Avoid exposing the webhook URL as a curl command argument. Use input redirection or environment variables to pass the URL safely without exposing it in process listings.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title mentions 'checking operator ready status comprehensive' which directly aligns with the main change: adding more comprehensive operator readiness verification including Available, Degraded, and ClusterVersion checks.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR contains only shell script and YAML configuration files, not Ginkgo Go tests. Custom check for Ginkgo test name stability is not applicable.
Test Structure And Quality ✅ Passed PR contains no Ginkgo test code. Changes are limited to shell scripts and YAML configuration files for OpenShift CI cluster operator readiness checks. The check does not apply to this PR.
Microshift Test Compatibility ✅ Passed This PR contains only a bash script and YAML configuration file for cluster operator health checks. No Ginkgo e2e tests are added, so the MicroShift compatibility check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR modifies shell scripts and YAML CI configuration files, not Ginkgo e2e tests. The custom check applies only when new e2e tests are added.
Topology-Aware Scheduling Compatibility ✅ Passed Changes are CI diagnostic scripts checking cluster operator readiness, not deployment manifests, operator code, or Kubernetes controllers. No scheduling constraints introduced.
Ote Binary Stdout Contract ✅ Passed PR contains only shell scripts and YAML files, no Go code. OTE Binary Stdout Contract check applies only to Go test binaries, making it inapplicable here.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR modifies CI/CD shell scripts and YAML configuration files, not Ginkgo e2e tests. Check only applies to new e2e tests, making it inapplicable here.
No-Weak-Crypto ✅ Passed PR modifies cluster health-check scripts with no cryptographic operations; no weak crypto algorithms, custom crypto implementations, or non-constant-time secret comparisons detected.
Container-Privileges ✅ Passed PR modifies CI/CD step registry scripts and metadata (not K8s manifests); no container privilege escalation keywords found in any modified files.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from olucasfreitas and svetsa-rh June 5, 2026 07:17
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 5, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bmeng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 5, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@bmeng: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-openshift-aws-load-balancer-operator-main-e2e-aws-rosa-operator openshift/aws-load-balancer-operator presubmit Registry content changed
pull-ci-openshift-aws-load-balancer-operator-release-1.0-e2e-aws-rosa-operator openshift/aws-load-balancer-operator presubmit Registry content changed
pull-ci-openshift-aws-load-balancer-operator-release-1.2-e2e-aws-rosa-operator openshift/aws-load-balancer-operator presubmit Registry content changed
pull-ci-openshift-aws-load-balancer-operator-release-1.1-e2e-aws-rosa-operator openshift/aws-load-balancer-operator presubmit Registry content changed
pull-ci-CSPI-QE-MSI-single-cluster-smoke-v4.14-single-cluster-rosa-4-14-candidate-smoke CSPI-QE/MSI presubmit Registry content changed
pull-ci-openshift-online-rosa-e2e-main-e2e-rosa-classic-smoke openshift-online/rosa-e2e presubmit Registry content changed
pull-ci-openshift-online-rosa-e2e-main-e2e-rosa-hcp-smoke openshift-online/rosa-e2e presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.18-nightly-krkn-tests-rosa redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.18-nightly-krkn-tests-rosa-node redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.20-nightly-krkn-tests-rosa redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.20-nightly-krkn-tests-rosa-node redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.17-nightly-krkn-tests-rosa redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.17-nightly-krkn-tests-rosa-node redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.17-nightly-krkn-tests-rosa-infra redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.19-nightly-krkn-tests-rosa redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.19-nightly-krkn-tests-rosa-node redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.21-nightly-krkn-tests-rosa redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.21-nightly-krkn-tests-rosa-node redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.15-nightly-krkn-tests-rosa redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.15-nightly-krkn-tests-rosa-hog redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.15-nightly-krkn-tests-rosa-infra redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.18-nightly-krkn-tests-rosa-hcp redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.18-nightly-krkn-rosa-hcp-node redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.20-nightly-krkn-tests-rosa-hcp redhat-chaos/prow-scripts presubmit Registry content changed
pull-ci-redhat-chaos-prow-scripts-main-rosa-4.20-nightly-krkn-rosa-hcp-node redhat-chaos/prow-scripts presubmit Registry content changed

A total of 303 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs.

A full list of affected jobs can be found here
Prior to this PR being merged, you will need to either run and acknowledge or opt to skip these rehearsals.

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
ci-operator/step-registry/rosa/cluster/wait-ready/operators/rosa-cluster-wait-ready-operators-commands.sh (1)

59-59: 💤 Low value

Simplify arithmetic expansion syntax.

The ${} expansions are unnecessary within $(( )) arithmetic context. Shellcheck flags this as style issue SC2004.

📝 Suggested simplification
-  record_cluster "timers" "co_wait_time" $(( "${end_time}" - "${start_time}" ))
+  record_cluster "timers" "co_wait_time" $(( end_time - start_time ))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/rosa/cluster/wait-ready/operators/rosa-cluster-wait-ready-operators-commands.sh`
at line 59, Replace the unnecessary ${} expansions inside the arithmetic context
in the echo statement: update the expression in the echo call that prints the
duration (the line containing "All cluster operators done progressing after $((
${end_time} - ${start_time} )) seconds") to use $(( end_time - start_time ))
instead of $(( ${end_time} - ${start_time} )); keep the rest of the message
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@ci-operator/step-registry/rosa/cluster/wait-ready/operators/rosa-cluster-wait-ready-operators-commands.sh`:
- Line 59: Replace the unnecessary ${} expansions inside the arithmetic context
in the echo statement: update the expression in the echo call that prints the
duration (the line containing "All cluster operators done progressing after $((
${end_time} - ${start_time} )) seconds") to use $(( end_time - start_time ))
instead of $(( ${end_time} - ${start_time} )); keep the rest of the message
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: dc6e9544-cdf8-4341-8ef8-2970bb791fcf

📥 Commits

Reviewing files that changed from the base of the PR and between 13af847 and 0e6b8ac.

📒 Files selected for processing (2)
  • ci-operator/step-registry/rosa/cluster/wait-ready/operators/rosa-cluster-wait-ready-operators-commands.sh
  • ci-operator/step-registry/rosa/cluster/wait-ready/operators/rosa-cluster-wait-ready-operators-ref.yaml

@bmeng
Copy link
Copy Markdown
Contributor Author

bmeng commented Jun 5, 2026

/pj-rehearse periodic-ci-openshift-online-rosa-e2e-main-periodics-rosa-classic-sts-e2e-stable-4-21

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@bmeng: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 5, 2026

@bmeng: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@dustman9000
Copy link
Copy Markdown
Member

Hey Bo, I looked at how other Prow jobs handle this. The standard openshift-e2e-test step used by all OCP conformance, e2e, and upgrade jobs uses a simple three-condition loop:

oc wait clusteroperators --all --for=Available=True
oc wait clusteroperators --all --for=Progressing=False
oc wait clusteroperators --all --for=Degraded=False

No DS/RS checks, no certificate checks, no CVO check. That's the accepted pattern across all of Prow. DS/RS/certificate issues surface through CO conditions since the owning CO reports Degraded when those sub-resources are unhealthy.

I think adding the Available and Degraded checks (to match the OCP e2e pattern) makes sense, but the CVO check is redundant since CVO availability is a prerequisite for COs reporting correctly.

The CAMO PR (#557) already merged and is validated on staging using the same three CO conditions. On ci-rosa-s-4ao6, it correctly detected all 34 COs stable and configured PD while the osd-cluster-ready Job was still stuck 2h+ later.

I'd suggest simplifying this PR to just add the Available and Degraded checks to match the OCP e2e standard, and skip the CVO check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants