Add qe-agent post-step to rerun, debug and fix distributed tracing test failures#80176
Add qe-agent post-step to rerun, debug and fix distributed tracing test failures#80176IshwarKanse wants to merge 1 commit into
Conversation
Assisted by Claude Code
WalkthroughThis PR unifies test runner image targets across OpenShift CI operator configs from product-specific names to ChangesQE Agent Infrastructure and Integration
Unified obs-tests-runner Image Target across CI Configs
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: IshwarKanse The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
[REHEARSALNOTIFIER]
A total of 108 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs. A full list of affected jobs can be found here Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
@IshwarKanse: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@ci-operator/step-registry/distributed-tracing/tests/qe-agent/distributed-tracing-tests-qe-agent-commands.sh`:
- Around line 45-47: The SKILL_URL and SKILL_CONTENT usage currently fetches
from the mutable main branch; change SKILL_URL to point to an immutable commit
SHA or tag (not main) and add retrieval+verification of a checksum before using
SKILL_CONTENT (e.g., fetch a known-good checksum file or embed expected hash and
verify curl output), failing fast if verification fails; update the code paths
that use SKILL_CONTENT (the variables SKILL_URL and SKILL_CONTENT and the
subsequent non-interactive Claude invocation) to only proceed after checksum
validation to ensure the qe-agent skill is pinned and integrity-checked.
- Line 47: The SKILL_CONTENT assignment currently uses a plain curl call that
can hang; update the SKILL_CONTENT=$(curl -sf "${SKILL_URL}") || true line to
use curl with explicit timeouts and retries (reference SKILL_CONTENT and
SKILL_URL): add --connect-timeout (e.g. 5s), --max-time (e.g. 15s), and retry
flags such as --retry 3 --retry-delay 2 --retry-connrefused while keeping -s and
-f and preserving the trailing || true so the step won’t fail outright; ensure
the new flags are documented in an inline comment near that assignment.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: e628dcd4-da2e-4827-adf1-a0c2ac9d1392
📒 Files selected for processing (59)
ci-operator/config/openshift/distributed-tracing-console-plugin/openshift-distributed-tracing-console-plugin-main__upstream-amd64-aws.yamlci-operator/config/openshift/distributed-tracing-console-plugin/openshift-distributed-tracing-console-plugin-release-0.4__upstream-amd64-aws.yamlci-operator/config/openshift/distributed-tracing-console-plugin/openshift-distributed-tracing-console-plugin-release-1.0__upstream-amd64-aws.yamlci-operator/config/openshift/distributed-tracing-console-plugin/openshift-distributed-tracing-console-plugin-release-coo-ocp-4.12__upstream-amd64-aws.yamlci-operator/config/openshift/distributed-tracing-console-plugin/openshift-distributed-tracing-console-plugin-release-coo-ocp-4.15__upstream-amd64-aws.yamlci-operator/config/openshift/distributed-tracing-console-plugin/openshift-distributed-tracing-console-plugin-release-coo-ocp-4.19__upstream-amd64-aws.yamlci-operator/config/openshift/distributed-tracing-console-plugin/openshift-distributed-tracing-console-plugin-release-coo-ocp-4.22__upstream-amd64-aws.yamlci-operator/config/openshift/distributed-tracing-qe/openshift-distributed-tracing-qe-main__ocp-4.16-disconnected.yamlci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.12-stage.yamlci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.14-arm-stage.yamlci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.14-stage.yamlci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.16-ibm-z-stage.yamlci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.17-fips-stage.yamlci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.17-ibm-p-stage.yamlci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.19-downstream.yamlci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.19-stage.yamlci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.20-downstream.yamlci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.20-stage.yamlci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.21-downstream.yamlci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.21-stage.yamlci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__upstream-ocp-4.12-amd64.yamlci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__upstream-ocp-4.21-amd64.yamlci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__upstream-ocp-4.22-amd64.yamlci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.12-stage.yamlci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.14-arm-stage.yamlci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.14-stage.yamlci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.16-ibm-z-stage.yamlci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.17-fips-stage.yamlci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.17-ibm-p-stage.yamlci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.19-downstream.yamlci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.19-stage.yamlci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.20-downstream.yamlci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.20-stage.yamlci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.21-downstream.yamlci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.21-stage.yamlci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__upstream-ocp-4.12-amd64.yamlci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__upstream-ocp-4.21-amd64.yamlci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__upstream-ocp-4.22-amd64.yamlci-operator/step-registry/distributed-tracing/tests/disconnected/distributed-tracing-tests-disconnected-commands.shci-operator/step-registry/distributed-tracing/tests/disconnected/distributed-tracing-tests-disconnected-ref.yamlci-operator/step-registry/distributed-tracing/tests/opentelemetry/downstream/distributed-tracing-tests-opentelemetry-downstream-commands.shci-operator/step-registry/distributed-tracing/tests/opentelemetry/downstream/distributed-tracing-tests-opentelemetry-downstream-ref.yamlci-operator/step-registry/distributed-tracing/tests/opentelemetry/stage/distributed-tracing-tests-opentelemetry-stage-commands.shci-operator/step-registry/distributed-tracing/tests/opentelemetry/stage/distributed-tracing-tests-opentelemetry-stage-ref.yamlci-operator/step-registry/distributed-tracing/tests/opentelemetry/upstream/distributed-tracing-tests-opentelemetry-upstream-commands.shci-operator/step-registry/distributed-tracing/tests/opentelemetry/upstream/distributed-tracing-tests-opentelemetry-upstream-ref.yamlci-operator/step-registry/distributed-tracing/tests/qe-agent/OWNERSci-operator/step-registry/distributed-tracing/tests/qe-agent/distributed-tracing-tests-qe-agent-commands.shci-operator/step-registry/distributed-tracing/tests/qe-agent/distributed-tracing-tests-qe-agent-ref.metadata.jsonci-operator/step-registry/distributed-tracing/tests/qe-agent/distributed-tracing-tests-qe-agent-ref.yamlci-operator/step-registry/distributed-tracing/tests/tempo/downstream/distributed-tracing-tests-tempo-downstream-commands.shci-operator/step-registry/distributed-tracing/tests/tempo/downstream/distributed-tracing-tests-tempo-downstream-ref.yamlci-operator/step-registry/distributed-tracing/tests/tempo/stage/distributed-tracing-tests-tempo-stage-commands.shci-operator/step-registry/distributed-tracing/tests/tempo/stage/distributed-tracing-tests-tempo-stage-ref.yamlci-operator/step-registry/distributed-tracing/tests/tempo/upstream/distributed-tracing-tests-tempo-upstream-commands.shci-operator/step-registry/distributed-tracing/tests/tempo/upstream/distributed-tracing-tests-tempo-upstream-ref.yamlci-operator/step-registry/distributed-tracing/tests/tracing-ui/integration/distributed-tracing-tests-tracing-ui-integration-commands.shci-operator/step-registry/distributed-tracing/tests/tracing-ui/upstream/distributed-tracing-tests-tracing-ui-upstream-commands.shci-operator/step-registry/distributed-tracing/tests/tracing-ui/upstream/distributed-tracing-tests-tracing-ui-upstream-ref.yaml
| SKILL_URL="https://raw.githubusercontent.com/openshift/distributed-tracing-qe/main/plugins/qe-agent/skills/SKILL.md" | ||
| echo "Fetching qe-agent skill from ${SKILL_URL}..." | ||
| SKILL_CONTENT=$(curl -sf "${SKILL_URL}") || true |
There was a problem hiding this comment.
Pin the qe-agent skill source to an immutable revision.
Line 45 fetches prompt instructions from a mutable main branch, while Lines 61-63 execute Claude in non-interactive mode with dangerous permissions and shell/write tools. This creates a mutable remote-execution control plane for CI behavior. Pin to a commit/tag (and ideally verify checksum) before execution.
Suggested hardening diff
-SKILL_URL="https://raw.githubusercontent.com/openshift/distributed-tracing-qe/main/plugins/qe-agent/skills/SKILL.md"
+SKILL_REF="${QE_AGENT_SKILL_REF:?QE_AGENT_SKILL_REF must be set to an immutable commit SHA}"
+SKILL_URL="https://raw.githubusercontent.com/openshift/distributed-tracing-qe/${SKILL_REF}/plugins/qe-agent/skills/SKILL.md"
echo "Fetching qe-agent skill from ${SKILL_URL}..."
-SKILL_CONTENT=$(curl -sf "${SKILL_URL}") || true
+SKILL_CONTENT=$(curl -fsSL --connect-timeout 10 --max-time 30 --retry 3 "${SKILL_URL}") || trueAlso applies to: 61-63
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@ci-operator/step-registry/distributed-tracing/tests/qe-agent/distributed-tracing-tests-qe-agent-commands.sh`
around lines 45 - 47, The SKILL_URL and SKILL_CONTENT usage currently fetches
from the mutable main branch; change SKILL_URL to point to an immutable commit
SHA or tag (not main) and add retrieval+verification of a checksum before using
SKILL_CONTENT (e.g., fetch a known-good checksum file or embed expected hash and
verify curl output), failing fast if verification fails; update the code paths
that use SKILL_CONTENT (the variables SKILL_URL and SKILL_CONTENT and the
subsequent non-interactive Claude invocation) to only proceed after checksum
validation to ensure the qe-agent skill is pinned and integrity-checked.
| # --------------------------------------------------------------------------- | ||
| SKILL_URL="https://raw.githubusercontent.com/openshift/distributed-tracing-qe/main/plugins/qe-agent/skills/SKILL.md" | ||
| echo "Fetching qe-agent skill from ${SKILL_URL}..." | ||
| SKILL_CONTENT=$(curl -sf "${SKILL_URL}") || true |
There was a problem hiding this comment.
Add network timeout/retry guards for skill retrieval.
Without explicit timeout/retry settings, transient network stalls can consume step time and skip useful analysis unnecessarily.
Suggested reliability diff
-SKILL_CONTENT=$(curl -sf "${SKILL_URL}") || true
+SKILL_CONTENT=$(curl -fsSL --connect-timeout 10 --max-time 30 --retry 3 "${SKILL_URL}") || true📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| SKILL_CONTENT=$(curl -sf "${SKILL_URL}") || true | |
| SKILL_CONTENT=$(curl -fsSL --connect-timeout 10 --max-time 30 --retry 3 "${SKILL_URL}") || true |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@ci-operator/step-registry/distributed-tracing/tests/qe-agent/distributed-tracing-tests-qe-agent-commands.sh`
at line 47, The SKILL_CONTENT assignment currently uses a plain curl call that
can hang; update the SKILL_CONTENT=$(curl -sf "${SKILL_URL}") || true line to
use curl with explicit timeouts and retries (reference SKILL_CONTENT and
SKILL_URL): add --connect-timeout (e.g. 5s), --max-time (e.g. 15s), and retry
flags such as --retry 3 --retry-delay 2 --retry-connrefused while keeping -s and
-f and preserving the trailing || true so the step won’t fail outright; ensure
the new flags are documented in an inline comment near that assignment.
Summary
distributed-tracing-tests-qe-agentpost-step to the OCP CI upstream jobs for OpenTelemetry Operator (4.22), Tempo Operator (4.22), and Tracing UI (main). The step runs autonomously after tests complete, triggered only when JUnit failures are detected.grace_period: 2m0sto all 7 distributed-tracing test step refs that usetrapfor artifact collection — required by ci-operator validation.best_effort: trueso its failure never blocks the job.What the qe-agent does
When test failures are found in
$SHARED_DIR/qe-agent/:--skip-deletePRODUCT_BUG,TEST_ISSUE, orFLAKYand takes appropriate actionqe-agent-analysis.mdto$ARTIFACT_DIRsummarising the diagnosisExits immediately at no cost when no JUnit failures are present.
Test plan
make jobspasses without errors🤖 Generated with Claude Code
Summary by CodeRabbit
This PR introduces an autonomous post-step for distributed tracing test failure analysis across three OpenShift CI projects, along with supporting infrastructure changes.
Core Change: New qe-agent Post-Step
A new
distributed-tracing-tests-qe-agentpost-step is added to the step registry and integrated into three upstream CI jobs:The qe-agent is configured as a best-effort post-step (marked with
best_effort: true) that runs after tests complete and is triggered only when JUnit test failures are detected. When failures are present, the agent re-establishes the test environment, reruns specific failing tests with--skip-deleteto preserve cluster state, and uses Claude AI to classify failures as PRODUCT_BUG, TEST_ISSUE, or FLAKY, writing a diagnostic analysis toqe-agent-analysis.md. The agent exits immediately without overhead when no test failures are present.Image Runner Unification
Across 40 CI configuration files spanning three projects, test image build targets are renamed to a unified
obs-tests-runnerimage:tracing-ui-tests-runner→obs-tests-runner(8 distributed-tracing-console-plugin configs)tempo-tests-runner→obs-tests-runner(16 grafana-tempo-operator configs)opentelemetry-tests-runner→obs-tests-runner(16 open-telemetry-opentelemetry-operator configs)Test Step Infrastructure Updates
Ten test command scripts are modified to capture JUnit results and setup context for the qe-agent post-step by adding EXIT trap handlers that copy test artifacts and environment metadata to
${SHARED_DIR}/qe-agent/. This includes updates to:Additionally, seven test step registry refs receive a
grace_period: 2m0sconfiguration to satisfy ci-operator validation requirements for trap-based artifact collection.Key Configuration Details
The qe-agent post-step is configured with: