Skip to content

Add qe-agent post-step to rerun, debug and fix distributed tracing test failures#80176

Open
IshwarKanse wants to merge 1 commit into
openshift:mainfrom
IshwarKanse:fix-stage-tests-branch-param
Open

Add qe-agent post-step to rerun, debug and fix distributed tracing test failures#80176
IshwarKanse wants to merge 1 commit into
openshift:mainfrom
IshwarKanse:fix-stage-tests-branch-param

Conversation

@IshwarKanse
Copy link
Copy Markdown
Member

@IshwarKanse IshwarKanse commented Jun 6, 2026

Summary

  • Adds a new distributed-tracing-tests-qe-agent post-step to the OCP CI upstream jobs for OpenTelemetry Operator (4.22), Tempo Operator (4.22), and Tracing UI (main). The step runs autonomously after tests complete, triggered only when JUnit failures are detected.
  • Adds grace_period: 2m0s to all 7 distributed-tracing test step refs that use trap for artifact collection — required by ci-operator validation.
  • The qe-agent runs as the first post-step (before cluster teardown) so it has a live cluster for test reruns. It is marked best_effort: true so its failure never blocks the job.

What the qe-agent does

When test failures are found in $SHARED_DIR/qe-agent/:

  1. Re-establishes the test environment by fetching the original step script from GitHub and running its setup section
  2. Reruns only the specific failing tests (not the full suite) with --skip-delete
  3. Classifies each failure as PRODUCT_BUG, TEST_ISSUE, or FLAKY and takes appropriate action
  4. Writes qe-agent-analysis.md to $ARTIFACT_DIR summarising the diagnosis

Exits immediately at no cost when no JUnit failures are present.

Test plan

  • Validate ci-operator config parses correctly: make jobs passes without errors
  • Confirm qe-agent post-step appears in the upstream job specs for OTEL 4.22, Tempo 4.22, and Tracing UI main
  • Rehearse one upstream job to verify the post-step fires after test failures

🤖 Generated with Claude Code

Summary by CodeRabbit

This PR introduces an autonomous post-step for distributed tracing test failure analysis across three OpenShift CI projects, along with supporting infrastructure changes.

Core Change: New qe-agent Post-Step

A new distributed-tracing-tests-qe-agent post-step is added to the step registry and integrated into three upstream CI jobs:

  • OpenTelemetry Operator (main, OCP 4.22)
  • Tempo Operator (main, OCP 4.22)
  • Tracing UI console plugin (main)

The qe-agent is configured as a best-effort post-step (marked with best_effort: true) that runs after tests complete and is triggered only when JUnit test failures are detected. When failures are present, the agent re-establishes the test environment, reruns specific failing tests with --skip-delete to preserve cluster state, and uses Claude AI to classify failures as PRODUCT_BUG, TEST_ISSUE, or FLAKY, writing a diagnostic analysis to qe-agent-analysis.md. The agent exits immediately without overhead when no test failures are present.

Image Runner Unification

Across 40 CI configuration files spanning three projects, test image build targets are renamed to a unified obs-tests-runner image:

  • tracing-ui-tests-runnerobs-tests-runner (8 distributed-tracing-console-plugin configs)
  • tempo-tests-runnerobs-tests-runner (16 grafana-tempo-operator configs)
  • opentelemetry-tests-runnerobs-tests-runner (16 open-telemetry-opentelemetry-operator configs)

Test Step Infrastructure Updates

Ten test command scripts are modified to capture JUnit results and setup context for the qe-agent post-step by adding EXIT trap handlers that copy test artifacts and environment metadata to ${SHARED_DIR}/qe-agent/. This includes updates to:

  • OpenTelemetry tests (upstream, stage, downstream)
  • Tempo tests (upstream, stage, downstream)
  • Tracing UI tests (upstream, integration)
  • Disconnected test variant

Additionally, seven test step registry refs receive a grace_period: 2m0s configuration to satisfy ci-operator validation requirements for trap-based artifact collection.

Key Configuration Details

The qe-agent post-step is configured with:

  • Vertex AI/Anthropic integration (Claude Opus 4.6 model)
  • Service account credentials mounted for Google Cloud authentication
  • 1-hour timeout with 2-minute grace period
  • Resource requests of 1 CPU and 3Gi memory
  • Detailed documentation of autonomous failure classification and remediation behavior

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 6, 2026

Walkthrough

This PR unifies test runner image targets across OpenShift CI operator configs from product-specific names to obs-tests-runner, introduces a new Claude-based QE agent post-step for e2e failure analysis, wires artifact collection into test scripts, and integrates the agent into upstream CI workflows.

Changes

QE Agent Infrastructure and Integration

Layer / File(s) Summary
QE agent step definition and ownership
ci-operator/step-registry/distributed-tracing/tests/qe-agent/OWNERS, .../distributed-tracing-tests-qe-agent-ref.metadata.json, .../distributed-tracing-tests-qe-agent-commands.sh, .../distributed-tracing-tests-qe-agent-ref.yaml
Complete QE agent post-step definition with Claude integration for failure analysis, environment/credentials, and ownership metadata. The command script validates artifact dirs, detects JUnit failures, fetches remote skill prompts, and runs Claude analysis in best-effort mode.
Artifact collection handlers in test step scripts
ci-operator/step-registry/distributed-tracing/tests/disconnected/...commands.sh, .../opentelemetry/downstream/...commands.sh, .../opentelemetry/stage/...commands.sh, .../opentelemetry/upstream/...commands.sh, .../tempo/downstream/...commands.sh, .../tempo/stage/...commands.sh, .../tempo/upstream/...commands.sh, .../tracing-ui/integration/...commands.sh, .../tracing-ui/upstream/...commands.sh
EXIT trap handlers added to copy JUnit XML results and write setup-context.json to SHARED_DIR/qe-agent for post-step consumption, registered to run on both success and failure.
Step reference runner and grace period configuration
ci-operator/step-registry/distributed-tracing/tests/disconnected/...ref.yaml, .../opentelemetry/downstream/...ref.yaml, .../opentelemetry/stage/...ref.yaml, .../opentelemetry/upstream/...ref.yaml, .../tempo/downstream/...ref.yaml, .../tempo/stage/...ref.yaml, .../tempo/upstream/...ref.yaml, .../tracing-ui/upstream/...ref.yaml
Update step references to use obs-tests-runner and add grace_period: 2m0s settings across distributed tracing test step variants.
QE agent post-step integration into upstream workflows
ci-operator/config/openshift/grafana-tempo-operator/.../upstream-ocp-4.22-amd64.yaml, .../opentelemetry-operator/.../upstream-ocp-4.22-amd64.yaml, .../distributed-tracing-console-plugin/.../main__upstream-amd64-aws.yaml
Add distributed-tracing-tests-qe-agent and deprovision chain references to post-step sections in upstream test workflows.

Unified obs-tests-runner Image Target across CI Configs

Layer / File(s) Summary
Distributed Tracing Console Plugin config updates
ci-operator/config/openshift/distributed-tracing-console-plugin/openshift-distributed-tracing-console-plugin-main__upstream-amd64-aws.yaml, ...release-0.4__upstream-amd64-aws.yaml, ...release-1.0__upstream-amd64-aws.yaml, ...release-coo-ocp-4.12__upstream-amd64-aws.yaml, ...release-coo-ocp-4.15__upstream-amd64-aws.yaml, ...release-coo-ocp-4.19__upstream-amd64-aws.yaml, ...release-coo-ocp-4.22__upstream-amd64-aws.yaml
Rebrand test runner from tracing-ui-tests-runner to obs-tests-runner across release and upstream variants.
Distributed Tracing QE config updates
ci-operator/config/openshift/distributed-tracing-qe/openshift-distributed-tracing-qe-main__ocp-4.16-disconnected.yaml
Update OCP 4.16 disconnected config from distributed-tracing-tests-runner to obs-tests-runner.
Grafana Tempo Operator config updates
ci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.12-stage.yaml, ...tempo-product-ocp-4.14-arm-stage.yaml, ...tempo-product-ocp-4.14-stage.yaml, ...tempo-product-ocp-4.16-ibm-z-stage.yaml, ...tempo-product-ocp-4.17-fips-stage.yaml, ...tempo-product-ocp-4.17-ibm-p-stage.yaml, ...tempo-product-ocp-4.19-downstream.yaml, ...tempo-product-ocp-4.19-stage.yaml, ...tempo-product-ocp-4.20-downstream.yaml, ...tempo-product-ocp-4.20-stage.yaml, ...tempo-product-ocp-4.21-downstream.yaml, ...tempo-product-ocp-4.21-stage.yaml, ...upstream-ocp-4.12-amd64.yaml, ...upstream-ocp-4.21-amd64.yaml, ...upstream-ocp-4.22-amd64.yaml
Rebrand test runner from tempo-tests-runner to obs-tests-runner across all tempo-product and upstream variants (4.12–4.22, multiple architectures/platforms).
OpenTelemetry Operator config updates
ci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.12-stage.yaml, ...opentelemetry-product-ocp-4.14-arm-stage.yaml, ...opentelemetry-product-ocp-4.14-stage.yaml, ...opentelemetry-product-ocp-4.16-ibm-z-stage.yaml, ...opentelemetry-product-ocp-4.17-fips-stage.yaml, ...opentelemetry-product-ocp-4.17-ibm-p-stage.yaml, ...opentelemetry-product-ocp-4.19-downstream.yaml, ...opentelemetry-product-ocp-4.19-stage.yaml, ...opentelemetry-product-ocp-4.20-downstream.yaml, ...opentelemetry-product-ocp-4.20-stage.yaml, ...opentelemetry-product-ocp-4.21-downstream.yaml, ...opentelemetry-product-ocp-4.21-stage.yaml, ...upstream-ocp-4.12-amd64.yaml, ...upstream-ocp-4.21-amd64.yaml, ...upstream-ocp-4.22-amd64.yaml
Rebrand test runner from opentelemetry-tests-runner to obs-tests-runner across all opentelemetry-product and upstream variants (4.12–4.22, multiple architectures/platforms/deployment modes).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested labels

lgtm, approved, rehearsals-ack

Suggested reviewers

  • pavolloffay
  • andreasgerstmayr
🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the main change: adding a qe-agent post-step for debugging and fixing distributed tracing test failures. It is concise, specific, and directly related to the primary purpose of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR contains only CI configuration changes (YAML, shell scripts, metadata). No Ginkgo test files or test name definitions are present, making this check not applicable.
Test Structure And Quality ✅ Passed This PR contains no Ginkgo test code (*.go files). Changes are exclusively CI configuration YAML and shell scripts. The custom check for Ginkgo test quality is not applicable.
Microshift Test Compatibility ✅ Passed PR adds only CI configuration YAML files and infrastructure bash scripts, not Ginkgo e2e tests. The check is not applicable as it targets test code additions only.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR adds no new Ginkgo e2e tests. Changes are CI infrastructure only: config YAML, step registry YAML/JSON, and Bash scripts. SNO compatibility check not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies CI configuration and test scripts (ci-operator/), not deployment manifests, operator code, or Kubernetes controllers. Topology-aware scheduling check doesn't apply.
Ote Binary Stdout Contract ✅ Passed PR contains no modifications to Go source code; only CI YAML configs and shell scripts modified. OTE Binary Stdout Contract applies to Go test binaries only.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e test files added. PR contains only CI infrastructure changes (YAML configs, shell scripts, metadata). Check is not applicable.
No-Weak-Crypto ✅ Passed No weak cryptographic algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time secret comparisons detected in PR changes.
Container-Privileges ✅ Passed No privileged: true, hostPID/Network/IPC, SYS_ADMIN, allowPrivilegeEscalation: true, or runAsUser: 0 found in any modified YAML configs or bash scripts.
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data exposed in logging. The PR only logs CI paths, status messages, image URLs, and branch parameters—all non-sensitive values appropriate for CI logs.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 6, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: IshwarKanse

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 6, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@IshwarKanse: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-upstream-ocp-4.21-amd64-ci-index-opentelemetry-bundle openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-upstream-ocp-4.21-amd64-images openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-upstream-ocp-4.21-amd64-opentelemetry-upstream-tests openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-opentelemetry-product-ocp-4.20-downstream-images openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-opentelemetry-product-ocp-4.14-stage-images openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-upstream-ocp-4.22-amd64-ci-index-opentelemetry-bundle openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-upstream-ocp-4.22-amd64-images openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-upstream-ocp-4.22-amd64-opentelemetry-upstream-tests openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-upstream-ocp-4.22-amd64-security-sast-otel openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-opentelemetry-product-ocp-4.16-ibm-z-stage-images openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-opentelemetry-product-ocp-4.12-stage-images openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-opentelemetry-product-ocp-4.20-stage-images openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-opentelemetry-product-ocp-4.21-downstream-images openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-opentelemetry-product-ocp-4.19-stage-images openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-opentelemetry-product-ocp-4.14-arm-stage-images openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-opentelemetry-product-ocp-4.21-stage-images openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-opentelemetry-product-ocp-4.17-fips-stage-images openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-upstream-ocp-4.12-amd64-ci-index-opentelemetry-bundle openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-upstream-ocp-4.12-amd64-images openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-upstream-ocp-4.12-amd64-opentelemetry-upstream-tests openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-upstream-ocp-4.12-amd64-security-sast-otel openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-opentelemetry-product-ocp-4.17-ibm-p-stage-images openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-open-telemetry-opentelemetry-operator-main-opentelemetry-product-ocp-4.19-downstream-images openshift/open-telemetry-opentelemetry-operator presubmit Ci-operator config changed
pull-ci-openshift-distributed-tracing-console-plugin-release-coo-ocp-4.12-upstream-amd64-aws-e2e openshift/distributed-tracing-console-plugin presubmit Ci-operator config changed
pull-ci-openshift-distributed-tracing-console-plugin-release-coo-ocp-4.12-upstream-amd64-aws-fips-image-scan openshift/distributed-tracing-console-plugin presubmit Ci-operator config changed

A total of 108 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs.

A full list of affected jobs can be found here
Prior to this PR being merged, you will need to either run and acknowledge or opt to skip these rehearsals.

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 6, 2026

@IshwarKanse: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/step-registry/distributed-tracing/tests/qe-agent/distributed-tracing-tests-qe-agent-commands.sh`:
- Around line 45-47: The SKILL_URL and SKILL_CONTENT usage currently fetches
from the mutable main branch; change SKILL_URL to point to an immutable commit
SHA or tag (not main) and add retrieval+verification of a checksum before using
SKILL_CONTENT (e.g., fetch a known-good checksum file or embed expected hash and
verify curl output), failing fast if verification fails; update the code paths
that use SKILL_CONTENT (the variables SKILL_URL and SKILL_CONTENT and the
subsequent non-interactive Claude invocation) to only proceed after checksum
validation to ensure the qe-agent skill is pinned and integrity-checked.
- Line 47: The SKILL_CONTENT assignment currently uses a plain curl call that
can hang; update the SKILL_CONTENT=$(curl -sf "${SKILL_URL}") || true line to
use curl with explicit timeouts and retries (reference SKILL_CONTENT and
SKILL_URL): add --connect-timeout (e.g. 5s), --max-time (e.g. 15s), and retry
flags such as --retry 3 --retry-delay 2 --retry-connrefused while keeping -s and
-f and preserving the trailing || true so the step won’t fail outright; ensure
the new flags are documented in an inline comment near that assignment.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: e628dcd4-da2e-4827-adf1-a0c2ac9d1392

📥 Commits

Reviewing files that changed from the base of the PR and between 9195d8d and 59ce53f.

📒 Files selected for processing (59)
  • ci-operator/config/openshift/distributed-tracing-console-plugin/openshift-distributed-tracing-console-plugin-main__upstream-amd64-aws.yaml
  • ci-operator/config/openshift/distributed-tracing-console-plugin/openshift-distributed-tracing-console-plugin-release-0.4__upstream-amd64-aws.yaml
  • ci-operator/config/openshift/distributed-tracing-console-plugin/openshift-distributed-tracing-console-plugin-release-1.0__upstream-amd64-aws.yaml
  • ci-operator/config/openshift/distributed-tracing-console-plugin/openshift-distributed-tracing-console-plugin-release-coo-ocp-4.12__upstream-amd64-aws.yaml
  • ci-operator/config/openshift/distributed-tracing-console-plugin/openshift-distributed-tracing-console-plugin-release-coo-ocp-4.15__upstream-amd64-aws.yaml
  • ci-operator/config/openshift/distributed-tracing-console-plugin/openshift-distributed-tracing-console-plugin-release-coo-ocp-4.19__upstream-amd64-aws.yaml
  • ci-operator/config/openshift/distributed-tracing-console-plugin/openshift-distributed-tracing-console-plugin-release-coo-ocp-4.22__upstream-amd64-aws.yaml
  • ci-operator/config/openshift/distributed-tracing-qe/openshift-distributed-tracing-qe-main__ocp-4.16-disconnected.yaml
  • ci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.12-stage.yaml
  • ci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.14-arm-stage.yaml
  • ci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.14-stage.yaml
  • ci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.16-ibm-z-stage.yaml
  • ci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.17-fips-stage.yaml
  • ci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.17-ibm-p-stage.yaml
  • ci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.19-downstream.yaml
  • ci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.19-stage.yaml
  • ci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.20-downstream.yaml
  • ci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.20-stage.yaml
  • ci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.21-downstream.yaml
  • ci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__tempo-product-ocp-4.21-stage.yaml
  • ci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__upstream-ocp-4.12-amd64.yaml
  • ci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__upstream-ocp-4.21-amd64.yaml
  • ci-operator/config/openshift/grafana-tempo-operator/openshift-grafana-tempo-operator-main__upstream-ocp-4.22-amd64.yaml
  • ci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.12-stage.yaml
  • ci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.14-arm-stage.yaml
  • ci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.14-stage.yaml
  • ci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.16-ibm-z-stage.yaml
  • ci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.17-fips-stage.yaml
  • ci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.17-ibm-p-stage.yaml
  • ci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.19-downstream.yaml
  • ci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.19-stage.yaml
  • ci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.20-downstream.yaml
  • ci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.20-stage.yaml
  • ci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.21-downstream.yaml
  • ci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__opentelemetry-product-ocp-4.21-stage.yaml
  • ci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__upstream-ocp-4.12-amd64.yaml
  • ci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__upstream-ocp-4.21-amd64.yaml
  • ci-operator/config/openshift/open-telemetry-opentelemetry-operator/openshift-open-telemetry-opentelemetry-operator-main__upstream-ocp-4.22-amd64.yaml
  • ci-operator/step-registry/distributed-tracing/tests/disconnected/distributed-tracing-tests-disconnected-commands.sh
  • ci-operator/step-registry/distributed-tracing/tests/disconnected/distributed-tracing-tests-disconnected-ref.yaml
  • ci-operator/step-registry/distributed-tracing/tests/opentelemetry/downstream/distributed-tracing-tests-opentelemetry-downstream-commands.sh
  • ci-operator/step-registry/distributed-tracing/tests/opentelemetry/downstream/distributed-tracing-tests-opentelemetry-downstream-ref.yaml
  • ci-operator/step-registry/distributed-tracing/tests/opentelemetry/stage/distributed-tracing-tests-opentelemetry-stage-commands.sh
  • ci-operator/step-registry/distributed-tracing/tests/opentelemetry/stage/distributed-tracing-tests-opentelemetry-stage-ref.yaml
  • ci-operator/step-registry/distributed-tracing/tests/opentelemetry/upstream/distributed-tracing-tests-opentelemetry-upstream-commands.sh
  • ci-operator/step-registry/distributed-tracing/tests/opentelemetry/upstream/distributed-tracing-tests-opentelemetry-upstream-ref.yaml
  • ci-operator/step-registry/distributed-tracing/tests/qe-agent/OWNERS
  • ci-operator/step-registry/distributed-tracing/tests/qe-agent/distributed-tracing-tests-qe-agent-commands.sh
  • ci-operator/step-registry/distributed-tracing/tests/qe-agent/distributed-tracing-tests-qe-agent-ref.metadata.json
  • ci-operator/step-registry/distributed-tracing/tests/qe-agent/distributed-tracing-tests-qe-agent-ref.yaml
  • ci-operator/step-registry/distributed-tracing/tests/tempo/downstream/distributed-tracing-tests-tempo-downstream-commands.sh
  • ci-operator/step-registry/distributed-tracing/tests/tempo/downstream/distributed-tracing-tests-tempo-downstream-ref.yaml
  • ci-operator/step-registry/distributed-tracing/tests/tempo/stage/distributed-tracing-tests-tempo-stage-commands.sh
  • ci-operator/step-registry/distributed-tracing/tests/tempo/stage/distributed-tracing-tests-tempo-stage-ref.yaml
  • ci-operator/step-registry/distributed-tracing/tests/tempo/upstream/distributed-tracing-tests-tempo-upstream-commands.sh
  • ci-operator/step-registry/distributed-tracing/tests/tempo/upstream/distributed-tracing-tests-tempo-upstream-ref.yaml
  • ci-operator/step-registry/distributed-tracing/tests/tracing-ui/integration/distributed-tracing-tests-tracing-ui-integration-commands.sh
  • ci-operator/step-registry/distributed-tracing/tests/tracing-ui/upstream/distributed-tracing-tests-tracing-ui-upstream-commands.sh
  • ci-operator/step-registry/distributed-tracing/tests/tracing-ui/upstream/distributed-tracing-tests-tracing-ui-upstream-ref.yaml

Comment on lines +45 to +47
SKILL_URL="https://raw.githubusercontent.com/openshift/distributed-tracing-qe/main/plugins/qe-agent/skills/SKILL.md"
echo "Fetching qe-agent skill from ${SKILL_URL}..."
SKILL_CONTENT=$(curl -sf "${SKILL_URL}") || true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Pin the qe-agent skill source to an immutable revision.

Line 45 fetches prompt instructions from a mutable main branch, while Lines 61-63 execute Claude in non-interactive mode with dangerous permissions and shell/write tools. This creates a mutable remote-execution control plane for CI behavior. Pin to a commit/tag (and ideally verify checksum) before execution.

Suggested hardening diff
-SKILL_URL="https://raw.githubusercontent.com/openshift/distributed-tracing-qe/main/plugins/qe-agent/skills/SKILL.md"
+SKILL_REF="${QE_AGENT_SKILL_REF:?QE_AGENT_SKILL_REF must be set to an immutable commit SHA}"
+SKILL_URL="https://raw.githubusercontent.com/openshift/distributed-tracing-qe/${SKILL_REF}/plugins/qe-agent/skills/SKILL.md"
 echo "Fetching qe-agent skill from ${SKILL_URL}..."
-SKILL_CONTENT=$(curl -sf "${SKILL_URL}") || true
+SKILL_CONTENT=$(curl -fsSL --connect-timeout 10 --max-time 30 --retry 3 "${SKILL_URL}") || true

Also applies to: 61-63

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/distributed-tracing/tests/qe-agent/distributed-tracing-tests-qe-agent-commands.sh`
around lines 45 - 47, The SKILL_URL and SKILL_CONTENT usage currently fetches
from the mutable main branch; change SKILL_URL to point to an immutable commit
SHA or tag (not main) and add retrieval+verification of a checksum before using
SKILL_CONTENT (e.g., fetch a known-good checksum file or embed expected hash and
verify curl output), failing fast if verification fails; update the code paths
that use SKILL_CONTENT (the variables SKILL_URL and SKILL_CONTENT and the
subsequent non-interactive Claude invocation) to only proceed after checksum
validation to ensure the qe-agent skill is pinned and integrity-checked.

# ---------------------------------------------------------------------------
SKILL_URL="https://raw.githubusercontent.com/openshift/distributed-tracing-qe/main/plugins/qe-agent/skills/SKILL.md"
echo "Fetching qe-agent skill from ${SKILL_URL}..."
SKILL_CONTENT=$(curl -sf "${SKILL_URL}") || true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add network timeout/retry guards for skill retrieval.

Without explicit timeout/retry settings, transient network stalls can consume step time and skip useful analysis unnecessarily.

Suggested reliability diff
-SKILL_CONTENT=$(curl -sf "${SKILL_URL}") || true
+SKILL_CONTENT=$(curl -fsSL --connect-timeout 10 --max-time 30 --retry 3 "${SKILL_URL}") || true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
SKILL_CONTENT=$(curl -sf "${SKILL_URL}") || true
SKILL_CONTENT=$(curl -fsSL --connect-timeout 10 --max-time 30 --retry 3 "${SKILL_URL}") || true
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/distributed-tracing/tests/qe-agent/distributed-tracing-tests-qe-agent-commands.sh`
at line 47, The SKILL_CONTENT assignment currently uses a plain curl call that
can hang; update the SKILL_CONTENT=$(curl -sf "${SKILL_URL}") || true line to
use curl with explicit timeouts and retries (reference SKILL_CONTENT and
SKILL_URL): add --connect-timeout (e.g. 5s), --max-time (e.g. 15s), and retry
flags such as --retry 3 --retry-delay 2 --retry-connrefused while keeping -s and
-f and preserving the trailing || true so the step won’t fail outright; ensure
the new flags are documented in an inline comment near that assignment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant