Skip to content

OCPEDGE-1036: fix: latency tuning for the rt-kernel tests on AWS metal#30790

Open
jeff-roche wants to merge 1 commit intoopenshift:mainfrom
jeff-roche:rt-test-latencies
Open

OCPEDGE-1036: fix: latency tuning for the rt-kernel tests on AWS metal#30790
jeff-roche wants to merge 1 commit intoopenshift:mainfrom
jeff-roche:rt-test-latencies

Conversation

@jeff-roche
Copy link
Contributor

@jeff-roche jeff-roche commented Feb 17, 2026

Summary

Replaces the binary pass/fail latency detection with a smarter three-tier analysis that distinguishes real RT kernel issues from environmental noise (e.g., isolated single-CPU spikes on AWS metal instances).

Changes

Two-tier soft/hard thresholds:

  • Soft threshold: expected max latency — exceeding triggers a warning but the test still passes
  • Hard threshold: absolute max latency — exceeding triggers a test failure

Statistical percentage-based detection:

  • If >5% of CPUs exceed the soft threshold, the test fails as a systemic latency issue
  • Isolated spikes on a single CPU (e.g., 1 out of 80) produce a warning, not a failure

Structured JSON diagnostic artifacts:

  • Each latency test now writes an _analysis.json artifact with: max, avg, P99 latency, per-CPU breakdown, soft/hard threshold counts, and overall result (PASS/WARN/FAIL)

Thresholds

Test Metal Soft Metal Hard Non-Metal Soft Non-Metal Hard
oslat 100us 500us 7500us 10000us
cyclictest 100us 500us 7500us 10000us
hwlatdetect 100us 200us 7500us 10000us
deadline_test 100us 200us 7500us 10000us

Code cleanup

  • Unified parseOslatResults and parseCyclictestResults into a single parseLatencyResults function
  • Added unit tests for the new parsing logic

Expected behavior with real job data

Scenario Old Behavior New Behavior
1/80 CPUs at 211us FAIL WARN (pass)
1/91 CPUs at 241us FAIL WARN (pass)
10/80 CPUs at 300us FAIL FAIL (systemic)
1/80 CPUs at 600us FAIL FAIL (hard threshold)
All CPUs under 100us PASS PASS

@openshift-ci-robot
Copy link

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 17, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 17, 2026

@jeff-roche: This pull request references OCPEDGE-1036 which is a valid jira issue.

Details

In response to this:

  • 100 microsecond latency cap for metal instances
  • 7500 microsecond latency cap for non-metal instance types (previous default)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jeff-roche
Copy link
Contributor Author

/test ?

@openshift-ci openshift-ci bot requested review from deads2k and sjenning February 17, 2026 00:17
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 17, 2026
@jeff-roche
Copy link
Contributor Author

/test e2e-gcp-ovn-rt-upgrade

@jeff-roche
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-nightly-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 17, 2026

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1e7a2f80-0b96-11f1-9252-e810dd3e02ff-0

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@jeff-roche
Copy link
Contributor Author

jeff-roche commented Feb 17, 2026

The payload job fails to upgrade but the RT Tests themselves pass. Addressing the upgrade failures on #30608

@qJkee
Copy link
Contributor

qJkee commented Feb 17, 2026

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 17, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 17, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jeff-roche, qJkee

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jeff-roche
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 17, 2026

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/61996750-0c26-11f1-8539-63c794c57c62-0

@jeff-roche
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 24, 2026

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9cdf16c0-11a8-11f1-9f71-c9b6ff3ae133-0

@jeff-roche
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4d5afc30-1253-11f1-9df8-fe79311f410f-0

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 26, 2026

New changes are detected. LGTM label has been removed.

@jeff-roche
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

@coderabbitai
Copy link

coderabbitai bot commented Feb 26, 2026

Walkthrough

getRealTimeWorkerNodes now returns a slice of node names and detects non-metal nodes (which forces all real-time thresholds to 7500µs). Multiple real-time test runners were refactored to use a centralized rtTestThresholds map, capture command output, wrap errors, and write timestamped per-test artifacts.

Changes

Cohort / File(s) Summary
Real-time Worker Node Collection
test/extended/kernel/common.go
Changed signature to return []string. Builds a pre-sized slice, detects nodes with/without the "metal" instance-type label, logs when non-metal nodes exist, and sets all entries in rtTestThresholds to 7500µs when a non-metal node is found.
Latency Thresholds & Test Runners
test/extended/kernel/tools.go
Introduced/used centralized rtTestThresholds map. Refactored runners (runDeadlineTest, runHwlatdetect, runOslat, runCyclictest, runPiStressFifo, runPiStressRR) to derive testName, use per-test thresholds for parsing/logging, capture command output, wrap errors with contextual messages, and write timestamped per-test log/artifact files (uses time/e2e.TimeNow()).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: fixing latency tuning for real-time kernel tests specifically on AWS metal instances, which aligns with the core modifications in the pull request.
Stable And Deterministic Test Names ✅ Passed All test names in kernel test files are stable, deterministic static strings with no dynamic content like timestamps, UUIDs, or node names.
Test Structure And Quality ✅ Passed Modified files are helper utilities containing test infrastructure functions rather than actual test cases with BDD patterns like It blocks or Expect assertions.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 26, 2026

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/44998af0-130a-11f1-9082-40e95c4d00b8-0

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 26, 2026

@jeff-roche: This pull request references OCPEDGE-1036 which is a valid jira issue.

Details

In response to this:

  • 100 microsecond latency cap for metal instances
  • 7500 microsecond latency cap for non-metal instance types (previous default)

Summary by CodeRabbit

  • Tests
  • Enhanced real-time kernel test infrastructure with adaptive threshold configuration based on hardware type detection.
  • Improved test logging and artifact collection with timestamped output files for better result tracking and debugging.
  • Refactored test execution with better error handling and context reporting.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
test/extended/kernel/common.go (1)

64-68: Side effect in getter function modifies global state.

getRealTimeWorkerNodes modifies the global rtTestThresholds map, which is unexpected for a function with a "get" prefix. This couples threshold configuration to node discovery and makes the behavior harder to reason about.

Consider either:

  1. Renaming the function to reflect it configures thresholds (e.g., setupRealTimeWorkerNodes)
  2. Returning the metal status and handling threshold adjustment at the call site
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/kernel/common.go` around lines 64 - 68, getRealTimeWorkerNodes
currently mutates the global rtTestThresholds map (when nodesAreMetal is false),
which is a surprising side effect for a getter; stop modifying rtTestThresholds
inside getRealTimeWorkerNodes and instead either (A) rename
getRealTimeWorkerNodes to setupRealTimeWorkerNodes if you intend it to configure
thresholds, or (B) change getRealTimeWorkerNodes to only return the metal status
(bool nodesAreMetal) and move the rtTestThresholds adjustments out to the call
site so callers can set rtTestThresholds[test] = 7500 when nodesAreMetal is
false; update all callers of getRealTimeWorkerNodes accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/kernel/common.go`:
- Around line 57-68: The metal-detection currently uses node.GetLabels() and
sets nodesAreMetal = false if any worker node isn't metal, which incorrectly
flags clusters where non-RT workers are non-metal; update the logic so the metal
check is only performed for nodes that match the RT kernel condition (the same
condition used to select RT nodes) — i.e., inside the RT kernel match block
iterate those nodes, call node.GetLabels(), and only then modify nodesAreMetal
and adjust rtTestThresholds; reference variables/functions: node.GetLabels(),
nodesAreMetal, rtTestThresholds, and the RT kernel match condition so the
threshold padding runs only when RT nodes are detected as non-metal.
- Line 48: Replace the incorrect capacity argument on the nodes slice
allocation: the current call uses kubeNodes.Size() (which returns protobuf
serialized size) when constructing nodes via make([]string, 0, ...); change it
to use the number of items with len(kubeNodes.Items) so nodes = make([]string,
0, len(kubeNodes.Items)). Update the allocation site that references kubeNodes
and the nodes variable in test/extended/kernel/common.go (search for the
make([]string, 0, kubeNodes.Size()) occurrence).

In `@test/extended/kernel/tools.go`:
- Around line 165-167: The error message in runCyclictest incorrectly references
"oslat test"; update the returned fmt.Errorf string in the runCyclictest
function (where cpuCount is checked) to reference "cyclictest" (or
"runCyclictest") instead and preserve the numeric cpuCount interpolation and
wording; ensure only the test name in the message is changed so the check using
cpuCount and the fmt.Errorf call remain otherwise identical.

---

Nitpick comments:
In `@test/extended/kernel/common.go`:
- Around line 64-68: getRealTimeWorkerNodes currently mutates the global
rtTestThresholds map (when nodesAreMetal is false), which is a surprising side
effect for a getter; stop modifying rtTestThresholds inside
getRealTimeWorkerNodes and instead either (A) rename getRealTimeWorkerNodes to
setupRealTimeWorkerNodes if you intend it to configure thresholds, or (B) change
getRealTimeWorkerNodes to only return the metal status (bool nodesAreMetal) and
move the rtTestThresholds adjustments out to the call site so callers can set
rtTestThresholds[test] = 7500 when nodesAreMetal is false; update all callers of
getRealTimeWorkerNodes accordingly.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 3cec87c and 8de1caf.

📒 Files selected for processing (2)
  • test/extended/kernel/common.go
  • test/extended/kernel/tools.go

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 2, 2026

@jeff-roche: This pull request references OCPEDGE-1036 which is a valid jira issue.

Details

In response to this:

  • 100 microsecond latency cap for metal instances
  • 7500 microsecond latency cap for non-metal instance types (previous default)

Summary by CodeRabbit

  • Tests
  • Enhanced real-time kernel test infrastructure with adaptive per-test thresholds when non-metal nodes are detected.
  • Improved test execution with unified per-test names, richer error context, and consistent output capture.
  • Added timestamped artifact logging for each test to simplify result tracking and debugging.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jeff-roche
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 2, 2026

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1732ad10-165e-11f1-9f18-6b2d93667007-0

@jeff-roche
Copy link
Contributor Author

/retest

1 similar comment
@jeff-roche
Copy link
Contributor Author

/retest

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 4, 2026

@jeff-roche: This pull request references OCPEDGE-1036 which is a valid jira issue.

Details

In response to this:

  • 100 microsecond latency cap for metal instances
  • 7500 microsecond latency cap for non-metal instance types (previous default)

Summary by CodeRabbit

  • Tests
  • Enhanced real-time kernel test infrastructure with adaptive per-test thresholds when any non-metal node is detected.
  • Unified per-test naming and thresholds across runners for consistent parsing, richer error context, and reliable output capture.
  • Added timestamped per-test artifact logging to simplify result tracking and debugging.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
test/extended/kernel/tools.go (2)

20-25: Consider guarded threshold access instead of raw map indexing.

Using string keys with direct map indexing can silently fall back to 0 if a key drifts, which makes failures harder to diagnose. A small helper with ok checks would make this safer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/kernel/tools.go` around lines 20 - 25, rtTestThresholds is
being read with raw map indexing which returns 0 for missing keys; add a guarded
accessor to make missing-key cases explicit. Implement a helper like
getRTTestThreshold(testName string) (int, bool) or GetRTTestThreshold(testName
string) (int, error) that looks up rtTestThresholds, returns the value and an ok
flag (or an error) and use that helper wherever rtTestThresholds is read so
callers can handle missing keys instead of silently getting 0; reference the
rtTestThresholds map and replace any direct accesses with calls to
getRTTestThreshold.

29-33: Persist command output even on failure paths.

Current flow returns on err before writing artifacts, so failure diagnostics are lost. This same pattern appears in the other runners that capture res.

Proposed adjustment
 	res, err := oc.SetNamespace(rtNamespace).Run("exec").Args(args...).Output()
+	writeTestArtifacts(fmt.Sprintf("%s_%s.log", "pi_stress_standard", e2e.TimeNow().Format(time.RFC3339)), res)
 	if err != nil {
 		// An error here indicates thresholds were exceeded or an issue with the test
 		return errors.Wrap(err, "error running pi_stress with the standard algorithm")
 	}
-
-	writeTestArtifacts(fmt.Sprintf("%s_%s.log", "pi_stress_standard", e2e.TimeNow().Format(time.RFC3339)), res)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/kernel/tools.go` around lines 29 - 33, The command output (res)
is discarded when oc.SetNamespace(...).Run("exec").Args(...).Output() returns an
err; update the err != nil branch to persist the captured res before returning.
Specifically, inside the error path of the block that reads res and err, write
res to the test artifacts/logging (using the existing test artifact writer or
logger used elsewhere in this package) with a clear filename/context, then
return errors.Wrap(err, "error running pi_stress with the standard algorithm")
as before. Ensure the same pattern is applied to the other runner blocks that
capture res so failures always include the command output for diagnostics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/extended/kernel/tools.go`:
- Around line 20-25: rtTestThresholds is being read with raw map indexing which
returns 0 for missing keys; add a guarded accessor to make missing-key cases
explicit. Implement a helper like getRTTestThreshold(testName string) (int,
bool) or GetRTTestThreshold(testName string) (int, error) that looks up
rtTestThresholds, returns the value and an ok flag (or an error) and use that
helper wherever rtTestThresholds is read so callers can handle missing keys
instead of silently getting 0; reference the rtTestThresholds map and replace
any direct accesses with calls to getRTTestThreshold.
- Around line 29-33: The command output (res) is discarded when
oc.SetNamespace(...).Run("exec").Args(...).Output() returns an err; update the
err != nil branch to persist the captured res before returning. Specifically,
inside the error path of the block that reads res and err, write res to the test
artifacts/logging (using the existing test artifact writer or logger used
elsewhere in this package) with a clear filename/context, then return
errors.Wrap(err, "error running pi_stress with the standard algorithm") as
before. Ensure the same pattern is applied to the other runner blocks that
capture res so failures always include the command output for diagnostics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b0195779-df43-4df7-8bf9-a8d96aabd777

📥 Commits

Reviewing files that changed from the base of the PR and between 40cfb1d and 1059dd7.

📒 Files selected for processing (2)
  • test/extended/kernel/common.go
  • test/extended/kernel/tools.go

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@jeff-roche
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 4, 2026

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5684d340-17de-11f1-9987-ae4c2d50ffbc-0

Replace binary pass/fail latency detection with a three-tier analysis:
- Two-tier thresholds (soft/hard) to distinguish warnings from failures
- Statistical percentage-based detection (>5% CPUs over soft = systemic fail)
- Structured JSON diagnostic artifacts for richer test result analysis

Metal thresholds: oslat/cyclictest soft=100us hard=500us,
hwlatdetect/deadline_test soft=100us hard=200us.
Non-metal thresholds: soft=7500us hard=10000us.

Unifies parseOslatResults and parseCyclictestResults into a single
parseLatencyResults function with comprehensive statistics (max, avg,
P99, per-CPU breakdown). Adds unit tests for the new parsing logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 5, 2026

@jeff-roche: This pull request references OCPEDGE-1036 which is a valid jira issue.

Details

In response to this:

Summary

Replaces the binary pass/fail latency detection with a smarter three-tier analysis that distinguishes real RT kernel issues from environmental noise (e.g., isolated single-CPU spikes on AWS metal instances).

Changes

Two-tier soft/hard thresholds:

  • Soft threshold: expected max latency — exceeding triggers a warning but the test still passes
  • Hard threshold: absolute max latency — exceeding triggers a test failure

Statistical percentage-based detection:

  • If >5% of CPUs exceed the soft threshold, the test fails as a systemic latency issue
  • Isolated spikes on a single CPU (e.g., 1 out of 80) produce a warning, not a failure

Structured JSON diagnostic artifacts:

  • Each latency test now writes an _analysis.json artifact with: max, avg, P99 latency, per-CPU breakdown, soft/hard threshold counts, and overall result (PASS/WARN/FAIL)

Thresholds

Test Metal Soft Metal Hard Non-Metal Soft Non-Metal Hard
oslat 100us 500us 7500us 10000us
cyclictest 100us 500us 7500us 10000us
hwlatdetect 100us 200us 7500us 10000us
deadline_test 100us 200us 7500us 10000us

Code cleanup

  • Unified parseOslatResults and parseCyclictestResults into a single parseLatencyResults function
  • Added unit tests for the new parsing logic

Expected behavior with real job data

Scenario Old Behavior New Behavior
1/80 CPUs at 211us FAIL WARN (pass)
1/91 CPUs at 241us FAIL WARN (pass)
10/80 CPUs at 300us FAIL FAIL (systemic)
1/80 CPUs at 600us FAIL FAIL (hard threshold)
All CPUs under 100us PASS PASS

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jeff-roche
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 5, 2026

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7e364250-18ad-11f1-9802-7c1b2e727428-0

@jeff-roche
Copy link
Contributor Author

/test e2e-gcp-ovn-rt-upgrade

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 5, 2026

@jeff-roche: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-ovn ceb21f5 link true /test e2e-gcp-ovn
ci/prow/e2e-metal-ipi-ovn-ipv6 ceb21f5 link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-vsphere-ovn-upi ceb21f5 link true /test e2e-vsphere-ovn-upi
ci/prow/e2e-aws-ovn-fips ceb21f5 link true /test e2e-aws-ovn-fips
ci/prow/e2e-vsphere-ovn ceb21f5 link true /test e2e-vsphere-ovn
ci/prow/e2e-aws-ovn-microshift ceb21f5 link true /test e2e-aws-ovn-microshift

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@jeff-roche
Copy link
Contributor Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants