OCPEDGE-1036: fix: latency tuning for the rt-kernel tests on AWS metal by jeff-roche · Pull Request #30790 · openshift/origin

jeff-roche · 2026-02-17T00:15:39Z

Summary

Replaces the binary pass/fail latency detection with a smarter three-tier analysis that distinguishes real RT kernel issues from environmental noise (e.g., isolated single-CPU spikes on AWS metal instances).

Changes

Two-tier soft/hard thresholds:

Soft threshold: expected max latency — exceeding triggers a warning but the test still passes
Hard threshold: absolute max latency — exceeding triggers a test failure

Statistical percentage-based detection:

If >5% of CPUs exceed the soft threshold, the test fails as a systemic latency issue
Isolated spikes on a single CPU (e.g., 1 out of 80) produce a warning, not a failure

Structured JSON diagnostic artifacts:

Each latency test now writes an _analysis.json artifact with: max, avg, P99 latency, per-CPU breakdown, soft/hard threshold counts, and overall result (PASS/WARN/FAIL)

Thresholds

Test	Metal Soft	Metal Hard	Non-Metal Soft	Non-Metal Hard
oslat	100us	500us	7500us	10000us
cyclictest	100us	500us	7500us	10000us
hwlatdetect	100us	200us	7500us	10000us
deadline_test	100us	200us	7500us	10000us

Code cleanup

Unified parseOslatResults and parseCyclictestResults into a single parseLatencyResults function
Added unit tests for the new parsing logic

Expected behavior with real job data

Scenario	Old Behavior	New Behavior
1/80 CPUs at 211us	FAIL	WARN (pass)
1/91 CPUs at 241us	FAIL	WARN (pass)
10/80 CPUs at 300us	FAIL	FAIL (systemic)
1/80 CPUs at 600us	FAIL	FAIL (hard threshold)
All CPUs under 100us	PASS	PASS

openshift-ci-robot · 2026-02-17T00:15:41Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

openshift-ci-robot · 2026-02-17T00:15:43Z

@jeff-roche: This pull request references OCPEDGE-1036 which is a valid jira issue.

Details

In response to this:

100 microsecond latency cap for metal instances

7500 microsecond latency cap for non-metal instance types (previous default)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jeff-roche · 2026-02-17T00:16:29Z

/test ?

jeff-roche · 2026-02-17T00:17:34Z

/test e2e-gcp-ovn-rt-upgrade

jeff-roche · 2026-02-17T00:18:00Z

/payload-job periodic-ci-openshift-release-main-nightly-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade

openshift-ci · 2026-02-17T00:18:15Z

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-nightly-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1e7a2f80-0b96-11f1-9252-e810dd3e02ff-0

openshift-ci-robot · 2026-02-17T00:36:31Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

jeff-roche · 2026-02-17T14:39:11Z

The payload job fails to upgrade but the RT Tests themselves pass. Addressing the upgrade failures on #30608

qJkee · 2026-02-17T15:21:14Z

/lgtm

openshift-ci · 2026-02-17T15:21:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jeff-roche, qJkee

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~test/extended/kernel/OWNERS~~ [jeff-roche,qJkee]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jeff-roche · 2026-02-17T17:30:40Z

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

openshift-ci · 2026-02-17T17:30:43Z

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/61996750-0c26-11f1-8539-63c794c57c62-0

jeff-roche · 2026-02-24T17:45:30Z

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

openshift-ci · 2026-02-24T17:45:34Z

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9cdf16c0-11a8-11f1-9f71-c9b6ff3ae133-0

jeff-roche · 2026-02-25T14:07:20Z

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

openshift-ci · 2026-02-25T14:07:29Z

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4d5afc30-1253-11f1-9df8-fe79311f410f-0

openshift-ci · 2026-02-26T11:56:48Z

New changes are detected. LGTM label has been removed.

jeff-roche · 2026-02-26T11:57:03Z

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

coderabbitai · 2026-02-26T11:57:07Z

Walkthrough

getRealTimeWorkerNodes now returns a slice of node names and detects non-metal nodes (which forces all real-time thresholds to 7500µs). Multiple real-time test runners were refactored to use a centralized rtTestThresholds map, capture command output, wrap errors, and write timestamped per-test artifacts.

Changes

Cohort / File(s)	Summary
Real-time Worker Node Collection `test/extended/kernel/common.go`	Changed signature to return `[]string`. Builds a pre-sized slice, detects nodes with/without the "metal" instance-type label, logs when non-metal nodes exist, and sets all entries in `rtTestThresholds` to 7500µs when a non-metal node is found.
Latency Thresholds & Test Runners `test/extended/kernel/tools.go`	Introduced/used centralized `rtTestThresholds` map. Refactored runners (`runDeadlineTest`, `runHwlatdetect`, `runOslat`, `runCyclictest`, `runPiStressFifo`, `runPiStressRR`) to derive `testName`, use per-test thresholds for parsing/logging, capture command output, wrap errors with contextual messages, and write timestamped per-test log/artifact files (uses time/e2e.TimeNow()).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: fixing latency tuning for real-time kernel tests specifically on AWS metal instances, which aligns with the core modifications in the pull request.
Stable And Deterministic Test Names	✅ Passed	All test names in kernel test files are stable, deterministic static strings with no dynamic content like timestamps, UUIDs, or node names.
Test Structure And Quality	✅ Passed	Modified files are helper utilities containing test infrastructure functions rather than actual test cases with BDD patterns like It blocks or Expect assertions.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-02-26T11:57:26Z

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/44998af0-130a-11f1-9082-40e95c4d00b8-0

openshift-ci-robot · 2026-02-26T11:58:14Z

@jeff-roche: This pull request references OCPEDGE-1036 which is a valid jira issue.

Details

In response to this:

100 microsecond latency cap for metal instances

7500 microsecond latency cap for non-metal instance types (previous default)

Summary by CodeRabbit

Tests

Enhanced real-time kernel test infrastructure with adaptive threshold configuration based on hardware type detection.

Improved test logging and artifact collection with timestamped output files for better result tracking and debugging.

Refactored test execution with better error handling and context reporting.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

test/extended/kernel/common.go (1)
64-68: Side effect in getter function modifies global state.

getRealTimeWorkerNodes modifies the global rtTestThresholds map, which is unexpected for a function with a "get" prefix. This couples threshold configuration to node discovery and makes the behavior harder to reason about.

Consider either:

Renaming the function to reflect it configures thresholds (e.g., setupRealTimeWorkerNodes)

Returning the metal status and handling threshold adjustment at the call site
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/kernel/common.go` around lines 64 - 68, getRealTimeWorkerNodes
currently mutates the global rtTestThresholds map (when nodesAreMetal is false),
which is a surprising side effect for a getter; stop modifying rtTestThresholds
inside getRealTimeWorkerNodes and instead either (A) rename
getRealTimeWorkerNodes to setupRealTimeWorkerNodes if you intend it to configure
thresholds, or (B) change getRealTimeWorkerNodes to only return the metal status
(bool nodesAreMetal) and move the rtTestThresholds adjustments out to the call
site so callers can set rtTestThresholds[test] = 7500 when nodesAreMetal is
false; update all callers of getRealTimeWorkerNodes accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/kernel/common.go`:
- Around line 57-68: The metal-detection currently uses node.GetLabels() and
sets nodesAreMetal = false if any worker node isn't metal, which incorrectly
flags clusters where non-RT workers are non-metal; update the logic so the metal
check is only performed for nodes that match the RT kernel condition (the same
condition used to select RT nodes) — i.e., inside the RT kernel match block
iterate those nodes, call node.GetLabels(), and only then modify nodesAreMetal
and adjust rtTestThresholds; reference variables/functions: node.GetLabels(),
nodesAreMetal, rtTestThresholds, and the RT kernel match condition so the
threshold padding runs only when RT nodes are detected as non-metal.
- Line 48: Replace the incorrect capacity argument on the nodes slice
allocation: the current call uses kubeNodes.Size() (which returns protobuf
serialized size) when constructing nodes via make([]string, 0, ...); change it
to use the number of items with len(kubeNodes.Items) so nodes = make([]string,
0, len(kubeNodes.Items)). Update the allocation site that references kubeNodes
and the nodes variable in test/extended/kernel/common.go (search for the
make([]string, 0, kubeNodes.Size()) occurrence).

In `@test/extended/kernel/tools.go`:
- Around line 165-167: The error message in runCyclictest incorrectly references
"oslat test"; update the returned fmt.Errorf string in the runCyclictest
function (where cpuCount is checked) to reference "cyclictest" (or
"runCyclictest") instead and preserve the numeric cpuCount interpolation and
wording; ensure only the test name in the message is changed so the check using
cpuCount and the fmt.Errorf call remain otherwise identical.

---

Nitpick comments:
In `@test/extended/kernel/common.go`:
- Around line 64-68: getRealTimeWorkerNodes currently mutates the global
rtTestThresholds map (when nodesAreMetal is false), which is a surprising side
effect for a getter; stop modifying rtTestThresholds inside
getRealTimeWorkerNodes and instead either (A) rename getRealTimeWorkerNodes to
setupRealTimeWorkerNodes if you intend it to configure thresholds, or (B) change
getRealTimeWorkerNodes to only return the metal status (bool nodesAreMetal) and
move the rtTestThresholds adjustments out to the call site so callers can set
rtTestThresholds[test] = 7500 when nodesAreMetal is false; update all callers of
getRealTimeWorkerNodes accordingly.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 3cec87c and 8de1caf.

📒 Files selected for processing (2)

test/extended/kernel/common.go
test/extended/kernel/tools.go

test/extended/kernel/common.go

test/extended/kernel/tools.go

openshift-ci-robot · 2026-02-26T12:27:57Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

openshift-ci-robot · 2026-03-02T17:34:23Z

@jeff-roche: This pull request references OCPEDGE-1036 which is a valid jira issue.

Details

In response to this:

100 microsecond latency cap for metal instances

7500 microsecond latency cap for non-metal instance types (previous default)

Summary by CodeRabbit

Tests

Enhanced real-time kernel test infrastructure with adaptive per-test thresholds when non-metal nodes are detected.

Improved test execution with unified per-test names, richer error context, and consistent output capture.

Added timestamped artifact logging for each test to simplify result tracking and debugging.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jeff-roche · 2026-03-02T17:34:39Z

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

openshift-ci · 2026-03-02T17:34:48Z

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1732ad10-165e-11f1-9f18-6b2d93667007-0

jeff-roche · 2026-03-03T16:00:26Z

/retest

jeff-roche · 2026-03-03T23:05:11Z

/retest

openshift-ci-robot · 2026-03-03T23:20:13Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

openshift-ci-robot · 2026-03-04T14:33:22Z

@jeff-roche: This pull request references OCPEDGE-1036 which is a valid jira issue.

Details

In response to this:

100 microsecond latency cap for metal instances

7500 microsecond latency cap for non-metal instance types (previous default)

Summary by CodeRabbit

Tests

Enhanced real-time kernel test infrastructure with adaptive per-test thresholds when any non-metal node is detected.

Unified per-test naming and thresholds across runners for consistent parsing, richer error context, and reliable output capture.

Added timestamped per-test artifact logging to simplify result tracking and debugging.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

🧹 Nitpick comments (2)

test/extended/kernel/tools.go (2)

20-25: Consider guarded threshold access instead of raw map indexing.

Using string keys with direct map indexing can silently fall back to 0 if a key drifts, which makes failures harder to diagnose. A small helper with ok checks would make this safer.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/extended/kernel/tools.go` around lines 20 - 25, rtTestThresholds is
being read with raw map indexing which returns 0 for missing keys; add a guarded
accessor to make missing-key cases explicit. Implement a helper like
getRTTestThreshold(testName string) (int, bool) or GetRTTestThreshold(testName
string) (int, error) that looks up rtTestThresholds, returns the value and an ok
flag (or an error) and use that helper wherever rtTestThresholds is read so
callers can handle missing keys instead of silently getting 0; reference the
rtTestThresholds map and replace any direct accesses with calls to
getRTTestThreshold.

29-33: Persist command output even on failure paths.

Current flow returns on err before writing artifacts, so failure diagnostics are lost. This same pattern appears in the other runners that capture res.

Proposed adjustment

 	res, err := oc.SetNamespace(rtNamespace).Run("exec").Args(args...).Output()
+	writeTestArtifacts(fmt.Sprintf("%s_%s.log", "pi_stress_standard", e2e.TimeNow().Format(time.RFC3339)), res)
 	if err != nil {
 		// An error here indicates thresholds were exceeded or an issue with the test
 		return errors.Wrap(err, "error running pi_stress with the standard algorithm")
 	}
-
-	writeTestArtifacts(fmt.Sprintf("%s_%s.log", "pi_stress_standard", e2e.TimeNow().Format(time.RFC3339)), res)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/extended/kernel/tools.go` around lines 29 - 33, The command output (res)
is discarded when oc.SetNamespace(...).Run("exec").Args(...).Output() returns an
err; update the err != nil branch to persist the captured res before returning.
Specifically, inside the error path of the block that reads res and err, write
res to the test artifacts/logging (using the existing test artifact writer or
logger used elsewhere in this package) with a clear filename/context, then
return errors.Wrap(err, "error running pi_stress with the standard algorithm")
as before. Ensure the same pattern is applied to the other runner blocks that
capture res so failures always include the command output for diagnostics.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/extended/kernel/tools.go`:
- Around line 20-25: rtTestThresholds is being read with raw map indexing which
returns 0 for missing keys; add a guarded accessor to make missing-key cases
explicit. Implement a helper like getRTTestThreshold(testName string) (int,
bool) or GetRTTestThreshold(testName string) (int, error) that looks up
rtTestThresholds, returns the value and an ok flag (or an error) and use that
helper wherever rtTestThresholds is read so callers can handle missing keys
instead of silently getting 0; reference the rtTestThresholds map and replace
any direct accesses with calls to getRTTestThreshold.
- Around line 29-33: The command output (res) is discarded when
oc.SetNamespace(...).Run("exec").Args(...).Output() returns an err; update the
err != nil branch to persist the captured res before returning. Specifically,
inside the error path of the block that reads res and err, write res to the test
artifacts/logging (using the existing test artifact writer or logger used
elsewhere in this package) with a clear filename/context, then return
errors.Wrap(err, "error running pi_stress with the standard algorithm") as
before. Ensure the same pattern is applied to the other runner blocks that
capture res so failures always include the command output for diagnostics.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b0195779-df43-4df7-8bf9-a8d96aabd777

📥 Commits

Reviewing files that changed from the base of the PR and between 40cfb1d and 1059dd7.

📒 Files selected for processing (2)

test/extended/kernel/common.go
test/extended/kernel/tools.go

openshift-ci-robot · 2026-03-04T14:55:42Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

jeff-roche · 2026-03-04T15:25:11Z

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

openshift-ci · 2026-03-04T15:25:27Z

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5684d340-17de-11f1-9987-ae4c2d50ffbc-0

Replace binary pass/fail latency detection with a three-tier analysis: - Two-tier thresholds (soft/hard) to distinguish warnings from failures - Statistical percentage-based detection (>5% CPUs over soft = systemic fail) - Structured JSON diagnostic artifacts for richer test result analysis Metal thresholds: oslat/cyclictest soft=100us hard=500us, hwlatdetect/deadline_test soft=100us hard=200us. Non-metal thresholds: soft=7500us hard=10000us. Unifies parseOslatResults and parseCyclictestResults into a single parseLatencyResults function with comprehensive statistics (max, avg, P99, per-CPU breakdown). Adds unit tests for the new parsing logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci-robot · 2026-03-05T16:06:05Z

@jeff-roche: This pull request references OCPEDGE-1036 which is a valid jira issue.

Details

In response to this:

Summary

Replaces the binary pass/fail latency detection with a smarter three-tier analysis that distinguishes real RT kernel issues from environmental noise (e.g., isolated single-CPU spikes on AWS metal instances).

Changes

Two-tier soft/hard thresholds:

Soft threshold: expected max latency — exceeding triggers a warning but the test still passes

Hard threshold: absolute max latency — exceeding triggers a test failure

Statistical percentage-based detection:

If >5% of CPUs exceed the soft threshold, the test fails as a systemic latency issue

Isolated spikes on a single CPU (e.g., 1 out of 80) produce a warning, not a failure

Structured JSON diagnostic artifacts:

Each latency test now writes an _analysis.json artifact with: max, avg, P99 latency, per-CPU breakdown, soft/hard threshold counts, and overall result (PASS/WARN/FAIL)

Thresholds

Test Metal Soft Metal Hard Non-Metal Soft Non-Metal Hard

oslat 100us 500us 7500us 10000us

cyclictest 100us 500us 7500us 10000us

hwlatdetect 100us 200us 7500us 10000us

deadline_test 100us 200us 7500us 10000us

Code cleanup

Unified parseOslatResults and parseCyclictestResults into a single parseLatencyResults function

Added unit tests for the new parsing logic

Expected behavior with real job data

Scenario Old Behavior New Behavior

1/80 CPUs at 211us FAIL WARN (pass)

1/91 CPUs at 241us FAIL WARN (pass)

10/80 CPUs at 300us FAIL FAIL (systemic)

1/80 CPUs at 600us FAIL FAIL (hard threshold)

All CPUs under 100us PASS PASS

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jeff-roche · 2026-03-05T16:08:04Z

/payload-job periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

openshift-ci · 2026-03-05T16:08:21Z

@jeff-roche: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-metal-ovn-single-node-rt-upgrade-test

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7e364250-18ad-11f1-9802-7c1b2e727428-0

jeff-roche · 2026-03-05T16:09:31Z

/test e2e-gcp-ovn-rt-upgrade

openshift-ci-robot · 2026-03-05T16:35:01Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

openshift-ci · 2026-03-05T20:49:32Z

@jeff-roche: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-ovn	`ceb21f5`	link	true	`/test e2e-gcp-ovn`
ci/prow/e2e-metal-ipi-ovn-ipv6	`ceb21f5`	link	true	`/test e2e-metal-ipi-ovn-ipv6`
ci/prow/e2e-vsphere-ovn-upi	`ceb21f5`	link	true	`/test e2e-vsphere-ovn-upi`
ci/prow/e2e-aws-ovn-fips	`ceb21f5`	link	true	`/test e2e-aws-ovn-fips`
ci/prow/e2e-vsphere-ovn	`ceb21f5`	link	true	`/test e2e-vsphere-ovn`
ci/prow/e2e-aws-ovn-microshift	`ceb21f5`	link	true	`/test e2e-aws-ovn-microshift`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

jeff-roche · 2026-03-05T21:14:15Z

/retest

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 17, 2026

openshift-ci bot requested review from deads2k and sjenning February 17, 2026 00:17

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 17, 2026

openshift-ci bot assigned qJkee Feb 17, 2026

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 17, 2026

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2026

coderabbitai bot reviewed Feb 26, 2026

View reviewed changes

test/extended/kernel/common.go Outdated Show resolved Hide resolved

test/extended/kernel/common.go Outdated Show resolved Hide resolved

test/extended/kernel/tools.go Show resolved Hide resolved

jeff-roche force-pushed the rt-test-latencies branch from 8de1caf to 40cfb1d Compare March 2, 2026 17:32

jeff-roche force-pushed the rt-test-latencies branch from 40cfb1d to 1059dd7 Compare March 4, 2026 14:31

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

jeff-roche force-pushed the rt-test-latencies branch from 1059dd7 to ceb21f5 Compare March 5, 2026 16:05

Conversation

jeff-roche commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Thresholds

Code cleanup

Expected behavior with real job data

Uh oh!

openshift-ci-robot commented Feb 17, 2026

Uh oh!

openshift-ci-robot commented Feb 17, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeff-roche commented Feb 17, 2026

Uh oh!

jeff-roche commented Feb 17, 2026

Uh oh!

jeff-roche commented Feb 17, 2026

Uh oh!

openshift-ci bot commented Feb 17, 2026

Uh oh!

openshift-ci-robot commented Feb 17, 2026

Uh oh!

jeff-roche commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qJkee commented Feb 17, 2026

Uh oh!

openshift-ci bot commented Feb 17, 2026

Uh oh!

jeff-roche commented Feb 17, 2026

Uh oh!

openshift-ci bot commented Feb 17, 2026

Uh oh!

jeff-roche commented Feb 24, 2026

Uh oh!

openshift-ci bot commented Feb 24, 2026

Uh oh!

jeff-roche commented Feb 25, 2026

Uh oh!

openshift-ci bot commented Feb 25, 2026

Uh oh!

openshift-ci bot commented Feb 26, 2026

Uh oh!

jeff-roche commented Feb 26, 2026

Uh oh!

coderabbitai bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

openshift-ci bot commented Feb 26, 2026

Uh oh!

openshift-ci-robot commented Feb 26, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

openshift-ci-robot commented Feb 26, 2026

Uh oh!

openshift-ci-robot commented Mar 2, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

jeff-roche commented Mar 2, 2026

Uh oh!

openshift-ci bot commented Mar 2, 2026

Uh oh!

jeff-roche commented Mar 3, 2026

Uh oh!

jeff-roche commented Mar 3, 2026

Uh oh!

jeff-roche commented Feb 17, 2026 •

edited

Loading

openshift-ci-robot commented Feb 17, 2026 •

edited by openshift-ci bot

Loading

jeff-roche commented Feb 17, 2026 •

edited

Loading

coderabbitai bot commented Feb 26, 2026 •

edited

Loading

openshift-ci-robot commented Feb 26, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Mar 2, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Mar 4, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Mar 5, 2026 •

edited by openshift-ci bot

Loading