[https://nvbugs/6130334][fix] Derive timeout from slurm.job_time when no explicit timeout exists, set iteratio by tensorrt-cicd · Pull Request #13691 · NVIDIA/TensorRT-LLM

tensorrt-cicd · 2026-05-01T17:37:23Z

Summary

Root cause: ctx_only mode used hardcoded 5400s health check timeout (insufficient for 384GB model loading on PP4) and 5x iterations multiplier causing 1280 requests of 128k tokens to exceed the CI time budget
Fix: Derive timeout from slurm.job_time when no explicit timeout exists, set iterations=1 for ctx_only (matching gen_only as single-phase benchmark), gate DB upload on OpenSearch credentials
Automated fix generated by repair-bot

Test plan

Verify fix on the same GPU type as the original failure
Check for regressions in related tests

Links

Bug: https://nvbugs/6130334

Summary by CodeRabbit

Tests
- Enhanced timeout configuration handling with per-test tracking and dynamic fallback logic
- Strengthened database upload validation with stricter requirement conditions
- Improved test iteration behavior configuration for specific test modes

…ob_time and using iterations=1 The ctx_only benchmark mode had three issues causing timeout failures: 1. _get_aggr_commands() hardcoded DEFAULT_TIMEOUT (5400s) instead of using the timeout derived from the YAML config's job_time field (7200s) 2. iterations was set to multi_round (5) instead of 1, generating 1280 requests of 128k tokens instead of 256 - exceeding the time budget 3. upload_to_db was unconditionally True even when OpenSearch credentials are not configured, causing post-benchmark failures Fix: derive health-check timeout from slurm.job_time when no explicit timeout field exists, set iterations=1 for ctx_only (matching gen_only as a single-phase benchmark), and gate DB upload on credential env vars. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

coderabbitai · 2026-05-01T17:42:08Z

📝 Walkthrough

Walkthrough

Updates to the test configuration class to track a configurable timeout field (initialized from disaggregated SLURM config or defaulting to DEFAULT_TIMEOUT), use this computed timeout in aggregated command generation, enforce stricter conditions for database upload (requiring both a test name prefix containing "upload" and OpenSearch environment variables), and expand client iteration selection so that context-only mode shares behavior with generation-only mode by forcing iterations=1.

Changes

Cohort / File(s)	Summary
Timeout and Upload Configuration `tests/integration/defs/perf/test_perf_sanity.py`	Added `_timeout` field tracking initialized to `DEFAULT_TIMEOUT` and conditionally set from SLURM config. Updated aggregated command generation to use computed timeout. Tightened database upload conditions to require test name prefix containing "upload" plus OpenSearch env vars. Extended client iteration selection so `ctx_only` forces `iterations=1` like `gen_only` mode.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title is incomplete and truncated, cutting off mid-word at 'iteratio'. While it references the main fixes (timeout derivation and iteration setting), the truncation makes it unclear and unprofessional.	Complete the truncated title. Consider: '[https://nvbugs/6130334][fix] Derive timeout from slurm.job_time and set iterations=1 for ctx_only mode'
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The description covers the root cause, fixes applied, and test plan. However, it's missing the structured format required by the template (Description section is empty, Test Coverage lacks detail, PR Checklist items are unchecked but not reviewed).	Expand the Description section to explicitly explain the issue and solution. Detail specific test cases in Test Coverage section. Review and check the PR Checklist items.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/integration/defs/perf/test_perf_sanity.py (1)
1-1: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update the copyright header year to 2026.

Line 1 still says 2022-2025 although this file is modified in 2026.
Suggested fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
As per coding guidelines, "Include NVIDIA copyright header on all new files; update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/perf/test_perf_sanity.py` at line 1, Update the SPDX
copyright header year range in the file (the top-line comment starting with
"SPDX-FileCopyrightText") from "2022-2025" to "2022-2026" so the modified file
reflects the 2026 update; locate the header comment in
tests/integration/defs/perf/test_perf_sanity.py and change the year range
accordingly.

🧹 Nitpick comments (1)

tests/integration/defs/perf/test_perf_sanity.py (1)
1859-1862: QA list follow-up looks unnecessary for this PR.

This change adjusts perf harness behavior for existing perf sanity cases; no new integration test definition appears to require a new tests/integration/test_lists/qa/* entry.

As per coding guidelines, "If the change adds or materially alters an integration test under tests/integration/defs/ ... call out whether an entry is needed under tests/integration/test_lists/qa/".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/perf/test_perf_sanity.py` around lines 1859 - 1862,
The QA-list follow-up is unnecessary because this change only modifies existing
perf harness behavior (the loop that appends entries to test_cases using
AGG_TEST_TYPES and config_yml) and does not add new integration test defs
requiring tests/integration/test_lists/qa/*; remove the QA follow-up from the PR
description/checklist and add a one-line comment next to the AGG_TEST_TYPES ->
test_cases loop (inside test_perf_sanity.py) stating “no new QA test_list entry
required for this change” so reviewers see the rationale.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/defs/perf/test_perf_sanity.py`:
- Around line 1294-1302: The current parsing of slurm_config["job_time"] in the
block that sets timeout assumes a strict "HH:MM:SS" format and can raise on
malformed or day-prefixed values; update the logic in this function to robustly
parse job_time: first check job_time is a non-empty string, strip whitespace,
then handle a day prefix by splitting on "-" for day+time if present (convert
days to seconds), split the time portion on ":" and accept 2- or 3-part forms
(MM:SS or HH:MM:SS) while padding/mapping parts to hours/minutes/seconds,
convert each to int inside a try/except ValueError, and on any parsing error
fall back to DEFAULT_TIMEOUT (assign to timeout) and optionally log or warn;
finally set self._timeout = timeout. Ensure you reference and update the
variables timeout, job_time, DEFAULT_TIMEOUT and self._timeout.

---

Outside diff comments:
In `@tests/integration/defs/perf/test_perf_sanity.py`:
- Line 1: Update the SPDX copyright header year range in the file (the top-line
comment starting with "SPDX-FileCopyrightText") from "2022-2025" to "2022-2026"
so the modified file reflects the 2026 update; locate the header comment in
tests/integration/defs/perf/test_perf_sanity.py and change the year range
accordingly.

---

Nitpick comments:
In `@tests/integration/defs/perf/test_perf_sanity.py`:
- Around line 1859-1862: The QA-list follow-up is unnecessary because this
change only modifies existing perf harness behavior (the loop that appends
entries to test_cases using AGG_TEST_TYPES and config_yml) and does not add new
integration test defs requiring tests/integration/test_lists/qa/*; remove the QA
follow-up from the PR description/checklist and add a one-line comment next to
the AGG_TEST_TYPES -> test_cases loop (inside test_perf_sanity.py) stating “no
new QA test_list entry required for this change” so reviewers see the rationale.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 197ad5e5-96c3-4f1f-bca4-fcc6b69f2857

📥 Commits

Reviewing files that changed from the base of the PR and between 81b5673 and 5fced2e.

📒 Files selected for processing (1)

tests/integration/defs/perf/test_perf_sanity.py

coderabbitai · 2026-05-01T17:42:11Z

+        timeout = slurm_config.get("timeout", None)
+        if timeout is None:
+            job_time = slurm_config.get("job_time", "")
+            if job_time:
+                parts = job_time.split(":")
+                timeout = int(parts[0]) * 3600 + int(parts[1]) * 60 + int(parts[2])
+            else:
+                timeout = DEFAULT_TIMEOUT
+        self._timeout = timeout


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden slurm.job_time parsing to prevent config-time crashes.

Lines 1298-1299 assume strict HH:MM:SS; unexpected values (e.g., malformed or day-prefixed) can raise and abort test setup.

Suggested fix

- timeout = slurm_config.get("timeout", None) + timeout = slurm_config.get("timeout") if timeout is None: job_time = slurm_config.get("job_time", "") if job_time: - parts = job_time.split(":") - timeout = int(parts[0]) * 3600 + int(parts[1]) * 60 + int(parts[2]) + try: + days = 0 + time_part = job_time + if "-" in job_time: + day_part, time_part = job_time.split("-", 1) + days = int(day_part) + hms = [int(part) for part in time_part.split(":")] + if len(hms) != 3: + raise ValueError(f"Unsupported job_time format: {job_time}") + timeout = days * 86400 + hms[0] * 3600 + hms[1] * 60 + hms[2] + except (TypeError, ValueError): + timeout = DEFAULT_TIMEOUT else: timeout = DEFAULT_TIMEOUT

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/integration/defs/perf/test_perf_sanity.py` around lines 1294 - 1302, The current parsing of slurm_config["job_time"] in the block that sets timeout assumes a strict "HH:MM:SS" format and can raise on malformed or day-prefixed values; update the logic in this function to robustly parse job_time: first check job_time is a non-empty string, strip whitespace, then handle a day prefix by splitting on "-" for day+time if present (convert days to seconds), split the time portion on ":" and accept 2- or 3-part forms (MM:SS or HH:MM:SS) while padding/mapping parts to hours/minutes/seconds, convert each to int inside a try/except ValueError, and on any parsing error fall back to DEFAULT_TIMEOUT (assign to timeout) and optionally log or warn; finally set self._timeout = timeout. Ensure you reference and update the variables timeout, job_time, DEFAULT_TIMEOUT and self._timeout.

tensorrt-cicd requested a review from a team as a code owner May 1, 2026 17:37

github-actions Bot assigned tensorrt-cicd May 1, 2026

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6130334][fix] Derive timeout from slurm.job_time when no explicit timeout exists, set iteratio#13691

[https://nvbugs/6130334][fix] Derive timeout from slurm.job_time when no explicit timeout exists, set iteratio#13691
tensorrt-cicd wants to merge 1 commit intoNVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6130334

tensorrt-cicd commented May 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 1, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tensorrt-cicd commented May 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Links

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 1, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tensorrt-cicd commented May 1, 2026 •

edited by coderabbitai Bot

Loading