Skip to content

[https://nvbugs/6130334][fix] Derive timeout from slurm.job_time when no explicit timeout exists, set iteratio#13691

Open
tensorrt-cicd wants to merge 1 commit intoNVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6130334
Open

[https://nvbugs/6130334][fix] Derive timeout from slurm.job_time when no explicit timeout exists, set iteratio#13691
tensorrt-cicd wants to merge 1 commit intoNVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6130334

Conversation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

@tensorrt-cicd tensorrt-cicd commented May 1, 2026

Summary

  • Root cause: ctx_only mode used hardcoded 5400s health check timeout (insufficient for 384GB model loading on PP4) and 5x iterations multiplier causing 1280 requests of 128k tokens to exceed the CI time budget
  • Fix: Derive timeout from slurm.job_time when no explicit timeout exists, set iterations=1 for ctx_only (matching gen_only as single-phase benchmark), gate DB upload on OpenSearch credentials
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

  • Tests
    • Enhanced timeout configuration handling with per-test tracking and dynamic fallback logic
    • Strengthened database upload validation with stricter requirement conditions
    • Improved test iteration behavior configuration for specific test modes

…ob_time and using iterations=1

The ctx_only benchmark mode had three issues causing timeout failures:
1. _get_aggr_commands() hardcoded DEFAULT_TIMEOUT (5400s) instead of using
   the timeout derived from the YAML config's job_time field (7200s)
2. iterations was set to multi_round (5) instead of 1, generating 1280
   requests of 128k tokens instead of 256 - exceeding the time budget
3. upload_to_db was unconditionally True even when OpenSearch credentials
   are not configured, causing post-benchmark failures

Fix: derive health-check timeout from slurm.job_time when no explicit
timeout field exists, set iterations=1 for ctx_only (matching gen_only
as a single-phase benchmark), and gate DB upload on credential env vars.

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
@tensorrt-cicd tensorrt-cicd requested a review from a team as a code owner May 1, 2026 17:37
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

📝 Walkthrough

Walkthrough

Updates to the test configuration class to track a configurable timeout field (initialized from disaggregated SLURM config or defaulting to DEFAULT_TIMEOUT), use this computed timeout in aggregated command generation, enforce stricter conditions for database upload (requiring both a test name prefix containing "upload" and OpenSearch environment variables), and expand client iteration selection so that context-only mode shares behavior with generation-only mode by forcing iterations=1.

Changes

Cohort / File(s) Summary
Timeout and Upload Configuration
tests/integration/defs/perf/test_perf_sanity.py
Added _timeout field tracking initialized to DEFAULT_TIMEOUT and conditionally set from SLURM config. Updated aggregated command generation to use computed timeout. Tightened database upload conditions to require test name prefix containing "upload" plus OpenSearch env vars. Extended client iteration selection so ctx_only forces iterations=1 like gen_only mode.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title is incomplete and truncated, cutting off mid-word at 'iteratio'. While it references the main fixes (timeout derivation and iteration setting), the truncation makes it unclear and unprofessional. Complete the truncated title. Consider: '[https://nvbugs/6130334][fix] Derive timeout from slurm.job_time and set iterations=1 for ctx_only mode'
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description covers the root cause, fixes applied, and test plan. However, it's missing the structured format required by the template (Description section is empty, Test Coverage lacks detail, PR Checklist items are unchecked but not reviewed). Expand the Description section to explicitly explain the issue and solution. Detail specific test cases in Test Coverage section. Review and check the PR Checklist items.
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/integration/defs/perf/test_perf_sanity.py (1)

1-1: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update the copyright header year to 2026.

Line 1 still says 2022-2025 although this file is modified in 2026.

Suggested fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

As per coding guidelines, "Include NVIDIA copyright header on all new files; update year on modified files".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/perf/test_perf_sanity.py` at line 1, Update the SPDX
copyright header year range in the file (the top-line comment starting with
"SPDX-FileCopyrightText") from "2022-2025" to "2022-2026" so the modified file
reflects the 2026 update; locate the header comment in
tests/integration/defs/perf/test_perf_sanity.py and change the year range
accordingly.
🧹 Nitpick comments (1)
tests/integration/defs/perf/test_perf_sanity.py (1)

1859-1862: QA list follow-up looks unnecessary for this PR.

This change adjusts perf harness behavior for existing perf sanity cases; no new integration test definition appears to require a new tests/integration/test_lists/qa/* entry.

As per coding guidelines, "If the change adds or materially alters an integration test under tests/integration/defs/ ... call out whether an entry is needed under tests/integration/test_lists/qa/".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/perf/test_perf_sanity.py` around lines 1859 - 1862,
The QA-list follow-up is unnecessary because this change only modifies existing
perf harness behavior (the loop that appends entries to test_cases using
AGG_TEST_TYPES and config_yml) and does not add new integration test defs
requiring tests/integration/test_lists/qa/*; remove the QA follow-up from the PR
description/checklist and add a one-line comment next to the AGG_TEST_TYPES ->
test_cases loop (inside test_perf_sanity.py) stating “no new QA test_list entry
required for this change” so reviewers see the rationale.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/defs/perf/test_perf_sanity.py`:
- Around line 1294-1302: The current parsing of slurm_config["job_time"] in the
block that sets timeout assumes a strict "HH:MM:SS" format and can raise on
malformed or day-prefixed values; update the logic in this function to robustly
parse job_time: first check job_time is a non-empty string, strip whitespace,
then handle a day prefix by splitting on "-" for day+time if present (convert
days to seconds), split the time portion on ":" and accept 2- or 3-part forms
(MM:SS or HH:MM:SS) while padding/mapping parts to hours/minutes/seconds,
convert each to int inside a try/except ValueError, and on any parsing error
fall back to DEFAULT_TIMEOUT (assign to timeout) and optionally log or warn;
finally set self._timeout = timeout. Ensure you reference and update the
variables timeout, job_time, DEFAULT_TIMEOUT and self._timeout.

---

Outside diff comments:
In `@tests/integration/defs/perf/test_perf_sanity.py`:
- Line 1: Update the SPDX copyright header year range in the file (the top-line
comment starting with "SPDX-FileCopyrightText") from "2022-2025" to "2022-2026"
so the modified file reflects the 2026 update; locate the header comment in
tests/integration/defs/perf/test_perf_sanity.py and change the year range
accordingly.

---

Nitpick comments:
In `@tests/integration/defs/perf/test_perf_sanity.py`:
- Around line 1859-1862: The QA-list follow-up is unnecessary because this
change only modifies existing perf harness behavior (the loop that appends
entries to test_cases using AGG_TEST_TYPES and config_yml) and does not add new
integration test defs requiring tests/integration/test_lists/qa/*; remove the QA
follow-up from the PR description/checklist and add a one-line comment next to
the AGG_TEST_TYPES -> test_cases loop (inside test_perf_sanity.py) stating “no
new QA test_list entry required for this change” so reviewers see the rationale.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 197ad5e5-96c3-4f1f-bca4-fcc6b69f2857

📥 Commits

Reviewing files that changed from the base of the PR and between 81b5673 and 5fced2e.

📒 Files selected for processing (1)
  • tests/integration/defs/perf/test_perf_sanity.py

Comment on lines +1294 to +1302
timeout = slurm_config.get("timeout", None)
if timeout is None:
job_time = slurm_config.get("job_time", "")
if job_time:
parts = job_time.split(":")
timeout = int(parts[0]) * 3600 + int(parts[1]) * 60 + int(parts[2])
else:
timeout = DEFAULT_TIMEOUT
self._timeout = timeout
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden slurm.job_time parsing to prevent config-time crashes.

Lines 1298-1299 assume strict HH:MM:SS; unexpected values (e.g., malformed or day-prefixed) can raise and abort test setup.

Suggested fix
-        timeout = slurm_config.get("timeout", None)
+        timeout = slurm_config.get("timeout")
         if timeout is None:
             job_time = slurm_config.get("job_time", "")
             if job_time:
-                parts = job_time.split(":")
-                timeout = int(parts[0]) * 3600 + int(parts[1]) * 60 + int(parts[2])
+                try:
+                    days = 0
+                    time_part = job_time
+                    if "-" in job_time:
+                        day_part, time_part = job_time.split("-", 1)
+                        days = int(day_part)
+                    hms = [int(part) for part in time_part.split(":")]
+                    if len(hms) != 3:
+                        raise ValueError(f"Unsupported job_time format: {job_time}")
+                    timeout = days * 86400 + hms[0] * 3600 + hms[1] * 60 + hms[2]
+                except (TypeError, ValueError):
+                    timeout = DEFAULT_TIMEOUT
             else:
                 timeout = DEFAULT_TIMEOUT
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/perf/test_perf_sanity.py` around lines 1294 - 1302,
The current parsing of slurm_config["job_time"] in the block that sets timeout
assumes a strict "HH:MM:SS" format and can raise on malformed or day-prefixed
values; update the logic in this function to robustly parse job_time: first
check job_time is a non-empty string, strip whitespace, then handle a day prefix
by splitting on "-" for day+time if present (convert days to seconds), split the
time portion on ":" and accept 2- or 3-part forms (MM:SS or HH:MM:SS) while
padding/mapping parts to hours/minutes/seconds, convert each to int inside a
try/except ValueError, and on any parsing error fall back to DEFAULT_TIMEOUT
(assign to timeout) and optionally log or warn; finally set self._timeout =
timeout. Ensure you reference and update the variables timeout, job_time,
DEFAULT_TIMEOUT and self._timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant