[https://nvbugs/6130334][fix] Derive timeout from slurm.job_time when no explicit timeout exists, set iteratio#13691
Conversation
…ob_time and using iterations=1 The ctx_only benchmark mode had three issues causing timeout failures: 1. _get_aggr_commands() hardcoded DEFAULT_TIMEOUT (5400s) instead of using the timeout derived from the YAML config's job_time field (7200s) 2. iterations was set to multi_round (5) instead of 1, generating 1280 requests of 128k tokens instead of 256 - exceeding the time budget 3. upload_to_db was unconditionally True even when OpenSearch credentials are not configured, causing post-benchmark failures Fix: derive health-check timeout from slurm.job_time when no explicit timeout field exists, set iterations=1 for ctx_only (matching gen_only as a single-phase benchmark), and gate DB upload on credential env vars. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
📝 WalkthroughWalkthroughUpdates to the test configuration class to track a configurable timeout field (initialized from disaggregated SLURM config or defaulting to Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/integration/defs/perf/test_perf_sanity.py (1)
1-1:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUpdate the copyright header year to 2026.
Line 1 still says
2022-2025although this file is modified in 2026.Suggested fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.As per coding guidelines, "Include NVIDIA copyright header on all new files; update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/defs/perf/test_perf_sanity.py` at line 1, Update the SPDX copyright header year range in the file (the top-line comment starting with "SPDX-FileCopyrightText") from "2022-2025" to "2022-2026" so the modified file reflects the 2026 update; locate the header comment in tests/integration/defs/perf/test_perf_sanity.py and change the year range accordingly.
🧹 Nitpick comments (1)
tests/integration/defs/perf/test_perf_sanity.py (1)
1859-1862: QA list follow-up looks unnecessary for this PR.This change adjusts perf harness behavior for existing perf sanity cases; no new integration test definition appears to require a new
tests/integration/test_lists/qa/*entry.As per coding guidelines, "If the change adds or materially alters an integration test under tests/integration/defs/ ... call out whether an entry is needed under tests/integration/test_lists/qa/".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/defs/perf/test_perf_sanity.py` around lines 1859 - 1862, The QA-list follow-up is unnecessary because this change only modifies existing perf harness behavior (the loop that appends entries to test_cases using AGG_TEST_TYPES and config_yml) and does not add new integration test defs requiring tests/integration/test_lists/qa/*; remove the QA follow-up from the PR description/checklist and add a one-line comment next to the AGG_TEST_TYPES -> test_cases loop (inside test_perf_sanity.py) stating “no new QA test_list entry required for this change” so reviewers see the rationale.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/integration/defs/perf/test_perf_sanity.py`:
- Around line 1294-1302: The current parsing of slurm_config["job_time"] in the
block that sets timeout assumes a strict "HH:MM:SS" format and can raise on
malformed or day-prefixed values; update the logic in this function to robustly
parse job_time: first check job_time is a non-empty string, strip whitespace,
then handle a day prefix by splitting on "-" for day+time if present (convert
days to seconds), split the time portion on ":" and accept 2- or 3-part forms
(MM:SS or HH:MM:SS) while padding/mapping parts to hours/minutes/seconds,
convert each to int inside a try/except ValueError, and on any parsing error
fall back to DEFAULT_TIMEOUT (assign to timeout) and optionally log or warn;
finally set self._timeout = timeout. Ensure you reference and update the
variables timeout, job_time, DEFAULT_TIMEOUT and self._timeout.
---
Outside diff comments:
In `@tests/integration/defs/perf/test_perf_sanity.py`:
- Line 1: Update the SPDX copyright header year range in the file (the top-line
comment starting with "SPDX-FileCopyrightText") from "2022-2025" to "2022-2026"
so the modified file reflects the 2026 update; locate the header comment in
tests/integration/defs/perf/test_perf_sanity.py and change the year range
accordingly.
---
Nitpick comments:
In `@tests/integration/defs/perf/test_perf_sanity.py`:
- Around line 1859-1862: The QA-list follow-up is unnecessary because this
change only modifies existing perf harness behavior (the loop that appends
entries to test_cases using AGG_TEST_TYPES and config_yml) and does not add new
integration test defs requiring tests/integration/test_lists/qa/*; remove the QA
follow-up from the PR description/checklist and add a one-line comment next to
the AGG_TEST_TYPES -> test_cases loop (inside test_perf_sanity.py) stating “no
new QA test_list entry required for this change” so reviewers see the rationale.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 197ad5e5-96c3-4f1f-bca4-fcc6b69f2857
📒 Files selected for processing (1)
tests/integration/defs/perf/test_perf_sanity.py
| timeout = slurm_config.get("timeout", None) | ||
| if timeout is None: | ||
| job_time = slurm_config.get("job_time", "") | ||
| if job_time: | ||
| parts = job_time.split(":") | ||
| timeout = int(parts[0]) * 3600 + int(parts[1]) * 60 + int(parts[2]) | ||
| else: | ||
| timeout = DEFAULT_TIMEOUT | ||
| self._timeout = timeout |
There was a problem hiding this comment.
Harden slurm.job_time parsing to prevent config-time crashes.
Lines 1298-1299 assume strict HH:MM:SS; unexpected values (e.g., malformed or day-prefixed) can raise and abort test setup.
Suggested fix
- timeout = slurm_config.get("timeout", None)
+ timeout = slurm_config.get("timeout")
if timeout is None:
job_time = slurm_config.get("job_time", "")
if job_time:
- parts = job_time.split(":")
- timeout = int(parts[0]) * 3600 + int(parts[1]) * 60 + int(parts[2])
+ try:
+ days = 0
+ time_part = job_time
+ if "-" in job_time:
+ day_part, time_part = job_time.split("-", 1)
+ days = int(day_part)
+ hms = [int(part) for part in time_part.split(":")]
+ if len(hms) != 3:
+ raise ValueError(f"Unsupported job_time format: {job_time}")
+ timeout = days * 86400 + hms[0] * 3600 + hms[1] * 60 + hms[2]
+ except (TypeError, ValueError):
+ timeout = DEFAULT_TIMEOUT
else:
timeout = DEFAULT_TIMEOUT🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/integration/defs/perf/test_perf_sanity.py` around lines 1294 - 1302,
The current parsing of slurm_config["job_time"] in the block that sets timeout
assumes a strict "HH:MM:SS" format and can raise on malformed or day-prefixed
values; update the logic in this function to robustly parse job_time: first
check job_time is a non-empty string, strip whitespace, then handle a day prefix
by splitting on "-" for day+time if present (convert days to seconds), split the
time portion on ":" and accept 2- or 3-part forms (MM:SS or HH:MM:SS) while
padding/mapping parts to hours/minutes/seconds, convert each to int inside a
try/except ValueError, and on any parsing error fall back to DEFAULT_TIMEOUT
(assign to timeout) and optionally log or warn; finally set self._timeout =
timeout. Ensure you reference and update the variables timeout, job_time,
DEFAULT_TIMEOUT and self._timeout.
Summary
Test plan
Links
Summary by CodeRabbit