[https://nvbugs/6115832][fix] Fix SSE stream parsing in benchmark client to handle split chunks#13686
[https://nvbugs/6115832][fix] Fix SSE stream parsing in benchmark client to handle split chunks#13686tensorrt-cicd wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
…e tolerance for high-concurrency perf tests Fix flaky perf sanity test failures caused by two issues: 1. SSE stream parsing: aiohttp response.content yields arbitrary byte chunks under high concurrency that don't align with SSE event boundaries, causing json.loads failures or LineTooLong errors. Add _iter_sse_lines() buffered parser that accumulates bytes and splits on newline boundaries before decoding. 2. Benchmark error checking: Zero tolerance for any failed request is too strict for 1280-request high-concurrency benchmarks where transient network issues can cause rare individual failures. Add 1% failure rate threshold - requests below this are considered acceptable for performance measurement purposes. Also make upload_to_db conditional on OPEN_SEARCH_DB_BASE_URL being set to avoid RuntimeError when the DB endpoint is unreachable. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
📝 WalkthroughWalkthroughTwo separate changes: (1) Extracts SSE parsing logic into a shared Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/serve/scripts/backend_request_func.py (1)
1-4:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd/update the NVIDIA copyright header on this modified source file.
This file was modified but still lacks the NVIDIA copyright notice/year block required for
.pysources.Suggested header adjustment
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # Adopted from # https://github.com/vllm-project/vllm/blob/200bbf92e8861e2458a6f90bca73f40cc3b1ad1f/benchmarks/backend_request_func.py # SPDX-License-Identifier: Apache-2.0As per coding guidelines: "All source files (.cpp, .h, .cu, .py) should contain an NVIDIA copyright header with the year of latest modification and Apache 2.0 license notice."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/serve/scripts/backend_request_func.py` around lines 1 - 4, The file backend_request_func.py is missing the required NVIDIA copyright/header block; add the standard NVIDIA copyright header comment at the top of backend_request_func.py including the year of the latest modification and the Apache-2.0 license notice consistent with other .py sources in the repo (replace the placeholder year with the current modification year), ensuring the header appears before any code or comments so tools and auditors recognize the file as covered.tests/integration/defs/perf/test_perf_sanity.py (1)
1-1:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUpdate SPDX copyright year for this modified file
This file was modified, but the header still ends at 2025. Please update it to include 2026.
As per coding guidelines: “Include NVIDIA copyright header on all new files; update year on modified files.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/defs/perf/test_perf_sanity.py` at line 1, Update the SPDX copyright header line that currently ends with "2025" to include 2026 (e.g., change "2022-2025" to "2022-2026") so the modified file's header reflects the new year; locate the SPDX header string at the top of tests/integration/defs/perf/test_perf_sanity.py (the line starting with "SPDX-FileCopyrightText") and update the year range accordingly.
🧹 Nitpick comments (1)
tests/integration/defs/perf/test_perf_sanity.py (1)
1524-1572: QA list hygiene check: list updates are likely unnecessary, but coverage should be confirmedSince this changes pass/fail behavior inside an existing perf sanity definition (not adding a new test), QA list additions are likely unnecessary. Please confirm current perf list entries already exercise high-concurrency streaming cases where this tolerance is intended.
As per coding guidelines: “If the change adds or materially alters an integration test… call out whether an entry is needed under tests/integration/test_lists/qa/ … and test-db perf lists.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/defs/perf/test_perf_sanity.py` around lines 1524 - 1572, This change introduces a 1% tolerance for failed requests (max_failure_rate) and special-case logic around failed_requests_match, explicit markers ("!FAILED REQUESTS!", "!CHECK LOG FOR ERRORS!"), and SA benchmark handling (num_prompts vs Successful requests); verify whether existing QA/perf test lists already cover high-concurrency streaming scenarios targeted by this tolerance and if not, add an entry referencing this behavioral change to the appropriate integration test lists so QA runs exercise these cases; if the lists already cover it, state that in the PR description (no list changes needed) and ensure the test message or doc string near max_failure_rate documents the intended tolerance and why SA benchmark logic (num_prompts handling) is needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/serve/scripts/backend_request_func.py`:
- Around line 29-39: The stream loop drops a final unterminated "data:" line if
the response ends without a trailing newline; after the async for chunk in
response_content loop finishes, check if buf is non-empty and treat it as a
final line: decode buf to a string, strip it, validate it starts with "data:",
strip the "data:" prefix (using line.removeprefix("data:").lstrip()), ignore
"[DONE]" and empty lines, and yield the final payload the same way the in-loop
code does; update the code paths around the existing buf handling/yield logic so
the final payload is emitted exactly once.
In `@tests/integration/defs/perf/test_perf_sanity.py`:
- Around line 1524-1539: The 1% max_failure_rate (max_failure_rate) is currently
applied unconditionally when computing failure_rate from
failed_requests_match/total_requests; change the logic so the tolerance is only
applied for high-concurrency runs by gating the comparison (failure_rate >
max_failure_rate) behind a concurrency or request-volume check (e.g., a
concurrency variable or total_requests threshold) or make max_failure_rate
configurable per test; update the block that computes failed_count,
total_requests and failure_rate to first determine if the run qualifies as
"high-concurrency" (using the existing concurrency/request count parameter) and
only then compare against max_failure_rate, otherwise treat any failures as hard
failures.
---
Outside diff comments:
In `@tensorrt_llm/serve/scripts/backend_request_func.py`:
- Around line 1-4: The file backend_request_func.py is missing the required
NVIDIA copyright/header block; add the standard NVIDIA copyright header comment
at the top of backend_request_func.py including the year of the latest
modification and the Apache-2.0 license notice consistent with other .py sources
in the repo (replace the placeholder year with the current modification year),
ensuring the header appears before any code or comments so tools and auditors
recognize the file as covered.
In `@tests/integration/defs/perf/test_perf_sanity.py`:
- Line 1: Update the SPDX copyright header line that currently ends with "2025"
to include 2026 (e.g., change "2022-2025" to "2022-2026") so the modified file's
header reflects the new year; locate the SPDX header string at the top of
tests/integration/defs/perf/test_perf_sanity.py (the line starting with
"SPDX-FileCopyrightText") and update the year range accordingly.
---
Nitpick comments:
In `@tests/integration/defs/perf/test_perf_sanity.py`:
- Around line 1524-1572: This change introduces a 1% tolerance for failed
requests (max_failure_rate) and special-case logic around failed_requests_match,
explicit markers ("!FAILED REQUESTS!", "!CHECK LOG FOR ERRORS!"), and SA
benchmark handling (num_prompts vs Successful requests); verify whether existing
QA/perf test lists already cover high-concurrency streaming scenarios targeted
by this tolerance and if not, add an entry referencing this behavioral change to
the appropriate integration test lists so QA runs exercise these cases; if the
lists already cover it, state that in the PR description (no list changes
needed) and ensure the test message or doc string near max_failure_rate
documents the intended tolerance and why SA benchmark logic (num_prompts
handling) is needed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 33ec1459-ce8e-4f07-ae7c-9e16ccf8b243
📒 Files selected for processing (2)
tensorrt_llm/serve/scripts/backend_request_func.pytests/integration/defs/perf/test_perf_sanity.py
| async for chunk in response_content: | ||
| buf += chunk | ||
| while b"\n" in buf: | ||
| line_bytes, buf = buf.split(b"\n", 1) | ||
| line = line_bytes.decode("utf-8").strip() | ||
| if not line or not line.startswith("data:"): | ||
| continue | ||
| payload = line.removeprefix("data:").lstrip() | ||
| if payload == "[DONE]": | ||
| continue | ||
| yield payload |
There was a problem hiding this comment.
Handle trailing unterminated SSE data at EOF.
If the stream ends without a final newline, the buffered last data: payload is dropped, which can truncate output or miss final usage metadata.
Suggested fix
async def _iter_sse_data(response_content):
@@
async for chunk in response_content:
buf += chunk
while b"\n" in buf:
line_bytes, buf = buf.split(b"\n", 1)
line = line_bytes.decode("utf-8").strip()
if not line or not line.startswith("data:"):
continue
payload = line.removeprefix("data:").lstrip()
if payload == "[DONE]":
continue
yield payload
+
+ if buf:
+ line = buf.decode("utf-8").strip()
+ if line.startswith("data:"):
+ payload = line.removeprefix("data:").lstrip()
+ if payload and payload != "[DONE]":
+ yield payload🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/serve/scripts/backend_request_func.py` around lines 29 - 39, The
stream loop drops a final unterminated "data:" line if the response ends without
a trailing newline; after the async for chunk in response_content loop finishes,
check if buf is non-empty and treat it as a final line: decode buf to a string,
strip it, validate it starts with "data:", strip the "data:" prefix (using
line.removeprefix("data:").lstrip()), ignore "[DONE]" and empty lines, and yield
the final payload the same way the in-loop code does; update the code paths
around the existing buf handling/yield logic so the final payload is emitted
exactly once.
| # Tolerance: allow up to 1% failed requests for high-concurrency benchmarks | ||
| # where transient network issues can cause rare individual request failures. | ||
| max_failure_rate = 0.01 | ||
|
|
||
| # Check for non-zero failed requests (default benchmark) | ||
| failed_requests_match = re.search(r"Failed requests:\s+(\d+)", output) | ||
| if failed_requests_match: | ||
| failed_count = int(failed_requests_match.group(1)) | ||
| if failed_count > 0: | ||
| error_msg = f"Benchmark output contains {failed_count} failed requests." | ||
| raise RuntimeError(error_msg) | ||
| total_match = re.search(r"Successful requests:\s+(\d+)", output) | ||
| total_requests = ( | ||
| int(total_match.group(1)) + failed_count if total_match else failed_count | ||
| ) | ||
| failure_rate = failed_count / total_requests if total_requests > 0 else 1.0 | ||
| if failure_rate > max_failure_rate: | ||
| error_msg = ( |
There was a problem hiding this comment.
Scope mismatch: 1% tolerance is applied to all runs, not just high-concurrency
The code comment and PR objective say this tolerance is for high-concurrency scenarios, but current logic applies it universally. That can mask regressions in lower-concurrency runs. Please gate the tolerance by request volume/concurrency (or make it config-driven per test case).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/integration/defs/perf/test_perf_sanity.py` around lines 1524 - 1539,
The 1% max_failure_rate (max_failure_rate) is currently applied unconditionally
when computing failure_rate from failed_requests_match/total_requests; change
the logic so the tolerance is only applied for high-concurrency runs by gating
the comparison (failure_rate > max_failure_rate) behind a concurrency or
request-volume check (e.g., a concurrency variable or total_requests threshold)
or make max_failure_rate configurable per test; update the block that computes
failed_count, total_requests and failure_rate to first determine if the run
qualifies as "high-concurrency" (using the existing concurrency/request count
parameter) and only then compare against max_failure_rate, otherwise treat any
failures as hard failures.
Summary
backend_request_func.pyprocessed raw byte chunks directly fromresponse.contentwithout buffering on newline boundaries. Under high concurrency, TCP chunks could split SSE messages mid-line, causingjson.loads()to fail on partial data. These parse failures cascaded into request timeouts, ultimately making the server health endpoint appear unready during performance sanity tests._iter_sse_dataasync generator that properly buffers incoming bytes until newline boundaries are found, then decodes, filters non-data lines, strips thedata:prefix, and skips[DONE]sentinels before yielding JSON-ready strings. This replaces the inline chunk-processing logic inasync_request_trt_llm,async_request_openai_completions, and the chat completions path, consolidating the fix in one robust helper and eliminating the class of errors caused by chunk-boundary misalignment.Test plan
Links
Summary by CodeRabbit