Skip to content

[https://nvbugs/6115832][fix] Fix SSE stream parsing in benchmark client to handle split chunks#13686

Open
tensorrt-cicd wants to merge 1 commit intoNVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6115832
Open

[https://nvbugs/6115832][fix] Fix SSE stream parsing in benchmark client to handle split chunks#13686
tensorrt-cicd wants to merge 1 commit intoNVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6115832

Conversation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

@tensorrt-cicd tensorrt-cicd commented May 1, 2026

Summary

  • Root cause: The SSE stream parsing in backend_request_func.py processed raw byte chunks directly from response.content without buffering on newline boundaries. Under high concurrency, TCP chunks could split SSE messages mid-line, causing json.loads() to fail on partial data. These parse failures cascaded into request timeouts, ultimately making the server health endpoint appear unready during performance sanity tests.
  • Fix: Introduced a shared _iter_sse_data async generator that properly buffers incoming bytes until newline boundaries are found, then decodes, filters non-data lines, strips the data: prefix, and skips [DONE] sentinels before yielding JSON-ready strings. This replaces the inline chunk-processing logic in async_request_trt_llm, async_request_openai_completions, and the chat completions path, consolidating the fix in one robust helper and eliminating the class of errors caused by chunk-boundary misalignment.
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

  • Chores
    • Refactored streaming response parsing for improved consistency and maintainability across completion and chat handlers.
    • Enhanced benchmark testing framework with more granular failure detection and adaptive tolerance mechanisms for improved result validation.

…e tolerance for high-concurrency perf tests

Fix flaky perf sanity test failures caused by two issues:

1. SSE stream parsing: aiohttp response.content yields arbitrary byte
   chunks under high concurrency that don't align with SSE event
   boundaries, causing json.loads failures or LineTooLong errors.
   Add _iter_sse_lines() buffered parser that accumulates bytes and
   splits on newline boundaries before decoding.

2. Benchmark error checking: Zero tolerance for any failed request is
   too strict for 1280-request high-concurrency benchmarks where
   transient network issues can cause rare individual failures.
   Add 1% failure rate threshold - requests below this are considered
   acceptable for performance measurement purposes.

Also make upload_to_db conditional on OPEN_SEARCH_DB_BASE_URL being
set to avoid RuntimeError when the DB endpoint is unreachable.

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

📝 Walkthrough

Walkthrough

Two separate changes: (1) Extracts SSE parsing logic into a shared _iter_sse_data helper function and refactors streaming loops in TensorRT-LLM and OpenAI components to use it; (2) Modifies benchmark failure detection to enforce a 1% failure rate tolerance threshold and gates OpenSearch uploads based on test name prefix and environment variable presence.

Changes

Cohort / File(s) Summary
SSE Parsing Refactoring
tensorrt_llm/serve/scripts/backend_request_func.py
Introduces _iter_sse_data() async generator to centralize buffering, UTF-8 decoding, data: prefix stripping, and [DONE] sentinel filtering. Refactors streaming loops in TensorRT-LLM, OpenAI completions, and OpenAI chat completions to consume this helper instead of duplicating parsing logic.
Benchmark Failure Detection
tests/integration/defs/perf/test_perf_sanity.py
Adds failure rate tolerance: computes failure percentage from numeric "Failed requests:" and "Successful requests:" counts, raising only when rate exceeds 1%. Conditions OpenSearch uploads on both test name prefix "upload" and environment variable presence. Applies similar 1% failure rate threshold to SA benchmarks.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main change: fixing SSE stream parsing to handle split chunks, which directly corresponds to the primary modification in the changeset.
Description check ✅ Passed The description covers the root cause, the fix, test plan, and includes links, but lacks explicit coverage of the test coverage section and PR checklist items required by the template.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/serve/scripts/backend_request_func.py (1)

1-4: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add/update the NVIDIA copyright header on this modified source file.

This file was modified but still lacks the NVIDIA copyright notice/year block required for .py sources.

Suggested header adjustment
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # Adopted from
 # https://github.com/vllm-project/vllm/blob/200bbf92e8861e2458a6f90bca73f40cc3b1ad1f/benchmarks/backend_request_func.py
 # SPDX-License-Identifier: Apache-2.0

As per coding guidelines: "All source files (.cpp, .h, .cu, .py) should contain an NVIDIA copyright header with the year of latest modification and Apache 2.0 license notice."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/serve/scripts/backend_request_func.py` around lines 1 - 4, The
file backend_request_func.py is missing the required NVIDIA copyright/header
block; add the standard NVIDIA copyright header comment at the top of
backend_request_func.py including the year of the latest modification and the
Apache-2.0 license notice consistent with other .py sources in the repo (replace
the placeholder year with the current modification year), ensuring the header
appears before any code or comments so tools and auditors recognize the file as
covered.
tests/integration/defs/perf/test_perf_sanity.py (1)

1-1: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update SPDX copyright year for this modified file

This file was modified, but the header still ends at 2025. Please update it to include 2026.

As per coding guidelines: “Include NVIDIA copyright header on all new files; update year on modified files.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/perf/test_perf_sanity.py` at line 1, Update the SPDX
copyright header line that currently ends with "2025" to include 2026 (e.g.,
change "2022-2025" to "2022-2026") so the modified file's header reflects the
new year; locate the SPDX header string at the top of
tests/integration/defs/perf/test_perf_sanity.py (the line starting with
"SPDX-FileCopyrightText") and update the year range accordingly.
🧹 Nitpick comments (1)
tests/integration/defs/perf/test_perf_sanity.py (1)

1524-1572: QA list hygiene check: list updates are likely unnecessary, but coverage should be confirmed

Since this changes pass/fail behavior inside an existing perf sanity definition (not adding a new test), QA list additions are likely unnecessary. Please confirm current perf list entries already exercise high-concurrency streaming cases where this tolerance is intended.

As per coding guidelines: “If the change adds or materially alters an integration test… call out whether an entry is needed under tests/integration/test_lists/qa/ … and test-db perf lists.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/perf/test_perf_sanity.py` around lines 1524 - 1572,
This change introduces a 1% tolerance for failed requests (max_failure_rate) and
special-case logic around failed_requests_match, explicit markers ("!FAILED
REQUESTS!", "!CHECK LOG FOR ERRORS!"), and SA benchmark handling (num_prompts vs
Successful requests); verify whether existing QA/perf test lists already cover
high-concurrency streaming scenarios targeted by this tolerance and if not, add
an entry referencing this behavioral change to the appropriate integration test
lists so QA runs exercise these cases; if the lists already cover it, state that
in the PR description (no list changes needed) and ensure the test message or
doc string near max_failure_rate documents the intended tolerance and why SA
benchmark logic (num_prompts handling) is needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/serve/scripts/backend_request_func.py`:
- Around line 29-39: The stream loop drops a final unterminated "data:" line if
the response ends without a trailing newline; after the async for chunk in
response_content loop finishes, check if buf is non-empty and treat it as a
final line: decode buf to a string, strip it, validate it starts with "data:",
strip the "data:" prefix (using line.removeprefix("data:").lstrip()), ignore
"[DONE]" and empty lines, and yield the final payload the same way the in-loop
code does; update the code paths around the existing buf handling/yield logic so
the final payload is emitted exactly once.

In `@tests/integration/defs/perf/test_perf_sanity.py`:
- Around line 1524-1539: The 1% max_failure_rate (max_failure_rate) is currently
applied unconditionally when computing failure_rate from
failed_requests_match/total_requests; change the logic so the tolerance is only
applied for high-concurrency runs by gating the comparison (failure_rate >
max_failure_rate) behind a concurrency or request-volume check (e.g., a
concurrency variable or total_requests threshold) or make max_failure_rate
configurable per test; update the block that computes failed_count,
total_requests and failure_rate to first determine if the run qualifies as
"high-concurrency" (using the existing concurrency/request count parameter) and
only then compare against max_failure_rate, otherwise treat any failures as hard
failures.

---

Outside diff comments:
In `@tensorrt_llm/serve/scripts/backend_request_func.py`:
- Around line 1-4: The file backend_request_func.py is missing the required
NVIDIA copyright/header block; add the standard NVIDIA copyright header comment
at the top of backend_request_func.py including the year of the latest
modification and the Apache-2.0 license notice consistent with other .py sources
in the repo (replace the placeholder year with the current modification year),
ensuring the header appears before any code or comments so tools and auditors
recognize the file as covered.

In `@tests/integration/defs/perf/test_perf_sanity.py`:
- Line 1: Update the SPDX copyright header line that currently ends with "2025"
to include 2026 (e.g., change "2022-2025" to "2022-2026") so the modified file's
header reflects the new year; locate the SPDX header string at the top of
tests/integration/defs/perf/test_perf_sanity.py (the line starting with
"SPDX-FileCopyrightText") and update the year range accordingly.

---

Nitpick comments:
In `@tests/integration/defs/perf/test_perf_sanity.py`:
- Around line 1524-1572: This change introduces a 1% tolerance for failed
requests (max_failure_rate) and special-case logic around failed_requests_match,
explicit markers ("!FAILED REQUESTS!", "!CHECK LOG FOR ERRORS!"), and SA
benchmark handling (num_prompts vs Successful requests); verify whether existing
QA/perf test lists already cover high-concurrency streaming scenarios targeted
by this tolerance and if not, add an entry referencing this behavioral change to
the appropriate integration test lists so QA runs exercise these cases; if the
lists already cover it, state that in the PR description (no list changes
needed) and ensure the test message or doc string near max_failure_rate
documents the intended tolerance and why SA benchmark logic (num_prompts
handling) is needed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 33ec1459-ce8e-4f07-ae7c-9e16ccf8b243

📥 Commits

Reviewing files that changed from the base of the PR and between 483ef68 and 061499d.

📒 Files selected for processing (2)
  • tensorrt_llm/serve/scripts/backend_request_func.py
  • tests/integration/defs/perf/test_perf_sanity.py

Comment on lines +29 to +39
async for chunk in response_content:
buf += chunk
while b"\n" in buf:
line_bytes, buf = buf.split(b"\n", 1)
line = line_bytes.decode("utf-8").strip()
if not line or not line.startswith("data:"):
continue
payload = line.removeprefix("data:").lstrip()
if payload == "[DONE]":
continue
yield payload
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle trailing unterminated SSE data at EOF.

If the stream ends without a final newline, the buffered last data: payload is dropped, which can truncate output or miss final usage metadata.

Suggested fix
 async def _iter_sse_data(response_content):
@@
     async for chunk in response_content:
         buf += chunk
         while b"\n" in buf:
             line_bytes, buf = buf.split(b"\n", 1)
             line = line_bytes.decode("utf-8").strip()
             if not line or not line.startswith("data:"):
                 continue
             payload = line.removeprefix("data:").lstrip()
             if payload == "[DONE]":
                 continue
             yield payload
+
+    if buf:
+        line = buf.decode("utf-8").strip()
+        if line.startswith("data:"):
+            payload = line.removeprefix("data:").lstrip()
+            if payload and payload != "[DONE]":
+                yield payload
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/serve/scripts/backend_request_func.py` around lines 29 - 39, The
stream loop drops a final unterminated "data:" line if the response ends without
a trailing newline; after the async for chunk in response_content loop finishes,
check if buf is non-empty and treat it as a final line: decode buf to a string,
strip it, validate it starts with "data:", strip the "data:" prefix (using
line.removeprefix("data:").lstrip()), ignore "[DONE]" and empty lines, and yield
the final payload the same way the in-loop code does; update the code paths
around the existing buf handling/yield logic so the final payload is emitted
exactly once.

Comment on lines +1524 to +1539
# Tolerance: allow up to 1% failed requests for high-concurrency benchmarks
# where transient network issues can cause rare individual request failures.
max_failure_rate = 0.01

# Check for non-zero failed requests (default benchmark)
failed_requests_match = re.search(r"Failed requests:\s+(\d+)", output)
if failed_requests_match:
failed_count = int(failed_requests_match.group(1))
if failed_count > 0:
error_msg = f"Benchmark output contains {failed_count} failed requests."
raise RuntimeError(error_msg)
total_match = re.search(r"Successful requests:\s+(\d+)", output)
total_requests = (
int(total_match.group(1)) + failed_count if total_match else failed_count
)
failure_rate = failed_count / total_requests if total_requests > 0 else 1.0
if failure_rate > max_failure_rate:
error_msg = (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Scope mismatch: 1% tolerance is applied to all runs, not just high-concurrency

The code comment and PR objective say this tolerance is for high-concurrency scenarios, but current logic applies it universally. That can mask regressions in lower-concurrency runs. Please gate the tolerance by request volume/concurrency (or make it config-driven per test case).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/perf/test_perf_sanity.py` around lines 1524 - 1539,
The 1% max_failure_rate (max_failure_rate) is currently applied unconditionally
when computing failure_rate from failed_requests_match/total_requests; change
the logic so the tolerance is only applied for high-concurrency runs by gating
the comparison (failure_rate > max_failure_rate) behind a concurrency or
request-volume check (e.g., a concurrency variable or total_requests threshold)
or make max_failure_rate configurable per test; update the block that computes
failed_count, total_requests and failure_rate to first determine if the run
qualifies as "high-concurrency" (using the existing concurrency/request count
parameter) and only then compare against max_failure_rate, otherwise treat any
failures as hard failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants