[None][perf] Skip KV cache estimation by default and update disagg pe… by HuiGao-NV · Pull Request #12971 · NVIDIA/TensorRT-LLM

HuiGao-NV · 2026-04-12T22:34:18Z

…rf-sanity fractions

Set TRTLLM_SKIP_KV_CACHE_ESTIMATION default to 1 to skip runtime KV cache memory estimation. Update free_gpu_memory_fraction in 17 disaggregated perf-sanity configs based on profiled memory usage:
ctx_new = KV_cache / (device_total - peak + KV_cache)

Summary by CodeRabbit

Chores
- Updated KV cache memory estimation behavior to enable by default for improved executor initialization
- Adjusted GPU memory allocation parameters across multiple performance benchmark configurations to optimize memory utilization

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-04-12T22:38:26Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aae5e133-d43f-400c-a6bd-7d0aa46eab9e

📥 Commits

Reviewing files that changed from the base of the PR and between 26a28ea and 4f3152b.

📒 Files selected for processing (18)

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_128k8k_con1_ctx1_pp8_gen1_tep8_eplb0_mtp3_ccb-UCX.yaml
tests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_1k1k_con1024_ctx1_dep4_gen1_dep8_eplb0_mtp0_ccb-UCX.yaml
tests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_1k1k_con1_ctx1_dep4_gen1_tep8_eplb0_mtp3_ccb-UCX.yaml
tests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_1k1k_con2048_ctx2_dep4_gen1_dep16_eplb0_mtp3_ccb-UCX.yaml
tests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_1k1k_con2048_ctx2_dep4_gen1_dep16_eplb288_mtp3_ccb-UCX.yaml
tests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_1k1k_con3072_ctx1_dep4_gen1_dep4_eplb0_mtp1_ccb-UCX.yaml
tests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_8k1k_con1_ctx1_dep4_gen1_tep8_eplb0_mtp3_ccb-UCX.yaml
tests/scripts/perf-sanity/disaggregated/gb200_deepseek-v32-fp4_1k1k_con2048_ctx1_dep4_gen1_dep4_eplb0_mtp1_ccb-UCX.yaml
tests/scripts/perf-sanity/disaggregated/gb200_gpt-oss-120b-fp4_1k1k_con2048_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-UCX.yaml
tests/scripts/perf-sanity/disaggregated/gb200_gpt-oss-120b-fp4_1k1k_con512_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-UCX.yaml
tests/scripts/perf-sanity/disaggregated/gb200_gpt-oss-120b-fp4_1k1k_con64_ctx1_tp1_gen1_tp4_eplb0_mtp0_ccb-UCX.yaml
tests/scripts/perf-sanity/disaggregated/gb200_gpt-oss-120b-fp4_8k1k_con128_ctx1_tp1_gen1_tp4_eplb0_mtp0_ccb-UCX.yaml
tests/scripts/perf-sanity/disaggregated/gb200_gpt-oss-120b-fp4_8k1k_con4_ctx1_tp1_gen1_tp4_eplb0_mtp0_ccb-UCX.yaml
tests/scripts/perf-sanity/disaggregated/gb200_gpt-oss-120b-fp4_8k1k_con512_ctx1_tp1_gen1_dep2_eplb0_mtp0_ccb-UCX.yaml
tests/scripts/perf-sanity/disaggregated/gb200_kimi-k2-thinking-fp4_1k1k_con4_ctx1_dep4_gen1_tep4_eplb0_mtp0_ccb-UCX.yaml
tests/scripts/perf-sanity/disaggregated/gb200_qwen3-235b-fp4_8k1k_con1024_ctx1_tp1_gen1_dep8_eplb0_mtp0_ccb-UCX.yaml
tests/scripts/perf-sanity/disaggregated/gb200_qwen3-235b-fp4_8k1k_con64_ctx1_tp1_gen1_tep4_eplb0_mtp0_ccb-UCX.yaml

📝 Walkthrough

Walkthrough

Modified KV-cache estimation default behavior by changing the environment variable fallback from '0' to '1', and updated GPU memory fraction allocations across 18 disaggregated test configuration files for various model deployments.

Changes

Cohort / File(s)	Summary
KV Cache Estimation Default `tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`	Changed `TRTLLM_SKIP_KV_CACHE_ESTIMATION` default from `'0'` to `'1'`, altering the `skip_est` parameter passed to `KvCacheCreator` to enable KV cache estimation skipping by default.
Deepseek-R1 FP4 Configs `tests/scripts/perf-sanity/disaggregated/gb200_deepseek-r1-fp4_*`	Reduced `free_gpu_memory_fraction` values across 5 configuration variants for generator and context worker KV cache allocations (e.g., 0.8→0.62, 0.9→0.61, 0.75→0.43).
Deepseek-V32 FP4 Config `tests/scripts/perf-sanity/disaggregated/gb200_deepseek-v32-fp4_1k1k_con2048_ctx1_dep4_gen1_dep4_eplb0_mtp1_ccb-UCX.yaml`	Adjusted generator and context KV cache `free_gpu_memory_fraction` from 0.8→0.56 and 0.6→0.42 respectively.
GPT-OSS-120B FP4 Configs `tests/scripts/perf-sanity/disaggregated/gb200_gpt-oss-120b-fp4_*`	Lowered `free_gpu_memory_fraction` across 6 configuration variants, predominantly from 0.9→0.48 and 0.6→0.42 for generator and context workers.
Kimi-K2-Thinking FP4 Config `tests/scripts/perf-sanity/disaggregated/gb200_kimi-k2-thinking-fp4_4k1k_con4_ctx1_dep4_gen1_tep4_eplb0_mtp0_ccb-UCX.yaml`	Adjusted KV cache `free_gpu_memory_fraction` from 0.8→0.66 (gen) and 0.6→0.49 (ctx).
Qwen3-235B FP4 Configs `tests/scripts/perf-sanity/disaggregated/gb200_qwen3-235b-fp4_*`	Updated `free_gpu_memory_fraction` across 2 configuration variants from 0.9→0.59 (gen) and 0.6→0.39 (ctx).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The PR description provides the basic context (TRTLLM_SKIP_KV_CACHE_ESTIMATION change and config updates) but leaves critical template sections incomplete or unfilled.	Complete the 'Description' and 'Test Coverage' sections to explain the rationale and testing approach; verify PR checklist items are addressed appropriately.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main changes: skipping KV cache estimation by default and updating disaggregated perf-sanity configuration fractions.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

HuiGao-NV · 2026-04-12T22:39:27Z

/bot run --stage-list="GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE2-GPU8,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8,GB200-16_GPUs-4_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE2-GPU8-GEN1-NODE2-GPU8,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4" --disable-fail-fast

tensorrt-cicd · 2026-04-12T22:45:31Z

PR_Github #42901 [ run ] triggered by Bot. Commit: 4f3152b Link to invocation

tensorrt-cicd · 2026-04-12T23:22:14Z

PR_Github #42901 [ run ] completed with state FAILURE. Commit: 4f3152b
/LLM/main/L0_MergeRequest_PR pipeline #33563 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Conversation

HuiGao-NV commented Apr 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Apr 12, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

HuiGao-NV commented Apr 12, 2026

Uh oh!

tensorrt-cicd commented Apr 12, 2026

Uh oh!

tensorrt-cicd commented Apr 12, 2026

Uh oh!

HuiGao-NV commented Apr 13, 2026

Uh oh!

tensorrt-cicd commented Apr 13, 2026

Uh oh!

tensorrt-cicd commented Apr 13, 2026

Uh oh!

HuiGao-NV commented Apr 13, 2026

Uh oh!

tensorrt-cicd commented Apr 13, 2026

Uh oh!

tensorrt-cicd commented Apr 13, 2026

Uh oh!

HuiGao-NV commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

HuiGao-NV commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

HuiGao-NV commented Apr 17, 2026

Uh oh!

tensorrt-cicd commented Apr 17, 2026

Uh oh!

tensorrt-cicd commented Apr 17, 2026

Uh oh!

HuiGao-NV commented Apr 17, 2026

Uh oh!

tensorrt-cicd commented Apr 17, 2026

Uh oh!

tensorrt-cicd commented Apr 17, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

HuiGao-NV commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

HuiGao-NV commented May 2, 2026

Uh oh!

tensorrt-cicd commented May 2, 2026

Uh oh!

tensorrt-cicd commented May 2, 2026

Uh oh!

HuiGao-NV commented May 3, 2026

HuiGao-NV commented Apr 12, 2026 •

edited by coderabbitai Bot

Loading