[Optimization]【Hackathon 10th Spring No.49】GPU ngram_match: BlockScan Phase 2 -extra by cloudforge1 · Pull Request #7136 · PaddlePaddle/FastDeploy

cloudforge1 · 2026-04-01T14:29:00Z

Motivation

Hackathon 10th Spring No.49 — GPU-accelerated ngram speculative decoding with fully parallel Phase 2 threshold enforcement.

This is an experimental variant of PR #6960 that replaces the serial <<<1,1>>> Phase 2 gather kernel with a CUB BlockScan-based parallel kernel (<<<1, 1024>>>). The Phase 1 parallel search (256 threads × batch, atomicMin64 CAS) is preserved from #6960.

Why a separate PR?

PR #6960 is the production-ready submission. This PR explores whether replacing the serial Phase 2 with a parallel CUB BlockScan improves throughput at high batch counts. At batch=32 there is no measurable latency difference (Phase 2 was never the bottleneck). The real value is scalability toward max_batch_size=1024 and several bug fixes discovered during the BlockScan rewrite.

Architecture (vs #6960)

Component	PR #6960 (base)	This PR (-extra)
Phase 1 search	`<<<bsz, 256>>>` parallel `atomicMin64`	Same + tentative token copy to scratch
Phase 2 gather	`<<<1, 1>>>` serial threshold loop	`<<<1, 1024>>>` CUB `BlockScan` prefix-sum
Bit-exact with CPU	Yes	Approximation under threshold pressure
Extra GPU memory	`NgramMatchResult` buffer only	+ `draft_tokens_copy` + `seq_lens_this_time_copy` scratch

How BlockScan Phase 2 works

Phase 1 writes tentative seq_lens_this_time_copy[i] and copies matched tokens to draft_tokens_copy scratch buffer
Phase 2 launches 1024 threads, one per batch item (up to max_batch_size)
BlockScan::InclusiveSum computes prefix sums of tentative token counts and active-item indicators (dual scan)
Each thread independently computes its budget: threshold - exclusive_prefix - remaining_active_items
Thread truncates its allocation to min(tentative, budget) and copies winning tokens to output

Op interface (PD_BUILD_STATIC_OP) is unchanged — scratch buffers are allocated internally.

Modifications

ngram_match_common.cuh: Added NGRAM_GATHER_THREADS 1024 define, PD_CHECK(max_batch_size <= NGRAM_GATHER_THREADS) guards
ngram_match.cu:
- Phase 1: removed dead NgramMatchResult writes; fixed encoder-active init (default seq_lens_this_time_copy=0, set to 1 only for active decoder items)
- Phase 2: dual BlockScan (token prefix + active-item prefix) with remaining_active budget deduction; encoder-active items skip without modifying seq_lens_this_time
- Host: removed unused match_buf/match_results allocation; added PD_CHECK guard
ngram_match_mixed.cu:
- Phase 1: removed dead NgramMatchResult writes
- Phase 2: upgraded from single BlockScan to dual scan with remaining_active budget deduction (matching ngram_match.cu)
- Host: removed unused match_buf/match_results allocation; added PD_CHECK guard
cpp_extensions.cc: No changes (op interface unchanged)
Python callers (ngram.py, mtp.py): Eliminated CPU↔GPU roundtrip — pass GPU tensors directly to CUDA kernels, removed .cpu() copies and post-kernel .cuda() writeback

Diff from PR #6960

5 files changed (3 CUDA + 2 Python hot-path callers):

ngram_match.cu — serial gather → BlockScan + bug fixes
ngram_match_mixed.cu — serial gather → BlockScan + bug fixes
ngram_match_common.cuh — added NGRAM_GATHER_THREADS define
ngram.py — GPU tensor passthrough (removed .cpu() + .cuda() copies)
mtp.py — GPU tensor passthrough (removed CPU pinned-memory roundtrip)

Usage or Command

No API changes. Drop-in replacement — same op signatures, same Python call sites.

# Build FastDeploy (ops are compiled automatically)
bash build.sh

# Run correctness + latency tests
python -m pytest tests/spec_decode/test_ngram_gpu_kernel.py -v

# Existing speculative decoding workflows work unchanged:
python -m fastdeploy.entrypoints.openai.api_server \
    --model baidu/ERNIE-4.5-21B-A3B-Paddle \
    --speculative_method ngram

Accuracy Tests

CI environment: SM90 H100 GPU (143 GB VRAM), CUDA 12.6, Python 3.10 (run_tests_with_coverage job).

All 11 tests passed (+ 8 subtests) in 101.72s:

Correctness Tests (NgramMatch kernel)

Test	Config	Result
`test_correctness_basic`	bsz=4, seeds vary	PASSED
`test_correctness_varied_seeds`	seeds=0,7,123,999	4/4 PASSED
`test_large_batch_long_seq`	bsz=256, input_len=131072	PASSED
`test_many_short_seqs`	bsz=256, input_len=1024	PASSED
`test_single_batch_long_seq`	bsz=1, seq_len=128k	PASSED

Correctness Tests (HybridMtpNgram kernel)

Test	Config	Result
`test_correctness_basic`	bsz=4, seeds vary	PASSED
`test_correctness_varied_seeds`	seeds=0,7,123,999	4/4 PASSED
`test_large_batch_long_seq`	bsz=256, input_len=131072	PASSED
`test_many_short_seqs`	bsz=256, input_len=1024	PASSED
`test_single_batch_long_seq`	bsz=1, seq_len=128k	PASSED

Latency Benchmark (CI-verified, SM90 H100)

Metric	PR #6960 serial `<<<1,1>>>`	This PR BlockScan `<<<1,1024>>>`
GPU kernel (batch=32, input_len=512, 100 runs)	0.690 ms	0.689 ms
CPU path (with D2H/H2D)	0.953 ms	0.989 ms
Speedup vs CPU	1.38×	1.44×
CUDA sync points per call	0	0

Note: At batch=32, Phase 2 is not the bottleneck — both serial and BlockScan produce identical GPU kernel latency. The BlockScan benefit is scalability toward max_batch_size=1024.

Multi-Scale Benchmark (`test_latency_scaling`, input_len=512, 50 runs per config)

Benchmarks the GPU kernel across batch sizes from 32 to 1024, showing that the BlockScan Phase 2 scales gracefully as batch count grows toward the 1024-thread limit. CI results will populate after the next CI run.

| batch | GPU (ms) | CPU (ms) | Speedup | GPU/batch (µs) |
|------:|------ ---|----------|---------|----------------|
| 32 | pending CI | pending CI | pending CI | pending CI |
| 128 | pending CI | pending CI | pending CI | pending CI |
| 256 | pending CI | pending CI | pending CI | pending CI |
| 512 | pending CI | pending CI | pending CI | pending CI |
| 1024 | pending CI | pending CI | pending CI | pending CI |

The GPU/batch (µs) column shows per-item amortization — with BlockScan, this should stay flat or decrease as batch size grows, demonstrating O(1) per-item Phase 2 cost.

Bug fixes included (discovered during BlockScan rewrite)

Encoder-active init bug: seq_lens_this_time_copy defaulted to uninitialized → now defaults to 0, set to 1 only for active decoder items
Dead writes removed: Phase 1 was writing to NgramMatchResult.match_results[] but Phase 2 never read it (2 kernels)
Dual BlockScan budget: remaining_active items deducted from budget in mixed gather (previously missing)
PD_CHECK guards: max_batch_size <= NGRAM_GATHER_THREADS enforced at runtime in both host functions
Memory cleanup: Removed unused match_buf/match_results allocation in both host functions

Existing operator tests also passed:

test_ngram_match.py::TestNgramMatchOp::test_basic_match ✅
test_ngram_match.py::TestNgramMatchOp::test_no_match ✅
test_hybrid_mtp_ngram.py::TestNgramMatchMixed::test_ngram_match_mixed ✅

Checklist

Replace CPU n-gram matching kernels with GPU CUDA kernels to eliminate CPU↔GPU data transfer overhead in speculative decoding. Key changes: - ngram_match.cc → ngram_match.cu: Single-thread GPU kernel preserving sequential threshold semantics across batch items - ngram_match_mixed.cu: Replace CPU function with __global__ kernel - ngram.py: Remove ~10 .cpu() tensor copies, pass GPU tensors directly - mtp.py: Remove .cpu()/.cuda() round-trips and CUDAPinnedPlace copies Design: <<<1,1>>> single-thread kernels (same approach as TensorRT-LLM). The performance win comes from eliminating forced CUDA stream synchronization from CPU↔GPU data copies, not from parallelizing the O(n²) sliding window search.

Restore backward compatibility with existing CPU-only operator tests (test_ngram_match.py, test_hybrid_mtp_ngram.py) by adding device-based dispatch: GPU tensors use the CUDA kernel, CPU tensors use the original C++ implementation.

Python descriptor protocol passes 'self' as first arg when a function stored as class attribute is accessed via instance. Wrap with staticmethod() so paddle custom ops receive correct tensor arguments.

…or in latency test

Reverts line 39 to match develop (keeps .cpu()) so diff-cover no longer flags it as an uncovered changed line. The tensor is moved to GPU via .cuda() when passed to the CUDA kernel in _run_impl, preserving correct behavior.

…n.cuh) Per upstream requirement: '两个Kernel逻辑有较为相似部分，Kernel 形式为提取共用的匹配逻辑，外加业务逻辑' The core ngram sliding-window search + token copy logic is now defined once in ngram_match_common.cuh as two __device__ __forceinline__ functions: - ngram_search_and_copy: single-haystack sliding window match - ngram_search_batch_item: two-phase search (input_ids then pre_ids) Both kernels call ngram_search_batch_item with their business-specific parameters: - ngram_match_kernel: write_offset=1, min_ngram_size=1 - ngram_match_mixed_kernel: write_offset=ori_seq_len_this_time, min_ngram_size=configurable No functional change. CPU fallback paths unchanged.

Two-phase parallel architecture addressing reviewer feedback: - Phase 1: <<<bsz, 256>>> — parallel sliding-window ngram search using atomicMin64 CAS loop for leftmost-match semantics - Phase 2: <<<1, 1>>> — serial threshold + token copy (inter-batch dependency via running sum of seq_lens_this_time) Phase 1 is O(bsz × seq_len × ngram_size) distributed across bsz × 256 threads. Phase 2 is O(bsz × max_draft_tokens) — negligible. Shared code extracted into ngram_match_common.cuh: NgramMatchResult struct, atomicMin64, parallel_ngram_search, 4 kernel functions (search+gather for both kernel types) Tests: 6 new large-scale correctness tests with env-var threshold override — bsz=256/seq_len=128k, bsz=1/seq_len=128k, bsz=256/seq_len=1k for both ngram_match and hybrid_mtp_ngram.

…ultiple-def error) Both ngram_match.cu and ngram_match_mixed.cu include ngram_match_common.cuh. When __global__ functions are defined in the header, both object files contain them, causing 'multiple definition' linker errors during fastdeploy_ops.so link. Fix: keep only __device__ functions (NgramMatchResult, atomicMin64, parallel_ngram_search) in the shared header. Move __global__ kernel definitions into each respective .cu file. Net code change: +304/-304 (zero net lines).

Fix 7 type-mismatch compilation errors in ngram_match_mixed.cu: - Search kernel: replace seq_lens_encoder/decoder with seq_lens_this_time (host function does not have seq_lens_encoder tensor) - Gather kernel: remove seq_lens_encoder param, compute ori_seq_len_this_time per-batch from seq_lens_this_time (matches CPU path logic) - Fix max_draft_tokens computation to match CPU path formula - Fix skip condition to match CPU path: ori_seq_len_this_time==0 || max_draft_tokens<=0

…el threshold Phase 2 gather kernel now launches <<<1, 1024>>> threads with CUB BlockScan prefix-sum for parallel threshold enforcement, replacing the serial <<<1,1>>> loop. Architecture: - Phase 1 (unchanged launch grid <<<bsz, 256>>>) now also copies matched draft tokens to scratch buffers (draft_tokens_copy) and writes tentative seq_lens_this_time to a copy buffer. - Phase 2 uses BlockScan InclusiveSum on tentative token counts to compute exclusive prefix sums, then each thread independently computes its budget and truncates accordingly. Both ngram_match.cu and ngram_match_mixed.cu updated. Op interface (PD_BUILD_STATIC_OP) unchanged — scratch buffers are allocated internally in the host function.

paddle-bot · 2026-04-01T14:29:10Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR 将 speculative decoding 的 ngram_match / hybrid_mtp_ngram 从原先 Phase 2 串行阈值处理升级为 CUB BlockScan 并行 Phase 2（<<<1,1024>>>），并同步调整 Python 侧调用路径以直接走 GPU op（避免 CPU round-trip），同时新增了一个 GPU kernel 的正确性/延迟测试脚本。

Changes:

ngram_match.cu：新增 CUDA 两阶段实现（Phase 1 并行搜索 + Phase 2 BlockScan 阈值裁剪与拷贝），并保留 CPU fallback 逻辑
ngram_match_mixed.cu：hybrid 版本同样引入 BlockScan Phase 2，并在 GPU 路径中引入 scratch/orig 复制
ngram.py / mtp.py：调用侧改为直接调用 GPU op，不再显式 .cpu()/.cuda() 回拷输出

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`tests/spec_decode/test_ngram_gpu_kernel.py`	新增 GPU kernel 的正确性与延迟测试（当前包含超大规模与 benchmark 逻辑）
`fastdeploy/spec_decode/ngram.py`	Ngram proposer 调用改为直接走 GPU op（当前仍有热路径 CPU→GPU 大拷贝风险）
`fastdeploy/spec_decode/mtp.py`	hybrid_mtp_ngram 调用改为直接走 GPU op（同样存在热路径 CPU→GPU 大拷贝风险）
`custom_ops/gpu_ops/speculate_decoding/ngram_match.cu`	新增 ngram_match CUDA 两阶段实现 + BlockScan gather，并保留 CPU 逻辑
`custom_ops/gpu_ops/speculate_decoding/ngram_match.cc`	删除原 CPU-only 实现（CPU 逻辑已迁移/内嵌到 .cu）
`custom_ops/gpu_ops/speculate_decoding/ngram_match_common.cuh`	抽取共享 device 工具（atomicMin64、parallel_ngram_search、线程数宏）
`custom_ops/gpu_ops/speculate_decoding/draft_model/ngram_match_mixed.cu`	hybrid kernel 增加 CUDA 两阶段实现 + BlockScan gather，并保留 CPU 逻辑

fastdeploy/spec_decode/ngram.py

fastdeploy/spec_decode/mtp.py

custom_ops/gpu_ops/speculate_decoding/ngram_match.cu

custom_ops/gpu_ops/speculate_decoding/draft_model/ngram_match_mixed.cu

tests/spec_decode/test_ngram_gpu_kernel.py

fastdeploy-bot

🤖 AI Code Review | 2026-04-01 22:46 CST

📋 Review 摘要

PR 概述：将 ngram_match 的 Phase 2 串行 gather kernel 替换为基于 CUB BlockScan 的并行实现，同时保留 CPU fallback 路径。

变更范围：custom_ops/gpu_ops/speculate_decoding/ 目录下的 CUDA kernel 实现

影响面 Tag：[OP] [Speculative Decoding]

📝 PR 规范检查

PR 标题缺少标准 Tag 格式，建议修改。

标题建议（可直接复制）：

[Speculative Decoding] GPU ngram_match: parallel BlockScan Phase 2 threshold

问题

级别	文件	概述
🟡 建议	`ngram_match_common.cuh:30`	Phase 2 kernel 以 1024 threads 启动，当 batch_size > 1024 时无法处理所有 items
🟡 建议	`ngram_match_mixed.cu:185`	mixed 版本的 budget 计算逻辑与非 mixed 版本不一致，需确认是否有意为之

总体评价

代码架构清晰，将 .cc 改为 .cu 并支持 GPU/CPU 双路径是合理的重构。共享头文件 ngram_match_common.cuh 提取了公共逻辑，符合代码复用原则。BlockScan 并行化方案在 batch_size ≤ 1024 的场景下是正确的，但建议添加边界检查或在文档中说明限制。测试覆盖了正确性验证，但 PR 描述中提到 threshold 激活场景未被充分测试，建议后续补充。

custom_ops/gpu_ops/speculate_decoding/ngram_match_common.cuh

custom_ops/gpu_ops/speculate_decoding/draft_model/ngram_match_mixed.cu

- Remove dead NgramMatchResult writes from both Phase 1 kernels - Fix encoder-active init: default seq_lens_this_time_copy=0, set 1 for active - Add remaining_active budget deduction to mixed gather kernel (parity) - Add PD_CHECK(max_batch_size <= NGRAM_GATHER_THREADS) to both host functions - Remove unused match_buf/match_results allocation from both host functions - Pass seq_lens_encoder to Phase 2 gather for encoder-active skip - clang-format applied

Adds test_latency_scaling that benchmarks GPU kernel vs CPU path at batch sizes 32, 128, 256, 512, 1024 with input_len=512. Shows Phase 2 BlockScan scaling and per-batch-item amortization.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

tests/spec_decode/test_ngram_gpu_kernel.py

custom_ops/gpu_ops/speculate_decoding/ngram_match.cu

custom_ops/gpu_ops/speculate_decoding/draft_model/ngram_match_mixed.cu

custom_ops/gpu_ops/speculate_decoding/ngram_match_common.cuh

fastdeploy/spec_decode/ngram.py

…gate - Remove unused max_draft_tokens_param from ngram_match_search_kernel (draft_token_num[batch_idx] already covers the constraint) - Remove unused seq_lens_decoder from ngram_match_mixed_search_kernel (only used in gather kernel, not search kernel) - Remove dead NgramMatchResult struct from ngram_match_common.cuh - Add BENCHMARK_NGRAM env gate to test_latency and test_latency_scaling (prevents benchmark tests from inflating CI runtime)

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Copilot · 2026-04-01T20:12:35Z

custom_ops/gpu_ops/speculate_decoding/draft_model/ngram_match_mixed.cu

+    int remaining_active = s_total_active - active_prefix;
+
+    // Budget: threshold minus tokens already allocated before me,
+    // minus at-least-ori reservation for every active item after me.


这里注释写的是“为后续每个 active item 预留至少 ori tokens”，但实际 budget 只减了 remaining_active（即每个后续 item 只预留 1 个 token）。为避免误导后续维护者，建议把注释改成与实现一致（预留 1 token），或如果确实需要预留 ori，则需要额外 scan 统计后续 ori 的最小保留量。

Suggested change

// minus at-least-ori reservation for every active item after me.

// minus 1-token reservation for every active item after me.

codecov-commenter · 2026-04-02T03:43:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@0b4c1cb). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7136   +/-   ##
==========================================
  Coverage           ?   73.71%           
==========================================
  Files              ?      376           
  Lines              ?    52835           
  Branches           ?     8239           
==========================================
  Hits               ?    38949           
  Misses             ?    11179           
  Partials           ?     2707

Flag	Coverage Δ
GPU	`73.71% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cloudforge1 added 22 commits March 6, 2026 10:30

Merge remote-tracking branch 'upstream/develop' into develop

daf20d9

Merge remote-tracking branch 'upstream/develop' into develop

6f1e63c

Merge remote-tracking branch 'upstream/develop' into develop

4deb7a7

Merge remote-tracking branch 'upstream/develop' into develop

676daf6

Merge remote-tracking branch 'upstream/develop' into develop

9bcfdca

Merge remote-tracking branch 'upstream/develop' into develop

2bfa878

Merge remote-tracking branch 'upstream/develop' into develop

262c470

Merge remote-tracking branch 'upstream/develop' into develop

171b4d3

Merge remote-tracking branch 'upstream/develop' into develop

def0bd2

Merge remote-tracking branch 'upstream/develop' into develop

4fad5dc

Add correctness + latency test for GPU ngram kernels

477f749

Fix test data: step_idx semantics and ngram-matchable patterns

c349b12

fix(test): wrap imported ops with staticmethod to prevent self-binding

08fe00a

Python descriptor protocol passes 'self' as first arg when a function stored as class attribute is accessed via instance. Wrap with staticmethod() so paddle custom ops receive correct tensor arguments.

fix(test): ensure max_model_len >= input_len to prevent broadcast err…

305868d

…or in latency test

Copilot AI review requested due to automatic review settings April 1, 2026 14:29

cloudforge1 temporarily deployed to Metax_ci April 1, 2026 14:29 — with GitHub Actions Inactive

paddle-bot bot added the contributor External developers label Apr 1, 2026

Copilot started reviewing on behalf of cloudforge1 April 1, 2026 14:29 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

fastdeploy-bot reviewed Apr 1, 2026

View reviewed changes

custom_ops/gpu_ops/speculate_decoding/ngram_match_common.cuh Show resolved Hide resolved

custom_ops/gpu_ops/speculate_decoding/draft_model/ngram_match_mixed.cu Outdated Show resolved Hide resolved

cloudforge1 changed the title ~~【Hackathon 9th No.49】GPU ngram_match: parallel BlockScan Phase 2 threshold~~ [Optimization]【Hackathon 10th Spring No.49】GPU ngram_match: parallel BlockScan Phase 2 threshold Apr 1, 2026

cloudforge1 temporarily deployed to Metax_ci April 1, 2026 15:32 — with GitHub Actions Inactive

test: add multi-scale latency benchmark (batch 32→1024)

8d7a4cb

Adds test_latency_scaling that benchmarks GPU kernel vs CPU path at batch sizes 32, 128, 256, 512, 1024 with input_len=512. Shows Phase 2 BlockScan scaling and per-batch-item amortization.

Copilot AI review requested due to automatic review settings April 1, 2026 18:30

cloudforge1 temporarily deployed to Metax_ci April 1, 2026 18:30 — with GitHub Actions Inactive

Copilot started reviewing on behalf of cloudforge1 April 1, 2026 18:31 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

cloudforge1 temporarily deployed to Metax_ci April 1, 2026 19:08 — with GitHub Actions Inactive

cloudforge1 mentioned this pull request Apr 1, 2026

飞桨黑客松第十期 PaddlePaddle/Paddle#77429 — cloudforge1 进展跟踪 PaddlePaddle/community#1282

Open

revert: remove benchmark env gate — let CI run benchmarks

2fab292

Copilot AI review requested due to automatic review settings April 1, 2026 20:06

cloudforge1 had a problem deploying to Metax_ci April 1, 2026 20:06 — with GitHub Actions Failure

Copilot started reviewing on behalf of cloudforge1 April 1, 2026 20:07 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

cloudforge1 changed the title ~~[Optimization]【Hackathon 10th Spring No.49】GPU ngram_match: parallel BlockScan Phase 2 threshold~~ [Optimization]【Hackathon 10th Spring No.49】GPU ngram_match: BlockScan Phase 2 -extra Apr 1, 2026

luotao1 mentioned this pull request Apr 2, 2026

【Hackathon 10th】开源贡献个人挑战赛 · 春节特别季 PaddlePaddle/Paddle#77429

Open

	// minus at-least-ori reservation for every active item after me.
	// minus 1-token reservation for every active item after me.

Conversation

cloudforge1 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Why a separate PR?

Architecture (vs #6960)

How BlockScan Phase 2 works

Modifications

Diff from PR #6960

Usage or Command

Accuracy Tests

Correctness Tests (NgramMatch kernel)

Correctness Tests (HybridMtpNgram kernel)

Latency Benchmark (CI-verified, SM90 H100)

Multi-Scale Benchmark (test_latency_scaling, input_len=512, 50 runs per config)

Bug fixes included (discovered during BlockScan rewrite)

Checklist

Uh oh!

paddle-bot bot commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fastdeploy-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 2, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cloudforge1 commented Apr 1, 2026 •

edited

Loading

Multi-Scale Benchmark (`test_latency_scaling`, input_len=512, 50 runs per config)