[ExecuTorch][WebGPU] SDPA test suite: replay + dynamic input_pos + in-graph KV cache by JulianCloudNTH · Pull Request #20087 · pytorch/executorch

JulianCloudNTH · 2026-06-06T07:15:01Z

Stack from ghstack (oldest at bottom):

-> [ExecuTorch][WebGPU] SDPA test suite: replay + dynamic input_pos + in-graph KV cache #20087
[ExecuTorch][WebGPU] Add fused SDPA (sdpa_with_kv_cache) with dynamic input_pos #20086

Adds the WebGPU SDPA test coverage as its own diff, stacked on the SDPA op (which already carries the dynamic-input_pos consumption) and the SymInt mechanism below it: multi-step prefill->mt->decode replay, runtime-dynamic input_pos (autoregressive decode), and an in-graph mutable KV cache, each compared against a torch F.scaled_dot_product_attention golden.

test/ops/sdpa/test_sdpa.py: ReplaySeq/REPLAY_SEQS + per-step replay export/golden; DynamicSdpaModule + export_dynamic_decode (one .pte, input_pos supplied at runtime as a SymInt); DecodeCacheModule + export_incache_decode (KV cache as register_buffer mutable buffers, so the cache persists in-graph and forward() feeds only the new token + input_pos).
test/test_webgpu_native.cpp: test_sdpa_replay, test_sdpa_dynamic_decode (+ negative control: a pinned input_pos diverges), test_sdpa_incache_decode (+ static control: a fresh Module per step diverges, proving in-graph accumulation is real), test_symint_roundtrip, test_resize_hook; shared per-element tolerance sdpa_within_tol (abs 1e-4 OR rel 1e-3).
test/test_build_webgpu.sh: export the replay / dynamic / in-graph-cache models for the native test.
Authored with assistance from Claude.

@exported-using-ghexport

Differential Revision: D107595144

[ghstack-poisoned]

pytorch-bot · 2026-06-06T07:15:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20087

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm] MI350 CI jobs will have longer queue times due to CI migration

⏳ 17 Pending, 1 Unrelated Failure

As of commit a8ed610 with merge base 5526971 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / android / build-android (gh) (trunk failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-06T07:15:43Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

[ghstack-poisoned]

…-graph KV cache Pull Request resolved: #20087 Adds the WebGPU SDPA test coverage as its own diff, stacked on the SDPA op (which already carries the dynamic-`input_pos` consumption) and the SymInt mechanism below it: multi-step prefill->mt->decode replay, runtime-dynamic `input_pos` (autoregressive decode), and an in-graph mutable KV cache, each compared against a torch `F.scaled_dot_product_attention` golden. - `test/ops/sdpa/test_sdpa.py`: `ReplaySeq`/`REPLAY_SEQS` + per-step replay export/golden; `DynamicSdpaModule` + `export_dynamic_decode` (one `.pte`, `input_pos` supplied at runtime as a SymInt); `DecodeCacheModule` + `export_incache_decode` (KV cache as `register_buffer` mutable buffers, so the cache persists in-graph and forward() feeds only the new token + `input_pos`). - `test/test_webgpu_native.cpp`: `test_sdpa_replay`, `test_sdpa_dynamic_decode` (+ negative control: a pinned `input_pos` diverges), `test_sdpa_incache_decode` (+ static control: a fresh Module per step diverges, proving in-graph accumulation is real), `test_symint_roundtrip`, `test_resize_hook`; shared per-element tolerance `sdpa_within_tol` (abs 1e-4 OR rel 1e-3). - `test/test_build_webgpu.sh`: export the replay / dynamic / in-graph-cache models for the native test. ghstack-source-id: 391352764 @exported-using-ghexport Differential Revision: [D107595144](https://our.internmc.facebook.com/intern/diff/D107595144/)

[ghstack-poisoned]

…-graph KV cache Pull Request resolved: #20087 Adds the WebGPU SDPA test coverage as its own diff, stacked on the SDPA op (which already carries the dynamic-`input_pos` consumption) and the SymInt mechanism below it: multi-step prefill->mt->decode replay, runtime-dynamic `input_pos` (autoregressive decode), and an in-graph mutable KV cache, each compared against a torch `F.scaled_dot_product_attention` golden. - `test/ops/sdpa/test_sdpa.py`: `ReplaySeq`/`REPLAY_SEQS` + per-step replay export/golden; `DynamicSdpaModule` + `export_dynamic_decode` (one `.pte`, `input_pos` supplied at runtime as a SymInt); `DecodeCacheModule` + `export_incache_decode` (KV cache as `register_buffer` mutable buffers, so the cache persists in-graph and forward() feeds only the new token + `input_pos`). - `test/test_webgpu_native.cpp`: `test_sdpa_replay`, `test_sdpa_dynamic_decode` (+ negative control: a pinned `input_pos` diverges), `test_sdpa_incache_decode` (+ static control: a fresh Module per step diverges, proving in-graph accumulation is real), `test_symint_roundtrip`, `test_resize_hook`; shared per-element tolerance `sdpa_within_tol` (abs 1e-4 OR rel 1e-3). - `test/test_build_webgpu.sh`: export the replay / dynamic / in-graph-cache models for the native test. ghstack-source-id: 391352764 @exported-using-ghexport Differential Revision: [D107595144](https://our.internmc.facebook.com/intern/diff/D107595144/)

[ghstack-poisoned]

…-graph KV cache Pull Request resolved: #20087 Adds the WebGPU SDPA test coverage as its own diff, stacked on the SDPA op (which already carries the dynamic-`input_pos` consumption) and the SymInt mechanism below it: multi-step prefill->mt->decode replay, runtime-dynamic `input_pos` (autoregressive decode), and an in-graph mutable KV cache, each compared against a torch `F.scaled_dot_product_attention` golden. - `test/ops/sdpa/test_sdpa.py`: `ReplaySeq`/`REPLAY_SEQS` + per-step replay export/golden; `DynamicSdpaModule` + `export_dynamic_decode` (one `.pte`, `input_pos` supplied at runtime as a SymInt); `DecodeCacheModule` + `export_incache_decode` (KV cache as `register_buffer` mutable buffers, so the cache persists in-graph and forward() feeds only the new token + `input_pos`). - `test/test_webgpu_native.cpp`: `test_sdpa_replay`, `test_sdpa_dynamic_decode` (+ negative control: a pinned `input_pos` diverges), `test_sdpa_incache_decode` (+ static control: a fresh Module per step diverges, proving in-graph accumulation is real), `test_symint_roundtrip`, `test_resize_hook`; shared per-element tolerance `sdpa_within_tol` (abs 1e-4 OR rel 1e-3). - `test/test_build_webgpu.sh`: export the replay / dynamic / in-graph-cache models for the native test. ghstack-source-id: 391373155 @exported-using-ghexport Differential Revision: [D107595144](https://our.internmc.facebook.com/intern/diff/D107595144/)

[ghstack-poisoned]

…-graph KV cache Pull Request resolved: #20087 Adds the WebGPU SDPA test coverage as its own diff, stacked on the SDPA op (which already carries the dynamic-`input_pos` consumption) and the SymInt mechanism below it: multi-step prefill->mt->decode replay, runtime-dynamic `input_pos` (autoregressive decode), and an in-graph mutable KV cache, each compared against a torch `F.scaled_dot_product_attention` golden. - `test/ops/sdpa/test_sdpa.py`: `ReplaySeq`/`REPLAY_SEQS` + per-step replay export/golden; `DynamicSdpaModule` + `export_dynamic_decode` (one `.pte`, `input_pos` supplied at runtime as a SymInt); `DecodeCacheModule` + `export_incache_decode` (KV cache as `register_buffer` mutable buffers, so the cache persists in-graph and forward() feeds only the new token + `input_pos`). - `test/test_webgpu_native.cpp`: `test_sdpa_replay`, `test_sdpa_dynamic_decode` (+ negative control: a pinned `input_pos` diverges), `test_sdpa_incache_decode` (+ static control: a fresh Module per step diverges, proving in-graph accumulation is real), `test_symint_roundtrip`, `test_resize_hook`; shared per-element tolerance `sdpa_within_tol` (abs 1e-4 OR rel 1e-3). - `test/test_build_webgpu.sh`: export the replay / dynamic / in-graph-cache models for the native test. ghstack-source-id: 391378806 @exported-using-ghexport Differential Revision: [D107595144](https://our.internmc.facebook.com/intern/diff/D107595144/)

[ghstack-poisoned]

…-graph KV cache Pull Request resolved: #20087 Adds the WebGPU SDPA test coverage as its own diff, stacked on the SDPA op (which already carries the dynamic-`input_pos` consumption) and the SymInt mechanism below it: multi-step prefill->mt->decode replay, runtime-dynamic `input_pos` (autoregressive decode), and an in-graph mutable KV cache, each compared against a torch `F.scaled_dot_product_attention` golden. - `test/ops/sdpa/test_sdpa.py`: `ReplaySeq`/`REPLAY_SEQS` + per-step replay export/golden; `DynamicSdpaModule` + `export_dynamic_decode` (one `.pte`, `input_pos` supplied at runtime as a SymInt); `DecodeCacheModule` + `export_incache_decode` (KV cache as `register_buffer` mutable buffers, so the cache persists in-graph and forward() feeds only the new token + `input_pos`). - `test/test_webgpu_native.cpp`: `test_sdpa_replay`, `test_sdpa_dynamic_decode` (+ negative control: a pinned `input_pos` diverges), `test_sdpa_incache_decode` (+ static control: a fresh Module per step diverges, proving in-graph accumulation is real), `test_symint_roundtrip`, `test_resize_hook`; shared per-element tolerance `sdpa_within_tol` (abs 1e-4 OR rel 1e-3). - `test/test_build_webgpu.sh`: export the replay / dynamic / in-graph-cache models for the native test. ghstack-source-id: 391378806 @exported-using-ghexport Differential Revision: [D107595144](https://our.internmc.facebook.com/intern/diff/D107595144/)

[ghstack-poisoned]

…-graph KV cache Pull Request resolved: #20087 Adds the WebGPU SDPA test coverage as its own diff, stacked on the SDPA op (which already carries the dynamic-`input_pos` consumption) and the SymInt mechanism below it: multi-step prefill->mt->decode replay, runtime-dynamic `input_pos` (autoregressive decode), and an in-graph mutable KV cache, each compared against a torch `F.scaled_dot_product_attention` golden. - `test/ops/sdpa/test_sdpa.py`: `ReplaySeq`/`REPLAY_SEQS` + per-step replay export/golden; `DynamicSdpaModule` + `export_dynamic_decode` (one `.pte`, `input_pos` supplied at runtime as a SymInt); `DecodeCacheModule` + `export_incache_decode` (KV cache as `register_buffer` mutable buffers, so the cache persists in-graph and forward() feeds only the new token + `input_pos`). - `test/test_webgpu_native.cpp`: `test_sdpa_replay`, `test_sdpa_dynamic_decode` (+ negative control: a pinned `input_pos` diverges), `test_sdpa_incache_decode` (+ static control: a fresh Module per step diverges, proving in-graph accumulation is real), `test_symint_roundtrip`, `test_resize_hook`; shared per-element tolerance `sdpa_within_tol` (abs 1e-4 OR rel 1e-3). - `test/test_build_webgpu.sh`: export the replay / dynamic / in-graph-cache models for the native test. Authored with assistance from Claude. ghstack-source-id: 391549550 @exported-using-ghexport Differential Revision: [D107595144](https://our.internmc.facebook.com/intern/diff/D107595144/)

[ghstack-poisoned]

…-graph KV cache Pull Request resolved: #20087 Adds the WebGPU SDPA test coverage as its own diff, stacked on the SDPA op (which already carries the dynamic-`input_pos` consumption) and the SymInt mechanism below it: multi-step prefill->mt->decode replay, runtime-dynamic `input_pos` (autoregressive decode), and an in-graph mutable KV cache, each compared against a torch `F.scaled_dot_product_attention` golden. - `test/ops/sdpa/test_sdpa.py`: `ReplaySeq`/`REPLAY_SEQS` + per-step replay export/golden; `DynamicSdpaModule` + `export_dynamic_decode` (one `.pte`, `input_pos` supplied at runtime as a SymInt); `DecodeCacheModule` + `export_incache_decode` (KV cache as `register_buffer` mutable buffers, so the cache persists in-graph and forward() feeds only the new token + `input_pos`). - `test/test_webgpu_native.cpp`: `test_sdpa_replay`, `test_sdpa_dynamic_decode` (+ negative control: a pinned `input_pos` diverges), `test_sdpa_incache_decode` (+ static control: a fresh Module per step diverges, proving in-graph accumulation is real), `test_symint_roundtrip`, `test_resize_hook`; shared per-element tolerance `sdpa_within_tol` (abs 1e-4 OR rel 1e-3). - `test/test_build_webgpu.sh`: export the replay / dynamic / in-graph-cache models for the native test. Authored with assistance from Claude. ghstack-source-id: 391549550 @exported-using-ghexport Differential Revision: [D107595144](https://our.internmc.facebook.com/intern/diff/D107595144/)

[ghstack-poisoned]

…-graph KV cache Pull Request resolved: #20087 Adds the WebGPU SDPA test coverage as its own diff, stacked on the SDPA op (which already carries the dynamic-`input_pos` consumption) and the SymInt mechanism below it: multi-step prefill->mt->decode replay, runtime-dynamic `input_pos` (autoregressive decode), and an in-graph mutable KV cache, each compared against a torch `F.scaled_dot_product_attention` golden. - `test/ops/sdpa/test_sdpa.py`: `ReplaySeq`/`REPLAY_SEQS` + per-step replay export/golden; `DynamicSdpaModule` + `export_dynamic_decode` (one `.pte`, `input_pos` supplied at runtime as a SymInt); `DecodeCacheModule` + `export_incache_decode` (KV cache as `register_buffer` mutable buffers, so the cache persists in-graph and forward() feeds only the new token + `input_pos`). - `test/test_webgpu_native.cpp`: `test_sdpa_replay`, `test_sdpa_dynamic_decode` (+ negative control: a pinned `input_pos` diverges), `test_sdpa_incache_decode` (+ static control: a fresh Module per step diverges, proving in-graph accumulation is real), `test_symint_roundtrip`, `test_resize_hook`; shared per-element tolerance `sdpa_within_tol` (abs 1e-4 OR rel 1e-3). - `test/test_build_webgpu.sh`: export the replay / dynamic / in-graph-cache models for the native test. Authored with assistance from Claude. ghstack-source-id: 391626190 @exported-using-ghexport Differential Revision: [D107595144](https://our.internmc.facebook.com/intern/diff/D107595144/)

[ghstack-poisoned]

…-graph KV cache Pull Request resolved: #20087 Adds the WebGPU SDPA test coverage as its own diff, stacked on the SDPA op (which already carries the dynamic-`input_pos` consumption) and the SymInt mechanism below it: multi-step prefill->mt->decode replay, runtime-dynamic `input_pos` (autoregressive decode), and an in-graph mutable KV cache, each compared against a torch `F.scaled_dot_product_attention` golden. - `test/ops/sdpa/test_sdpa.py`: `ReplaySeq`/`REPLAY_SEQS` + per-step replay export/golden; `DynamicSdpaModule` + `export_dynamic_decode` (one `.pte`, `input_pos` supplied at runtime as a SymInt); `DecodeCacheModule` + `export_incache_decode` (KV cache as `register_buffer` mutable buffers, so the cache persists in-graph and forward() feeds only the new token + `input_pos`). - `test/test_webgpu_native.cpp`: `test_sdpa_replay`, `test_sdpa_dynamic_decode` (+ negative control: a pinned `input_pos` diverges), `test_sdpa_incache_decode` (+ static control: a fresh Module per step diverges, proving in-graph accumulation is real), `test_symint_roundtrip`, `test_resize_hook`; shared per-element tolerance `sdpa_within_tol` (abs 1e-4 OR rel 1e-3). - `test/test_build_webgpu.sh`: export the replay / dynamic / in-graph-cache models for the native test. Authored with assistance from Claude. ghstack-source-id: 391637443 @exported-using-ghexport Differential Revision: [D107595144](https://our.internmc.facebook.com/intern/diff/D107595144/)

[ghstack-poisoned]

SS-JIA

Review automatically exported from Phabricator review in Meta.

[ghstack-poisoned]

…-graph KV cache Pull Request resolved: #20087 Adds the WebGPU SDPA test coverage as its own diff, stacked on the SDPA op (which already carries the dynamic-`input_pos` consumption) and the SymInt mechanism below it: multi-step prefill->mt->decode replay, runtime-dynamic `input_pos` (autoregressive decode), and an in-graph mutable KV cache, each compared against a torch `F.scaled_dot_product_attention` golden. - `test/ops/sdpa/test_sdpa.py`: `ReplaySeq`/`REPLAY_SEQS` + per-step replay export/golden; `DynamicSdpaModule` + `export_dynamic_decode` (one `.pte`, `input_pos` supplied at runtime as a SymInt); `DecodeCacheModule` + `export_incache_decode` (KV cache as `register_buffer` mutable buffers, so the cache persists in-graph and forward() feeds only the new token + `input_pos`). - `test/test_webgpu_native.cpp`: `test_sdpa_replay`, `test_sdpa_dynamic_decode` (+ negative control: a pinned `input_pos` diverges), `test_sdpa_incache_decode` (+ static control: a fresh Module per step diverges, proving in-graph accumulation is real), `test_symint_roundtrip`, `test_resize_hook`; shared per-element tolerance `sdpa_within_tol` (abs 1e-4 OR rel 1e-3). - `test/test_build_webgpu.sh`: export the replay / dynamic / in-graph-cache models for the native test. Authored with assistance from Claude. ghstack-source-id: 393014582 @exported-using-ghexport Differential Revision: [D107595144](https://our.internmc.facebook.com/intern/diff/D107595144/)

@JulianCloudNTH

…-graph KV cache (#20260) This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: #20087 by @JulianCloudNTH ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/20/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/20/head Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/19/orig Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/20/orig @diff-train-skip-merge --------- Co-authored-by: Julian Ng-Thow-Hing <juliannth@meta.com>

Update

a6ccd86

[ghstack-poisoned]

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 6, 2026

JulianCloudNTH closed this Jun 6, 2026

JulianCloudNTH had a problem deploying to cherry-pick-bot June 6, 2026 07:16 — with GitHub Actions Failure

JulianCloudNTH reopened this Jun 9, 2026

Update

dc3123d

[ghstack-poisoned]

meta-codesync Bot added the meta-exported label Jun 9, 2026

Update

493b5cc

[ghstack-poisoned]

Update

5c5a727

[ghstack-poisoned]

Update

9656d0e

[ghstack-poisoned]

Update

a8d90f7

[ghstack-poisoned]

Update

16fb9ad

[ghstack-poisoned]

Update

7518161

[ghstack-poisoned]

Update

0bab72f

[ghstack-poisoned]

Update

a274711

[ghstack-poisoned]

JulianCloudNTH mentioned this pull request Jun 9, 2026

[ExecuTorch][WebGPU] GPU timestamp query profiling for SDPA #20167

Merged

JulianCloudNTH added 2 commits June 9, 2026 17:17

Update

6eab4db

[ghstack-poisoned]

Update

507bafd

[ghstack-poisoned]

JulianCloudNTH mentioned this pull request Jun 10, 2026

[ExecuTorch][WebGPU] GPU timestamp query profiling (general implementation) #20201

Merged

JulianCloudNTH added 4 commits June 10, 2026 14:37

Update

03da102

[ghstack-poisoned]

Update

fe230eb

[ghstack-poisoned]

Update

776e190

[ghstack-poisoned]

Update

3ae3e66

[ghstack-poisoned]

This was referenced Jun 11, 2026

[ExecuTorch][WebGPU] Add 4-bit weight-only quantized linear (et_vk.linear_q4gsw) #20226

Merged

[ExecuTorch][WebGPU] linear_q4gsw test suite: Llama-1B shapes + 4k/8k sweep #20227

Merged

psiddh approved these changes Jun 12, 2026

View reviewed changes

SS-JIA approved these changes Jun 12, 2026

View reviewed changes

Update

a8ed610

[ghstack-poisoned]

meta-codesync Bot merged commit 58963cf into gh/JulianCloudNTH/20/base Jun 12, 2026
177 of 180 checks passed

meta-codesync Bot deleted the gh/JulianCloudNTH/20/head branch June 12, 2026 23:45

meta-codesync Bot temporarily deployed to cherry-pick-bot June 12, 2026 23:45 Inactive

pytorchbot mentioned this pull request Jun 12, 2026

[ExecuTorch][WebGPU] SDPA test suite: replay + dynamic input_pos + in-graph KV cache #20260

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ExecuTorch][WebGPU] SDPA test suite: replay + dynamic input_pos + in-graph KV cache#20087

[ExecuTorch][WebGPU] SDPA test suite: replay + dynamic input_pos + in-graph KV cache#20087
meta-codesync[bot] merged 17 commits into
gh/JulianCloudNTH/20/basefrom
gh/JulianCloudNTH/20/head

JulianCloudNTH commented Jun 6, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 6, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 6, 2026

Uh oh!

SS-JIA left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JulianCloudNTH commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20087

❗ 1 Active SEVs

⏳ 17 Pending, 1 Unrelated Failure

Uh oh!

github-actions Bot commented Jun 6, 2026

This PR needs a release notes: label

Uh oh!

SS-JIA left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JulianCloudNTH commented Jun 6, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 6, 2026 •

edited

Loading

This PR needs a `release notes:` label