add cpp test for transpose by zonglinpeng · Pull Request #16512 · pytorch/executorch

zonglinpeng · 2026-01-08T22:50:35Z

Summary: titled

Differential Revision: D90288383

Summary: Summary Adding fixes for im2row and conv1d Pull Request resolved: pytorch#16243 Test Plan: Unit tested topic: not user facing Differential Revision: D89198050 Pulled By: zonglinpeng

Summary: titled Differential Revision: D90286146

Summary: titled Differential Revision: D90286169

Summary: titled Differential Revision: D90288367

Summary: titled Differential Revision: D90288383

pytorch-bot · 2026-01-08T22:50:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16512

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit 319bc5a with merge base 7815c38 ():

NEW FAILURE - The following job has failed:

Lint / lintrunner / linux-job (gh)
>>> Lint for backends/cadence/hifi/operators/tests/test_op_transpose_copy.cpp:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / test-llama-runner-linux-android / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / android / run-emulator (gh) (#16137)
Timeout waiting for emulator to boot.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-01-08T22:51:10Z

@zonglinpeng has exported this pull request. If you are a Meta employee, you can view the originating Diff in D90288383.

github-actions · 2026-01-08T22:51:48Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…transpose kernel for safe load & store, [ET][Cadence][Hifi] add cpp test for im2row, [ET][Cadence][Hifi] link transpose to internal, [ET][Cadence][Hifi] add cpp... (pytorch#16512) Summary: ...test for transpose Summary Adding fixes for im2row and conv1d Original patch: cad-audio@1cb9ee7#diff-72cb6b6b59f6d8be4021885fa14490cd182776f883b6d2132678968b9e7cb267 Explanation of Code Changes in xa_nn_transpose_32.c This patch fixes a memory alignment crash issue in the transpose operation for Cadence HiFi DSPs. Let me explain both changes: Change 1: Simplified Inner Loop (Lines 170-191) The Problem The original code had a flawed alignment check: // OLD CODE - BUGGY if((((unsigned)p_inp4 & 1) == 0) && (((unsigned)p_out & 1) == 0)) { // "Aligned" path using AE_L32X2_IP / AE_S32X2_IP } else { // Unaligned path using AE_LA32X2_IP / AE_SA32X2_IP } The bug: The check & 1 only verifies 2-byte alignment (checking if the least significant bit is 0). However: ae_int32x2 is a 64-bit (8-byte) SIMD type AE_L32X2_IP / AE_S32X2_IP are aligned load/store intrinsics that require 8-byte alignment The correct check should be & 0x7 (for 8-byte alignment) or at minimum & 0x3 (for 4-byte alignment) Result: When pointers were 2-byte aligned but NOT 8-byte aligned, the code incorrectly took the "aligned" path, causing: Crashes on HiFi DSPs due to misaligned memory access Data corruption in some cases The Fix // NEW CODE - SAFE ae_int32x2 *__restrict__ pae_i = (ae_int32x2 *)(p_inp4); ae_int32x2 *__restrict__ pae_o = (ae_int32x2 *)(p_out); ae_valign a_inp = AE_LA64_PP(pae_i); // Prime alignment register for input ae_valign a_out = AE_ZALIGN64(); // Initialize alignment register for output ae_int32x2 d0; for(itr4 = 0; itr4 < (out_dim4 >> 1); itr4++) { AE_LA32X2_IP(d0, a_inp, pae_i); // Unaligned load AE_SA32X2_IP(d0, a_out, pae_o); // Unaligned store } AE_SA64POS_FP(a_out, pae_o); // Flush remaining data Why this works: Always uses unaligned intrinsics (AE_LA32X2_IP / AE_SA32X2_IP) which work correctly for both aligned and unaligned data Eliminates the faulty branch entirely - no alignment check needed Minimal performance impact - on modern HiFi DSPs, unaligned intrinsics on aligned data have negligible overhead Simpler code - easier to maintain and less error-prone Intrinsic Type Requirement AE_L32X2_IP Aligned load Requires 8-byte aligned pointer AE_LA32X2_IP Unaligned load Works with any alignment AE_S32X2_IP Aligned store Requires 8-byte aligned pointer AE_SA32X2_IP Unaligned store Works with any alignment Change 2: Fixed Strided Load (Lines 216-222) The Problem // OLD CODE - PROBLEMATIC AE_L32_XP(d0, (ae_int32 *)p_inp4, inp_stride[p_5D_permute_vec[4]] << 2); AE_L32_XP(d1, (ae_int32 *)p_inp4, inp_stride[p_5D_permute_vec[4]] << 2); The AE_L32_XP intrinsic: Loads a 32-bit value from the pointer Modifies the pointer by adding the byte offset Has specific alignment and type requirements Issues: The << 2 converts stride to bytes, but the pointer increment behavior inside the macro can be problematic with unaligned addresses The intrinsic may have stricter alignment requirements than expected The cast (ae_int32 *)p_inp4 combined with the post-increment behavior can cause undefined behavior on some platforms The Fix // NEW CODE - EXPLICIT AND SAFE d0 = AE_L32_X((ae_int32 *)p_inp4, 0); // Load with explicit offset p_inp4 += inp_stride[p_5D_permute_vec[4]]; // Manual pointer increment d1 = AE_L32_X((ae_int32 *)p_inp4, 0); // Load with explicit offset p_inp4 += inp_stride[p_5D_permute_vec[4]]; // Manual pointer increment Why this works: AE_L32_X(ptr, offset) - Loads from ptr + offset without modifying the pointer Explicit pointer arithmetic - p_inp4 += stride increments by elements (not bytes), which is correct for WORD32* Clearer semantics - Separates load and pointer update, making behavior predictable Avoids intrinsic quirks - No reliance on how the intrinsic handles pointer modification internally Aspect Old (AE_L32_XP) New (AE_L32_X + manual) Pointer update Inside intrinsic (byte offset) Explicit (element offset) Clarity Implicit behavior Explicit, readable Safety Potential alignment issues Guaranteed correct Summary Change Root Cause Fix Strategy Inner loop simplification Wrong alignment check (& 1 instead of & 0x7) Always use unaligned intrinsics Strided load fix AE_L32_XP intrinsic issues with unaligned pointers Use AE_L32_X + manual pointer increment Both fixes follow the same principle: Instead of trying to detect alignment and use different code paths, use intrinsics that handle unaligned data correctly in all cases. This is safer, simpler, and has minimal performance impact on modern HiFi DSPs. titled titled titled Differential Revision: D92207944

…transpose kernel for safe load & store, [ET][Cadence][Hifi] add cpp test for im2row, [ET][Cadence][Hifi] link transpose to internal, [ET][Cadence][Hifi] add cpp... (#16512) Differential Revision: D92207944 Pull Request resolved: #17180

cad-audio and others added 5 commits January 7, 2026 14:10

Adding fixes for im2row and conv1d (pytorch#16243)

2a9a3b0

Summary: Summary Adding fixes for im2row and conv1d Pull Request resolved: pytorch#16243 Test Plan: Unit tested topic: not user facing Differential Revision: D89198050 Pulled By: zonglinpeng

link im2row to internal

e535ddc

Summary: titled Differential Revision: D90286146

add cpp test for im2row

04e3dae

Summary: titled Differential Revision: D90286169

link transpose to internal

3965604

Summary: titled Differential Revision: D90288367

add cpp test for transpose

319bc5a

Summary: titled Differential Revision: D90288383

zonglinpeng requested review from kirklandsign and larryliu0820 as code owners January 8, 2026 22:50

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 8, 2026

meta-codesync bot added fb-exported meta-exported labels Jan 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add cpp test for transpose#16512

add cpp test for transpose#16512
zonglinpeng wants to merge 5 commits intopytorch:mainfrom
zonglinpeng:export-D90288383

zonglinpeng commented Jan 8, 2026

Uh oh!

pytorch-bot bot commented Jan 8, 2026 •

edited

Loading

Uh oh!

meta-codesync bot commented Jan 8, 2026

Uh oh!

github-actions bot commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zonglinpeng commented Jan 8, 2026

Uh oh!

pytorch-bot bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16512

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

meta-codesync bot commented Jan 8, 2026

Uh oh!

github-actions bot commented Jan 8, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Jan 8, 2026 •

edited

Loading

This PR needs a `release notes:` label