Conversation
Summary: Summary Adding fixes for im2row and conv1d Pull Request resolved: pytorch#16243 Test Plan: Unit tested topic: not user facing Differential Revision: D89198050 Pulled By: zonglinpeng
Summary: titled Differential Revision: D90286146
Summary: titled Differential Revision: D90286169
Summary: titled Differential Revision: D90288367
Summary: titled Differential Revision: D90288383
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16512
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit 319bc5a with merge base 7815c38 ( NEW FAILURE - The following job has failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@zonglinpeng has exported this pull request. If you are a Meta employee, you can view the originating Diff in D90288383. |
This PR needs a
|
…transpose kernel for safe load & store, [ET][Cadence][Hifi] add cpp test for im2row, [ET][Cadence][Hifi] link transpose to internal, [ET][Cadence][Hifi] add cpp... (pytorch#16512) Summary: ...test for transpose Summary Adding fixes for im2row and conv1d Original patch: cad-audio@1cb9ee7#diff-72cb6b6b59f6d8be4021885fa14490cd182776f883b6d2132678968b9e7cb267 Explanation of Code Changes in xa_nn_transpose_32.c This patch fixes a memory alignment crash issue in the transpose operation for Cadence HiFi DSPs. Let me explain both changes: Change 1: Simplified Inner Loop (Lines 170-191) The Problem The original code had a flawed alignment check: // OLD CODE - BUGGY if((((unsigned)p_inp4 & 1) == 0) && (((unsigned)p_out & 1) == 0)) { // "Aligned" path using AE_L32X2_IP / AE_S32X2_IP } else { // Unaligned path using AE_LA32X2_IP / AE_SA32X2_IP } The bug: The check & 1 only verifies 2-byte alignment (checking if the least significant bit is 0). However: ae_int32x2 is a 64-bit (8-byte) SIMD type AE_L32X2_IP / AE_S32X2_IP are aligned load/store intrinsics that require 8-byte alignment The correct check should be & 0x7 (for 8-byte alignment) or at minimum & 0x3 (for 4-byte alignment) Result: When pointers were 2-byte aligned but NOT 8-byte aligned, the code incorrectly took the "aligned" path, causing: Crashes on HiFi DSPs due to misaligned memory access Data corruption in some cases The Fix // NEW CODE - SAFE ae_int32x2 *__restrict__ pae_i = (ae_int32x2 *)(p_inp4); ae_int32x2 *__restrict__ pae_o = (ae_int32x2 *)(p_out); ae_valign a_inp = AE_LA64_PP(pae_i); // Prime alignment register for input ae_valign a_out = AE_ZALIGN64(); // Initialize alignment register for output ae_int32x2 d0; for(itr4 = 0; itr4 < (out_dim4 >> 1); itr4++) { AE_LA32X2_IP(d0, a_inp, pae_i); // Unaligned load AE_SA32X2_IP(d0, a_out, pae_o); // Unaligned store } AE_SA64POS_FP(a_out, pae_o); // Flush remaining data Why this works: Always uses unaligned intrinsics (AE_LA32X2_IP / AE_SA32X2_IP) which work correctly for both aligned and unaligned data Eliminates the faulty branch entirely - no alignment check needed Minimal performance impact - on modern HiFi DSPs, unaligned intrinsics on aligned data have negligible overhead Simpler code - easier to maintain and less error-prone Intrinsic Type Requirement AE_L32X2_IP Aligned load Requires 8-byte aligned pointer AE_LA32X2_IP Unaligned load Works with any alignment AE_S32X2_IP Aligned store Requires 8-byte aligned pointer AE_SA32X2_IP Unaligned store Works with any alignment Change 2: Fixed Strided Load (Lines 216-222) The Problem // OLD CODE - PROBLEMATIC AE_L32_XP(d0, (ae_int32 *)p_inp4, inp_stride[p_5D_permute_vec[4]] << 2); AE_L32_XP(d1, (ae_int32 *)p_inp4, inp_stride[p_5D_permute_vec[4]] << 2); The AE_L32_XP intrinsic: Loads a 32-bit value from the pointer Modifies the pointer by adding the byte offset Has specific alignment and type requirements Issues: The << 2 converts stride to bytes, but the pointer increment behavior inside the macro can be problematic with unaligned addresses The intrinsic may have stricter alignment requirements than expected The cast (ae_int32 *)p_inp4 combined with the post-increment behavior can cause undefined behavior on some platforms The Fix // NEW CODE - EXPLICIT AND SAFE d0 = AE_L32_X((ae_int32 *)p_inp4, 0); // Load with explicit offset p_inp4 += inp_stride[p_5D_permute_vec[4]]; // Manual pointer increment d1 = AE_L32_X((ae_int32 *)p_inp4, 0); // Load with explicit offset p_inp4 += inp_stride[p_5D_permute_vec[4]]; // Manual pointer increment Why this works: AE_L32_X(ptr, offset) - Loads from ptr + offset without modifying the pointer Explicit pointer arithmetic - p_inp4 += stride increments by elements (not bytes), which is correct for WORD32* Clearer semantics - Separates load and pointer update, making behavior predictable Avoids intrinsic quirks - No reliance on how the intrinsic handles pointer modification internally Aspect Old (AE_L32_XP) New (AE_L32_X + manual) Pointer update Inside intrinsic (byte offset) Explicit (element offset) Clarity Implicit behavior Explicit, readable Safety Potential alignment issues Guaranteed correct Summary Change Root Cause Fix Strategy Inner loop simplification Wrong alignment check (& 1 instead of & 0x7) Always use unaligned intrinsics Strided load fix AE_L32_XP intrinsic issues with unaligned pointers Use AE_L32_X + manual pointer increment Both fixes follow the same principle: Instead of trying to detect alignment and use different code paths, use intrinsics that handle unaligned data correctly in all cases. This is safer, simpler, and has minimal performance impact on modern HiFi DSPs. titled titled titled Differential Revision: D92207944
Summary: titled
Differential Revision: D90288383