[NPU] Add group norm support on NPU by orangeH25 · Pull Request #1144 · linkedin/Liger-Kernel

orangeH25 · 2026-03-12T11:52:24Z

Summary

This PR introduces a functional GroupNorm operator for Ascend NPU.

Key improvements:

Fixes the runtime error grid should be less than 65536! and ub overflow that occurs when the original GPU-oriented liger-kernel GroupNorm implementation is executed on NPU.
Adjusts the kernel launch and tiling strategy to comply with Ascend NPU execution constraints.
Resolves numerical accuracy issues with PyTorch reference outputs.

While the current implementation is still slower than the HuggingFace implementation in end-to-end benchmarks, it provides a stable and functional GroupNorm path for Ascend NPU.

This PR mainly focuses on correctness and NPU compatibility. Further kernel-level optimizations will be explored in follow-up work.

Testing Done

Hardware Type: Atlas 800I A2
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

orangeH25 · 2026-03-16T09:12:56Z

Hi @Tcc0403 , please take a look. Thanks!

Tcc0403 · 2026-03-16T22:51:07Z

+                else:
+                    dW_block = tl.where(mask, DY_block * x_hat, 0.0)
+                    dB_block = tl.where(mask, DY_block, 0.0)
+                    tl.atomic_add(DW_scratch_base + global_channel, dW_block, mask=mask)
+                    tl.atomic_add(DB_scratch_base + global_channel, dB_block, mask=mask)


L377 says

# Placeholder buffers (unused in kernel when COMPUTE_PARAM_GRAD=False)

, which contradicts what this block does. Shouldn't it be no-op here?

Removed.

I originally kept it to preserve the kernel structure for potential future experiments, but it’s not needed in the current implementation.

Tcc0403 · 2026-03-16T22:52:31Z

+        # Placeholder buffers (unused in kernel when COMPUTE_PARAM_GRAD=False)
+        DW_scratch = torch.empty((1, 1), dtype=torch.float32, device=W.device)
+        DB_scratch = torch.empty((1, 1), dtype=torch.float32, device=W.device)


Can placeholder buffers set to None in triton-ascend? to avoid accidently access in device code.

…norm

orangeH25 · 2026-03-17T02:47:30Z

Hi @Tcc0403, changes applied. Appreciate another review when you have time, thanks!

Tcc0403

LGTM! I left a comment about a potential improvement, but it can be done in another PR!

Tcc0403 · 2026-03-17T12:26:29Z

+            if COMPUTE_PARAM_GRAD:
+                if SINGLE_CHANNEL_TILE:
+                    dW_partial = tl.sum(tl.where(mask, DY_block * x_hat, 0.0), axis=1)
+                    dB_partial = tl.sum(tl.where(mask, DY_block, 0.0), axis=1)
+                    tl.atomic_add(DW_scratch_base + global_channel, dW_partial, mask=row_mask)
+                    tl.atomic_add(DB_scratch_base + global_channel, dB_partial, mask=row_mask)


I wonder if we can accumulate dw and db over grid loop and store it after, similar to

Liger-Kernel/src/liger_kernel/ops/rms_norm.py

Line 390 in db14ea8

dW_row += tl.sum(dY_row * (X_row * rstd_row[:, None]), 0)

With this approach, we can avoid using atomic_add and potentially handle the scenario where num_col_blocks>1.

The solution is not trivial and not gauranteed to achieve better performance, leaving the comment here as a future works direction.

Thanks for the suggestion! I’ll look into this.

[NPU] Add group norm support on NPU

4bd92e2

Tcc0403 reviewed Mar 16, 2026

View reviewed changes

[NPU] remove unused branch and clean up placeholder buffers in group-…

0733d45

…norm

Tcc0403 approved these changes Mar 17, 2026

View reviewed changes

Tcc0403 added this pull request to the merge queue Mar 17, 2026

Merged via the queue into linkedin:main with commit 68a7489 Mar 17, 2026
5 of 7 checks passed

zheliuyu mentioned this pull request Apr 14, 2026

[NPU Roadmap, Updated to 2026-Q2] NPU support for Liger-Kernel #969

Open

36 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPU] Add group norm support on NPU#1144

[NPU] Add group norm support on NPU#1144
Tcc0403 merged 2 commits intolinkedin:mainfrom
orangeH25:group-norm/1

orangeH25 commented Mar 12, 2026

Uh oh!

orangeH25 commented Mar 16, 2026

Uh oh!

Tcc0403 Mar 16, 2026

Uh oh!

orangeH25 Mar 17, 2026

Uh oh!

Tcc0403 Mar 16, 2026

Uh oh!

orangeH25 Mar 17, 2026

Uh oh!

orangeH25 commented Mar 17, 2026

Uh oh!

Tcc0403 left a comment

Uh oh!

Tcc0403 Mar 17, 2026

Uh oh!

orangeH25 Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

orangeH25 commented Mar 12, 2026

Summary

Testing Done

Uh oh!

orangeH25 commented Mar 16, 2026

Uh oh!

Tcc0403 Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

orangeH25 Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Tcc0403 Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

orangeH25 Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

orangeH25 commented Mar 17, 2026

Uh oh!

Tcc0403 left a comment

Choose a reason for hiding this comment

Uh oh!

Tcc0403 Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

orangeH25 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants