[NVFP4][Dense] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose-Cast Fusion Kernel #2555

zhongbozhu · 2026-01-03T01:23:23Z

Description

Previously, similar optimization has been applied for MOE grouped quantize with RHT in #2411. This PR targets the dense linear layers & shared experts when being quantized to NVFP4. Having this fusion means high precision input only needs to be read once while without this fusion, it needs to be read twice.

Similarly, we have env var NVTE_USE_FAST_MATH to control the numerical behavior of RHT quant fusion kernel to accelerate it further. The fast math is only applied to the high precision math so it will have minimal impact of the training convergence.

What fast-math toggle controls:

replace x / y by x * (1/y)
replace 1 / x by reciporal_approximate_ftz(x)
when RHT cast fusion is available, fusion allows nvfp4 quantize to be performed directly on FP32 data in register files, this will essentially remove a round trip between FP32 to BF16 then FP32.

Therefore, I DO recommend turn it on since it will significantly improve the RHT kernel performance.

The only reason why it's still not default open is because we want ZERO TOLERNACE test between our CUDA quantize kernels and our pytorch-based emulated quantize references. With fast math toggle turned on, it's hard to pass test with zero tolerance without further investigation of how to relax the test conditions while still providing high confidence of the test case.

TODO items:

Some cutlass deprecating APIs are being used, output many warnings
Maybe turn on fast math by default and use NVTE_DISABLE_RHT_FAST_MATH instead of using NVTE_USE_FAST_MATH? @timmoon10 for opinions.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Zhongbo Zhu <[email protected]>

greptile-apps · 2026-01-03T01:26:28Z

Greptile Summary

This PR integrates a new Cutlass-based CUDA kernel that fuses Row-Cast and Column-RHT-Transpose-Cast operations for NVFP4 quantization in dense linear layers. The optimization reduces memory bandwidth by reading high-precision input data only once instead of twice.

Added nvte_hadamard_transform_cast_fusion API function that performs rowwise quantization and columnwise RHT+quantization+transpose in a single kernel
Kernel uses MMA hardware for efficient Hadamard transform computation and is eligible when input is BF16 with dimensions divisible by 64x128
Refactored NVFP4Quantizer::quantize_impl() to use the fused kernel when eligible, with extracted helper method for unfused fallback path
Added NVTE_USE_FAST_MATH environment variable support to accelerate RHT operations (replaces division with multiplication by reciprocal, uses approximate reciprocal)
Extended test coverage to include columnwise-only quantization mode
Added benchmark script for profiling linear layer performance across quantization recipes

Confidence Score: 4/5

This PR is safe to merge - it adds a new performance optimization path with proper fallback to existing behavior for unsupported shapes.
Score of 4 reflects well-structured code with proper compile guards, shape validation checks, and fallback paths. The CUDA kernel follows established patterns in the codebase. Minor unused variables exist but do not affect functionality.
The new CUDA kernel file (row_cast_col_hadamard_transform_cast_fusion.cu) is the most complex addition and should be reviewed for numerical correctness in production workloads.

Important Files Changed

Filename	Overview
transformer_engine/common/hadamard_transform/row_cast_col_hadamard_transform_cast_fusion.cu	New CUDA kernel implementing fused Row-Cast-Col-RHT-Transpose-Cast for NVFP4 quantization, leveraging MMA hardware for Hadamard transform. Contains minor unused variables but no functional issues.
transformer_engine/pytorch/csrc/quantizer.cpp	Refactored quantization logic to use fused kernel when eligible (rows%64==0, cols%128==0), with unfused fallback. Added fast math toggle support and extracted unfused helper method.
transformer_engine/common/include/transformer_engine/hadamard_transform.h	Added new API function `nvte_hadamard_transform_cast_fusion` for row-cast and column-RHT-transpose-cast fusion; marked old columnwise-only function for deprecation.
tests/pytorch/nvfp4/test_nvfp4_rht_quantize_exact.py	Extended test coverage to support columnwise-only quantization mode in addition to existing rowwise and combined modes.

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant NVFP4Q as NVFP4Quantizer
    participant QImpl as quantize_impl()
    participant FusedK as nvte_hadamard_transform_cast_fusion
    participant UnfusedH as quantize_with_rht_unfused_helper
    participant RowQuant as nvte_quantize_v2 (rowwise)
    participant RHT as nvte_hadamard_transform
    participant ColQuant as nvte_quantize_v2 (columnwise)

    User->>NVFP4Q: quantize(input, output)
    NVFP4Q->>QImpl: quantize_impl(input, output)
    
    alt eligible_for_rht_cast_fusion (BF16, rows%64==0, cols%128==0)
        QImpl->>FusedK: nvte_hadamard_transform_cast_fusion()
        Note over FusedK: Single kernel does:<br/>1. Rowwise quantization<br/>2. RHT + columnwise quant + transpose
        FusedK-->>QImpl: rowwise + columnwise output
    else not eligible (irregular shapes)
        QImpl->>UnfusedH: quantize_with_rht_unfused_helper()
        alt rowwise_usage
            UnfusedH->>RowQuant: nvte_quantize_v2()
            RowQuant-->>UnfusedH: rowwise output
        end
        alt columnwise_usage
            UnfusedH->>RHT: nvte_hadamard_transform()
            RHT-->>UnfusedH: RHT(input.T)
            UnfusedH->>ColQuant: nvte_quantize_v2()
            ColQuant-->>UnfusedH: columnwise output
        end
        UnfusedH-->>QImpl: combined output
    end
    
    QImpl-->>NVFP4Q: quantized tensor
    NVFP4Q-->>User: NVFP4Tensor

Signed-off-by: Zhongbo Zhu <[email protected]>

zhongbozhu · 2026-01-03T04:16:46Z

/te-ci arm L1

greptile-apps

Additional Comments (2)

benchmarks/linear/benchmark_linear.py, line 141 (link)

logic: NVTX range is pushed but never popped in the benchmark function
transformer_engine/common/hadamard_transform/row_cast_col_hadamard_transform_cast_fusion.cu, line 346 (link)

syntax: Typo in comment: 'SMEMork' should be 'SMEM work'

_{8 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Zhongbo Zhu <[email protected]>

zhongbozhu added 3 commits January 2, 2026 12:23

first draft

335e60c

Signed-off-by: Zhongbo Zhu <[email protected]>

pass numerical unit test

e6560a3

Signed-off-by: Zhongbo Zhu <[email protected]>

format

23a9fa2

Signed-off-by: Zhongbo Zhu <[email protected]>

zhongbozhu requested a review from timmoon10 January 3, 2026 01:23

zhongbozhu self-assigned this Jan 3, 2026

zhongbozhu added the fp4 label Jan 3, 2026

zhongbozhu changed the title ~~[NVFP4][Dense] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose Fusion Kernel~~ [NVFP4][Dense] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose-Cast Fusion Kernel Jan 3, 2026

zhongbozhu added 2 commits January 2, 2026 20:07

add benchmark script

7ed5d9e

Signed-off-by: Zhongbo Zhu <[email protected]>

lint and format

fc42825

Signed-off-by: Zhongbo Zhu <[email protected]>

zhongbozhu force-pushed the zhongbo/dense_row_col_rht_fp4_quant branch from c80932f to fc42825 Compare January 3, 2026 04:16

greptile-apps bot reviewed Jan 3, 2026

View reviewed changes

compile guard

2bc695e

Signed-off-by: Zhongbo Zhu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NVFP4][Dense] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose-Cast Fusion Kernel #2555

[NVFP4][Dense] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose-Cast Fusion Kernel #2555

Uh oh!

zhongbozhu commented Jan 3, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Jan 3, 2026 •

edited

Loading

Uh oh!

zhongbozhu commented Jan 3, 2026

Uh oh!

greptile-apps bot left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[NVFP4][Dense] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose-Cast Fusion Kernel #2555

Are you sure you want to change the base?

[NVFP4][Dense] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose-Cast Fusion Kernel #2555

Uh oh!

Conversation

zhongbozhu commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

zhongbozhu commented Jan 3, 2026

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (2)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhongbozhu commented Jan 3, 2026 •

edited

Loading

greptile-apps bot commented Jan 3, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading