Skip to content

Conversation

@zhongbozhu
Copy link
Collaborator

@zhongbozhu zhongbozhu commented Jan 3, 2026

Description

Previously, similar optimization has been applied for MOE grouped quantize with RHT in #2411. This PR targets the dense linear layers & shared experts when being quantized to NVFP4. Having this fusion means high precision input only needs to be read once while without this fusion, it needs to be read twice.

Similarly, we have env var NVTE_USE_FAST_MATH to control the numerical behavior of RHT quant fusion kernel to accelerate it further. The fast math is only applied to the high precision math so it will have minimal impact of the training convergence.

What fast-math toggle controls:

  1. replace x / y by x * (1/y)
  2. replace 1 / x by reciporal_approximate_ftz(x)
  3. when RHT cast fusion is available, fusion allows nvfp4 quantize to be performed directly on FP32 data in register files, this will essentially remove a round trip between FP32 to BF16 then FP32.

Therefore, I DO recommend turn it on since it will significantly improve the RHT kernel performance.

The only reason why it's still not default open is because we want ZERO TOLERNACE test between our CUDA quantize kernels and our pytorch-based emulated quantize references. With fast math toggle turned on, it's hard to pass test with zero tolerance without further investigation of how to relax the test conditions while still providing high confidence of the test case.

TODO items:

  • Some cutlass deprecating APIs are being used, output many warnings
  • Maybe turn on fast math by default and use NVTE_DISABLE_RHT_FAST_MATH instead of using NVTE_USE_FAST_MATH? @timmoon10 for opinions.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Zhongbo Zhu <[email protected]>
Signed-off-by: Zhongbo Zhu <[email protected]>
Signed-off-by: Zhongbo Zhu <[email protected]>
@zhongbozhu zhongbozhu requested a review from timmoon10 January 3, 2026 01:23
@zhongbozhu zhongbozhu self-assigned this Jan 3, 2026
@zhongbozhu zhongbozhu added the fp4 label Jan 3, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 3, 2026

Greptile Summary

This PR integrates a new Cutlass-based CUDA kernel that fuses Row-Cast and Column-RHT-Transpose-Cast operations for NVFP4 quantization in dense linear layers. The optimization reduces memory bandwidth by reading high-precision input data only once instead of twice.

  • Added nvte_hadamard_transform_cast_fusion API function that performs rowwise quantization and columnwise RHT+quantization+transpose in a single kernel
  • Kernel uses MMA hardware for efficient Hadamard transform computation and is eligible when input is BF16 with dimensions divisible by 64x128
  • Refactored NVFP4Quantizer::quantize_impl() to use the fused kernel when eligible, with extracted helper method for unfused fallback path
  • Added NVTE_USE_FAST_MATH environment variable support to accelerate RHT operations (replaces division with multiplication by reciprocal, uses approximate reciprocal)
  • Extended test coverage to include columnwise-only quantization mode
  • Added benchmark script for profiling linear layer performance across quantization recipes

Confidence Score: 4/5

  • This PR is safe to merge - it adds a new performance optimization path with proper fallback to existing behavior for unsupported shapes.
  • Score of 4 reflects well-structured code with proper compile guards, shape validation checks, and fallback paths. The CUDA kernel follows established patterns in the codebase. Minor unused variables exist but do not affect functionality.
  • The new CUDA kernel file (row_cast_col_hadamard_transform_cast_fusion.cu) is the most complex addition and should be reviewed for numerical correctness in production workloads.

Important Files Changed

Filename Overview
transformer_engine/common/hadamard_transform/row_cast_col_hadamard_transform_cast_fusion.cu New CUDA kernel implementing fused Row-Cast-Col-RHT-Transpose-Cast for NVFP4 quantization, leveraging MMA hardware for Hadamard transform. Contains minor unused variables but no functional issues.
transformer_engine/pytorch/csrc/quantizer.cpp Refactored quantization logic to use fused kernel when eligible (rows%64==0, cols%128==0), with unfused fallback. Added fast math toggle support and extracted unfused helper method.
transformer_engine/common/include/transformer_engine/hadamard_transform.h Added new API function nvte_hadamard_transform_cast_fusion for row-cast and column-RHT-transpose-cast fusion; marked old columnwise-only function for deprecation.
tests/pytorch/nvfp4/test_nvfp4_rht_quantize_exact.py Extended test coverage to support columnwise-only quantization mode in addition to existing rowwise and combined modes.

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant NVFP4Q as NVFP4Quantizer
    participant QImpl as quantize_impl()
    participant FusedK as nvte_hadamard_transform_cast_fusion
    participant UnfusedH as quantize_with_rht_unfused_helper
    participant RowQuant as nvte_quantize_v2 (rowwise)
    participant RHT as nvte_hadamard_transform
    participant ColQuant as nvte_quantize_v2 (columnwise)

    User->>NVFP4Q: quantize(input, output)
    NVFP4Q->>QImpl: quantize_impl(input, output)
    
    alt eligible_for_rht_cast_fusion (BF16, rows%64==0, cols%128==0)
        QImpl->>FusedK: nvte_hadamard_transform_cast_fusion()
        Note over FusedK: Single kernel does:<br/>1. Rowwise quantization<br/>2. RHT + columnwise quant + transpose
        FusedK-->>QImpl: rowwise + columnwise output
    else not eligible (irregular shapes)
        QImpl->>UnfusedH: quantize_with_rht_unfused_helper()
        alt rowwise_usage
            UnfusedH->>RowQuant: nvte_quantize_v2()
            RowQuant-->>UnfusedH: rowwise output
        end
        alt columnwise_usage
            UnfusedH->>RHT: nvte_hadamard_transform()
            RHT-->>UnfusedH: RHT(input.T)
            UnfusedH->>ColQuant: nvte_quantize_v2()
            ColQuant-->>UnfusedH: columnwise output
        end
        UnfusedH-->>QImpl: combined output
    end
    
    QImpl-->>NVFP4Q: quantized tensor
    NVFP4Q-->>User: NVFP4Tensor
Loading

@zhongbozhu zhongbozhu changed the title [NVFP4][Dense] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose Fusion Kernel [NVFP4][Dense] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose-Cast Fusion Kernel Jan 3, 2026
Signed-off-by: Zhongbo Zhu <[email protected]>
Signed-off-by: Zhongbo Zhu <[email protected]>
@zhongbozhu zhongbozhu force-pushed the zhongbo/dense_row_col_rht_fp4_quant branch from c80932f to fc42825 Compare January 3, 2026 04:16
@zhongbozhu
Copy link
Collaborator Author

/te-ci arm L1

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (2)

  1. benchmarks/linear/benchmark_linear.py, line 141 (link)

    logic: NVTX range is pushed but never popped in the benchmark function

  2. transformer_engine/common/hadamard_transform/row_cast_col_hadamard_transform_cast_fusion.cu, line 346 (link)

    syntax: Typo in comment: 'SMEMork' should be 'SMEM work'

8 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Zhongbo Zhu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant