-
Notifications
You must be signed in to change notification settings - Fork 596
[NVFP4][Dense] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose-Cast Fusion Kernel #2555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[NVFP4][Dense] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose-Cast Fusion Kernel #2555
Conversation
Signed-off-by: Zhongbo Zhu <[email protected]>
Signed-off-by: Zhongbo Zhu <[email protected]>
Signed-off-by: Zhongbo Zhu <[email protected]>
Greptile SummaryThis PR integrates a new Cutlass-based CUDA kernel that fuses Row-Cast and Column-RHT-Transpose-Cast operations for NVFP4 quantization in dense linear layers. The optimization reduces memory bandwidth by reading high-precision input data only once instead of twice.
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User as User Code
participant NVFP4Q as NVFP4Quantizer
participant QImpl as quantize_impl()
participant FusedK as nvte_hadamard_transform_cast_fusion
participant UnfusedH as quantize_with_rht_unfused_helper
participant RowQuant as nvte_quantize_v2 (rowwise)
participant RHT as nvte_hadamard_transform
participant ColQuant as nvte_quantize_v2 (columnwise)
User->>NVFP4Q: quantize(input, output)
NVFP4Q->>QImpl: quantize_impl(input, output)
alt eligible_for_rht_cast_fusion (BF16, rows%64==0, cols%128==0)
QImpl->>FusedK: nvte_hadamard_transform_cast_fusion()
Note over FusedK: Single kernel does:<br/>1. Rowwise quantization<br/>2. RHT + columnwise quant + transpose
FusedK-->>QImpl: rowwise + columnwise output
else not eligible (irregular shapes)
QImpl->>UnfusedH: quantize_with_rht_unfused_helper()
alt rowwise_usage
UnfusedH->>RowQuant: nvte_quantize_v2()
RowQuant-->>UnfusedH: rowwise output
end
alt columnwise_usage
UnfusedH->>RHT: nvte_hadamard_transform()
RHT-->>UnfusedH: RHT(input.T)
UnfusedH->>ColQuant: nvte_quantize_v2()
ColQuant-->>UnfusedH: columnwise output
end
UnfusedH-->>QImpl: combined output
end
QImpl-->>NVFP4Q: quantized tensor
NVFP4Q-->>User: NVFP4Tensor
|
Signed-off-by: Zhongbo Zhu <[email protected]>
Signed-off-by: Zhongbo Zhu <[email protected]>
c80932f to
fc42825
Compare
|
/te-ci arm L1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (2)
8 files reviewed, 2 comments
Signed-off-by: Zhongbo Zhu <[email protected]>
Description
Previously, similar optimization has been applied for MOE grouped quantize with RHT in #2411. This PR targets the dense linear layers & shared experts when being quantized to NVFP4. Having this fusion means high precision input only needs to be read once while without this fusion, it needs to be read twice.
Similarly, we have env var NVTE_USE_FAST_MATH to control the numerical behavior of RHT quant fusion kernel to accelerate it further. The fast math is only applied to the high precision math so it will have minimal impact of the training convergence.
What fast-math toggle controls:
Therefore, I DO recommend turn it on since it will significantly improve the RHT kernel performance.
The only reason why it's still not default open is because we want ZERO TOLERNACE test between our CUDA quantize kernels and our pytorch-based emulated quantize references. With fast math toggle turned on, it's hard to pass test with zero tolerance without further investigation of how to relax the test conditions while still providing high confidence of the test case.
TODO items:
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: