-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[https://nvbugs/5378031] [feat] W4A8 AWQ MoE supports Per Expert Pre-quant Scale Factor for PyT backend #7286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds per-expert pre-quantization: device utility, per-expert CUDA kernels and host launchers, extended applyPrequantScale signature and GEMM2 call sites, PyTorch MoE quantization switched to per-expert shapes, and a unit test updated to use randomized pre-quant scales. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor GEMM2 as GEMM2 Path
participant Runner as CutlassMoeFCRunner
participant Launcher as Per-Expert Launcher
participant Kernel as CUDA Kernel
Note over Runner,Launcher: Decision: use_awq && expert mapping provided
GEMM2->>Runner: applyPrequantScale(..., stream, expert_offsets, num_experts)
alt per-expert path
Runner->>Launcher: apply_per_channel_scale_per_expert(..., expert_offsets, num_experts, stream)
Launcher->>Kernel: launch<<<grid,block>>>(..., expert_offsets, num_experts)
Kernel-->>Launcher: writes smoothed_act (per-expert scaled)
Launcher-->>Runner: returns smoothed_act pointer
Runner-->>GEMM2: proceed with smoothed_act as gemm input
else fallback path
Runner-->>GEMM2: proceed with permuted_data (existing path)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (14)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (6)
955-966
: Per-expert shapes look good; consider clarifying the comment.fc31_act_scale and fc2_act_scale now carry per-expert factors. The trailing singleton dim on fc2_act_scale is fine; please tweak the comment to explicitly say shapes are [E, K] and [E, I, 1].
1187-1203
: Avoid extra stacks/permutes; compute in fp32 and use torch.maximum.This reduces temporary allocations, improves numeric stability, and matches the target [E, K] layout directly.
- all_w3_w1_pre_quant_scales_greater = torch.max( - torch.stack([torch.stack(all_w3_pre_quant_scales), - torch.stack(all_w1_pre_quant_scales)]).to(module.dtype), - dim=0,).values.permute(1, 0) - - all_w3_w1_input_scales_greater = torch.max( - torch.stack([torch.stack(all_w3_input_scales), - torch.stack(all_w1_input_scales)]).to(module.dtype), - dim=0,).values - - all_w3_w1_pre_quant_scales_div_input_scales = ( - all_w3_w1_pre_quant_scales_greater * - (1 / all_w3_w1_input_scales_greater.reshape(1, module.expert_size_per_partition).float()) - ) - - module.fc31_act_scale.data.copy_(all_w3_w1_pre_quant_scales_div_input_scales.permute(1, 0)) + # [E, K] tensors; keep math in fp32 for stability + w3_pre = torch.stack(all_w3_pre_quant_scales).to(torch.float32) + w1_pre = torch.stack(all_w1_pre_quant_scales).to(torch.float32) + pre_max = torch.maximum(w3_pre, w1_pre) # [E, K] + + in_scales = torch.maximum( + torch.stack(all_w3_input_scales).to(torch.float32), + torch.stack(all_w1_input_scales).to(torch.float32), + ) # [E] + + fc31_act = pre_max / in_scales.unsqueeze(1) # [E, K] + module.fc31_act_scale.data.copy_(fc31_act.to(module.dtype))Also applies to: 1192-1196
1214-1224
: Same simplification for weight_scale_2 and alpha; reuse in_scales.Saves memory and keeps compute in fp32.
- all_w3_w1_weight_scale_2 = torch.stack( - [torch.stack(all_w3_weight_scale_2), torch.stack(all_w1_weight_scale_2)]).to( - module.dtype) - all_w3_w1_weight_scale_2_greater = torch.max(all_w3_w1_weight_scale_2, dim=0).values - - all_w3_w1_weight_scale_2_mul_input_scales = ( - all_w3_w1_weight_scale_2_greater.reshape(module.expert_size_per_partition, 1).float() * - all_w3_w1_input_scales_greater.reshape(module.expert_size_per_partition, 1).float() - ) - module.fc31_alpha.data.copy_(all_w3_w1_weight_scale_2_mul_input_scales.reshape(module.expert_size_per_partition, 1).float()) + w3_ws2 = torch.stack(all_w3_weight_scale_2).to(torch.float32).squeeze(-1) # [E] + w1_ws2 = torch.stack(all_w1_weight_scale_2).to(torch.float32).squeeze(-1) # [E] + all_w3_w1_weight_scale_2_greater = torch.maximum(w3_ws2, w1_ws2) # [E] + + module.fc31_alpha.data.copy_((all_w3_w1_weight_scale_2_greater * in_scales).unsqueeze(1))
1251-1254
: Normalize per-group scales without permutes.Broadcast over the expert axis; cheaper and clearer.
- w3_w1_scales = w3_w1_scales.permute(1, 2, 0) - w3_w1_scales /= all_w3_w1_weight_scale_2_greater.reshape(module.expert_size_per_partition).float() - w3_w1_scales = w3_w1_scales.permute(2, 0, 1) + w3_w1_scales = w3_w1_scales / all_w3_w1_weight_scale_2_greater.view(-1, 1, 1).float()
1294-1313
: Simplify fc2 per-expert scales; avoid permutes and keep fp32 math.Mirrors the fc31 refactor for consistency and stability.
- all_w2_pre_quant_scales = torch.stack(all_w2_pre_quant_scales).to(module.dtype) - all_w2_input_scales = torch.stack(all_w2_input_scales).to(module.dtype) - all_w2_pre_quant_scales_div_input_scales = ( - all_w2_pre_quant_scales.permute(1, 0) * - (1 / (all_w2_input_scales.reshape(module.expert_size_per_partition).float())) - ).permute(1, 0) - module.fc2_act_scale.data.copy_(all_w2_pre_quant_scales_div_input_scales.reshape(module.fc2_act_scale.shape)) + fc2_act = (torch.stack(all_w2_pre_quant_scales).to(torch.float32) + / torch.stack(all_w2_input_scales).to(torch.float32).unsqueeze(1)) # [E, I] + module.fc2_act_scale.data.copy_(fc2_act.unsqueeze(-1).to(module.dtype))- all_w2_weight_scale_2 =torch.stack(all_w2_weight_scale_2).to(module.dtype) - all_w2_weight_scale_2_mul_input_scales = ( - all_w2_weight_scale_2.reshape(module.expert_size_per_partition, 1) * - all_w2_input_scales.reshape(module.expert_size_per_partition, 1) - ) - module.fc2_alpha.data.copy_(all_w2_weight_scale_2_mul_input_scales) + all_w2_weight_scale_2 = torch.stack(all_w2_weight_scale_2).to(torch.float32) # [E] + module.fc2_alpha.data.copy_( + (all_w2_weight_scale_2 * torch.stack(all_w2_input_scales).to(torch.float32)).unsqueeze(1))
1331-1335
: Same broadcast trick here; drop permutes.- w2_scales = w2_scales.permute(1, 2, 0) - all_w2_weight_scale_2 = all_w2_weight_scale_2.reshape(module.expert_size_per_partition) - w2_scales /= (all_w2_weight_scale_2.float()) - w2_scales = w2_scales.permute(2, 0, 1) + w2_scales = w2_scales / all_w2_weight_scale_2.view(-1, 1, 1).float()cpp/tensorrt_llm/kernels/preQuantScaleKernel.h (1)
42-46
: Make expert offsets const and match defaults with the non-expert launcher; add brief API doc.The kernel does not modify offsets; aligning defaults simplifies callsites and mirrors the existing launcher.
-template <typename T_in, typename T_out = T_in> -void apply_per_channel_scale_per_expert_kernel_launcher(T_out* smoothed_act, T_in const* act, T_in const* per_channel_scale, - int rows, int cols, int64_t* expert_first_token_offset, int const num_experts_per_node, - int64_t const* num_valid_tokens_ptr, cudaStream_t stream); +//! Applies per-channel pre-quant scales per expert to permuted activations. +template <typename T_in, typename T_out = T_in> +void apply_per_channel_scale_per_expert_kernel_launcher( + T_out* smoothed_act, + T_in const* act, + T_in const* per_channel_scale, + int rows, + int cols, + int64_t const* expert_first_token_offset, + int const num_experts_per_node, + int64_t const* num_valid_tokens_ptr = nullptr, + cudaStream_t stream = 0);cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)
863-866
: Const-correct expert offsets; add a short doc to the helper.Keeps signature consistent with the kernel launcher and clarifies behavior.
- T const* applyPrequantScalePerExpert(void* smoothed_act, void const* permuted_data, void const* prequant_scales, - int64_t const* num_valid_tokens_ptr, int64_t const expanded_num_rows, int64_t const seq_len, bool const use_awq, - cudaStream_t stream, int64_t* expert_first_token_offset, int const num_experts_per_node); + //! Applies pre-quant scales per expert and returns the GEMM input pointer + T const* applyPrequantScalePerExpert(void* smoothed_act, void const* permuted_data, void const* prequant_scales, + int64_t const* num_valid_tokens_ptr, int64_t const expanded_num_rows, int64_t const seq_len, bool const use_awq, + cudaStream_t stream, int64_t const* expert_first_token_offset, int const num_experts_per_node);cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)
3049-3076
: Make expert_first_token_offset const; it’s read-only.Minor const-correctness + clarity for the new API.
Apply this diff:
-template <class T, class WeightType, class OutputType, class InputType, class ScaleBiasType, class Enable> -T const* CutlassMoeFCRunner<T, WeightType, OutputType, InputType, ScaleBiasType, Enable>::applyPrequantScalePerExpert( - void* smoothed_act, void const* permuted_data, void const* prequant_scales, int64_t const* num_valid_tokens_ptr, - int64_t const expanded_num_rows, int64_t const seq_len, bool const use_awq, cudaStream_t stream, - int64_t* expert_first_token_offset, int const num_experts_per_node) +template <class T, class WeightType, class OutputType, class InputType, class ScaleBiasType, class Enable> +T const* CutlassMoeFCRunner<T, WeightType, OutputType, InputType, ScaleBiasType, Enable>::applyPrequantScalePerExpert( + void* smoothed_act, void const* permuted_data, void const* prequant_scales, int64_t const* num_valid_tokens_ptr, + int64_t const expanded_num_rows, int64_t const seq_len, bool const use_awq, cudaStream_t stream, + int64_t const* expert_first_token_offset, int const num_experts_per_node) { @@ - tensorrt_llm::kernels::apply_per_channel_scale_per_expert_kernel_launcher<UnfusedGemmOutputType, T>( + tensorrt_llm::kernels::apply_per_channel_scale_per_expert_kernel_launcher<UnfusedGemmOutputType, T>( reinterpret_cast<T*>(smoothed_act), reinterpret_cast<UnfusedGemmOutputType const*>(permuted_data), reinterpret_cast<UnfusedGemmOutputType const*>(prequant_scales), expanded_num_rows, seq_len, - expert_first_token_offset, num_experts_per_node, num_valid_tokens_ptr, stream); + const_cast<int64_t*>(expert_first_token_offset), num_experts_per_node, num_valid_tokens_ptr, stream);Note: header declaration and the kernel launcher signature should be updated to accept int64_t const* as well (see companion comments).
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu (5)
99-118
: Avoid duplicating findTotalEltsLessThanTarget across compilation units.This device helper exists in moe_kernels.cu too. Consider moving it to a shared header (e.g., moe_util_kernels.h) and including it here to prevent divergence.
120-178
: Const-correctness and type width in per-expert kernel.
- expert_first_token_offset is read-only; take it as int64_t const*.
- Avoid narrowing: store expert as int64_t.
Apply this diff:
-template <typename T_in, typename T_out, int kProcessRows, typename AccessType> -__global__ void apply_per_channel_scale_per_expert(T_out* smoothed_act, T_in const* act, T_in const* per_channel_scale, int rows, - int cols, int64_t* expert_first_token_offset, int const num_experts_per_node, int64_t const* num_valid_tokens_ptr) +template <typename T_in, typename T_out, int kProcessRows, typename AccessType> +__global__ void apply_per_channel_scale_per_expert(T_out* smoothed_act, T_in const* act, T_in const* per_channel_scale, int rows, + int cols, int64_t const* expert_first_token_offset, int const num_experts_per_node, int64_t const* num_valid_tokens_ptr) { @@ - for (int i = 0; i < kProcessRows; ++i) + for (int i = 0; i < kProcessRows; ++i) { - int expert = findTotalEltsLessThanTarget(expert_first_token_offset, num_experts_per_node, (int64_t) row_offset * kProcessRows + i + 1) - 1; + int64_t expert = findTotalEltsLessThanTarget( + expert_first_token_offset, num_experts_per_node, (int64_t)row_offset * kProcessRows + i + 1) - 1; *reinterpret_cast<AccessType*>(scale) = reinterpret_cast<AccessType const*>(per_channel_scale)[expert * cols / kElems + col_offset];
190-202
: Launcher: pass expert_first_token_offset as const.Keep API consistent with the kernel.
Apply this diff:
-template <typename T_in, typename T_out, int kProcessRows, typename AccessType = float4> -void apply_per_channel_scale_per_expert_kernel_launcher_(T_out* smoothed_act, T_in const* act, T_in const* per_channel_scale, - int rows, int cols, int64_t* expert_first_token_offset, int const num_experts_per_node, - int64_t const* num_valid_tokens_ptr, cudaStream_t stream) +template <typename T_in, typename T_out, int kProcessRows, typename AccessType = float4> +void apply_per_channel_scale_per_expert_kernel_launcher_(T_out* smoothed_act, T_in const* act, T_in const* per_channel_scale, + int rows, int cols, int64_t const* expert_first_token_offset, int const num_experts_per_node, + int64_t const* num_valid_tokens_ptr, cudaStream_t stream) { @@ - apply_per_channel_scale_per_expert<T_in, T_out, kProcessRows, AccessType> + apply_per_channel_scale_per_expert<T_in, T_out, kProcessRows, AccessType> <<<grid, block, 0, stream>>>(smoothed_act, act, per_channel_scale, rows, cols, - expert_first_token_offset, num_experts_per_node, num_valid_tokens_ptr); + expert_first_token_offset, num_experts_per_node, num_valid_tokens_ptr);
230-256
: Top-level launcher: const-correct expert_first_token_offset.Mirror the change through the public launcher.
Apply this diff:
-template <typename T_in, typename T_out> -void apply_per_channel_scale_per_expert_kernel_launcher(T_out* smoothed_act, T_in const* act, T_in const* per_channel_scale, - int rows, int cols, int64_t* expert_first_token_offset, int const num_experts_per_node, - int64_t const* num_valid_tokens_ptr, cudaStream_t stream) +template <typename T_in, typename T_out> +void apply_per_channel_scale_per_expert_kernel_launcher(T_out* smoothed_act, T_in const* act, T_in const* per_channel_scale, + int rows, int cols, int64_t const* expert_first_token_offset, int const num_experts_per_node, + int64_t const* num_valid_tokens_ptr, cudaStream_t stream) { @@ - apply_per_channel_scale_per_expert_kernel_launcher_<T_in, T_out, 1, float4>( + apply_per_channel_scale_per_expert_kernel_launcher_<T_in, T_out, 1, float4>( smoothed_act, act, per_channel_scale, rows, cols, expert_first_token_offset, num_experts_per_node, num_valid_tokens_ptr, stream); @@ - apply_per_channel_scale_per_expert_kernel_launcher_<T_in, T_out, 4, float4>( + apply_per_channel_scale_per_expert_kernel_launcher_<T_in, T_out, 4, float4>( smoothed_act, act, per_channel_scale, rows, cols, expert_first_token_offset, num_experts_per_node, num_valid_tokens_ptr, stream); @@ - apply_per_channel_scale_per_expert_kernel_launcher_<T_in, T_out, 8, float4>( + apply_per_channel_scale_per_expert_kernel_launcher_<T_in, T_out, 8, float4>( smoothed_act, act, per_channel_scale, rows, cols, expert_first_token_offset, num_experts_per_node, num_valid_tokens_ptr, stream); @@ - apply_per_channel_scale_per_expert_kernel_launcher_<T_in, T_out, 16, float4>( + apply_per_channel_scale_per_expert_kernel_launcher_<T_in, T_out, 16, float4>( smoothed_act, act, per_channel_scale, rows, cols, expert_first_token_offset, num_experts_per_node, num_valid_tokens_ptr, stream);
274-290
: Instantiate const-correct launcher prototypes.If you adopt const in the launcher signatures, these instantiations remain unchanged; ensure the corresponding header declarations are updated.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
(3 hunks)cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu
(4 hunks)cpp/tensorrt_llm/kernels/preQuantScaleKernel.h
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/quantization.py
(6 hunks)tests/unittest/_torch/modules/test_fused_moe.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{h,hh,hpp,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hh,hpp,hxx,cuh}
: Closing braces of C++ namespaces in headers must include a trailing comment naming the namespace
Use Allman brace style and always use braces for control statements in headers as well
C++ header filenames must be lowerCamelCase and case-insensitively unique within a compilation target
Document public C++ interfaces with Doxygen using //! and //!<; C-style comments are not allowed except inline special cases; single-line comments should use // and be properly capitalized and punctuated if full sentences
Avoid assignment in subexpressions within header inline/template code as well
All class/function templates and their members should be instantiated at least once; if a class is not POD, its data members should be private
Use header include guards; name as TRTLLM__H (all caps of filename only, no dirs), no leading underscore and no trailing underscore
Files:
cpp/tensorrt_llm/kernels/preQuantScaleKernel.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...
Files:
cpp/tensorrt_llm/kernels/preQuantScaleKernel.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
cpp/tensorrt_llm/kernels/preQuantScaleKernel.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
tests/unittest/_torch/modules/test_fused_moe.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py
**/*.{c,cc,cpp,cxx,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu}
: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)
Files:
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tests/unittest/_torch/modules/test_fused_moe.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py
🧠 Learnings (3)
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
PR: NVIDIA/TensorRT-LLM#7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.
Applied to files:
cpp/tensorrt_llm/kernels/preQuantScaleKernel.h
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
🧬 Code graph analysis (5)
cpp/tensorrt_llm/kernels/preQuantScaleKernel.h (1)
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu (4)
void
(43-96)void
(121-176)apply_per_channel_scale_per_expert_kernel_launcher
(231-256)apply_per_channel_scale_per_expert_kernel_launcher
(231-233)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)
applyPrequantScalePerExpert
(3050-3076)applyPrequantScalePerExpert
(3050-3053)
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)
findTotalEltsLessThanTarget
(902-920)findTotalEltsLessThanTarget
(902-902)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu (4)
findTotalEltsLessThanTarget
(100-118)findTotalEltsLessThanTarget
(100-100)apply_per_channel_scale_per_expert_kernel_launcher
(231-256)apply_per_channel_scale_per_expert_kernel_launcher
(231-233)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (3)
smoothed_act_
(896-896)fc1_result_
(886-886)expert_first_token_offset_
(882-882)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
tensorrt_llm/module.py (1)
register_parameter
(186-190)tensorrt_llm/_torch/modules/linear.py (1)
load_weight_shard
(58-102)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/quantization.py
1223-1223: Line too long (136 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tests/unittest/_torch/modules/test_fused_moe.py (1)
1331-1340
: Good: randomized pre-quant scales exercise per-expert paths.Seeds are set earlier, the 0.95–1.05 range avoids degenerate cases while keeping stability. LGTM.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)
1613-1625
: Per-expert prequant scaling path looks correct; verify scale layout/alignments.
- Row→expert mapping via expert_first_token_offset and the binary search is sound.
- Indexing into prequant_scales assumes a contiguous [num_experts_per_node, hidden_size] layout and hidden_size divisible by ELEM_PER_THREAD/Vec width. Please confirm these invariants at call sites and in tests.
3854-3855
: Check per-expert scale tensor shape and dtype at call-site.Ensure quant_params.groupwise.fc2.act_scales is laid out [num_experts_per_node, inter_size] and matches UnfusedGemmOutputType; otherwise scaling or vectorized loads can misbehave.
eb370a5
to
e417e05
Compare
e417e05
to
a062f81
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu (1)
51-54
: Fix vector tail handling and grid-y computation in preQuantScaleKernelThe current grid-y setup uses integer division (
cols / kElems
), allowing threads that begin a partial vector to pass thecol_offset * kElems >= cols
guard and then read/write past the end whencols % kElems != 0
. To prevent out-of-bounds accesses:• Compute the number of vector columns with a ceiling:
const int vecCols = (cols + kElems - 1) / kElems; // ceil(cols/kElems) dim3 grid( (rows + kProcessRows - 1) / kProcessRows, (vecCols + block.x - 1) / block.x );(Replace the existing grid definition at line 115.)
• Tighten the in-kernel guard so any thread whose vector would overflow returns early:
- if (col_offset * kElems >= cols || row_offset * kProcessRows >= rows) + if ((col_offset + 1) * kElems > cols || row_offset * kProcessRows >= rows) return;(Update the check at line 51, and similarly in the overload at lines 108–118.)
• Add a masked “tail” path for the last partial vector when
(col_offset + 1) * kElems > cols
, loading and storing elements one by one to handle the remainder safely.These changes eliminate any partial-vector overreads or overwrites when
cols
is not an exact multiple ofkElems
.
🧹 Nitpick comments (11)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)
466-472
: Gate W4A8_CUSTOM by algo as well, not only by “int4 per-group” mode.
layer_quant_mode.is_int4_weight_only_per_group()
may be true for non‑AWQ quant paths (e.g., other int4 per‑group modes). To avoid routing unrelated configs into the “custom W4A8” loader, also check the algo is W4A8_AWQ.Apply:
- weight_loading_mode=( - MoEWeightLoadingMode.W4A8_CUSTOM - if self._get_experts_quant_config( - model_config, - layer_idx).layer_quant_mode.is_int4_weight_only_per_group() - else MoEWeightLoadingMode.VANILLA), + weight_loading_mode=( + MoEWeightLoadingMode.W4A8_CUSTOM + if (lambda qc: qc.quant_algo == QuantAlgo.W4A8_AWQ + and qc.layer_quant_mode.is_int4_weight_only_per_group() + )(self._get_experts_quant_config(model_config, layer_idx)) + else MoEWeightLoadingMode.VANILLA),
537-543
: Document and type-annotate helper for maintainability.Add a brief docstring and annotate the argument to clarify expected keys and fallback behavior.
@staticmethod -def _get_experts_quant_config(model_config, layer_idx: int) -> QuantConfig: - if getattr(model_config, "quant_config_dict", None) is None: - return model_config.quant_config - return model_config.quant_config_dict.get( - f"model.layers.{layer_idx}.mlp.experts", model_config.quant_config) +def _get_experts_quant_config(model_config: ModelConfig, layer_idx: int) -> QuantConfig: + """Return experts' QuantConfig for a given layer, falling back to the global config.""" + if getattr(model_config, "quant_config_dict", None) is None: + return model_config.quant_config + return model_config.quant_config_dict.get( + f"model.layers.{layer_idx}.mlp.experts", model_config.quant_config)
654-656
: Nit: prefer “!=” for enum comparison.
QuantAlgo
is an enum;!=
reads clearer than identity checks.- assert ( - quant_config.quant_algo - is not QuantAlgo.MIXED_PRECISION), "MIXED_PRECISION is ambiguous" + assert quant_config.quant_algo != QuantAlgo.MIXED_PRECISION, "MIXED_PRECISION is ambiguous"cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (3)
1593-1596
: Per-expert AWQ prequant scale indexing: check shape/stride.Indexing assumes prequant_scales is laid out as [num_experts_per_node, hidden_size] contiguous. Verify producer matches this layout and dtype matches
InputActivationsType
to avoid silent precision loss.If scales are produced as float regardless of activation dtype, consider casting once at load to avoid repeated implicit conversions inside the loop.
Also applies to: 1603-1605
3005-3009
: API extension: add defaults carefully and const-correctness.The new parameters have defaults, preserving ABI at call sites. Since
expert_first_token_offset
is read-only, prefer aconst int64_t*
to better express intent.-T const* CutlassMoeFCRunner<...>::applyPrequantScale( +T const* CutlassMoeFCRunner<...>::applyPrequantScale( void* smoothed_act, void const* permuted_data, void const* prequant_scales, int64_t const* num_valid_tokens_ptr, int64_t const expanded_num_rows, int64_t const seq_len, bool const use_awq, cudaStream_t stream, - int64_t* expert_first_token_offset, int const num_experts_per_node) + int64_t const* expert_first_token_offset, int const num_experts_per_node)Note: this change ripples to declarations in the public headers.
3819-3822
: Plumbed per-expert offsets into GEMM2 prequant path.This wires up the new capability on the main path. For consistency, consider passing the offsets in the min-latency branch too (it’s currently FP4-only, so this is a non-functional nit).
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu (5)
44-46
: Const-qualify expert_first_token_offset through the kernel and launchers.The pointer is read-only. Making it const avoids accidental writes and qualifier drops at call sites.
-__global__ void apply_per_channel_scale(T_out* smoothed_act, T_in const* act, T_in const* per_channel_scale, int rows, - int cols, int64_t const* num_valid_tokens_ptr, int64_t* expert_first_token_offset, int const num_experts_per_node) +__global__ void apply_per_channel_scale(T_out* smoothed_act, T_in const* act, T_in const* per_channel_scale, int rows, + int cols, int64_t const* num_valid_tokens_ptr, int64_t const* expert_first_token_offset, int const num_experts_per_node)Apply the same change to the launcher signatures/usages below (Lines 111, 149, 193).
68-70
: Per-channel scale indexing: avoid hidden assumptions and integer-division pitfalls.
- Make numVecsPerExpert explicit to document the assumption and help static analyzers.
- If cols is not a multiple of kElems, current vector load/store paths can read beyond the row boundary.
- *reinterpret_cast<AccessType*>(scale) - = reinterpret_cast<AccessType const*>(per_channel_scale)[expert * cols / kElems + col_offset]; + const int numVecsPerExpert = cols / kElems; // assumes cols % kElems == 0 + *reinterpret_cast<AccessType*>(scale) + = reinterpret_cast<AccessType const*>(per_channel_scale)[expert * numVecsPerExpert + col_offset];If cols may not be divisible by kElems in any path, add a masked tail for the last vector or assert the divisibility precondition.
110-112
: Prefer nullptr for stream default and const-qualify expert offsets.Minor API hygiene.
- int rows, int cols, int64_t const* num_valid_tokens_ptr = nullptr, cudaStream_t stream = 0, - int64_t* expert_first_token_offset = nullptr, int const num_experts_per_node = 0) + int rows, int cols, int64_t const* num_valid_tokens_ptr = nullptr, cudaStream_t stream = nullptr, + int64_t const* expert_first_token_offset = nullptr, int const num_experts_per_node = 0)
147-174
: Public per-expert launcher: mirror the const and stream defaults; otherwise LGTM.API shape and dispatch thresholds match the non-expert path.
-void apply_per_channel_scale_per_expert_kernel_launcher(T_out* smoothed_act, T_in const* act, - T_in const* per_channel_scale, int rows, int cols, int64_t* expert_first_token_offset, - int const num_experts_per_node, int64_t const* num_valid_tokens_ptr, cudaStream_t stream) +void apply_per_channel_scale_per_expert_kernel_launcher(T_out* smoothed_act, T_in const* act, + T_in const* per_channel_scale, int rows, int cols, int64_t const* expert_first_token_offset, + int const num_experts_per_node, int64_t const* num_valid_tokens_ptr, cudaStream_t stream)
191-207
: Template instantiations: keep in sync with const changes.Update the declaration used in the macro to reflect const pointer type; otherwise these are fine.
- template void apply_per_channel_scale_per_expert_kernel_launcher<T_in, T_out>(T_out * smoothed_act, \ - const T_in* act, const T_in* per_channel_scale, int rows, int cols, int64_t* expert_first_token_offset, \ + template void apply_per_channel_scale_per_expert_kernel_launcher<T_in, T_out>(T_out * smoothed_act, \ + const T_in* act, const T_in* per_channel_scale, int rows, int cols, const int64_t* expert_first_token_offset, \ int const num_experts_per_node, int64_t const* num_valid_tokens_ptr, cudaStream_t stream)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (8)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
(5 hunks)cpp/tensorrt_llm/kernels/moe_utils.h
(1 hunks)cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu
(6 hunks)cpp/tensorrt_llm/kernels/preQuantScaleKernel.h
(1 hunks)tensorrt_llm/_torch/models/modeling_deepseekv3.py
(4 hunks)tensorrt_llm/_torch/modules/fused_moe/quantization.py
(6 hunks)tests/unittest/_torch/modules/test_fused_moe.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- cpp/tensorrt_llm/kernels/moe_utils.h
- cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
- cpp/tensorrt_llm/kernels/preQuantScaleKernel.h
- tests/unittest/_torch/modules/test_fused_moe.py
- tensorrt_llm/_torch/modules/fused_moe/quantization.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{c,cc,cpp,cxx,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu}
: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}
: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu
tensorrt_llm/_torch/models/modeling_deepseekv3.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/models/modeling_deepseekv3.py
🧠 Learnings (9)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
PR: NVIDIA/TensorRT-LLM#7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-21T21:48:35.135Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.135Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/models/modeling_deepseekv3.py
🧬 Code graph analysis (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (4)
cpp/tensorrt_llm/kernels/moe_utils.h (1)
findTotalEltsLessThanTarget
(27-45)cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu (4)
apply_per_channel_scale_per_expert_kernel_launcher
(148-173)apply_per_channel_scale_per_expert_kernel_launcher
(148-150)apply_per_channel_scale_kernel_launcher
(121-145)apply_per_channel_scale_kernel_launcher
(121-122)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)
expert_first_token_offset_
(878-878)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (1)
expert_first_token_offset_
(825-825)
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu (1)
cpp/tensorrt_llm/kernels/moe_utils.h (1)
findTotalEltsLessThanTarget
(27-45)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (4)
tensorrt_llm/quantization/mode.py (1)
QuantAlgo
(23-46)tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
MoEWeightLoadingMode
(13-19)tensorrt_llm/models/modeling_utils.py (4)
layer_quant_mode
(166-170)layer_quant_mode
(306-312)QuantConfig
(128-268)quant_algo
(547-548)tensorrt_llm/llmapi/llm_args.py (2)
quant_config
(2316-2319)quant_config
(2322-2323)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
48-48
: Import of QuantAlgo is correct and scoped.Needed for the new runtime guard; no issues.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (3)
3018-3031
: Good: per-expert path selects expert-aware kernel.Conditional dispatch to per-expert launcher when offsets are provided is clear and keeps the legacy path intact.
55-55
: Confirmed:findTotalEltsLessThanTarget
Is Device-CallableI’ve verified that in
cpp/tensorrt_llm/kernels/moe_utils.h
the function is declared as:__device__ inline int64_t findTotalEltsLessThanTarget(…)This
__device__
annotation ensures it can be called from your CUDA kernels, so no link-time failures will occur when used in device code.No further changes are needed here.
3005-3009
: No staleapplyPrequantScale
calls found—signature usage is consistentAll call sites in
moe_kernels.cu
either supply both new parameters or correctly rely on the header’s defaults:
- Line 3712: called with 8 args, using defaults for
expert_first_token_offset
andnum_experts_per_node
- Line 3800: called with 8 args, again using defaults
- Line 3819: called with all 10 args, explicitly passing
expert_first_token_offset_
andnum_experts_per_node
Since the header in
cutlass_kernels/include/moe_kernels.h
provides default values for the two new parameters, existing 8-argument calls remain valid and no updates are required.cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu (3)
17-17
: Include of moe_utils.h looks correct.Needed for findTotalEltsLessThanTarget; no issues.
64-66
: Device-qualification of findTotalEltsLessThanTarget.Ensure findTotalEltsLessThanTarget in moe_utils.h is marked device (or host device). Otherwise calling it from a global kernel will fail or cause host/device linkage issues.
Would you like me to scan the repo for its definition and open a follow-up if it’s not device-qualified?
61-67
: Refine expert‐index calculation for safety and correctness
- In
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu
(around lines 61–67), the current code computesSinceexpert = findTotalEltsLessThanTarget( expert_first_token_offset, num_experts_per_node, tokenIdx + 1) - 1;expert_first_token_offset[0]
is always initialized to 0, underflow to −1 cannot occur at runtime. Instead of clamping here (which may hide upstream bugs), add a sanity check that the offset array is well-formed:int64_t arr_len = static_cast<int64_t>(num_experts_per_node) + 1; // include sentinel int64_t raw = findTotalEltsLessThanTarget(expert_first_token_offset, num_experts_per_node, tokenIdx + 1)
expert_first_token_offset, arr_len, tokenIdx + 1) - 1;
- TORCH_CHECK(expert_first_token_offset[0] == 0,
"expert_first_token_offset[0] must be zero");
- TORCH_CHECK(raw >= 0 && raw < num_experts_per_node,
expert = static_cast(raw);"computed expert index out of range");
- By passing
arr_len = num_experts_per_node + 1
, you include the sentinel at the end of the offsets array, ensuring full coverage of all token ranges.- The
TORCH_CHECK
macros guarantee that any corruption or misalignment in the offset array is caught early, rather than silently clamped.- If you still prefer a fail-safe clamp for release builds, wrap it in
#ifndef NDEBUG
or a function-level flag, but avoid hiding data-corruption issues in debug/development modes.
PR_Github #16770 [ run ] triggered by Bot |
PR_Github #16770 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
/bot run |
d0a4c05
to
6dc6f97
Compare
/bot run |
1 similar comment
/bot run |
6dc6f97
to
2c8af28
Compare
/bot run |
/bot kill |
/bot run |
1 similar comment
/bot run |
PR_Github #17659 [ run ] triggered by Bot |
PR_Github #17659 [ run ] completed with state |
2c8af28
to
616d5f2
Compare
PR_Github #18146 [ run ] triggered by Bot |
PR_Github #18146 [ run ] completed with state |
/bot run |
PR_Github #18280 [ run ] triggered by Bot |
/bot run |
2 similar comments
/bot run |
/bot run |
PR_Github #19520 [ run ] triggered by Bot |
cd4d8de
to
2d5b9d1
Compare
/bot run |
PR_Github #19524 [ run ] triggered by Bot |
PR_Github #19520 [ run ] completed with state |
PR_Github #19524 [ run ] completed with state |
/bot run |
PR_Github #19676 [ run ] triggered by Bot |
PR_Github #19676 [ run ] completed with state |
Do you think any model level e2e accuracy tests shall be added? |
Signed-off-by: Min Yu <[email protected]>
Signed-off-by: Min Yu <[email protected]>
Signed-off-by: Min Yu <[email protected]>
2d5b9d1
to
8ac2332
Compare
I will add this checkpoint for e2e accuracy test according to @Barry-Delaney 's suggestion. https://huggingface.co/Barrrrry/DeepSeek-R1-W4AFP8/tree/main |
/bot run |
PR_Github #19876 [ run ] triggered by Bot |
PR_Github #19876 [ run ] completed with state |
Description
This PR depends on #6200 to support ModelOpt W4A8 checkpoint.
Current W4A8 AWQ only accepts
per-layer
pre-quant scale factors (for both FC1 and FC2). However, ModelOpt generates pre-quant scale factors for each expert.This PR makes W4A8 AWQ accept
per-expert
pre-quant scale factors.Note:
The signature of
tensorrt_llm::kernels::apply_per_channel_scale_kernel_launcher
could not be modifed, because it must be aligned with the signature ininternal_cutlass_kernels
. Otherwise, undefined reference is reported in link stage.As a result, I have implemented
tensorrt_llm::kernels::apply_per_channel_scale_per_expert_kernel_launcher
to workaround.Cherry-pick #7123 to this PR to run CI together.
Test Coverage
pytest tests/unittest/_torch/modules/test_fused_moe.py -k w4afp8 -s
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
New Features
Improvements
Tests