-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-7319][perf] Fuse slicing into MoE. #6728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThreaded padded vs unpadded hidden/column sizes through MOE codepaths: added Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Py as Python API
participant Op as Torch Op
participant Cpp as C++ MoE Runner
participant Prof as GemmProfilerBackend
participant CUDA as CUDA Kernel
Py->>Op: fused_moe(..., hidden_size(padded), unpadded_hidden_size)
Op->>Cpp: runMoe(..., hidden_size(padded), unpadded_hidden_size, ...)
alt profiling path
Cpp->>Prof: init/runProfiler(..., expert_hidden_size=padded, expert_unpadded_hidden_size=unpadded, ...)
Prof->>Cpp: profiler results
end
Cpp->>CUDA: finalizeMoeRoutingKernelLauncher(..., padded_cols, unpadded_cols, ...)
CUDA-->>Cpp: final_output (shape uses unpadded_hidden_size)
Cpp-->>Op: final_output
Op-->>Py: final_output
sequenceDiagram
autonumber
participant Runner as CutlassMoeFCRunner
participant Kernel as GEMM2 Kernel
Runner->>Kernel: gemm2(..., hidden_size=padded, orig_hidden_size=unpadded, ...)
Kernel-->>Runner: computed outputs (padding-aware reads/writes)
Runner-->>Caller: merged/blocked outputs (finalization uses unpadded cols/hidden)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)
2-2
: Update copyright year to include 2025According to the coding guidelines, all TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. Please update the copyright notice to include 2025.
- * Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2020-2025, NVIDIA CORPORATION. All rights reserved.
♻️ Duplicate comments (1)
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp (1)
979-979
: Same TODO comment issue as line 966.This line has the same duplicated parameter and TODO comment as line 966. Ensure both code paths receive the same fix when implementing proper original hidden size handling.
🧹 Nitpick comments (2)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (1)
18-18
: Use preprocessor guard instead of pragma once.The coding guidelines specify using a preprocessor guard with the format
TRTLLM_<FILENAME>
instead of#pragma once
.Replace with:
-#pragma once +#ifndef TRTLLM_MOE_KERNELS_H +#define TRTLLM_MOE_KERNELS_HAnd add the closing directive at the end of the file:
#endif // TRTLLM_MOE_KERNELS_H
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)
1777-1848
: Well-implemented padded/original column size separationThe implementation correctly distinguishes between padded and original hidden sizes throughout the kernel. The assertions ensure validity, and the boundary check prevents out-of-bounds writes.
Consider adding a comment explaining why padded columns are used for strides while original columns determine write boundaries, as this distinction is critical for memory safety.
__global__ void finalizeMoeRoutingKernel(GemmOutputType const* expanded_permuted_rows, OutputType* reduced_unpermuted_output, ScaleBiasType const* bias, float const* scales, int const* unpermuted_row_to_permuted_row, int const* token_selected_experts, int64_t const padded_cols, int64_t const orig_cols, int64_t const experts_per_token, int const num_experts_per_node, int const start_expert_id) { + // padded_cols is used for memory strides to ensure alignment + // orig_cols is the actual data size to prevent out-of-bounds writes assert(padded_cols % 4 == 0); assert(orig_cols % 4 == 0); assert(orig_cols <= padded_cols);
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
(9 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
(21 hunks)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
(2 hunks)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
(2 hunks)cpp/tensorrt_llm/thop/moeOp.cpp
(10 hunks)cpp/tensorrt_llm/thop/moeUtilOp.cpp
(6 hunks)tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
(2 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
(3 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/thop/moeUtilOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/thop/moeOp.cpp
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...
Files:
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/thop/moeUtilOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/thop/moeOp.cpp
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
🧠 Learnings (3)
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Applied to files:
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{h,hpp} : Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Applied to files:
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
🔇 Additional comments (36)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py (1)
238-239
: LGTM! Proper implementation of original hidden size parameter.The addition of
self.original_hidden_size
alongside the existingx.shape[1]
parameter correctly implements the interface update to distinguish between padded and original hidden sizes. The comments clearly indicate the purpose of each parameter.tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)
474-475
: LGTM! Consistent implementation across MoE backends.This change mirrors the implementation in the CuteDSL backend, correctly adding
self.original_hidden_size
as an additional parameter to distinguish from the padded hidden size. The consistency across different MoE backends is excellent for maintainability.tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
148-148
: LGTM! Clean parameter addition for original hidden size support.The new
original_hidden_size
parameter is properly added to both the function signature and fake registration with appropriate defaults, maintaining backward compatibility. The parameter is correctly passed through to the underlying MoE runner.Also applies to: 229-229, 262-262
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)
89-91
: LGTM! Proper storage of original hidden size before padding.The constructor correctly stores the original hidden size before any potential padding adjustments, which enables the fused MoE operation to handle output sizing correctly without manual slicing.
424-424
: LGTM! Correct parameter passing to eliminate manual slicing.The stored
original_hidden_size
is properly passed to the fused MoE custom operator, enabling the kernel to handle output tensor sizing correctly and eliminating the need for manual slicing operations.tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1)
466-466
: LGTM! Correct fake implementation update for original hidden size.The fake registration properly adds the
orig_hidden_size
parameter and uses it for output tensor sizing instead of the paddedhidden_size
. This ensures consistency with the actual kernel implementation that now handles padded vs original sizes explicitly.Also applies to: 475-477
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h (1)
68-70
: All Calls Updated to New SignatureI’ve verified that every invocation of
finalizeMoeRoutingKernelLauncher
now supplies bothpadded_cols
andorig_cols
:
- In
cpp/tensorrt_llm/thop/moeUtilOp.cpp
(around line 241), the call passeshidden_size
andorig_hidden_size
.- In
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
(lines 2988–2995 and 3306–3314), each call includeshidden_size
(aspadded_cols
) andorig_hidden_size
(asorig_cols
).No calls remain using the old single-
cols
parameter. This change is fully propagated—no further updates are needed here.cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (3)
1-16
: LGTM! Copyright header is compliant.The NVIDIA copyright header includes the current year (2025) as required by coding guidelines.
25-25
: LGTM! Proper use of C++ standard header.Adding
<cstdint>
follows the coding guideline to use C++ standard headers instead of C headers like<stdint.h>
. This is likely needed for theint64_t
type used in the new parameter.
797-799
: Approved:orig_hidden_size
Addition is Consistent Across MOE CodeVerified that the new
orig_hidden_size
parameter appears in all relevant MOE interfaces and matches the ordering in:
- cpp/tensorrt_llm/thop/moeUtilOp.cpp
- cpp/tensorrt_llm/thop/moeOp.cpp
- cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
- cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
No other
BlockScaleFC2
overloads or call sites were found, so this single signature update aligns with the broader architectural change and requires no further modifications.cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (4)
1860-1955
: Consistent implementation in NoFilling variantThe changes to
finalizeMoeRoutingNoFillingKernel
correctly mirror those in the regular kernel, maintaining consistency across both implementations.
1966-2014
: Correct parameter propagation in launcherThe launcher function properly handles the new
padded_cols
andorig_cols
parameters, passing them consistently to both kernel variants.
2021-2024
: Template instantiation correctly updatedThe macro instantiation properly includes the new column size parameters in the function signature.
2236-2239
: Comprehensive and consistent function call updatesAll function calls throughout the file have been correctly updated to pass both
hidden_size
(padded) andorig_hidden_size
(original) parameters. This ensures memory operations use appropriate strides while respecting actual data boundaries.Also applies to: 2989-2993, 3215-3218, 3308-3311, 3695-3700, 3800-3805, 4768-4771
cpp/tensorrt_llm/thop/moeUtilOp.cpp (7)
237-237
: LGTM: Parameter addition follows consistent pattern.The
orig_hidden_size
parameter is correctly positioned in the function signature, following the established pattern of placing it afterhidden_size
and beforeexperts_per_token
.
244-245
: LGTM: Kernel launcher call correctly updated.The
finalizeMoeRoutingKernelLauncher
call is properly updated to include theorig_hidden_size
parameter in the correct position.
252-254
: LGTM: Function signature correctly updated.The function signature is properly updated to include the
orig_hidden_size_param
parameter, maintaining consistency with the Torch binding pattern.
258-258
: LGTM: Parameter extraction follows established pattern.The
orig_hidden_size
extraction usingguard_int
is consistent with how other SymInt parameters are handled in the function.
282-282
: LGTM: Output tensor shape correctly updated.The output tensor allocation now properly uses
orig_hidden_size
for the second dimension, which is the key improvement allowing the operator to produce correctly sized outputs without manual slicing.
296-297
: LGTM: All function calls consistently updated.All calls to
runMoEFinalizeScaleOp
across different data types (float32, bfloat16, half) are correctly updated to include theorig_hidden_size
parameter.Also applies to: 308-309, 319-320
343-345
: LGTM: Torch library binding correctly updated.The library binding signature properly includes the
orig_hidden_size
parameter in the correct position, ensuring the new parameter is accessible from Python code.cpp/tensorrt_llm/thop/moeOp.cpp (8)
242-243
: LGTM: Optional parameter correctly added.The
original_hidden_size
parameter is properly added as an optional parameter, maintaining backward compatibility while enabling the new functionality.
316-316
: LGTM: Original hidden size calculation is correct.The calculation properly handles the optional parameter, defaulting to
hidden_size
when not provided, ensuring backward compatibility.
370-370
: LGTM: Output tensor shape correctly updated.The output tensor shape now uses
orig_hidden_size
for the second dimension, which is the key improvement allowing proper output sizing without manual slicing.
391-391
: LGTM: Kernel runner calls consistently updated.Both conditional branches (OSS and non-OSS Cutlass) properly pass the
orig_hidden_size
parameter to the kernel runner, ensuring consistent behavior across implementations.Also applies to: 405-405
423-424
: LGTM: Optional parameter correctly added to min latency method.The
original_hidden_size
parameter is consistently added to therunMoeMinLantency
method, maintaining the same pattern as the regularrunMoe
method.
483-483
: LGTM: Consistent original hidden size calculation.The calculation follows the same pattern as in
runMoe
, properly handling the optional parameter with a sensible default.
520-520
: LGTM: Output tensor shape correctly calculated for min latency mode.The output tensor shape correctly uses
orig_hidden_size
while maintaining the proper multiplier for min latency mode (num_rows * num_experts_on_rank
), which is different from the regular mode but appropriate for this method's semantics.
550-550
: LGTM: Kernel runner calls consistently updated in min latency method.Both conditional branches properly pass the
orig_hidden_size
parameter to the kernel runner, maintaining consistency across OSS and non-OSS implementations.Also applies to: 564-564
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (7)
24-24
: LGTM! Proper use of C++ standard header.The addition of
<cstdint>
follows the coding guidelines by using the C++ standard header instead of the C header<stdint.h>
.
455-459
: Interface extension properly handles original vs padded hidden size.The addition of
orig_hidden_size
parameter to the virtualrunMoe
method is well-placed and follows the existing parameter conventions. This change aligns with the PR objective of distinguishing between padded and original hidden sizes.
484-489
: Consistent parameter addition to gemm2 interface.The
orig_hidden_size
parameter addition togemm2
maintains consistency with therunMoe
method signature change.
611-615
: Correct implementation of interface override.The
runMoe
override properly implements the updated interface with the neworig_hidden_size
parameter.
646-650
: Consistent parameter propagation in gemm2 methods.Both the static
gemm2
method and its override wrapper correctly propagate theorig_hidden_size
parameter, maintaining consistency throughout the call chain.Also applies to: 683-699
836-838
: BlockScaleFC2 signature properly updated.The static
BlockScaleFC2
method for FP8 block scaling correctly includes theorig_hidden_size
parameter in the same position as other methods.
959-959
: Verify initialization ofmExpertOrigHiddenSize
in GemmProfilerBackend::init()The new member
mExpertOrigHiddenSize
is named and zero-initialized correctly according to our C++ coding guidelines. However, I wasn’t able to locate any assignment to it in the implementation ofGemmProfilerBackend::init()
.Please manually verify in your C++ sources that
GemmProfilerBackend::init()
includes something like:mExpertOrigHiddenSize = /* appropriate hidden size value */;– Search for
GemmProfilerBackend::init()
across your.cpp
/.cc
files.
– ConfirmmExpertOrigHiddenSize
is initialized alongsidemExpertHiddenSize
.
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a couple of suggestions and comments.
It might also be valuable to consider adding an option to also fuse the padding with expandInputRowsKernel, but this is not really required for this MR
Would it be possible to update the C++ tests to add coverage for this case?
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
Outdated
Show resolved
Hide resolved
7e1b005
to
fc320a6
Compare
998d64b
to
b12be8d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🔭 Outside diff range comments (8)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (5)
167-183
: Add runtime validation for unpadded_hidden_size.Guard against invalid values early to surface clear errors and avoid mismatches downstream.
else: assert tuner_num_tokens is None assert tuner_top_k is None tuner_input = input tuner_top_k = token_selected_experts.size(1) + # Validate unpadded_hidden_size if provided + if unpadded_hidden_size is not None: + if not isinstance(unpadded_hidden_size, int): + raise TypeError("unpadded_hidden_size must be an int when provided") + max_hidden = int(fc2_expert_weights.shape[1]) + if not (0 < unpadded_hidden_size <= max_hidden): + raise ValueError( + f"unpadded_hidden_size ({unpadded_hidden_size}) must be in (0, {max_hidden}]" + ) + # allocate workspace for profiling moe_runner = MoERunner(
237-266
: Fake schema is missing tuner_num_tokens and tuner_top_k; must match custom op exactly.The fake registration omits tuner_num_tokens and tuner_top_k, which exist in the real op. Torch fake must match the op schema positionally and in defaults. This will break torch.compile and fake mode.
Apply this diff to align the schema and prepare for using unpadded_hidden_size in shape inference:
@torch.library.register_fake("trtllm::fused_moe") def _( input: torch.Tensor, token_selected_experts: torch.Tensor, token_final_scales: torch.Tensor, fc1_expert_weights: torch.Tensor, fc1_expert_biases: Optional[torch.Tensor], fc2_expert_weights: torch.Tensor, fc2_expert_biases: Optional[torch.Tensor], output_dtype: torch.dtype, quant_scales: List[torch.Tensor], input_sf: Optional[torch.Tensor] = None, swizzled_input_sf: bool = True, swiglu_alpha: Optional[torch.Tensor] = None, swiglu_beta: Optional[torch.Tensor] = None, swiglu_limit: Optional[torch.Tensor] = None, tp_size: int = 1, tp_rank: int = 0, ep_size: int = 1, ep_rank: int = 0, cluster_size: int = 1, cluster_rank: int = 0, enable_alltoall: bool = False, use_deepseek_fp8_block_scale: bool = False, use_w4_group_scaling: bool = False, use_mxfp8_act_scaling: bool = False, min_latency_mode: bool = False, tune_max_num_tokens: int = 8192, + tuner_num_tokens: Optional[int] = None, + tuner_top_k: Optional[int] = None, unpadded_hidden_size: Optional[int] = None, ):
267-283
: Use unpadded_hidden_size in fake tensor shape inference.When slicing is fused, the fake should reflect the unpadded output hidden size to keep Compile/FX shapes consistent. Otherwise, graph shape will be padded while runtime returns sliced tensors.
Apply this diff to make shapes consistent:
def _( @@ ): - seq_len = input.shape[0] - hidden_size = fc2_expert_weights.shape[1] + seq_len = int(input.shape[0]) + padded_hidden_size = int(fc2_expert_weights.shape[1]) + if unpadded_hidden_size is not None: + if not (0 < int(unpadded_hidden_size) <= padded_hidden_size): + raise ValueError( + f"unpadded_hidden_size ({unpadded_hidden_size}) must be in (0, {padded_hidden_size}]" + ) + out_hidden_size = int(unpadded_hidden_size) if unpadded_hidden_size is not None else padded_hidden_size if min_latency_mode: num_experts_on_rank = fc2_expert_weights.shape[0] - output_shape = [seq_len * num_experts_on_rank, hidden_size] + output_shape = [seq_len * num_experts_on_rank, out_hidden_size] experts_to_token_score_shape = [num_experts_on_rank, seq_len] active_expert_global_ids_shape = [num_experts_on_rank] return [ input.new_empty(output_shape, dtype=output_dtype), input.new_empty([1], dtype=torch.int32), input.new_empty(experts_to_token_score_shape, dtype=torch.float32), input.new_empty(active_expert_global_ids_shape, dtype=torch.int32), ] else: - return [input.new_empty([seq_len, hidden_size], dtype=output_dtype)] + return [input.new_empty([seq_len, out_hidden_size], dtype=output_dtype)]
692-702
: Bug: register_fake references undefined variable act_fp8.The fake for w4a8_mxfp4_fp8_gemm defines act_fp4 but returns act_fp8.new_empty(...), which will raise NameError when invoked.
Apply this diff to align the parameter name with the custom op and fix the reference:
@w4a8_mxfp4_fp8_gemm.register_fake def _( - act_fp4: torch.Tensor, + act_fp8: torch.Tensor, weight: torch.Tensor, act_sf: torch.Tensor, weight_scale: torch.Tensor, alpha: torch.Tensor, output_dtype: torch.dtype, to_userbuffers: bool = False, ) -> torch.Tensor: - return act_fp8.new_empty((act_fp8.size(0), weight.size(0)), - dtype=output_dtype) + return act_fp8.new_empty((act_fp8.size(0), weight.size(0)), + dtype=output_dtype)
1-1
: Add NVIDIA SPDX copyright header to Python sourceThe file
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
is missing the required NVIDIA header. Insert the canonical two‐line SPDX header above all code (i.e. before the firstfrom
):File: tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
+ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # SPDX-License-Identifier: Apache-2.0 from functools import lru_cachecpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (3)
1791-1810
: Finalize kernel: alignment assertions are too weak; tie them to the vectorization width.padded/orig are asserted to be multiples of 4, but the kernel computes FINALIZE_ELEM_PER_THREAD as 128 / min(sizeof(OutputType), sizeof(GemmOutputType)). For half/bf16 it’s 8, for float it’s 4. If padded/orig are multiples of 4 but not 8 (half path), vectorized loads/stores will silently truncate columns due to integer division.
Strengthen the assertions to match FINALIZE_ELEM_PER_THREAD (and place them after it’s computed).
Apply:
- assert(padded_cols % 4 == 0); - assert(orig_cols % 4 == 0); - assert(orig_cols <= padded_cols); + // After FINALIZE_ELEM_PER_THREAD is defined below, assert col counts align with the vector widthThen immediately after FINALIZE_ELEM_PER_THREAD is defined:
constexpr int64_t FINALIZE_ELEM_PER_THREAD = 128 / std::min(sizeof_bits<OutputType>::value, sizeof_bits<GemmOutputType>::value); + assert(padded_cols % FINALIZE_ELEM_PER_THREAD == 0); + assert(orig_cols % FINALIZE_ELEM_PER_THREAD == 0); + assert(orig_cols <= padded_cols);
1870-1880
: No-Filling finalize kernel: align the assertions with vectorization width.Same concern as the filling kernel: replace the hard-coded “%4==0” checks with
% FINALIZE_ELEM_PER_THREAD == 0
, placed after FINALIZE_ELEM_PER_THREAD is computed.Apply:
- assert(padded_cols % 4 == 0); - assert(orig_cols % 4 == 0); - assert(orig_cols <= padded_cols); + // see below after FINALIZE_ELEM_PER_THREAD for alignment checksAnd after FINALIZE_ELEM_PER_THREAD is defined:
+ assert(padded_cols % FINALIZE_ELEM_PER_THREAD == 0); + assert(orig_cols % FINALIZE_ELEM_PER_THREAD == 0); + assert(orig_cols <= padded_cols);
3216-3331
: gemm2: add runtime validation for hidden sizes; and mirror the stronger col-alignment checks.Before launching finalize kernels, guard that orig_hidden_size <= hidden_size. Optionally, when not using TMA fused finalize, assert 4-element alignment for both (mirrors the launcher checks).
Apply:
void CutlassMoeFCRunner<...>::gemm2(..., - int64_t const expanded_num_rows, int64_t const hidden_size, int64_t const orig_hidden_size, + int64_t const expanded_num_rows, int64_t const hidden_size, int64_t const orig_hidden_size, int64_t const inter_size, ...) { + TLLM_CHECK_WITH_INFO(orig_hidden_size <= hidden_size, + "orig_hidden_size (%ld) must be <= hidden_size (%ld)", (long) orig_hidden_size, (long) hidden_size); + // Common alignment sanity (exact vector width is type-dependent and enforced in kernel) + TLLM_CHECK_WITH_INFO((hidden_size % 4) == 0 && (orig_hidden_size % 4) == 0, + "Finalize expects 4-element alignment for hidden sizes");
🧹 Nitpick comments (6)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (3)
74-81
: Return type inconsistency: List[int] vs range.get_valid_tactics is annotated to return List[int] but returns a range. Minor, but can confuse type checkers and callers.
- return range(self.fused_moe_runner.get_tactic_num()) + return list(range(self.fused_moe_runner.get_tactic_num()))
632-644
: Default mismatch between op and fake (out_dtype).fp8_batched_gemm_trtllmgen op defaults out_dtype to torch.half, but the fake defaults to None. Keep defaults identical to avoid tracing/fake inconsistencies.
- out_dtype: Optional[torch.dtype] = None + out_dtype: Optional[torch.dtype] = torch.half
119-151
: Reminder: router scales must be non-null when finalize fusion is enabled (CUTLASS epilogue contract).This op requires token_final_scales (router scales). Ensure callers never pass None here when the finalize fusion is active, since downstream epilogue expects a valid pointer.
If you want, I can scan the Python call sites to ensure token_final_scales is always provided when using the fused finalize path.
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)
1-15
: Nit: Copyright year.This header still says 2020-2023. Other files are already at 2020-2025. Consider updating for consistency.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)
1976-2010
: Launcher: host-side guards would improve failure modes.You already assert inside kernels; adding host-side checks makes errors fail fast (before launch) and with clearer messages.
Apply:
void finalizeMoeRoutingKernelLauncher(..., int64_t const num_rows, int64_t const padded_cols, int64_t const orig_cols, int64_t const experts_per_token, int64_t const num_experts_per_node, ...) { + TLLM_CHECK_WITH_INFO(orig_cols <= padded_cols, "orig_cols (%ld) must be <= padded_cols (%ld)", + (long) orig_cols, (long) padded_cols); + // The kernel vectorization width depends on OutputType/GemmOutputType; we can't compute it here. + // Still, cheaply assert common alignments. + TLLM_CHECK_WITH_INFO((padded_cols % 4) == 0 && (orig_cols % 4) == 0, + "Expect 4-element alignment for finalize paths");
2032-2034
: Nit: unify parameter naming (actual_cols vs orig_cols).The explicit instantiation uses “actual_cols” while the function uses “orig_cols”. Consider renaming for consistency.
- int64_t const actual_cols, int64_t const experts_per_token, int64_t const num_experts_per_node, + int64_t const orig_cols, int64_t const experts_per_token, int64_t const num_experts_per_node,
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
(9 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
(21 hunks)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
(2 hunks)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
(2 hunks)cpp/tensorrt_llm/thop/moeOp.cpp
(10 hunks)cpp/tensorrt_llm/thop/moeUtilOp.cpp
(6 hunks)tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
(2 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
(3 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
(2 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
- tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
- tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
- cpp/tensorrt_llm/thop/moeUtilOp.cpp
- cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
- cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
- tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
- cpp/tensorrt_llm/thop/moeOp.cpp
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
🧠 Learnings (5)
📓 Common learnings
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
📚 Learning: 2025-08-09T20:57:04.067Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.067Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T22:03:40.685Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T05:10:38.893Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.893Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-08T05:06:31.537Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.537Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
🔇 Additional comments (9)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
231-233
: Native runner ABI matches updated Python callsVerified that both FusedMoeRunner::runMoe (lines 233–243) and FusedMoeRunner::runMoeMinLantency (lines 414–424) in cpp/tensorrt_llm/thop/moeOp.cpp include the trailing
unpadded_hidden_size
parameter. The Python calls in torch_custom_ops.py now align with the C++ ABI. Approved.cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h (1)
69-71
: AllfinalizeMoeRoutingKernelLauncher
calls now include both padded_cols and orig_cols in the correct orderI’ve checked every invocation in:
cpp/tensorrt_llm/thop/moeUtilOp.cpp
(runMoEFinalizeScaleOp)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
(device‐side and fused‐epilogue paths)Each passes
… , num_rows, hidden_size (padded_cols), orig_hidden_size (orig_cols), experts_per_token, num_experts_per_node, …No stale or mis-ordered calls remain.
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (3)
24-24
: Good addition of include.Needed for the newly threaded int64_t params across interfaces.
455-460
: runMoe interface: new orig_hidden_size placement is consistent.Placing orig_hidden_size immediately after hidden_size is clear and minimizes churn downstream.
836-839
: BlockScaleFC2 signature sync looks good.Passing both hidden_size and orig_hidden_size downstream to finalization is necessary for padding-aware writes.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (4)
1842-1849
: Bias stride assumes bias is padded; verify allocation/layout.
bias_ptr = bias_v + expert_id * num_elems_in_padded_col;
strides by padded_cols, not orig_cols. This is correct only if fc2 biases are materialized per-expert at the padded width. If biases remain unpadded (orig_cols), this will read OOB.Confirm bias tensors are allocated/laid out with padded_cols. If not, stride using num_elems_in_orig_col and gate accesses accordingly.
Would you like me to scan plugin/runner allocation paths in the repo to confirm the fc2 bias buffers are padded to hidden_size?
1927-1931
: LGTM: bounds guard prevents writes beyond orig_cols.Looping over padded elements and early-continue for indices >= orig-cols is a safe pattern for the no-filling path.
2985-3003
: BlockScaleFC2 → finalize: threading (hidden_size, orig_hidden_size) is correct.This correctly passes padded and original widths to the launcher. No issues spotted.
3999-4011
: Fused finalize epilogue: router scales are correctly non-null; validate output shape parameters.Good: setFinalizeFusionParams receives permuted_token_final_scales_ (non-null when fusion enabled), which satisfies CUTLASS fusion’s implicit contract.
Minor: please re-check the ordering of (hidden_size, num_rows). Given the transpose in the fused epilogue, rows map to hidden_size and columns to num_output_tokens; num_rows here should be num_output_tokens.
If helpful, I can cross-check the TMA setFinalizeFusionParams signature in the CUTLASS extension and confirm the expected argument order.
b12be8d
to
f6d9323
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)
959-963
: Initialize mExpertOrigHiddenSize to prevent undefined behaviorThe
mExpertOrigHiddenSize
member is declared but never initialized in theinit()
method, which will cause the profiler's gemm2 to receive 0 asorig_hidden_size
, breaking finalization when no columns are written.Apply this diff to initialize the member properly:
void init(..., int64_t hidden_size, int64_t inter_size, int64_t group_size, ...) { ... mExpertHiddenSize = hidden_size; + mExpertOrigHiddenSize = hidden_size; // default: no padding unless caller overrides later ... }
Additionally, consider adding a public setter method if the original hidden size needs to be configured differently:
+void setOrigHiddenSize(int64_t orig_hidden_size) +{ + mExpertOrigHiddenSize = orig_hidden_size; +}
484-490
: Add runtime validation for orig_hidden_sizeConsider adding a runtime check to ensure
orig_hidden_size <= hidden_size
to prevent potential buffer overrun issues when the original size exceeds the padded size.Apply this diff to add validation in the implementation:
virtual void gemm2(..., int64_t const hidden_size, int64_t const orig_hidden_size, int64_t const inter_size, int const num_experts_per_node, ... ) = 0; +// In the implementation (moe_kernels.cu), add: +TLLM_CHECK_WITH_INFO(orig_hidden_size <= hidden_size, + "orig_hidden_size (%ld) must be <= hidden_size (%ld)", (long) orig_hidden_size, (long) hidden_size);cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)
4800-4801
: Profiler: ensure mExpertOrigHiddenSize is initialized before useYou now pass mExpertOrigHiddenSize to gemm2. Confirm it's set in GemmProfilerBackend::init (and/or constructor) alongside mExpertHiddenSize to prevent UB.
Search for initialization:
#!/bin/bash rg -n "mExpertOrigHiddenSize" rg -n "GemmProfilerBackend::init"
🧹 Nitpick comments (5)
cpp/tensorrt_llm/thop/moeOp.cpp (2)
246-248
: API naming consistency: prefer orig_hidden_size over unpadded_hidden_sizeEverywhere else in the stack (runners/kernels/profiler) refers to this concept as orig_hidden_size. Consider renaming the new argument from unpadded_hidden_size to orig_hidden_size for consistency across the public surface.
- bool const enable_alltoall, bool min_latency_mode, torch::optional<c10::ArrayRef<int64_t>> const& profile_ids, - torch::optional<int64_t> const& unpadded_hidden_size) + bool const enable_alltoall, bool min_latency_mode, torch::optional<c10::ArrayRef<int64_t>> const& profile_ids, + torch::optional<int64_t> const& orig_hidden_size)And similarly for runMoeMinLantency.
Also applies to: 427-429
2-2
: Update copyright yearCoding guidelines require the current year. This file shows 2022-2024; please update to 2025.
- * Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2022-2025, NVIDIA CORPORATION. All rights reserved.cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (3)
1883-1889
: NoFilling finalize: avoid per-iteration bounds check for better throughputMinor micro-optimization: iterate only up to num_elems_in_orig_col to drop the inner branch.
- for (int elem_index = start_offset; elem_index < num_elems_in_padded_col; elem_index += stride) - { - if (elem_index >= num_elems_in_orig_col) - continue; // Skip writing beyond original columns + int64_t const loop_end = num_elems_in_orig_col; + for (int elem_index = start_offset; elem_index < loop_end; elem_index += stride) + { ... }Also applies to: 1929-1931, 1941-1944, 1962-1971
1886-1889
: Mirror device asserts with host-side checksKernels assert padded_cols % 4 == 0 and orig_cols % 4 == 0. Add matching host checks where orig_hidden_size is computed (moeOp.cpp) to fail fast and aid debugging (see suggested patch in moeOp.cpp).
Also applies to: 1806-1809
50-55
: Header year is up-to-date; keep consistent across filesThis file already shows 2020-2025. Please align moeOp.cpp accordingly.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
(9 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
(22 hunks)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
(2 hunks)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
(2 hunks)cpp/tensorrt_llm/thop/moeOp.cpp
(10 hunks)cpp/tensorrt_llm/thop/moeUtilOp.cpp
(6 hunks)tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
(2 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
(3 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
(4 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
- cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
- tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
- cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
- cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
cpp/tensorrt_llm/thop/moeUtilOp.cpp
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/thop/moeUtilOp.cpp
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/thop/moeUtilOp.cpp
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/thop/moeUtilOp.cpp
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
🧠 Learnings (6)
📓 Common learnings
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
cpp/tensorrt_llm/thop/moeUtilOp.cpp
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
cpp/tensorrt_llm/thop/moeUtilOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T04:10:19.038Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:19.038Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.
Applied to files:
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (20)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (4)
24-24
: LGTM!The addition of
<cstdint>
header is appropriate for using standard fixed-width integer types likeint64_t
.
455-455
: LGTM!The addition of
orig_hidden_size
parameter afterhidden_size
is consistent with the PR's objective of threading original dimensions through the MoE operator stack for fused slicing operations.
612-612
: LGTM!The propagation of
orig_hidden_size
through the runMoe and gemm2 method signatures in CutlassMoeFCRunner is consistent with the interface changes and maintains proper parameter threading.Also applies to: 647-700
840-842
: LGTM!The addition of
orig_hidden_size
parameter to BlockScaleFC2 is properly integrated and aligns with the overall MoE operator changes.tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (1)
576-577
: LGTM!The passing of both padded
hidden_size
and originalunpadded_hidden_size
to the finalize operation correctly enables proper output sizing within the kernel.tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (2)
466-466
: LGTM!The addition of
orig_hidden_size: torch.SymInt
parameter to the fake operator signature is properly positioned and typed.
475-477
: LGTM!The output tensor allocation now correctly uses
orig_hidden_size
for the second dimension, ensuring the output matches the original unpadded dimensions rather than the potentially padded hidden size.tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3)
90-91
: Consider using a more descriptive name for clarityThe variable name
unpadded_hidden_size
clearly indicates its purpose. This is better thanoriginal_hidden_size
which could be ambiguous in the context of tensor parallelism where "original" might refer to the pre-TP-split size.
425-425
: LGTM!The addition of
unpadded_hidden_size
parameter to the fused_moe operator correctly propagates the original dimensions to the kernel for proper output sizing.
406-406
: Confirm swizzled_input_sf behavior in fused_moe_cutlassI traced the defaults and usage across layers:
• In
torch_custom_ops.py
,swizzled_input_sf: bool = True
• Infused_moe_wide_ep.py
, default passed isFalse
• Infused_moe_cutlass.py
, default passed isFalse
• In the CUTLASS kernel signature, the default isswizzled_input_sf = true
, and the code path underif (swizzled_input_sf)
applies a “swizzled” offset layout, while theelse
path treatsinput_sf
as a flat, contiguous array.By flipping to
False
infused_moe_cutlass.py
, you’ll hit the contiguous‐layout branch. Please verify:
- That the
input_sf
tensor you pass in this path is laid out contiguously (per-expert scales in row-major order), not in the swizzled pattern.- That downstream code (and any existing quantization tests) covers this non-swizzled branch to avoid silent scaling errors.
Consider either unifying the default across the Python wrappers or adding a comment/parameter doc to explain when to use each mode, and include a unit test for the non-swizzled case.
cpp/tensorrt_llm/thop/moeUtilOp.cpp (4)
237-237
: LGTM!The addition of
orig_hidden_size
parameter torunMoEFinalizeScaleOp
and its propagation through the kernel launcher is correctly implemented.Also applies to: 244-245, 252-253, 258-258
282-282
: LGTM!The output tensor allocation now correctly uses
orig_hidden_size
for the second dimension, ensuring proper output sizing for unpadded data.
296-296
: LGTM!The propagation of
orig_hidden_size
through all three data type paths (float32, bfloat16, half) is consistent and correct.Also applies to: 308-308, 319-319
343-344
: LGTM!The updated TORCH_LIBRARY_FRAGMENT registration correctly includes the new
SymInt orig_hidden_size
parameter in the function signature.cpp/tensorrt_llm/thop/moeOp.cpp (2)
374-376
: Output tensor shape switch to orig_hidden_size looks goodSwitching the output shape to {num_rows, orig_hidden_size} and {num_rows * num_experts_on_rank, orig_hidden_size} aligns with the slicing objective.
Also applies to: 524-526
395-413
: AllrunMoe
andgemm2
invocations now includeorig_hidden_size
• In
moeOp.cpp
,runMoe(...)
calls at lines 386, 400, 545, and 559 passorig_hidden_size
.
• InmixtureOfExpertsPlugin.cpp
,runMoe(...)
calls at lines 961 and 974 passmExpertHiddenSize
for both padded and unpadded sizes.
• Incutlass_kernels/include/moe_kernels.h
, thegemm2
override signature and its call toSelf::gemm2(…, hidden_size, orig_hidden_size, inter_size, …)
have been updated accordingly.No further call sites are missing the new parameter.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (4)
1799-1805
: Finalize kernel: padded/original column handling is correct
- Using padded_cols for strides and orig_cols for iteration bounds prevents OOB while keeping coalesced reads.
- Bias pointer stride based on padded_cols is consistent.
Also applies to: 1818-1822, 1836-1836, 1855-1862
1989-1992
: Launcher: dual-dimension plumb-through is correctThe launcher now supplies both padded_cols and orig_cols to the kernels in both A2A and non-A2A paths. Looks good.
Also applies to: 2021-2023, 2035-2037
2998-3003
: BlockScaleFC2: finalize with padded vs original columns wired correctlyForwarding both hidden_size and orig_hidden_size to the finalize launcher ensures sliced outputs with padded GEMM strides.
Also applies to: 3011-3016
2050-2055
: Template instantiations updated for new launcher signatureMacro instantiation reflects the new parameter order (padded vs orig). Good.
/bot run --disable-fail-fast --add-multi-gpu-test |
PR_Github #15166 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h (1)
65-71
: Add sanity checks for padded vs. unpadded columns infinalizeMoeRoutingKernelLauncher
We verified that the signature is consistent across the header, CU implementation, and all call sites (no legacy single-cols calls remain). To guard against misconfiguration when fusing slicing into the final kernel, insert explicit checks at the top of the implementation:
• File:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
in the body oftemplate<…> void finalizeMoeRoutingKernelLauncher(...)
(around line 1985)• Add:
// Sanity checks for new padded/unpadded API TORCH_CHECK(padded_cols >= unpadded_cols, "finalizeMoeRoutingKernelLauncher: padded_cols (", padded_cols, ") must be >= unpadded_cols (", unpadded_cols, ")"); TORCH_CHECK(unpadded_cols > 0, "finalizeMoeRoutingKernelLauncher: unpadded_cols must be > 0"); // Optional: enforce vector‐load alignment, e.g. for float4 loads: // TORCH_CHECK(unpadded_cols % 4 == 0, // "unpadded_cols must be a multiple of 4 for vectorized loads");This prevents silent errors when callers pass mismatched dimensions.
cpp/tensorrt_llm/thop/moeOp.cpp (2)
374-399
: Guard fused-finalize requires non-null router scales (token_final_scales) to avoid kernel faultsWhen fused finalize is enabled in the CUTLASS path, router scales must be non-null. Enforce this precondition on the host path to avoid undefined behavior.
Apply this diff:
// setRunnerProfiles(profile_ids); auto stream = at::cuda::getCurrentCUDAStream(input.get_device()); +if (mUseFusedFinalize) { + TORCH_CHECK(token_final_scales.has_value(), + "token_final_scales must be provided when fused finalize is enabled."); +}
544-558
: Min-latency path: enforce non-null router scales if fused finalize is activeSame rationale as the regular path; prevent device-side faults caused by missing scales when finalization is fused.
Apply this diff:
kernels::MoeMinLatencyParams min_latency_params{}; min_latency_params.num_active_experts_per_node = static_cast<int*>(num_active_experts_per_node.data_ptr()); ... +if (mUseFusedFinalize) { + TORCH_CHECK(token_final_scales.has_value(), + "token_final_scales must be provided when fused finalize is enabled."); +}
♻️ Duplicate comments (5)
cpp/tensorrt_llm/thop/moeOp.cpp (2)
319-323
: Add bounds/alignment checks for unpadded_hidden_size before use (runMoe)Fail fast if the host passes an invalid unpadded size; this prevents device-side faults and misaligned vectorization in finalize.
Apply this diff:
int64_t hidden_size = fc2_expert_weights.sizes()[1]; int64_t unpadded_hidden_size_val = unpadded_hidden_size.has_value() ? unpadded_hidden_size.value() : hidden_size; +TORCH_CHECK(unpadded_hidden_size_val > 0 && unpadded_hidden_size_val <= hidden_size, + "unpadded_hidden_size must be in (0, hidden_size]. Got ", + unpadded_hidden_size_val, " vs hidden_size ", hidden_size, "."); +// Kernels vectorize on at least 128 bits per thread. Enforce minimal alignment used by finalize (multiple of 4). +TORCH_CHECK(unpadded_hidden_size_val % 4 == 0, + "unpadded_hidden_size (", unpadded_hidden_size_val, ") must be divisible by 4 to satisfy vectorized finalize.");
485-489
: Add bounds/alignment checks for unpadded_hidden_size before use (runMoeMinLantency)Mirror the validation in the min-latency path to keep behavior consistent.
Apply this diff:
int64_t hidden_size = fc2_expert_weights.sizes()[1]; int64_t unpadded_hidden_size_val = unpadded_hidden_size.has_value() ? unpadded_hidden_size.value() : hidden_size; +TORCH_CHECK(unpadded_hidden_size_val > 0 && unpadded_hidden_size_val <= hidden_size, + "unpadded_hidden_size must be in (0, hidden_size]. Got ", + unpadded_hidden_size_val, " vs hidden_size ", hidden_size, "."); +TORCH_CHECK(unpadded_hidden_size_val % 4 == 0, + "unpadded_hidden_size (", unpadded_hidden_size_val, ") must be divisible by 4 to satisfy vectorized finalize.");cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)
908-944
: Bug: mExpertOrigHiddenSize is never initialized in profiler; GEMM2 profiling sees 0Initialize mExpertOrigHiddenSize alongside mExpertHiddenSize to a sane default (hidden_size) to avoid zero-width finalization during profiling.
Apply this diff:
@@ void init(CutlassMoeFCRunnerInterface& runner, GemmToProfile gemm_to_profile, nvinfer1::DataType dtype, - mExpertHiddenSize = hidden_size; + mExpertHiddenSize = hidden_size; + // Default to padded hidden size unless caller overrides with actual unpadded/original. + mExpertOrigHiddenSize = hidden_size; mExpertInterSize = inter_size; // Already divided by tp_sizeAlso applies to: 963-964
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)
3277-3279
: Potential fused finalize epilogue memory safety issue remains.Based on the past review comment, there's still a concern about the fused finalize epilogue potentially writing beyond
orig_hidden_size
whensetFinalizeFusionParams
receiveshidden_size
. The comment suggests this needs to be addressed by either:
- Extending the API to accept both padded and original column counts
- Using a temporary buffer approach
4799-4802
: Profiler: gemm2 now passes mExpertOrigHiddenSize; ensure it's initialized.This depends on mExpertOrigHiddenSize being set in GemmProfilerBackend::init. See my fix suggested in moe_kernels.h.
🧹 Nitpick comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)
647-701
: Consider passing MOEParallelismConfig by const reference in virtuals/overridesLarge trivially-copyable structs can still incur copies; passing by const& reduces churn across this hot path. This is optional and can be deferred.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
(9 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
(22 hunks)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
(2 hunks)cpp/tensorrt_llm/thop/moeOp.cpp
(10 hunks)cpp/tensorrt_llm/thop/moeUtilOp.cpp
(6 hunks)tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
- tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/thop/moeUtilOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/thop/moeUtilOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/thop/moeUtilOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/thop/moeUtilOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
🧠 Learnings (5)
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/thop/moeUtilOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/thop/moeUtilOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T04:10:19.038Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:19.038Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.
Applied to files:
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/thop/moeUtilOp.cpp
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (15)
cpp/tensorrt_llm/thop/moeUtilOp.cpp (3)
241-246
: Forwarding order to launcher looks correct; keep padded/unpadded semantics consistentPassing (hidden_size, unpadded_hidden_size) aligns with the new launcher’s (padded_cols, unpadded_cols) expectation. No changes needed here after adding the early validation above.
If you want additional certainty, run the repository-wide grep (previous comment) to confirm no stale call sites remain.
341-347
: All Python wrappers updated with newunpadded_hidden_size
parameterBoth
fused_moe_deepgemm.py
andfused_moe_cute_dsl.py
now call torch.ops.trtllm.moe_finalize_scale_op with 17 arguments (including the inserted unpadded_hidden_size in the 11th position), matching the C++ schema. No legacy 12-arg invocations remain.
282-282
: Confirm downstream consumers handle unpadded widthPlease verify that all downstream users of
moe_finalize_scale_op
—including the two Python wrappers—are updated to expect the new output shape{num_rows, unpadded_hidden_size}
instead of the padded width.Key callsites to review:
- tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (lines 564–573)
- tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py (lines 227–236)
Check any tests or TorchScript modules that consume
final_hidden_states
and adjust shape assertions or downstream tensor operations accordingly.cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)
24-24
: Include is appropriate for the new int64_t API surfaceGood addition to make 64-bit usage explicit and portable.
455-460
: No mismatches found:unpadded_hidden_size
is consistently declared and passedAll public/virtual/static methods now include
unpadded_hidden_size
in their signatures, and every callsite passes it immediately afterhidden_size
. No parameter order or naming inconsistencies were detected across:
- C++ headers (
moe_kernels.h
variants)- Kernel implementations (
moe_kernels.cu
,moeUtilOp.cpp
,moeOp.cpp
)- Python/C++ custom-op bindings and tests
No changes required.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (10)
1803-1804
: Pass both padded and original column counts to enable proper memory bounds checking.The function signature now correctly accepts both
padded_cols
(for reading from padded GEMM output) andunpadded_cols
(for writing to the original output size). This addresses the slicing optimization mentioned in the PR title.
1820-1822
: Correct loop bounds calculation using original columns.The loop now correctly uses
num_elems_in_orig_col
derived fromunpadded_cols
for the write loop bounds, preventing out-of-bounds writes. This is the core fix for the slicing optimization.
1855-1855
: Ensure bias indexing matches padded column layout.The bias pointer calculation uses
num_elems_in_padded_col
which is correct since bias tensors are typically allocated with the same padding as the GEMM output tensors.Also applies to: 1861-1861
1883-1888
: Same padding-aware parameter additions to NoFilling variant.The NoFilling kernel variant correctly receives the same
padded_cols
andunpadded_cols
parameters, maintaining consistency across both finalization paths.
1940-1943
: Guard against out-of-bounds writes in padded dimension iteration.The conditional check
if (elem_index >= num_elems_in_orig_col) continue;
correctly prevents writes beyond the original column boundaries when iterating over the padded dimension. This is essential for memory safety.
1962-1962
: Verify bias indexing consistency in NoFilling kernel.Similar to the regular finalize kernel, bias access uses
num_elems_in_padded_col
which should be consistent with how bias tensors are allocated and accessed throughout the MOE pipeline.Also applies to: 1969-1969
1989-1991
: Update all kernel launcher call sites with new padding parameters.All call sites to
finalizeMoeRoutingKernelLauncher
now correctly pass bothpadded_cols
andunpadded_cols
parameters. The parameter order and types appear consistent.Also applies to: 2021-2022, 2035-2036
2998-3000
: DeepSeek block scale functions updated with original hidden size parameter.The
BlockScaleFC2
function signature now includesunpadded_hidden_size
parameter and correctly passes it through to the finalization kernels. This maintains consistency with the slicing optimization.Also applies to: 3014-3015
3554-3559
: runMoe function signature properly extended with unpadded_hidden_size parameter.The main MOE entry point now correctly accepts
unpadded_hidden_size
as a separate parameter afterhidden_size
, enabling the slicing optimization throughout the MOE pipeline.
3720-3724
: All gemm2 calls consistently pass unpadded_hidden_size parameter.Both the min_latency_mode and regular execution paths correctly thread the
unpadded_hidden_size
parameter through to the gemm2 function calls, ensuring consistent slicing behavior.Also applies to: 3827-3830
/bot kill |
/bot run --disable-fail-fast --add-multi-gpu-test |
PR_Github #15171 [ run ] triggered by Bot |
PR_Github #15166 [ run ] completed with state |
PR_Github #15172 [ kill ] triggered by Bot |
PR_Github #15171 [ run ] completed with state |
PR_Github #15172 [ kill ] completed with state |
/bot run --disable-fail-fast --add-multi-gpu-test |
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Sergey Klevtsov <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
b2a9b93
to
69164a4
Compare
/bot run --reuse-test |
PR_Github #16147 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)
998-1009
: Non-OSS runMoe arity mismatch (extra hidden_size and enable_alltoall arg)Internal Cutlass runMoe takes only one hidden_size and does not accept enable_alltoall; current call will fail to compile on the non-OSS path.
Apply this diff:
- mExpertBias2 + mExpertBias2Size * mBufferIndex, mQuantParams[mBufferIndex], mTotalTokens, mHiddenSize, - mHiddenSize, mInterSize, mNumExperts, mK, mWorkspace + mWorkspaceSize * mBufferIndex, - mFinalOutput + mFinalOutputSize * mBufferIndex, - mSourceToExpandedMap + mSourceToExpandedMapSize * mBufferIndex, parallelism_config, - /*enable_alltoall=*/false, mUseLora, mLoraParams[mBufferIndex], + mExpertBias2 + mExpertBias2Size * mBufferIndex, mQuantParams[mBufferIndex], mTotalTokens, + /*hiddenSize=*/mHiddenSize, mInterSize, mNumExperts, mK, mWorkspace + mWorkspaceSize * mBufferIndex, + mFinalOutput + mFinalOutputSize * mBufferIndex, + mSourceToExpandedMap + mSourceToExpandedMapSize * mBufferIndex, parallelism_config, + mUseLora, mLoraParams[mBufferIndex],
♻️ Duplicate comments (9)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (2)
986-997
: OSS runMoe: disambiguate padded vs. unpadded hidden sizesThe call passes mHiddenSize twice. Annotate which is padded vs unpadded for readability and future padding support in this fixture.
- mExpertBias2 + mExpertBias2Size * mBufferIndex, mQuantParams[mBufferIndex], mTotalTokens, mHiddenSize, - mHiddenSize, mInterSize, mNumExperts, mK, mWorkspace + mWorkspaceSize * mBufferIndex, + mExpertBias2 + mExpertBias2Size * mBufferIndex, mQuantParams[mBufferIndex], mTotalTokens, + /*paddedHiddenSize=*/mHiddenSize, /*unpaddedHiddenSize=*/mHiddenSize, + mInterSize, mNumExperts, mK, mWorkspace + mWorkspaceSize * mBufferIndex,
708-718
: Non-OSS GemmProfilerBackend::init arity mismatch; OSS path should label padded vs. unpadded hidden sizes
- Non-OSS: internal init does not take two hidden_size args nor enable_alltoall; current call will not compile.
- OSS: passing mHiddenSize twice is ambiguous; annotate which is padded vs unpadded.
Apply this diff:
#ifdef USING_OSS_CUTLASS_MOE_GEMM - mGemmProfilerBackend.init(mMoERunner, GemmProfilerBackend::GemmToProfile::Undefined, typeToDtypeID<DataType>(), - typeToDtypeID<WeightType>(), typeToDtypeID<OutputType>(), mNumExperts, mK, mHiddenSize, mHiddenSize, - mInterSize, mGroupSize, mActType, mUseBias, mUseLora, /*min_latency_mode=*/false, - /*need_weights=*/false, parallelism_config, /*enable_alltoall=*/false); + mGemmProfilerBackend.init(mMoERunner, GemmProfilerBackend::GemmToProfile::Undefined, typeToDtypeID<DataType>(), + typeToDtypeID<WeightType>(), typeToDtypeID<OutputType>(), mNumExperts, mK, + /*paddedHiddenSize=*/mHiddenSize, /*unpaddedHiddenSize=*/mHiddenSize, + mInterSize, mGroupSize, mActType, mUseBias, mUseLora, /*min_latency_mode=*/false, + /*need_weights=*/false, parallelism_config, /*enable_alltoall=*/false); #else - mGemmProfilerBackend.init(mMoERunner, GemmProfilerBackend::GemmToProfile::Undefined, typeToDtypeID<DataType>(), - typeToDtypeID<WeightType>(), typeToDtypeID<OutputType>(), mNumExperts, mK, mHiddenSize, mHiddenSize, - mInterSize, mGroupSize, mActType, mUseBias, mUseLora, /*min_latency_mode=*/false, - /*need_weights=*/false, parallelism_config); + mGemmProfilerBackend.init(mMoERunner, GemmProfilerBackend::GemmToProfile::Undefined, typeToDtypeID<DataType>(), + typeToDtypeID<WeightType>(), typeToDtypeID<OutputType>(), mNumExperts, mK, + /*hiddenSize=*/mHiddenSize, + mInterSize, mGroupSize, mActType, mUseBias, mUseLora, /*min_latency_mode=*/false, + /*need_weights=*/false, parallelism_config); #endiftensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
12-12
: Fix ceil_div import to prevent ImportError at runtimeceil_div is defined under quantization/utils/fp8_utils.py, not in ...utils.
-from ...utils import AuxStreamType, EventType, Fp4QuantizedTensor, ceil_div +from ...utils import AuxStreamType, EventType, Fp4QuantizedTensor +from ...quantization.utils.fp8_utils import ceil_divcpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)
2065-2072
: Gate GptOss120b test to OSS path to avoid false failures on non-OSS buildsThis scenario relies on OSS epilogue honoring unpadded stride; skip when USING_OSS_CUTLASS_MOE_GEMM is not defined.
TYPED_TEST(MixtureOfExpertsTest, ParallelismType##GptOss120b) \ { \ + /* Only meaningful with the OSS Cutlass MoE path (unpadded stride support) */ \ + #ifndef USING_OSS_CUTLASS_MOE_GEMM \ + GTEST_SKIP() << "Skipping GptOss120b: requires USING_OSS_CUTLASS_MOE_GEMM"; \ + return; \ + #endif \cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)
459-468
: API: runMoe now threads unpadded_hidden_size — verify all call sites are updatedThe signature change looks consistent (unpadded_hidden_size after hidden_size, plus enable_alltoall). Please re-verify that all C++/Py bindings (plugins, THOPs, unit tests, micro-benchmarks) call this overload on OSS builds.
Run to spot any stale call sites still using the old arity:
#!/bin/bash set -euo pipefail echo "Searching runMoe(...) call sites" rg -nP --type=cpp --type=cu --type=cuh -C3 '\brunMoe\s*\('cpp/tensorrt_llm/thop/moeUtilOp.cpp (1)
252-259
: Validate unpadded_hidden_size early (bounds)Add fast-fail host-side checks to prevent device-side asserts when unpadded_hidden_size is invalid.
int64_t num_rows = num_rows_param.guard_int(__FILE__, __LINE__); int64_t hidden_size = hidden_size_param.guard_int(__FILE__, __LINE__); int64_t unpadded_hidden_size = unpadded_hidden_size_param.guard_int(__FILE__, __LINE__); + TORCH_CHECK( + unpadded_hidden_size > 0 && unpadded_hidden_size <= hidden_size, + "unpadded_hidden_size must be in (0, hidden_size]. Got ", + unpadded_hidden_size, " vs hidden_size ", hidden_size, ".");cpp/tensorrt_llm/thop/moeOp.cpp (2)
330-333
: Validate unpadded_hidden_size_val in runMoeAdd host-side bounds checks to catch bad inputs before launching kernels.
- int64_t unpadded_hidden_size_val - = unpadded_hidden_size.has_value() ? unpadded_hidden_size.value() : hidden_size; + int64_t unpadded_hidden_size_val = + unpadded_hidden_size.has_value() ? unpadded_hidden_size.value() : hidden_size; + TORCH_CHECK( + unpadded_hidden_size_val > 0 && unpadded_hidden_size_val <= hidden_size, + "unpadded_hidden_size must be in (0, hidden_size]. Got ", + unpadded_hidden_size_val, " vs hidden_size ", hidden_size, "."); + // Optional (if you want to enforce minimal vectorization constraints used by finalize kernels): + // TORCH_CHECK(unpadded_hidden_size_val % 4 == 0, "unpadded_hidden_size (", + // unpadded_hidden_size_val, ") must be divisible by 4.");
505-507
: Validate unpadded_hidden_size_val in runMoeMinLantencyMirror the same bounds checks here to keep both paths consistent.
- int64_t unpadded_hidden_size_val - = unpadded_hidden_size.has_value() ? unpadded_hidden_size.value() : hidden_size; + int64_t unpadded_hidden_size_val = + unpadded_hidden_size.has_value() ? unpadded_hidden_size.value() : hidden_size; + TORCH_CHECK( + unpadded_hidden_size_val > 0 && unpadded_hidden_size_val <= hidden_size, + "unpadded_hidden_size must be in (0, hidden_size]. Got ", + unpadded_hidden_size_val, " vs hidden_size ", hidden_size, ".");cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)
1807-1809
: Replace hard-coded alignment “4” with FINALIZE_ELEM_PER_THREAD.The finalize kernels still assert 4-element alignment, which fails for 16-bit data types (half/bf16) that load 8 elements per 128-bit lane. You already compute FINALIZE_ELEM_PER_THREAD; use it in the asserts for correctness across types.
Apply:
@@ - assert(padded_cols % 4 == 0); - assert(unpadded_cols % 4 == 0); + assert(padded_cols % FINALIZE_ELEM_PER_THREAD == 0); + assert(unpadded_cols % FINALIZE_ELEM_PER_THREAD == 0);Do the same in finalizeMoeRoutingNoFillingKernel.
Also applies to: 1887-1889
🧹 Nitpick comments (11)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
345-346
: Wrap long comment to satisfy Ruff E501 (line > 120 chars)Minor style fix to keep CI linting green.
- is_sf_swizzled = True # In case of post-quant communication, scaling factors will not be swizzled before communication, and swizzling after communication is merged into MoE. + # In case of post-quant communication, scaling factors will not be swizzled before communication, + # and swizzling after communication is merged into MoE. + is_sf_swizzled = Truecpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (4)
484-498
: Minor type inconsistency: experts_per_token int vs int64_t across interfacesHere experts_per_token is int64_t, while most other public methods use int. Prefer one type across the surface (int is sufficient and avoids accidental sign/width drift).
Minimal change (only if you choose to standardize to int here — remember to mirror in definitions/overrides):
- int const num_experts_per_node, int64_t const experts_per_token, float const** alpha_scale_ptr_array, + int const num_experts_per_node, int const experts_per_token, float const** alpha_scale_ptr_array,
759-768
: Setup path takes unpadded_hidden_size — consider brief param docsSince both hidden_size and unpadded_hidden_size are carried, a short comment clarifying “hidden_size = padded cols used for GEMM; unpadded_hidden_size = true output cols” would prevent future misuse.
918-956
: Profiler API extended: init stores mExpertUnpaddedHiddenSizeThe added field and parameter are wired correctly. One suggestion: if the caller ever passes 0 (legacy paths), default to hidden_size inside init to keep the profiler robust.
Example:
- mExpertUnpaddedHiddenSize = unpadded_hidden_size; + mExpertUnpaddedHiddenSize = (unpadded_hidden_size > 0) ? unpadded_hidden_size : hidden_size;Also applies to: 966-975
1-15
: Header copyright yearCoding guidelines request current year; this file still shows 2020–2023. Consider updating the range to include 2025.
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h (1)
65-72
: FinalizeMoeRoutingKernelLauncher: call‐site parameter order verified; suggest adding documentation
- Verified that the declaration in
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
and the definition in
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
both take(padded_cols, unpadded_cols)
immediately afternum_rows
.- Inspected all call sites—including
cpp/tensorrt_llm/thop/moeUtilOp.cpp
(around line 241) and
multiple invocations inmoe_kernels.cu
(around lines 3007, 3331, 3339) —and confirmed they each pass the full column count (e.g.hidden_size
) first, then the unpadded count, thenexperts_per_token
. No mismatches were found.- Nit: Please add a brief doc‐line above the function declaration in
moe_util_kernels.h
explaining the distinction between “padded_cols” (compute buffer width) vs. “unpadded_cols” (actual data width) to guard against accidental swaps.cpp/tensorrt_llm/thop/moeUtilOp.cpp (1)
237-246
: Caller passes (hidden_size, unpadded_hidden_size) to launcher — order looks correctThis matches the new “padded_cols, unpadded_cols” contract. Consider adding one invariant check to assert gemm2_output’s second dim equals hidden_size to fail fast on shape drift.
@@ void runMoEFinalizeScaleOp(...) - cutlass_kernels::finalizeMoeRoutingKernelLauncher<OutputType, UnfusedGemmOutputType>( + // Sanity: GEMM2 output should be [expanded_rows, hidden_size] (padded) + TORCH_CHECK( + gemm2_output && hidden_size > 0, "Invalid GEMM2 output or hidden_size."); + // Note: dtype-agnostic sizes, just check the logical shape. + // Caller ensures gemm2_output is a 2D tensor with shape [expanded_rows, hidden_size]. + cutlass_kernels::finalizeMoeRoutingKernelLauncher<OutputType, UnfusedGemmOutputType>( static_cast<UnfusedGemmOutputType const*>(gemm2_output), final_output, biases, unpermuted_final_scales, unpermuted_row_to_permuted_row, permuted_row_to_unpermuted_row, token_selected_experts, expert_first_token_offset, num_rows, hidden_size, unpadded_hidden_size, experts_per_token, num_experts_per_node, parallelism_config, enable_alltoall, stream);cpp/tensorrt_llm/thop/moeOp.cpp (2)
269-299
: Duplicate bias validation block — remove repetitionThis block duplicates the checks immediately above (Lines 269–283). Drop one copy to reduce maintenance burden.
- if (fc1_expert_biases.has_value() || fc2_expert_biases.has_value()) - { - CHECK_INPUT(fc1_expert_biases.value(), mOutputDtype); - CHECK_INPUT(fc2_expert_biases.value(), mOutputDtype); - TORCH_CHECK(fc1_expert_biases.value().dim() == 2, "fc1_expert_biases must be 2D."); - TORCH_CHECK(fc2_expert_biases.value().dim() == 2, "fc2_expert_biases must be 2D."); - TORCH_CHECK(fc1_expert_weights.sizes()[0] == fc1_expert_biases.value().sizes()[0], - "fc1_expert_weights and fc1_expert_biases must have the same number of experts."); - TORCH_CHECK(fc2_expert_weights.sizes()[0] == fc2_expert_biases.value().sizes()[0], - "fc2_expert_weights and fc2_expert_biases must have the same number of experts."); - TORCH_CHECK(fc1_expert_biases.value().sizes()[1] == fc1_expert_weights.sizes()[1], - "fc1_expert_biases should match fc1_expert_weights output shape."); - TORCH_CHECK(fc2_expert_biases.value().sizes()[1] == fc2_expert_weights.sizes()[1], - "fc2_expert_biases should match fc2_expert_weights output shape."); - }
646-661
: Optional: add a one-liner assert before mProfiler->run to catch missing prepareIf do_preparation was never true, mProfileWorkspace may be null. Consider asserting it's set when calling runProfiler to improve error messages.
@@ - // Profile specific tactic. Assuming at least one preparation phase has been executed already. + // Profile specific tactic. Assuming at least one preparation phase has been executed already. + TORCH_CHECK(mProfileWorkspace != nullptr, "runProfiler called before prepare()."); mProfiler->runProfiler(num_rows, profile, mProfileWorkspace, expert_weights_ptr, stream);Also applies to: 672-681
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)
1930-1932
: Avoid per-iteration bounds branch in NoFilling finalize.Looping to num_elems_in_padded_col and continuing when beyond original width adds divergence. Iterate directly to num_elems_in_orig_col as in the other finalize kernel.
@@ - int64_t const num_elems_in_padded_col = padded_cols / FINALIZE_ELEM_PER_THREAD; - int64_t const num_elems_in_orig_col = unpadded_cols / FINALIZE_ELEM_PER_THREAD; + int64_t const num_elems_in_padded_col = padded_cols / FINALIZE_ELEM_PER_THREAD; + int64_t const num_elems_in_orig_col = unpadded_cols / FINALIZE_ELEM_PER_THREAD; @@ - for (int elem_index = start_offset; elem_index < num_elems_in_padded_col; elem_index += stride) + for (int elem_index = start_offset; elem_index < num_elems_in_orig_col; elem_index += stride) { - if (elem_index >= num_elems_in_orig_col) - continue; // Skip writing beyond original columnsAlso applies to: 1941-1945
2046-2048
: Unify parameter naming: “actual_cols” → “unpadded_cols”.Elsewhere you use “unpadded_cols”. The explicit instantiation macro still says “actual_cols”, which hurts grep-ability and violates our consistency guideline.
- int64_t const actual_cols, int64_t const experts_per_token, int64_t const num_experts_per_node, \ + int64_t const unpadded_cols, int64_t const experts_per_token, int64_t const num_experts_per_node, \
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (18)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
(3 hunks)cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
(6 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
(12 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inl
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu
(1 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
(27 hunks)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
(2 hunks)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
(2 hunks)cpp/tensorrt_llm/thop/moeOp.cpp
(10 hunks)cpp/tensorrt_llm/thop/moeUtilOp.cpp
(6 hunks)cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
(8 hunks)tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
(2 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
(7 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
(7 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
- cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h
- tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
- cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/launchers/moe_gemm_tma_ws_launcher.inl
- cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp
- tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
- cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu
- tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
- tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
- cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/thop/moeUtilOp.cpp
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/thop/moeUtilOp.cpp
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/thop/moeUtilOp.cpp
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/thop/moeUtilOp.cpp
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
🧠 Learnings (13)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/thop/moeUtilOp.cpp
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-08T04:10:19.038Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:19.038Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.
Applied to files:
cpp/tensorrt_llm/thop/moeUtilOp.cpp
cpp/tensorrt_llm/thop/moeOp.cpp
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/thop/moeUtilOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
PR: NVIDIA/TensorRT-LLM#7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.
Applied to files:
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h
cpp/tensorrt_llm/thop/moeOp.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-21T21:48:35.105Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.105Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h
📚 Learning: 2025-08-21T02:39:11.984Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:11.984Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_util_kernels.h
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-17T15:07:01.420Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#6968
File: cpp/tensorrt_llm/thop/loraOp.cpp:133-141
Timestamp: 2025-08-17T15:07:01.420Z
Learning: In TensorRT-LLM's LoRA implementation, the LoraImpl::run() method handles setStream() internally in _runGemm() (line 51 in lora.cpp), along with setWorkspace(). The stream parameter flows from loraOp.cpp through LoraImpl::run() to _runGemm() where setStream() is called appropriately. Adding setStream() in loraOp.cpp would be redundant and goes against the intended architectural design.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
📚 Learning: 2025-08-17T15:07:01.420Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#6968
File: cpp/tensorrt_llm/thop/loraOp.cpp:133-141
Timestamp: 2025-08-17T15:07:01.420Z
Learning: In TensorRT-LLM's LoRA implementation, the LoraImpl::run() method handles setStream() internally in _runGemm(), along with setWorkspace(). Both stream and workspace are passed as arguments to run(), so there's no need to call setStream() explicitly in loraOp.cpp - this avoids redundancy and follows the intended architectural separation.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
🧬 Code graph analysis (8)
cpp/tensorrt_llm/thop/moeUtilOp.cpp (2)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (5)
parallelism_config
(983-1090)parallelism_config
(983-983)parallelism_config
(1168-1256)parallelism_config
(1168-1168)stream
(831-843)cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)
finalizeMoeRoutingKernelLauncher
(1986-2039)finalizeMoeRoutingKernelLauncher
(1986-1992)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (7)
mTotalTokens
(540-540)mHiddenSize
(378-378)mInterSize
(539-539)mNumExperts
(379-379)mK
(381-381)mWorkspace
(515-515)mFinalOutput
(537-537)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (3)
tensorrt_llm/_torch/utils.py (1)
Fp4QuantizedTensor
(97-104)tensorrt_llm/quantization/utils/fp8_utils.py (1)
ceil_div
(10-21)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)
hidden_size
(216-216)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (13)
parallelism_config
(983-1090)parallelism_config
(983-983)parallelism_config
(1168-1256)parallelism_config
(1168-1168)stream
(831-843)k
(1513-1514)k
(1518-1529)k
(1518-1519)k
(1533-1541)k
(1533-1534)k
(1543-1555)k
(1543-1544)k
(1557-1558)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (4)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (6)
typeToDtypeID
(2254-2289)typeToDtypeID
(2254-2254)parallelism_config
(983-1090)parallelism_config
(983-983)parallelism_config
(1168-1256)parallelism_config
(1168-1168)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (4)
mNumExperts
(970-970)mK
(972-972)mGroupSize
(976-976)mUseLora
(994-994)cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (4)
mNumExperts
(922-922)mK
(924-924)mGroupSize
(927-927)mUseLora
(944-944)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h (4)
mNumExperts
(174-174)mGroupSize
(179-179)mUseBias
(186-186)mUseLora
(202-202)
cpp/tensorrt_llm/thop/moeOp.cpp (2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
enable_alltoall
(193-200)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)
hidden_size
(216-216)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (3)
cpp/tensorrt_llm/thop/moeOp.cpp (2)
num_rows
(760-780)num_rows
(760-762)cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (13)
parallelism_config
(983-1090)parallelism_config
(983-983)parallelism_config
(1168-1256)parallelism_config
(1168-1168)stream
(831-843)k
(1513-1514)k
(1518-1529)k
(1518-1519)k
(1533-1541)k
(1533-1534)k
(1543-1555)k
(1543-1544)k
(1557-1558)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (2)
mExpertUnpaddedHiddenSize
(974-974)mK
(972-972)
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (1)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (13)
k
(1513-1514)k
(1518-1529)k
(1518-1519)k
(1533-1541)k
(1533-1534)k
(1543-1555)k
(1543-1544)k
(1557-1558)parallelism_config
(983-1090)parallelism_config
(983-983)parallelism_config
(1168-1256)parallelism_config
(1168-1168)stream
(831-843)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
345-345: Line too long (182 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (23)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)
985-1009
: The script above will print the fullrunMoe
signatures in both the OSS and internal headers so we can confirm the arity differences before updating the call sites. Once we see the exact parameter lists, we can adjust the benchmark and test calls under each macro to match the corresponding header.tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)
91-93
: Good: capture unpadded hidden size before paddingStoring self.unpadded_hidden_size ahead of W4A16 padding is correct and avoids downstream slicing in Python.
410-442
: Confirm C++ registration of fused_moe includes new keyword argsI wasn’t able to locate any
m.def("fused_moe(...)"
or corresponding schema in the C++ extension (e.g. inmoeOp.cpp
underTORCH_LIBRARY(trtllm, m)
). Without those, passingswizzled_input_sf
andunpadded_hidden_size
will trigger a runtime TypeError.Please:
- Verify that
fused_moe
is registered in C++ (look form.def("fused_moe(
) or aTORCH_LIBRARY
block incpp/tensorrt_llm/thop/moeOp.cpp
).- Ensure its schema signature explicitly lists
•bool swizzled_input_sf
•int64_t unpadded_hidden_size
- Confirm the
FusedMoeRunner::operator()
method matches those parameters.If the schema is missing these kwargs, let me know—I can draft the C++ registration update to include them.
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (4)
195-196
: Add unpadded hidden size test knobIntroducing mUnpaddedHiddenSize (and resetting it in TearDown) is a clean way to validate unpadded output semantics.
1243-1246
: OSS runMoe: correctly forwards unpaddedHiddenSize when setThe ternary forwarding ensures backward compatibility when unpadded size is unset.
1479-1480
: Validation uses unpadded size when providedGood change; aligns test checks with kernel’s stride semantics.
1504-1508
: Assertion indexes into final tensor using unpadded widthCorrect and necessary when hidden is padded in kernel but logical width is smaller.
cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (3)
25-25
: Include is appropriateRequired for fixed-width integer usage in new signatures; good addition.
573-581
: AI summary inconsistent with code: runMoe did not add unpadded_hidden_size in internal headerThe summary claims runMoe includes unpadded_hidden_size and expanded_num_rows here, but this header’s runMoe still takes (num_rows, hidden_size, inter_size, ...). Please correct either the code or the summary.
801-804
: All BlockScaleFC2 call sites have been updatedI’ve verified that the only invocation of BlockScaleFC2 in moe_kernels.cu now includes the new unpadded_hidden_size (and the additional enable_alltoall flag), matching the updated signature. No remaining mismatches were found.
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (3)
617-626
: Override matches interface — good propagation of unpadded_hidden_sizeThe override order and constness align with the interface, reducing chances of ODR/signature mismatches. LGTM.
646-661
: Static gemm2: consistent placement of unpadded_hidden_sizeThe static helper mirrors the interface ordering; this minimizes caller confusion and avoids mis-threading K/N dims. LGTM.
849-858
: Confirmed unpadded_hidden_size usage in BlockScaleFC2Verified that BlockScaleFC2 in moe_kernels.cu uses hidden_size solely for loads/compute and unpadded_hidden_size for final-write bounds. No changes required.
• In the kernel body (moe_kernels.cu:2993–3002),
shape_n is set from hidden_size for the GEMM compute.
• finalizeMoeRoutingKernelLauncher is called with both parameters—using hidden_size for compute and unpadded_hidden_size for the write bounds (moe_kernels.cu:3013–3018).
• Any buffer zeroing (cudaMemsetAsync) and TMA epilogue fusion both leverage unpadded_hidden_size for final_output sizing.cpp/tensorrt_llm/thop/moeUtilOp.cpp (2)
282-283
: Output tensor shape switched to unpadded — goodAllocating {num_rows, unpadded_hidden_size} aligns with the sliced writeback semantics. LGTM.
332-347
: All Python wrappers updated with newunadded_hidden_size
argument
A repository-wide search confirms that both fused-MoE modules pass the newly addedunadded_hidden_size
intotorch.ops.trtllm.moe_finalize_scale_op
:
- In
fused_moe_cute_dsl.py
(line 229),self.unpadded_hidden_size
is provided as the 11th argument.- In
fused_moe_deepgemm.py
(line 566),self.unpadded_hidden_size
is likewise passed.No other Python callers invoke this op directly, and all existing tests exercise the high-level fused modules rather than calling the op schema. No further changes are required.
cpp/tensorrt_llm/thop/moeOp.cpp (3)
392-397
: Output shape switched to {num_rows, unpadded_hidden_size_val} — goodMatches sliced finalization. Ensure downstream consumers read only unpadded cols.
543-545
: Min-latency output shape uses unpadded cols — goodSliced width matches the new semantics. LGTM.
667-674
: Profiler init: unpadded parameter passed — good; keep legacy defaulting in callersThe preparation path already maps 0 → hidden_size when invoking init. That keeps compatibility. LGTM.
Also applies to: 671-681
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (5)
4038-4040
: Finalize fusion uses num_rows (tokens) as the row dimension — correct.Passing num_tokens for the row dimension in setFinalizeFusionParams aligns with the swapped/transpose layout of the epilogue; good catch and consistent with prior guidance.
Also applies to: 4656-4656
3013-3018
: End-to-end threading of unpadded_hidden_size into finalize launchers looks correct.Calls to finalizeMoeRoutingKernelLauncher now pass both padded (hidden_size) and unpadded hidden sizes; this aligns with the slicing objective and prevents OOB writes. LGTM.
Also applies to: 3331-3336, 3340-3343
3279-3281
: Zeroing final_output for fused finalize is sized to unpadded width — good.cudaMemsetAsync uses num_rows * unpadded_hidden_size, matching the epilogue’s write extents. This avoids touching padded columns and reduces unnecessary bandwidth.
2021-2024
: NoFilling path plumbs expert_first_token_offset and uses (padded, unpadded) correctly.The launcher passes expert_first_token_offset and both column counts into finalizeMoeRoutingNoFillingKernel; indexing and guards then operate on unpadded width while using padded strides. Solid.
4821-4822
: Initialization of mExpertUnpaddedHiddenSize confirmed ininit
- In
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
, withinGemmProfilerBackend::init(...)
(lines ~914–950), there is an explicit assignment:ensuring the field is initialized before any call tomExpertUnpaddedHiddenSize = unpadded_hidden_size;
runProfiler
.No further changes required.
PR_Github #16147 [ run ] completed with state |
/bot run --reuse-test |
PR_Github #16258 [ run ] triggered by Bot |
PR_Github #16258 [ run ] completed with state |
Signed-off-by: Bo Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
276-301
: Fake fused_moe should return unpadded width when requestedThe fake currently shapes outputs with the padded hidden size only. With the upstream change removing post-op slicing, graph building will expect the unpadded width when unpadded_hidden_size is provided. Update the fake to reflect that to keep compile-time shapes consistent.
Apply this diff:
@torch.library.register_fake("trtllm::fused_moe") def _( @@ - tuner_num_tokens: Optional[int] = None, - tuner_top_k: Optional[int] = None, - unpadded_hidden_size: Optional[int] = None, + tuner_num_tokens: Optional[int] = None, + tuner_top_k: Optional[int] = None, + unpadded_hidden_size: Optional[int] = None, ): seq_len = input.shape[0] if use_int8_woq_per_channel: @@ - hidden_size = fc2_expert_weights.shape[2] + hidden_size = fc2_expert_weights.shape[2] else: hidden_size = fc2_expert_weights.shape[1] + # Determine output hidden dimension: prefer unpadded when provided (>0). + out_hidden_size = (hidden_size if (unpadded_hidden_size is None + or int(unpadded_hidden_size) <= 0) + else int(unpadded_hidden_size)) + if min_latency_mode: num_experts_on_rank = fc2_expert_weights.shape[0] - output_shape = [seq_len * num_experts_on_rank, hidden_size] + output_shape = [seq_len * num_experts_on_rank, out_hidden_size] experts_to_token_score_shape = [num_experts_on_rank, seq_len] active_expert_global_ids_shape = [num_experts_on_rank] return [ input.new_empty(output_shape, dtype=output_dtype), input.new_empty([1], dtype=torch.int32), input.new_empty(experts_to_token_score_shape, dtype=torch.float32), input.new_empty(active_expert_global_ids_shape, dtype=torch.int32), ] else: - return [input.new_empty([seq_len, hidden_size], dtype=output_dtype)] + return [input.new_empty([seq_len, out_hidden_size], dtype=output_dtype)]
♻️ Duplicate comments (2)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
122-171
: Add docstring + input validation for unpadded_hidden_sizeWe should document semantics and guard invalid values (<=0, > padded hidden size) at the fused_moe entrypoint. This mirrors the kernel’s expectations and reduces silent shape surprises.
Apply this diff:
@torch.library.custom_op("trtllm::fused_moe", mutates_args=()) def fused_moe( @@ - unpadded_hidden_size: Optional[int] = None, + unpadded_hidden_size: Optional[int] = None, ) -> List[torch.Tensor]: + """ + Fused MoE custom op. If unpadded_hidden_size is provided (> 0), outputs are + sized to that hidden dimension; otherwise the padded hidden size is used. + + Args: + unpadded_hidden_size: Original (unpadded) hidden size to slice output to. + Must satisfy 0 < unpadded_hidden_size <= padded hidden size inferred + from fc2_expert_weights. + """ @@ - # Only the non-alltoall case is considered for profiling in the warmup phase. + # Only the non-alltoall case is considered for profiling in the warmup phase. @@ else: assert tuner_num_tokens is None assert tuner_top_k is None tuner_input = input tuner_top_k = token_selected_experts.size(1) + + # Sanity-check unpadded_hidden_size if provided. + padded_hidden_size = (fc2_expert_weights.shape[2] + if use_int8_woq_per_channel + else fc2_expert_weights.shape[1]) + if unpadded_hidden_size is not None: + if not (0 < int(unpadded_hidden_size) <= int(padded_hidden_size)): + raise ValueError( + f"unpadded_hidden_size ({unpadded_hidden_size}) " + f"must be in (0, {int(padded_hidden_size)}]" + )tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
11-11
: Fix ceil_div import to the correct moduleceil_div is defined under quantization/utils/fp8_utils.py, not in _torch/utils.py. Importing from ...utils will raise ImportError.
Apply this diff:
-from ...utils import AuxStreamType, EventType, Fp4QuantizedTensor, ceil_div +from ...utils import AuxStreamType, EventType, Fp4QuantizedTensor +from ...quantization.utils.fp8_utils import ceil_div
🧹 Nitpick comments (2)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1)
418-439
: Finalize-scale fake op: honor unpadded_hidden_size with sane fallback and validationRight now the fake returns shape (num_rows, int(unpadded_hidden_size)) unconditionally. If callers pass 0/None-equivalent to indicate “no slicing,” this will produce a zero-width tensor. Add a fallback to hidden_size when unpadded_hidden_size <= 0 and validate it never exceeds hidden_size.
Apply this diff:
@torch.library.register_fake("trtllm::moe_finalize_scale_op") def _( @@ num_rows: torch.SymInt, hidden_size: torch.SymInt, - unpadded_hidden_size: torch.SymInt, + unpadded_hidden_size: torch.SymInt, experts_per_token: int, @@ ): - num_rows_val = int(num_rows) - unpadded_hidden_size_val = int(unpadded_hidden_size) - return gemm2_output.new_empty((num_rows_val, unpadded_hidden_size_val), - dtype=gemm2_output.dtype) + num_rows_val = int(num_rows) + hidden_size_val = int(hidden_size) + target_size = int(unpadded_hidden_size) + # Allow 0 to mean "no slicing" (use padded hidden size). + if target_size <= 0: + target_size = hidden_size_val + else: + assert target_size <= hidden_size_val, ( + "unpadded_hidden_size must be <= hidden_size" + ) + return gemm2_output.new_empty((num_rows_val, target_size), + dtype=gemm2_output.dtype)tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1)
339-386
: is_sf_swizzled logic looks correct; shorten the long inline commentThe flag is set to False whenever post-quant communication occurs (alltoall or allgather), which matches the deferred-swizzle design. One nit: the comment at Line 342 exceeds the 120-char limit flagged by Ruff.
Apply this diff:
- is_sf_swizzled = True # In case of post-quant communication, scaling factors will not be swizzled before communication, and swizzling after communication is merged into MoE. + # Default: scaling factors are already swizzled. + # For post-quant communication, we defer swizzling into the MoE op. + is_sf_swizzled = True
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
(2 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
(7 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
(7 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
📚 Learning: 2025-08-08T04:10:19.038Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:19.038Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.
Applied to files:
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)
tensorrt_llm/_torch/utils.py (1)
Fp4QuantizedTensor
(97-104)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)
hidden_size
(216-216)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
342-342: Line too long (182 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (3)
53-74
: Threading unpadded_hidden_size through MoERunner is soundUsing 0 as a sentinel in self.unpadded_hidden_size keeps the Python <-> C++ interface simple and backwards-compatible. Looks good.
189-190
: Forward unpadded_hidden_size into MoERunner constructionGood—this ensures the profile and run paths see the same target hidden size.
240-241
: Propagate unpadded_hidden_size into run_moe/run_moe_min_latencyCorrect propagation to the execution path. No concerns.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (2)
90-92
: Good: capture unpadded_hidden_size before any paddingStoring the original hidden size early ensures the op can request unpadded outputs regardless of backend padding. This aligns with the fused op’s new behavior.
393-425
: Fused op call: swizzled_input_sf + unpadded_hidden_size propagation LGTMPassing swizzled_input_sf=is_sf_swizzled and unpadded_hidden_size=self.unpadded_hidden_size matches the new kernel contracts and removes Python-side slicing. Nicely done.
Signed-off-by: Bo Li <[email protected]>
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)
2073-2095
: Guard GPT-OSS-120B test for non-OSS path or neutralize unpadded overrideNon-OSS kernels don’t honor an unpadded output stride; combined with the current validation, this can still cause false negatives in non-OSS builds. You have two viable options:
- Preferable for fidelity: compile-time skip when OSS path is unavailable (keeps the test’s intent intact).
- Alternative: only set mUnpaddedHiddenSize when OSS is enabled; otherwise leave it as 0 so the test runs with padded stride.
Below shows the skip approach; if you’d rather keep it running everywhere, I can provide the “neutralize override” variant.
TYPED_TEST(MixtureOfExpertsTest, ParallelismType##GptOss120b) \ { \ + /* Only meaningful with the OSS Cutlass MoE path (unpadded stride support) */ \ +#ifndef USING_OSS_CUTLASS_MOE_GEMM \ + GTEST_SKIP() << "Skipping GptOss120b: requires USING_OSS_CUTLASS_MOE_GEMM"; \ + return; \ +#endif \ this->mIsLongTest = true; \ this->mUseBias = true; \ this->mActType = ActivationType::Swiglu; \ @@ float inter_size_fraction = float(inter_size) / hidden_size; \ - this->mUnpaddedHiddenSize = 2880; \ + this->mUnpaddedHiddenSize = 2880; \ @@ if (!this->checkSufficientTestMemory(75, hidden_size, 128, 4, true)) \Note: Once the compareFinal out_stride fix lands, this test will not mis-index; the guard keeps the intent of validating unpadded-stride output specifically on the OSS path.
🧹 Nitpick comments (1)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)
195-196
: Add basic sanity checks for mUnpaddedHiddenSizeGood addition. Add a lightweight precondition to ensure the unpadded dimension is never larger than the padded one and is non-negative. This helps catch misconfigurations early in tests.
Example (place right before invoking runMoe in runMoEPermute):
@@ - QuantParams quant_params; + QuantParams quant_params; + if (mUnpaddedHiddenSize > 0) + { + ASSERT_LE(mUnpaddedHiddenSize, mHiddenSize) << "Unpadded hidden size must be <= hidden size."; + }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
(8 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
🧠 Learnings (2)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
PR: NVIDIA/TensorRT-LLM#7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.
Applied to files:
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu
🧬 Code graph analysis (1)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (7)
mTotalTokens
(540-540)mHiddenSize
(378-378)mInterSize
(539-539)mNumExperts
(379-379)mK
(381-381)mWorkspace
(515-515)mFinalOutput
(537-537)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (5)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (5)
257-258
: Reset of mUnpaddedHiddenSize in TearDown is correctResetting the test-scoped override avoids bleed-over between tests. LGTM.
2065-2072
: DeepSeekV3 param threading looks goodThe inter_size_fraction calculation and parallel test invocation are consistent with the earlier memory guard.
2351-2353
: OSS profiler distribution dims updated correctlyUsing 1024, 1024, 4096 for (hidden, unpadded_hidden, inter) is consistent. LGTM.
2301-2302
: I’ve pulled theGemmProfilerBackend
definitions from both internal and public Cutlass headers around their struct declarations so we can inspect theinit(...)
signature (showing hidden vs. unpadded_hidden arguments). Once we see the exact parameter list, we can confirm the tuple ordering and DEFAULT_HIDDEN_SIZE usage.
1241-1247
: runMoe parameters match updated signature—no changes required.
- In cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (around lines 460–465), the pure-virtual runMoe is declared as
…, QuantParams quant_params, int64_t num_rows, int64_t hidden_size, int64_t unpadded_hidden_size, int64_t inter_size, …- In the unit test cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (lines 1241–1247), the call passes
mTotalTokens (num_rows), mHiddenSize (hidden_size), (mUnpaddedHiddenSize > 0 ? mUnpaddedHiddenSize : mHiddenSize) (unpadded_hidden_size), mInterSize / parallelism_config.tp_size (inter_size)
exactly in that order.All three size parameters line up with the header declaration.
PR_Github #16416 [ run ] triggered by Bot |
PR_Github #16416 [ run ] completed with state |
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Tests
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.