[https://nvbugs/5378031] [feat] W4A8 AWQ MoE supports Per Expert Pre-quant Scale Factor for PyT backend #7286

yumin066 · 2025-08-27T07:53:24Z

Description

This PR depends on #6200 to support ModelOpt W4A8 checkpoint.

Current W4A8 AWQ only accepts per-layer pre-quant scale factors (for both FC1 and FC2). However, ModelOpt generates pre-quant scale factors for each expert.
This PR makes W4A8 AWQ accept per-expert pre-quant scale factors.

	Mixtral-8x7B-Instruct-v0.1	Mixtral-8x7B-Instruct-v0.1	Qwen3-30B-A3B	Qwen3-30B-A3B	DeepSeek-R1-W4AFP8	DeepSeek-R1-W4AFP8
	MMLU	GSM8K	MMLU	GSM8K	MMLU	GSM8K
main branch	67.97	56.71	78.35	83.7	87.18	94.77
This PR	68.14	57.20	78.29	84.69	87.18	94.77

Note:
The signature of tensorrt_llm::kernels::apply_per_channel_scale_kernel_launcher could not be modifed, because it must be aligned with the signature in internal_cutlass_kernels. Otherwise, undefined reference is reported in link stage.
As a result, I have implemented tensorrt_llm::kernels::apply_per_channel_scale_per_expert_kernel_launcher to workaround.

Cherry-pick #7123 to this PR to run CI together.

Test Coverage

pytest tests/unittest/_torch/modules/test_fused_moe.py -k w4afp8 -s

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

New Features
- Per-expert activation scaling for MoE pre-quantization, applied automatically during inference.
- Per-layer control for expert weight quantization, enabling finer-grained quant settings per model layer.
Improvements
- More robust per-expert scaling for FC layers (act scales and alphas), enhancing numerical stability and consistency.
- Optimized vectorized scaling paths and expanded type support (FP8/BF16) for better performance.
Tests
- Unit tests updated to use randomized pre-quantization scales to increase edge-case coverage.

coderabbitai · 2025-08-27T07:53:30Z

📝 Walkthrough

Walkthrough

Adds per-expert pre-quantization: device utility, per-expert CUDA kernels and host launchers, extended applyPrequantScale signature and GEMM2 call sites, PyTorch MoE quantization switched to per-expert shapes, and a unit test updated to use randomized pre-quant scales.

Changes

Cohort / File(s)	Summary
Per-expert CUDA kernels & utils `cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu`, `cpp/tensorrt_llm/kernels/preQuantScaleKernel.h`, `cpp/tensorrt_llm/kernels/moe_utils.h`	Adds device helper `findTotalEltsLessThanTarget`, extends `apply_per_channel_scale` kernel to accept `expert_first_token_offset` and `num_experts_per_node`, adds per-expert host launchers and public wrapper `apply_per_channel_scale_per_expert_kernel_launcher`, updates kernel launch signatures, and provides explicit template instantiations.
MOE GEMM runner integration `cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h`, `cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu`	Extends `CutlassMoeFCRunner::applyPrequantScale` signature to accept `expert_first_token_offset` and `num_experts_per_node`, conditionally launches per-expert scaling, returns `smoothed_act` as GEMM input when used, and updates GEMM2 path to pass expert mapping and perform sync/checks.
PyTorch MoE quantization logic `tensorrt_llm/_torch/modules/fused_moe/quantization.py`	Converts internal quantization/alpha/scale computations to per-expert tensor shapes and reductions (W4A8 and related variants), recomputes per-expert maxima/normalizations and reshapes/permutations; no public API signature changes.
Model config per-layer quant `tensorrt_llm/_torch/models/modeling_deepseekv3.py`	Adds static helper `_get_experts_quant_config` to fetch per-layer expert quant config and uses it to choose per-layer weight-loading mode; adds runtime guard against mixed-precision quant_algo.
Unit tests `tests/unittest/_torch/modules/test_fused_moe.py`	Changes W4A8 test pre-quant scale tensors from fixed ones to small randomized values to reflect per-expert variability.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor GEMM2 as GEMM2 Path
  participant Runner as CutlassMoeFCRunner
  participant Launcher as Per-Expert Launcher
  participant Kernel as CUDA Kernel
  Note over Runner,Launcher: Decision: use_awq && expert mapping provided
  GEMM2->>Runner: applyPrequantScale(..., stream, expert_offsets, num_experts)
  alt per-expert path
    Runner->>Launcher: apply_per_channel_scale_per_expert(..., expert_offsets, num_experts, stream)
    Launcher->>Kernel: launch<<<grid,block>>>(..., expert_offsets, num_experts)
    Kernel-->>Launcher: writes smoothed_act (per-expert scaled)
    Launcher-->>Runner: returns smoothed_act pointer
    Runner-->>GEMM2: proceed with smoothed_act as gemm input
  else fallback path
    Runner-->>GEMM2: proceed with permuted_data (existing path)
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[https://nvbugs/5378031] [feat] Hopper W4A8 MoE supports ModelOpt ckpt for PyT backend #6200 — Modifies fused_moe quantization and scale handling; likely overlaps with per-expert scale logic.
[TRTLLM-5863][feat] Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow #6629 — Touches MoE quantization and Cutlass runner paths; directly related to runner and per-channel quant pathways.
[TRTLLM-6898][feat] make fused_moe_cute_dsl work on blackwell #6616 — Adjusts fused_moe FP8/block-scale quantization implementation; relevant to FP8/per-expert quant paths.

Suggested labels

Community want to contribute

Suggested reviewers

mikeiovine
hlu1
symphonylyh

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

yumin066 · 2025-08-27T07:57:36Z

/bot run

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (14)

tensorrt_llm/_torch/modules/fused_moe/quantization.py (6)

955-966: Per-expert shapes look good; consider clarifying the comment.

fc31_act_scale and fc2_act_scale now carry per-expert factors. The trailing singleton dim on fc2_act_scale is fine; please tweak the comment to explicitly say shapes are [E, K] and [E, I, 1].

1187-1203: Avoid extra stacks/permutes; compute in fp32 and use torch.maximum.

This reduces temporary allocations, improves numeric stability, and matches the target [E, K] layout directly.

-            all_w3_w1_pre_quant_scales_greater = torch.max(
-                torch.stack([torch.stack(all_w3_pre_quant_scales),
-                            torch.stack(all_w1_pre_quant_scales)]).to(module.dtype),
-                dim=0,).values.permute(1, 0)
-
-            all_w3_w1_input_scales_greater = torch.max(
-                torch.stack([torch.stack(all_w3_input_scales),
-                            torch.stack(all_w1_input_scales)]).to(module.dtype),
-                dim=0,).values
-
-            all_w3_w1_pre_quant_scales_div_input_scales = (
-                all_w3_w1_pre_quant_scales_greater *
-                (1 / all_w3_w1_input_scales_greater.reshape(1, module.expert_size_per_partition).float())
-            )
-
-            module.fc31_act_scale.data.copy_(all_w3_w1_pre_quant_scales_div_input_scales.permute(1, 0))
+            # [E, K] tensors; keep math in fp32 for stability
+            w3_pre = torch.stack(all_w3_pre_quant_scales).to(torch.float32)
+            w1_pre = torch.stack(all_w1_pre_quant_scales).to(torch.float32)
+            pre_max = torch.maximum(w3_pre, w1_pre)  # [E, K]
+
+            in_scales = torch.maximum(
+                torch.stack(all_w3_input_scales).to(torch.float32),
+                torch.stack(all_w1_input_scales).to(torch.float32),
+            )  # [E]
+
+            fc31_act = pre_max / in_scales.unsqueeze(1)  # [E, K]
+            module.fc31_act_scale.data.copy_(fc31_act.to(module.dtype))

Also applies to: 1192-1196

1214-1224: Same simplification for weight_scale_2 and alpha; reuse in_scales.

Saves memory and keeps compute in fp32.

-            all_w3_w1_weight_scale_2 = torch.stack(
-                [torch.stack(all_w3_weight_scale_2), torch.stack(all_w1_weight_scale_2)]).to(
-                    module.dtype)
-            all_w3_w1_weight_scale_2_greater = torch.max(all_w3_w1_weight_scale_2, dim=0).values
-
-            all_w3_w1_weight_scale_2_mul_input_scales = (
-                all_w3_w1_weight_scale_2_greater.reshape(module.expert_size_per_partition, 1).float() *
-                all_w3_w1_input_scales_greater.reshape(module.expert_size_per_partition, 1).float()
-            )
-            module.fc31_alpha.data.copy_(all_w3_w1_weight_scale_2_mul_input_scales.reshape(module.expert_size_per_partition, 1).float())
+            w3_ws2 = torch.stack(all_w3_weight_scale_2).to(torch.float32).squeeze(-1)  # [E]
+            w1_ws2 = torch.stack(all_w1_weight_scale_2).to(torch.float32).squeeze(-1)  # [E]
+            all_w3_w1_weight_scale_2_greater = torch.maximum(w3_ws2, w1_ws2)  # [E]
+
+            module.fc31_alpha.data.copy_((all_w3_w1_weight_scale_2_greater * in_scales).unsqueeze(1))

1251-1254: Normalize per-group scales without permutes.

Broadcast over the expert axis; cheaper and clearer.

-            w3_w1_scales = w3_w1_scales.permute(1, 2, 0)
-            w3_w1_scales /= all_w3_w1_weight_scale_2_greater.reshape(module.expert_size_per_partition).float()
-            w3_w1_scales = w3_w1_scales.permute(2, 0, 1)
+            w3_w1_scales = w3_w1_scales / all_w3_w1_weight_scale_2_greater.view(-1, 1, 1).float()

1294-1313: Simplify fc2 per-expert scales; avoid permutes and keep fp32 math.

Mirrors the fc31 refactor for consistency and stability.

-            all_w2_pre_quant_scales = torch.stack(all_w2_pre_quant_scales).to(module.dtype)
-            all_w2_input_scales = torch.stack(all_w2_input_scales).to(module.dtype)
-            all_w2_pre_quant_scales_div_input_scales = (
-                all_w2_pre_quant_scales.permute(1, 0) *
-                (1 / (all_w2_input_scales.reshape(module.expert_size_per_partition).float()))
-            ).permute(1, 0)
-            module.fc2_act_scale.data.copy_(all_w2_pre_quant_scales_div_input_scales.reshape(module.fc2_act_scale.shape))
+            fc2_act = (torch.stack(all_w2_pre_quant_scales).to(torch.float32)
+                       / torch.stack(all_w2_input_scales).to(torch.float32).unsqueeze(1))  # [E, I]
+            module.fc2_act_scale.data.copy_(fc2_act.unsqueeze(-1).to(module.dtype))

-            all_w2_weight_scale_2 =torch.stack(all_w2_weight_scale_2).to(module.dtype)
-            all_w2_weight_scale_2_mul_input_scales = (
-                all_w2_weight_scale_2.reshape(module.expert_size_per_partition, 1) *
-                all_w2_input_scales.reshape(module.expert_size_per_partition, 1)
-            )
-            module.fc2_alpha.data.copy_(all_w2_weight_scale_2_mul_input_scales)
+            all_w2_weight_scale_2 = torch.stack(all_w2_weight_scale_2).to(torch.float32)  # [E]
+            module.fc2_alpha.data.copy_(
+                (all_w2_weight_scale_2 * torch.stack(all_w2_input_scales).to(torch.float32)).unsqueeze(1))

1331-1335: Same broadcast trick here; drop permutes.

-            w2_scales = w2_scales.permute(1, 2, 0)
-            all_w2_weight_scale_2 = all_w2_weight_scale_2.reshape(module.expert_size_per_partition)
-            w2_scales /= (all_w2_weight_scale_2.float())
-            w2_scales = w2_scales.permute(2, 0, 1)
+            w2_scales = w2_scales / all_w2_weight_scale_2.view(-1, 1, 1).float()

cpp/tensorrt_llm/kernels/preQuantScaleKernel.h (1)

42-46: Make expert offsets const and match defaults with the non-expert launcher; add brief API doc.

The kernel does not modify offsets; aligning defaults simplifies callsites and mirrors the existing launcher.

-template <typename T_in, typename T_out = T_in>
-void apply_per_channel_scale_per_expert_kernel_launcher(T_out* smoothed_act, T_in const* act, T_in const* per_channel_scale,
-    int rows, int cols, int64_t* expert_first_token_offset, int const num_experts_per_node,
-    int64_t const* num_valid_tokens_ptr, cudaStream_t stream);
+//! Applies per-channel pre-quant scales per expert to permuted activations.
+template <typename T_in, typename T_out = T_in>
+void apply_per_channel_scale_per_expert_kernel_launcher(
+    T_out* smoothed_act,
+    T_in const* act,
+    T_in const* per_channel_scale,
+    int rows,
+    int cols,
+    int64_t const* expert_first_token_offset,
+    int const num_experts_per_node,
+    int64_t const* num_valid_tokens_ptr = nullptr,
+    cudaStream_t stream = 0);

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)

863-866: Const-correct expert offsets; add a short doc to the helper.

Keeps signature consistent with the kernel launcher and clarifies behavior.

-    T const* applyPrequantScalePerExpert(void* smoothed_act, void const* permuted_data, void const* prequant_scales,
-        int64_t const* num_valid_tokens_ptr, int64_t const expanded_num_rows, int64_t const seq_len, bool const use_awq,
-        cudaStream_t stream, int64_t* expert_first_token_offset, int const num_experts_per_node);
+    //! Applies pre-quant scales per expert and returns the GEMM input pointer
+    T const* applyPrequantScalePerExpert(void* smoothed_act, void const* permuted_data, void const* prequant_scales,
+        int64_t const* num_valid_tokens_ptr, int64_t const expanded_num_rows, int64_t const seq_len, bool const use_awq,
+        cudaStream_t stream, int64_t const* expert_first_token_offset, int const num_experts_per_node);

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (1)

3049-3076: Make expert_first_token_offset const; it’s read-only.

Minor const-correctness + clarity for the new API.

Apply this diff:

-template <class T, class WeightType, class OutputType, class InputType, class ScaleBiasType, class Enable>
-T const* CutlassMoeFCRunner<T, WeightType, OutputType, InputType, ScaleBiasType, Enable>::applyPrequantScalePerExpert(
-    void* smoothed_act, void const* permuted_data, void const* prequant_scales, int64_t const* num_valid_tokens_ptr,
-    int64_t const expanded_num_rows, int64_t const seq_len, bool const use_awq, cudaStream_t stream,
-    int64_t* expert_first_token_offset, int const num_experts_per_node)
+template <class T, class WeightType, class OutputType, class InputType, class ScaleBiasType, class Enable>
+T const* CutlassMoeFCRunner<T, WeightType, OutputType, InputType, ScaleBiasType, Enable>::applyPrequantScalePerExpert(
+    void* smoothed_act, void const* permuted_data, void const* prequant_scales, int64_t const* num_valid_tokens_ptr,
+    int64_t const expanded_num_rows, int64_t const seq_len, bool const use_awq, cudaStream_t stream,
+    int64_t const* expert_first_token_offset, int const num_experts_per_node)
 {
@@
-            tensorrt_llm::kernels::apply_per_channel_scale_per_expert_kernel_launcher<UnfusedGemmOutputType, T>(
+            tensorrt_llm::kernels::apply_per_channel_scale_per_expert_kernel_launcher<UnfusedGemmOutputType, T>(
                 reinterpret_cast<T*>(smoothed_act), reinterpret_cast<UnfusedGemmOutputType const*>(permuted_data),
                 reinterpret_cast<UnfusedGemmOutputType const*>(prequant_scales), expanded_num_rows, seq_len,
-                expert_first_token_offset, num_experts_per_node, num_valid_tokens_ptr, stream);
+                const_cast<int64_t*>(expert_first_token_offset), num_experts_per_node, num_valid_tokens_ptr, stream);

Note: header declaration and the kernel launcher signature should be updated to accept int64_t const* as well (see companion comments).

cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu (5)

99-118: Avoid duplicating findTotalEltsLessThanTarget across compilation units.

This device helper exists in moe_kernels.cu too. Consider moving it to a shared header (e.g., moe_util_kernels.h) and including it here to prevent divergence.

120-178: Const-correctness and type width in per-expert kernel.

expert_first_token_offset is read-only; take it as int64_t const*.
Avoid narrowing: store expert as int64_t.

Apply this diff:

-template <typename T_in, typename T_out, int kProcessRows, typename AccessType>
-__global__ void apply_per_channel_scale_per_expert(T_out* smoothed_act, T_in const* act, T_in const* per_channel_scale, int rows,
-    int cols, int64_t* expert_first_token_offset, int const num_experts_per_node, int64_t const* num_valid_tokens_ptr)
+template <typename T_in, typename T_out, int kProcessRows, typename AccessType>
+__global__ void apply_per_channel_scale_per_expert(T_out* smoothed_act, T_in const* act, T_in const* per_channel_scale, int rows,
+    int cols, int64_t const* expert_first_token_offset, int const num_experts_per_node, int64_t const* num_valid_tokens_ptr)
 {
@@
-    for (int i = 0; i < kProcessRows; ++i)
+    for (int i = 0; i < kProcessRows; ++i)
     {
-        int expert = findTotalEltsLessThanTarget(expert_first_token_offset, num_experts_per_node, (int64_t) row_offset * kProcessRows + i + 1) - 1;
+        int64_t expert = findTotalEltsLessThanTarget(
+            expert_first_token_offset, num_experts_per_node, (int64_t)row_offset * kProcessRows + i + 1) - 1;
         *reinterpret_cast<AccessType*>(scale) = reinterpret_cast<AccessType const*>(per_channel_scale)[expert * cols / kElems + col_offset];

190-202: Launcher: pass expert_first_token_offset as const.

Keep API consistent with the kernel.

Apply this diff:

-template <typename T_in, typename T_out, int kProcessRows, typename AccessType = float4>
-void apply_per_channel_scale_per_expert_kernel_launcher_(T_out* smoothed_act, T_in const* act, T_in const* per_channel_scale,
-    int rows, int cols, int64_t* expert_first_token_offset, int const num_experts_per_node,
-    int64_t const* num_valid_tokens_ptr, cudaStream_t stream)
+template <typename T_in, typename T_out, int kProcessRows, typename AccessType = float4>
+void apply_per_channel_scale_per_expert_kernel_launcher_(T_out* smoothed_act, T_in const* act, T_in const* per_channel_scale,
+    int rows, int cols, int64_t const* expert_first_token_offset, int const num_experts_per_node,
+    int64_t const* num_valid_tokens_ptr, cudaStream_t stream)
 {
@@
-    apply_per_channel_scale_per_expert<T_in, T_out, kProcessRows, AccessType>
+    apply_per_channel_scale_per_expert<T_in, T_out, kProcessRows, AccessType>
         <<<grid, block, 0, stream>>>(smoothed_act, act, per_channel_scale, rows, cols,
-        expert_first_token_offset, num_experts_per_node, num_valid_tokens_ptr);
+        expert_first_token_offset, num_experts_per_node, num_valid_tokens_ptr);

230-256: Top-level launcher: const-correct expert_first_token_offset.

Mirror the change through the public launcher.

Apply this diff:

-template <typename T_in, typename T_out>
-void apply_per_channel_scale_per_expert_kernel_launcher(T_out* smoothed_act, T_in const* act, T_in const* per_channel_scale,
-    int rows, int cols, int64_t* expert_first_token_offset, int const num_experts_per_node,
-    int64_t const* num_valid_tokens_ptr, cudaStream_t stream)
+template <typename T_in, typename T_out>
+void apply_per_channel_scale_per_expert_kernel_launcher(T_out* smoothed_act, T_in const* act, T_in const* per_channel_scale,
+    int rows, int cols, int64_t const* expert_first_token_offset, int const num_experts_per_node,
+    int64_t const* num_valid_tokens_ptr, cudaStream_t stream)
 {
@@
-        apply_per_channel_scale_per_expert_kernel_launcher_<T_in, T_out, 1, float4>(
+        apply_per_channel_scale_per_expert_kernel_launcher_<T_in, T_out, 1, float4>(
             smoothed_act, act, per_channel_scale, rows, cols, expert_first_token_offset, num_experts_per_node, num_valid_tokens_ptr, stream);
@@
-        apply_per_channel_scale_per_expert_kernel_launcher_<T_in, T_out, 4, float4>(
+        apply_per_channel_scale_per_expert_kernel_launcher_<T_in, T_out, 4, float4>(
             smoothed_act, act, per_channel_scale, rows, cols, expert_first_token_offset, num_experts_per_node, num_valid_tokens_ptr, stream);
@@
-        apply_per_channel_scale_per_expert_kernel_launcher_<T_in, T_out, 8, float4>(
+        apply_per_channel_scale_per_expert_kernel_launcher_<T_in, T_out, 8, float4>(
             smoothed_act, act, per_channel_scale, rows, cols, expert_first_token_offset, num_experts_per_node, num_valid_tokens_ptr, stream);
@@
-        apply_per_channel_scale_per_expert_kernel_launcher_<T_in, T_out, 16, float4>(
+        apply_per_channel_scale_per_expert_kernel_launcher_<T_in, T_out, 16, float4>(
             smoothed_act, act, per_channel_scale, rows, cols, expert_first_token_offset, num_experts_per_node, num_valid_tokens_ptr, stream);

274-290: Instantiate const-correct launcher prototypes.

If you adopt const in the launcher signatures, these instantiations remain unchanged; ensure the corresponding header declarations are updated.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f167b1f and eb370a5.

📒 Files selected for processing (6)

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1 hunks)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (3 hunks)
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu (4 hunks)
cpp/tensorrt_llm/kernels/preQuantScaleKernel.h (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (6 hunks)
tests/unittest/_torch/modules/test_fused_moe.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (5)

**/*.{h,hh,hpp,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hh,hpp,hxx,cuh}: Closing braces of C++ namespaces in headers must include a trailing comment naming the namespace
Use Allman brace style and always use braces for control statements in headers as well
C++ header filenames must be lowerCamelCase and case-insensitively unique within a compilation target
Document public C++ interfaces with Doxygen using //! and //!<; C-style comments are not allowed except inline special cases; single-line comments should use // and be properly capitalized and punctuated if full sentences
Avoid assignment in subexpressions within header inline/template code as well
All class/function templates and their members should be instantiated at least once; if a class is not POD, its data members should be private
Use header include guards; name as TRTLLM__H (all caps of filename only, no dirs), no leading underscore and no trailing underscore

Files:

cpp/tensorrt_llm/kernels/preQuantScaleKernel.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...

Files:

cpp/tensorrt_llm/kernels/preQuantScaleKernel.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

cpp/tensorrt_llm/kernels/preQuantScaleKernel.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
tests/unittest/_torch/modules/test_fused_moe.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py

**/*.{c,cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu}: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)

Files:

cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tests/unittest/_torch/modules/test_fused_moe.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py

🧠 Learnings (3)

📚 Learning: 2025-08-20T07:43:36.447Z

Learnt from: ChristinaZ
PR: NVIDIA/TensorRT-LLM#7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.

Applied to files:

cpp/tensorrt_llm/kernels/preQuantScaleKernel.h
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu

📚 Learning: 2025-08-08T22:03:40.707Z

Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu

📚 Learning: 2025-08-09T20:57:04.084Z

Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu

🧬 Code graph analysis (5)

cpp/tensorrt_llm/kernels/preQuantScaleKernel.h (1)

cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu (4)

void (43-96)

void (121-176)

apply_per_channel_scale_per_expert_kernel_launcher (231-256)

apply_per_channel_scale_per_expert_kernel_launcher (231-233)

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)

applyPrequantScalePerExpert (3050-3076)

applyPrequantScalePerExpert (3050-3053)

cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu (1)

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)

findTotalEltsLessThanTarget (902-920)

findTotalEltsLessThanTarget (902-902)

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)

cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu (4)

findTotalEltsLessThanTarget (100-118)

findTotalEltsLessThanTarget (100-100)

apply_per_channel_scale_per_expert_kernel_launcher (231-256)

apply_per_channel_scale_per_expert_kernel_launcher (231-233)

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (3)

smoothed_act_ (896-896)

fc1_result_ (886-886)

expert_first_token_offset_ (882-882)

tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)

tensorrt_llm/module.py (1)

register_parameter (186-190)

tensorrt_llm/_torch/modules/linear.py (1)

load_weight_shard (58-102)

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/modules/fused_moe/quantization.py

1223-1223: Line too long (136 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

tests/unittest/_torch/modules/test_fused_moe.py (1)

1331-1340: Good: randomized pre-quant scales exercise per-expert paths.

Seeds are set earlier, the 0.95–1.05 range avoids degenerate cases while keeping stability. LGTM.

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (2)

1613-1625: Per-expert prequant scaling path looks correct; verify scale layout/alignments.

Row→expert mapping via expert_first_token_offset and the binary search is sound.

Indexing into prequant_scales assumes a contiguous [num_experts_per_node, hidden_size] layout and hidden_size divisible by ELEM_PER_THREAD/Vec width. Please confirm these invariants at call sites and in tests.

3854-3855: Check per-expert scale tensor shape and dtype at call-site.

Ensure quant_params.groupwise.fc2.act_scales is laid out [num_experts_per_node, inter_size] and matches UnfusedGemmOutputType; otherwise scaling or vectorized loads can misbehave.

yumin066 · 2025-08-27T11:35:35Z

/bot run

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu (1)
51-54: Fix vector tail handling and grid-y computation in preQuantScaleKernel

The current grid-y setup uses integer division (cols / kElems), allowing threads that begin a partial vector to pass the col_offset * kElems >= cols guard and then read/write past the end when cols % kElems != 0. To prevent out-of-bounds accesses:

• Compute the number of vector columns with a ceiling:
const int vecCols = (cols + kElems - 1) / kElems;  // ceil(cols/kElems)
dim3 grid(
    (rows + kProcessRows - 1) / kProcessRows,
    (vecCols + block.x - 1) / block.x
);
(Replace the existing grid definition at line 115.)

• Tighten the in-kernel guard so any thread whose vector would overflow returns early:
- if (col_offset * kElems >= cols || row_offset * kProcessRows >= rows)
+ if ((col_offset + 1) * kElems > cols || row_offset * kProcessRows >= rows)
      return;
(Update the check at line 51, and similarly in the overload at lines 108–118.)

• Add a masked “tail” path for the last partial vector when (col_offset + 1) * kElems > cols, loading and storing elements one by one to handle the remainder safely.

These changes eliminate any partial-vector overreads or overwrites when cols is not an exact multiple of kElems.

🧹 Nitpick comments (11)

tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)
466-472: Gate W4A8_CUSTOM by algo as well, not only by “int4 per-group” mode.

layer_quant_mode.is_int4_weight_only_per_group() may be true for non‑AWQ quant paths (e.g., other int4 per‑group modes). To avoid routing unrelated configs into the “custom W4A8” loader, also check the algo is W4A8_AWQ.

Apply:
-            weight_loading_mode=(
-                MoEWeightLoadingMode.W4A8_CUSTOM
-                if self._get_experts_quant_config(
-                    model_config,
-                    layer_idx).layer_quant_mode.is_int4_weight_only_per_group()
-                else MoEWeightLoadingMode.VANILLA),
+            weight_loading_mode=(
+                MoEWeightLoadingMode.W4A8_CUSTOM
+                if (lambda qc: qc.quant_algo == QuantAlgo.W4A8_AWQ
+                               and qc.layer_quant_mode.is_int4_weight_only_per_group()
+                   )(self._get_experts_quant_config(model_config, layer_idx))
+                else MoEWeightLoadingMode.VANILLA),
537-543: Document and type-annotate helper for maintainability.

Add a brief docstring and annotate the argument to clarify expected keys and fallback behavior.
 @staticmethod
-def _get_experts_quant_config(model_config, layer_idx: int) -> QuantConfig:
-    if getattr(model_config, "quant_config_dict", None) is None:
-        return model_config.quant_config
-    return model_config.quant_config_dict.get(
-        f"model.layers.{layer_idx}.mlp.experts", model_config.quant_config)
+def _get_experts_quant_config(model_config: ModelConfig, layer_idx: int) -> QuantConfig:
+    """Return experts' QuantConfig for a given layer, falling back to the global config."""
+    if getattr(model_config, "quant_config_dict", None) is None:
+        return model_config.quant_config
+    return model_config.quant_config_dict.get(
+        f"model.layers.{layer_idx}.mlp.experts", model_config.quant_config)
654-656: Nit: prefer “!=” for enum comparison.

QuantAlgo is an enum; != reads clearer than identity checks.
-        assert (
-            quant_config.quant_algo
-            is not QuantAlgo.MIXED_PRECISION), "MIXED_PRECISION is ambiguous"
+        assert quant_config.quant_algo != QuantAlgo.MIXED_PRECISION, "MIXED_PRECISION is ambiguous"
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (3)
1593-1596: Per-expert AWQ prequant scale indexing: check shape/stride.

Indexing assumes prequant_scales is laid out as [num_experts_per_node, hidden_size] contiguous. Verify producer matches this layout and dtype matches InputActivationsType to avoid silent precision loss.

If scales are produced as float regardless of activation dtype, consider casting once at load to avoid repeated implicit conversions inside the loop.

Also applies to: 1603-1605

3005-3009: API extension: add defaults carefully and const-correctness.

The new parameters have defaults, preserving ABI at call sites. Since expert_first_token_offset is read-only, prefer a const int64_t* to better express intent.
-T const* CutlassMoeFCRunner<...>::applyPrequantScale(
+T const* CutlassMoeFCRunner<...>::applyPrequantScale(
     void* smoothed_act, void const* permuted_data, void const* prequant_scales, int64_t const* num_valid_tokens_ptr,
     int64_t const expanded_num_rows, int64_t const seq_len, bool const use_awq, cudaStream_t stream,
-    int64_t* expert_first_token_offset, int const num_experts_per_node)
+    int64_t const* expert_first_token_offset, int const num_experts_per_node)
Note: this change ripples to declarations in the public headers.

3819-3822: Plumbed per-expert offsets into GEMM2 prequant path.

This wires up the new capability on the main path. For consistency, consider passing the offsets in the min-latency branch too (it’s currently FP4-only, so this is a non-functional nit).
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu (5)
44-46: Const-qualify expert_first_token_offset through the kernel and launchers.

The pointer is read-only. Making it const avoids accidental writes and qualifier drops at call sites.
-__global__ void apply_per_channel_scale(T_out* smoothed_act, T_in const* act, T_in const* per_channel_scale, int rows,
-    int cols, int64_t const* num_valid_tokens_ptr, int64_t* expert_first_token_offset, int const num_experts_per_node)
+__global__ void apply_per_channel_scale(T_out* smoothed_act, T_in const* act, T_in const* per_channel_scale, int rows,
+    int cols, int64_t const* num_valid_tokens_ptr, int64_t const* expert_first_token_offset, int const num_experts_per_node)
Apply the same change to the launcher signatures/usages below (Lines 111, 149, 193).

68-70: Per-channel scale indexing: avoid hidden assumptions and integer-division pitfalls.

Make numVecsPerExpert explicit to document the assumption and help static analyzers.

If cols is not a multiple of kElems, current vector load/store paths can read beyond the row boundary.
-        *reinterpret_cast<AccessType*>(scale)
-            = reinterpret_cast<AccessType const*>(per_channel_scale)[expert * cols / kElems + col_offset];
+        const int numVecsPerExpert = cols / kElems; // assumes cols % kElems == 0
+        *reinterpret_cast<AccessType*>(scale)
+            = reinterpret_cast<AccessType const*>(per_channel_scale)[expert * numVecsPerExpert + col_offset];
If cols may not be divisible by kElems in any path, add a masked tail for the last vector or assert the divisibility precondition.

110-112: Prefer nullptr for stream default and const-qualify expert offsets.

Minor API hygiene.
-    int rows, int cols, int64_t const* num_valid_tokens_ptr = nullptr, cudaStream_t stream = 0,
-    int64_t* expert_first_token_offset = nullptr, int const num_experts_per_node = 0)
+    int rows, int cols, int64_t const* num_valid_tokens_ptr = nullptr, cudaStream_t stream = nullptr,
+    int64_t const* expert_first_token_offset = nullptr, int const num_experts_per_node = 0)
147-174: Public per-expert launcher: mirror the const and stream defaults; otherwise LGTM.

API shape and dispatch thresholds match the non-expert path.
-void apply_per_channel_scale_per_expert_kernel_launcher(T_out* smoothed_act, T_in const* act,
-    T_in const* per_channel_scale, int rows, int cols, int64_t* expert_first_token_offset,
-    int const num_experts_per_node, int64_t const* num_valid_tokens_ptr, cudaStream_t stream)
+void apply_per_channel_scale_per_expert_kernel_launcher(T_out* smoothed_act, T_in const* act,
+    T_in const* per_channel_scale, int rows, int cols, int64_t const* expert_first_token_offset,
+    int const num_experts_per_node, int64_t const* num_valid_tokens_ptr, cudaStream_t stream)
191-207: Template instantiations: keep in sync with const changes.

Update the declaration used in the macro to reflect const pointer type; otherwise these are fine.
-    template void apply_per_channel_scale_per_expert_kernel_launcher<T_in, T_out>(T_out * smoothed_act,                \
-        const T_in* act, const T_in* per_channel_scale, int rows, int cols, int64_t* expert_first_token_offset,        \
+    template void apply_per_channel_scale_per_expert_kernel_launcher<T_in, T_out>(T_out * smoothed_act,                \
+        const T_in* act, const T_in* per_channel_scale, int rows, int cols, const int64_t* expert_first_token_offset,  \
         int const num_experts_per_node, int64_t const* num_valid_tokens_ptr, cudaStream_t stream)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e417e05 and a062f81.

📒 Files selected for processing (8)

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1 hunks)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (5 hunks)
cpp/tensorrt_llm/kernels/moe_utils.h (1 hunks)
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu (6 hunks)
cpp/tensorrt_llm/kernels/preQuantScaleKernel.h (1 hunks)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (4 hunks)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (6 hunks)
tests/unittest/_torch/modules/test_fused_moe.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (5)

cpp/tensorrt_llm/kernels/moe_utils.h
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h
cpp/tensorrt_llm/kernels/preQuantScaleKernel.h
tests/unittest/_torch/modules/test_fused_moe.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py

🧰 Additional context used

📓 Path-based instructions (4)

**/*.{c,cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu}: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)

Files:

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...

Files:

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu
tensorrt_llm/_torch/models/modeling_deepseekv3.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tensorrt_llm/_torch/models/modeling_deepseekv3.py

🧠 Learnings (9)

📓 Common learnings

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

📚 Learning: 2025-08-09T20:57:04.084Z

Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu

📚 Learning: 2025-08-08T22:03:40.707Z

Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu

📚 Learning: 2025-08-19T03:35:20.866Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu

📚 Learning: 2025-08-20T07:43:36.447Z

Learnt from: ChristinaZ
PR: NVIDIA/TensorRT-LLM#7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.

Applied to files:

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu

📚 Learning: 2025-08-08T05:06:31.596Z

Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu

📚 Learning: 2025-08-21T21:48:35.135Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.135Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.

Applied to files:

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu

📚 Learning: 2025-08-08T05:10:38.906Z

Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.

Applied to files:

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu

📚 Learning: 2025-08-14T23:23:27.449Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

tensorrt_llm/_torch/models/modeling_deepseekv3.py

🧬 Code graph analysis (3)

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (4)

cpp/tensorrt_llm/kernels/moe_utils.h (1)

findTotalEltsLessThanTarget (27-45)

cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu (4)

apply_per_channel_scale_per_expert_kernel_launcher (148-173)

apply_per_channel_scale_per_expert_kernel_launcher (148-150)

apply_per_channel_scale_kernel_launcher (121-145)

apply_per_channel_scale_kernel_launcher (121-122)

cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)

expert_first_token_offset_ (878-878)

cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h (1)

expert_first_token_offset_ (825-825)

cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu (1)

cpp/tensorrt_llm/kernels/moe_utils.h (1)

findTotalEltsLessThanTarget (27-45)

tensorrt_llm/_torch/models/modeling_deepseekv3.py (4)

tensorrt_llm/quantization/mode.py (1)

QuantAlgo (23-46)

tensorrt_llm/_torch/modules/fused_moe/interface.py (1)

MoEWeightLoadingMode (13-19)

tensorrt_llm/models/modeling_utils.py (4)

layer_quant_mode (166-170)

layer_quant_mode (306-312)

QuantConfig (128-268)

quant_algo (547-548)

tensorrt_llm/llmapi/llm_args.py (2)

quant_config (2316-2319)

quant_config (2322-2323)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (7)

tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

48-48: Import of QuantAlgo is correct and scoped.

Needed for the new runtime guard; no issues.
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (3)
3018-3031: Good: per-expert path selects expert-aware kernel.

Conditional dispatch to per-expert launcher when offsets are provided is clear and keeps the legacy path intact.

55-55: Confirmed: findTotalEltsLessThanTarget Is Device-Callable

I’ve verified that in cpp/tensorrt_llm/kernels/moe_utils.h the function is declared as:
__device__ inline int64_t findTotalEltsLessThanTarget(…)
This __device__ annotation ensures it can be called from your CUDA kernels, so no link-time failures will occur when used in device code.

No further changes are needed here.

3005-3009: No stale applyPrequantScale calls found—signature usage is consistent

All call sites in moe_kernels.cu either supply both new parameters or correctly rely on the header’s defaults:

Line 3712: called with 8 args, using defaults for expert_first_token_offset and num_experts_per_node

Line 3800: called with 8 args, again using defaults

Line 3819: called with all 10 args, explicitly passing expert_first_token_offset_ and num_experts_per_node

Since the header in cutlass_kernels/include/moe_kernels.h provides default values for the two new parameters, existing 8-argument calls remain valid and no updates are required.
cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu (3)
17-17: Include of moe_utils.h looks correct.

Needed for findTotalEltsLessThanTarget; no issues.

64-66: Device-qualification of findTotalEltsLessThanTarget.

Ensure findTotalEltsLessThanTarget in moe_utils.h is marked device (or host device). Otherwise calling it from a global kernel will fail or cause host/device linkage issues.

Would you like me to scan the repo for its definition and open a follow-up if it’s not device-qualified?

61-67: Refine expert‐index calculation for safety and correctness
In cpp/tensorrt_llm/kernels/preQuantScaleKernel.cu (around lines 61–67), the current code computes
    expert = findTotalEltsLessThanTarget(
                 expert_first_token_offset, num_experts_per_node, tokenIdx + 1)
        - 1;
Since expert_first_token_offset[0] is always initialized to 0, underflow to −1 cannot occur at runtime. Instead of clamping here (which may hide upstream bugs), add a sanity check that the offset array is well-formed:
 int64_t arr_len = static_cast<int64_t>(num_experts_per_node) + 1;  // include sentinel
 int64_t raw   = findTotalEltsLessThanTarget(
             expert_first_token_offset, num_experts_per_node, tokenIdx + 1)
             expert_first_token_offset, arr_len, tokenIdx + 1)
 - 1;
TORCH_CHECK(expert_first_token_offset[0] == 0,
 "expert_first_token_offset[0] must be zero");
TORCH_CHECK(raw >= 0 && raw < num_experts_per_node,
 "computed expert index out of range");
expert = static_cast(raw);
By passing arr_len = num_experts_per_node + 1, you include the sentinel at the end of the offsets array, ensuring full coverage of all token ranges.

The TORCH_CHECK macros guarantee that any corruption or misalignment in the offset array is caught early, rather than silently clamped.

If you still prefer a fail-safe clamp for release builds, wrap it in #ifndef NDEBUG or a function-level flag, but avoid hiding data-corruption issues in debug/development modes.

tensorrt-cicd · 2025-08-28T02:21:36Z

PR_Github #16770 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-28T03:10:17Z

PR_Github #16770 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12588 completed with status: 'FAILURE'

rosenrodt

Looks good to me.

yumin066 · 2025-09-01T01:24:49Z

/bot run

yumin066 · 2025-09-04T06:45:24Z

/bot run

yumin066 · 2025-09-04T07:07:35Z

/bot run

yumin066 · 2025-09-04T07:33:00Z

/bot run

yumin066 · 2025-09-04T08:57:07Z

/bot kill

yumin066 · 2025-09-04T08:57:59Z

/bot run

ZhanruiSunCh · 2025-09-04T09:06:54Z

/bot run

tensorrt-cicd · 2025-09-04T09:12:01Z

PR_Github #17659 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-04T14:26:44Z

PR_Github #17659 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13274 completed with status: 'ABORTED'

tensorrt-cicd · 2025-09-09T03:53:54Z

PR_Github #18146 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-10T01:23:48Z

PR_Github #18146 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13598 completed with status: 'FAILURE'

rosenrodt · 2025-09-10T01:55:40Z

/bot run

tensorrt-cicd · 2025-09-10T02:04:27Z

PR_Github #18280 [ run ] triggered by Bot

yumin066 · 2025-09-15T05:52:24Z

/bot run

yumin066 · 2025-09-15T08:29:04Z

/bot run

yumin066 · 2025-09-22T06:30:02Z

/bot run

tensorrt-cicd · 2025-09-22T06:35:09Z

PR_Github #19520 [ run ] triggered by Bot

yumin066 · 2025-09-22T06:46:08Z

/bot run

tensorrt-cicd · 2025-09-22T06:52:14Z

PR_Github #19524 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-22T06:52:16Z

PR_Github #19520 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #14672 (Blue Ocean) completed with status: ABORTED

tensorrt-cicd · 2025-09-22T09:59:38Z

PR_Github #19524 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14675 completed with status: 'FAILURE'

yumin066 · 2025-09-23T07:13:48Z

/bot run

tensorrt-cicd · 2025-09-23T07:19:33Z

PR_Github #19676 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-24T00:28:04Z

PR_Github #19676 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14807 completed with status: 'SUCCESS'

cpp/tensorrt_llm/kernels/moe_utils.cuh

tests/unittest/_torch/modules/test_fused_moe.py

litaotju · 2025-09-24T11:02:18Z

Do you think any model level e2e accuracy tests shall be added?

Signed-off-by: Min Yu <[email protected]>

yumin066 · 2025-09-25T03:45:17Z

Do you think any model level e2e accuracy tests shall be added?

I will add this checkpoint for e2e accuracy test according to @Barry-Delaney 's suggestion. https://huggingface.co/Barrrrry/DeepSeek-R1-W4AFP8/tree/main

yumin066 · 2025-09-25T03:45:54Z

/bot run

tensorrt-cicd · 2025-09-25T03:58:34Z

PR_Github #19876 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-25T11:37:14Z

PR_Github #19876 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14958 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

yumin066 requested a review from a team as a code owner August 27, 2025 07:53

yumin066 requested a review from mikeiovine August 27, 2025 07:53

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

yumin066 force-pushed the w4a8-moe-per-expert-act-scale branch from eb370a5 to e417e05 Compare August 27, 2025 11:14

yumin066 requested review from a team as code owners August 27, 2025 11:14

yumin066 requested review from hlu1 and symphonylyh August 27, 2025 11:14

yumin066 force-pushed the w4a8-moe-per-expert-act-scale branch from e417e05 to a062f81 Compare August 27, 2025 11:23

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

rosenrodt approved these changes Aug 31, 2025

View reviewed changes

yumin066 force-pushed the w4a8-moe-per-expert-act-scale branch 3 times, most recently from d0a4c05 to 6dc6f97 Compare September 4, 2025 06:43

yumin066 force-pushed the w4a8-moe-per-expert-act-scale branch from 6dc6f97 to 2c8af28 Compare September 4, 2025 07:26

yumin066 force-pushed the w4a8-moe-per-expert-act-scale branch from 2c8af28 to 616d5f2 Compare September 8, 2025 12:09

yumin066 force-pushed the w4a8-moe-per-expert-act-scale branch from cd4d8de to 2d5b9d1 Compare September 22, 2025 06:38

litaotju approved these changes Sep 24, 2025

View reviewed changes

cpp/tensorrt_llm/kernels/moe_utils.cuh Show resolved Hide resolved

tests/unittest/_torch/modules/test_fused_moe.py Show resolved Hide resolved

yumin066 added 3 commits September 24, 2025 19:26

Apply per expert act scale to FC1 for w4a8 moe on PyT flow

677bb6c

Signed-off-by: Min Yu <[email protected]>

Apply per expert act scale to FC2 for w4a8 moe on PyT flow

ac73e31

Signed-off-by: Min Yu <[email protected]>

Update unit test of fused moe w4afp8

8ac2332

Signed-off-by: Min Yu <[email protected]>

yumin066 force-pushed the w4a8-moe-per-expert-act-scale branch from 2d5b9d1 to 8ac2332 Compare September 25, 2025 03:35

hlu1 approved these changes Oct 13, 2025

View reviewed changes

symphonylyh approved these changes Oct 14, 2025

View reviewed changes

yumin066 merged commit 0a0159f into NVIDIA:main Oct 16, 2025
7 checks passed

yumin066 deleted the w4a8-moe-per-expert-act-scale branch October 16, 2025 03:09

[https://nvbugs/5378031] [feat] W4A8 AWQ MoE supports Per Expert Pre-quant Scale Factor for PyT backend #7286

[https://nvbugs/5378031] [feat] W4A8 AWQ MoE supports Per Expert Pre-quant Scale Factor for PyT backend #7286

Uh oh!

Conversation

yumin066 commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

yumin066 commented Aug 27, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

yumin066 commented Aug 27, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Aug 28, 2025

Uh oh!

tensorrt-cicd commented Aug 28, 2025

Uh oh!

rosenrodt left a comment

Choose a reason for hiding this comment

Uh oh!

yumin066 commented Sep 1, 2025

Uh oh!

yumin066 commented Sep 4, 2025

Uh oh!

yumin066 commented Sep 4, 2025

Uh oh!

yumin066 commented Sep 4, 2025

Uh oh!

yumin066 commented Sep 4, 2025

Uh oh!

yumin066 commented Sep 4, 2025

Uh oh!

ZhanruiSunCh commented Sep 4, 2025

Uh oh!

tensorrt-cicd commented Sep 4, 2025

Uh oh!

tensorrt-cicd commented Sep 4, 2025

Uh oh!

tensorrt-cicd commented Sep 9, 2025

Uh oh!

tensorrt-cicd commented Sep 10, 2025

Uh oh!

rosenrodt commented Sep 10, 2025

Uh oh!

tensorrt-cicd commented Sep 10, 2025

Uh oh!

yumin066 commented Sep 15, 2025

Uh oh!

yumin066 commented Sep 15, 2025

Uh oh!

yumin066 commented Sep 22, 2025

Uh oh!

tensorrt-cicd commented Sep 22, 2025

Uh oh!

yumin066 commented Aug 27, 2025 •

edited

Loading

coderabbitai bot commented Aug 27, 2025 •

edited

Loading