[None][perf] Accelerate global scale calculations for deepEP fp4 combine #7126

yilin-void · 2025-08-21T10:14:33Z

Summary by CodeRabbit

New Features
- GPU-accelerated per-token FP4 global-scale calculation added; supports FP16 and BF16 inputs, optional tokens-per-batch masking, and a PyTorch operator used in low-precision MoE combine flow.
Performance
- Native operator replaces Python-side reduction, improving throughput for FP4 quantization and enabling vectorized reductions for larger hidden sizes.
Tests
- New unit tests for correctness across shapes/dtypes, masked tokens, performance timing, and input validation.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-21T10:14:39Z

📝 Walkthrough

Walkthrough

Adds an additive FP4 per-token global-scale computation: a vectorized templated CUDA kernel + host launcher (FP16/BF16), header declaration, C++/PyTorch binding (trtllm.calculate_nvfp4_global_scale), Python integration in fused_moe_wide_ep, and unit tests and a Python custom-op stub.

Changes

Cohort / File(s)	Summary
CUDA kernel & header `cpp/tensorrt_llm/kernels/quantization.cu`, `cpp/tensorrt_llm/kernels/quantization.h`	Adds vectorized utilities, `computePerTokenGlobalScaleForFP4QuantizationKernel<T>` and host launcher `computePerTokenGlobalScaleForFP4Quantization<T>`; computes per-token max-abs and writes `globalScale[token] = 2688.0f / maxAbs`. Adds alignment checks and explicit instantiations for `half` and `__nv_bfloat16`, and a template declaration in the header.
C++ PyTorch extension `cpp/tensorrt_llm/thop/fp4Quantize.cpp`, `cpp/tensorrt_llm/thop/fp4Quantize.h`	Adds `calculate_nvfp4_global_scale(at::Tensor, optional tokensPerBatch) -> at::Tensor`; dispatches by dtype (FP16/BF16 with guard), queries SM count, calls the CUDA host launcher, and registers the op via `TORCH_LIBRARY`/`TORCH_LIBRARY_IMPL`.
Python integration `tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py`	Replaces manual (448*6/max_abs) Python per-token scaling with `torch.ops.trtllm.calculate_nvfp4_global_scale(final_hidden_states, recv_expert_count)` under `use_low_precision_combine`; downstream usage unchanged.
Unit tests `tests/unittest/_torch/thop/test_fp4_calculate_global_scale.py`	Adds tests comparing operator output to reference (per-token max-abs then `(448*6)/maxAbs`), masked tokens_per_batch handling, performance timing, and invalid-input checks across shapes and dtypes (FP16/BF16).
Python custom-op stub `tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py`	Adds a fake Python op `trtllm::calculate_nvfp4_global_scale` returning a float32 tensor with last dim = 1 for non-CUDA/backstop usage.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Python as fused_moe_wide_ep
  participant Op as trtllm.calculate_nvfp4_global_scale
  participant Cpp as torch_ext (C++ binding)
  participant Host as HostWrapper computePerTokenGlobalScaleForFP4Quantization<T>
  participant Kernel as CUDA kernel
  participant GS as globalScale buffer

  Python->>Op: call(input, tokensPerBatch?)
  Op->>Cpp: dispatch by dtype, validate shapes
  Cpp->>Host: launch(b,m,n,input,tokensPerBatch,globalScale,SMcount,stream)
  Host->>Kernel: <<<grid,block>>> computePerTokenGlobalScaleForFP4QuantizationKernel<T>
  Kernel->>GS: reduce per-token maxAbs → write scale = 2688.0f / maxAbs
  Kernel-->>Host: done
  Host-->>Cpp: return globalScale tensor
  Cpp-->>Op: return at::Tensor
  Op-->>Python: use globalScale in low_latency_combine_fp4

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[None][feat] DeepEP LL combine FP4 #6822 — Strongly related: also touches FP4 per-token global-scale computation and fused_moe integration.
[None][fix] Pre-allocate workspaces for DeepGEMM MoE to avoid frequent cudaFree/cudaMalloc #6811 — Related: changes to fused_moe_wide_ep forward_chunk and token/DP scaling logic.
DeepEP LL dispatch FP4 #6296 — Related: FP4 combine/forward_chunk flow and low-latency FP4 dispatch/scale changes.

Suggested reviewers

yuxianq
yuantailing
litaotju
chenfeiz0326

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8197039 and aecf242.

📒 Files selected for processing (5)

cpp/tensorrt_llm/kernels/quantization.cu (3 hunks)
cpp/tensorrt_llm/kernels/quantization.h (1 hunks)
cpp/tensorrt_llm/thop/fp4Quantize.cpp (2 hunks)
cpp/tensorrt_llm/thop/fp4Quantize.h (1 hunks)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

cpp/tensorrt_llm/kernels/quantization.h
cpp/tensorrt_llm/thop/fp4Quantize.h
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
cpp/tensorrt_llm/thop/fp4Quantize.cpp

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{c,cc,cpp,cxx,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu}: Closing braces of C++ namespaces must include a trailing comment naming the namespace (e.g., } // namespace foo)
Use Allman brace style; empty for/while loop semicolon on its own line; always use braces for control statements
C++ filenames must be lowerCamelCase (e.g., thisIsAFilename.cpp) and be case-insensitively unique within a compilation target
Use smart pointers; prefer unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases; do not use deprecated smart pointers
In implementation, prefer C++ comments (//); use inline C comments only for annotating parameters in calls (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) or chained x = y = z)

Files:

cpp/tensorrt_llm/kernels/quantization.cu

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{c,cc,cpp,cxx,cu,h,hh,hpp,hxx,cuh}: Prefer const or constexpr variables over #define for constants; variables not modified after initialization must be declared const
Avoid using literals (except 0, nullptr, true, false) outside of initialization; prefer named constexpr constants
Type names (classes, structs, enums, typedefs) must be UpperCamelCase
Local variables, methods, and namespaces must be lowerCamelCase
Non-magic-number global variables that are non-static/not in anonymous namespace must be prefixed with g (e.g., gDontUseGlobalFoos)
Non-magic-number globals that are static or in an anonymous namespace must be prefixed with s (e.g., sMutableStaticGlobal)
Locally visible static variables should be lowerCamelCase prefixed with s (e.g., static std::once_flag sFlag)
Member variables should be lowerCamelCase prefixed with m (e.g., mNbFooValues); public members may omit but prefix is encouraged for clarity
Constants (enums, globals, static constants, and function-scope magic numbers) should be UPPER_SNAKE_CASE with k prefix (e.g., kDIGIT_NUM)
Avoid Hungarian notation except limited 'apps Hungarian' like nb for counts; literal suffixes should be uppercase (e.g., 1234L)
Use spaces only; indent with 4 spaces (no tabs)
Format C++ code with clang-format (LLVM style) and limit lines to 120 characters; exceptions must be bracketed with // clang-format off/on
Disable code with #if/#endif (prefer mnemonic conditions) or macros that noop in release; do not comment out code; avoid dead code
Use the least forceful cast necessary; avoid removing const/volatile; avoid C-style and functional casts (except explicit constructors); cast void* to T* with static_cast; use reinterpret_cast only as last resort; avoid dynamic_cast
Switch on enum should cover all values and omit default when possible; switch statements must be well-structured with no fall-through except between adjacent empty cases; each case must end with break or throw; returns at end of case are not allowed; if ...

Files:

cpp/tensorrt_llm/kernels/quantization.cu

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

cpp/tensorrt_llm/kernels/quantization.cu

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

cpp/tensorrt_llm/kernels/quantization.cu (2)

406-408: Half specialization instantiation looks good

Explicit instantiation for half ensures the new path is linked even when only used from other TUs. No issues.

417-419: BF16 specialization instantiation is correctly gated

Instantiation under ENABLE_BF16 is consistent with the rest of the file. Once the VecTypeImpl guard is added (above), this remains coherent.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

yilin-void · 2025-08-21T10:17:57Z

/bot run

tensorrt-cicd · 2025-08-21T10:23:16Z

PR_Github #16046 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (9)

cpp/tensorrt_llm/kernels/quantization.h (2)

91-94: Add API contract (shape, dtype, semantics) to the new declaration.

Please document constraints clearly (rank 2/3, hidden_size multiple of 16, supported input dtypes, tokensPerBatch semantics and dtype). This avoids mismatches between callers and the .cu implementation.

Proposed header doc right above the declaration:
+//! Compute per-token NVFP4 global scales.
+//! Input:
+//!   - b: batch size (1 for [token_num, hidden_size], or batch for [batch, token_num, hidden_size]).
+//!   - m: token_num per batch (uniform upper bound when tokensPerBatch is provided).
+//!   - n: hidden_size, must be divisible by 16.
+//!   - input: device pointer to {half|__nv_bfloat16}, contiguous.
+//!   - tokensPerBatch: optional device pointer to int32 of length b; tokensPerBatch[i] in [0, m].
+//! Output:
+//!   - globalScale: device pointer to float32 with shape [b, m, 1] (or [m, 1] when b == 1).
+//! Stream:
+//!   - stream on which the kernel is launched.
 template <typename T>
 void computePerTokenGlobalScaleForFP4Quantization(int b, int m, int n, T const* input, int const* tokensPerBatch,
     float* globalScale, int multiProcessorCount, cudaStream_t stream = 0);
91-94: Consider int32_t for dimension parameters for cross-platform consistency.

Most kernels take int32-sized dims; using int32_t for b/m/n may reduce implicit casts at call sites and inside kernels.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1)

692-694: Great swap-in of the CUDA op; add a lightweight guard for dtype and operator availability.

The CUDA op supports fp16/bf16 only. If final_hidden_states ever becomes fp32 here, the call will throw. Also, when the extension isn’t loaded (e.g., custom builds), we can fall back to the reference formula.

Patch:

-                    global_scales = torch.ops.trtllm.calculate_nvfp4_global_scale(
-                        final_hidden_states, recv_expert_count)
+                    # Ensure supported dtype (operator expects fp16/bf16)
+                    assert final_hidden_states.dtype in (torch.float16, torch.bfloat16), \
+                        f"calculate_nvfp4_global_scale expects fp16/bf16, got {final_hidden_states.dtype}"
+                    try:
+                        global_scales = torch.ops.trtllm.calculate_nvfp4_global_scale(
+                            final_hidden_states, recv_expert_count)
+                    except (RuntimeError, AttributeError) as e:
+                        # Fallback to Python reference if the op is unavailable
+                        max_abs = final_hidden_states.abs().amax(dim=-1, keepdim=True).to(torch.float32)
+                        global_scales = (448 * 6) / max_abs

cpp/tensorrt_llm/thop/fp4Quantize.h (1)

30-31: Document the new public API signature and expectations.

Please add a short Doxygen-style comment to clarify accepted ranks, dtype, shape of the returned tensor, and tokensPerBatch dtype/shape. This improves discoverability and prevents misuse.

Suggested header comment:
- at::Tensor calculate_nvfp4_global_scale(at::Tensor const& input, std::optional<at::Tensor> const& tokensPerBatch);
+//! Calculate NVFP4 per-token global scales.
+//! input: [token_num, hidden] or [batch, token_num, hidden], dtype: fp16/bf16, contiguous, hidden % 16 == 0
+//! tokensPerBatch (optional): 1D int32 tensor of length batch; tokensPerBatch[i] in [0, token_num]
+//! Returns: same leading dims as input with last dim = 1, dtype float32
+at::Tensor calculate_nvfp4_global_scale(at::Tensor const& input, std::optional<at::Tensor> const& tokensPerBatch);

tests/unittest/_torch/thop/test_fp4_calculate_global_scale.py (3)

25-29: Match reference formula with in-code constant; consider naming a constant.

The repeated literal (448 * 6) would be clearer as a named constant to convey meaning (e.g., kNVFP4_SCALE_NUMERATOR).

-def reference_calculate_global_scale(input_tensor):
+NVFP4_SCALE_NUMERATOR = 448 * 6
+
+def reference_calculate_global_scale(input_tensor):
     max_abs_values = input_tensor.abs().max(dim=-1, keepdim=True).values.to(
         torch.float32)
-    global_scales = (448 * 6) / max_abs_values
+    global_scales = NVFP4_SCALE_NUMERATOR / max_abs_values
     return global_scales

93-100: Fix long line flagged by Ruff and avoid dumping massive tensors on failure.

The assert message exceeds 120 chars and printing full tensors harms readability/perf.

-        torch.testing.assert_close(
-            custom_result,
-            reference_result,
-            atol=1e-3,
-            rtol=1e-3,
-            msg=
-            f"Shape: {input_shape}, dtype: {dtype}, custom_result: {custom_result}, reference_result: {reference_result}"
-        )
+        torch.testing.assert_close(
+            custom_result,
+            reference_result,
+            atol=1e-3,
+            rtol=1e-3,
+            msg=f"Mismatch for shape={input_shape}, dtype={dtype}"
+        )

171-219: Add negative tests for tokensPerBatch dtype, device mismatch, and out-of-range values.

The CUDA entry point currently doesn’t validate tokensPerBatch dtype/device/range; add tests to pin the expected behavior.

Proposed additions inside test_calculate_nvfp4_global_scale_invalid_inputs:

@@
     def test_calculate_nvfp4_global_scale_invalid_inputs(self):
@@
         with self.assertRaises(Exception):
             torch.ops.trtllm.calculate_nvfp4_global_scale(
                 input_tensor, tokens_per_batch)
 
+        # Wrong dtype for tokensPerBatch (int64 instead of int32)
+        input_tensor = torch.randn((4, 32, 4096), dtype=torch.float16, device='cuda')
+        tokens_per_batch = torch.randint(1, 33, (4,), device='cuda', dtype=torch.int64)
+        with self.assertRaises(Exception):
+            torch.ops.trtllm.calculate_nvfp4_global_scale(input_tensor, tokens_per_batch)
+
+        # tokensPerBatch on a different device (if multiple GPUs are available)
+        if torch.cuda.device_count() >= 2:
+            input_tensor = input_tensor.to('cuda:0')
+            tokens_per_batch = torch.randint(1, 33, (4,), device='cuda:1', dtype=torch.int32)
+            with self.assertRaises(Exception):
+                torch.ops.trtllm.calculate_nvfp4_global_scale(input_tensor, tokens_per_batch)
+
+        # tokensPerBatch values out of range
+        input_tensor = torch.randn((4, 32, 4096), dtype=torch.float16, device='cuda')
+        tokens_per_batch = torch.tensor([0, 32, 65, -1], device='cuda', dtype=torch.int32)  # includes > max and negative
+        with self.assertRaises(Exception):
+            torch.ops.trtllm.calculate_nvfp4_global_scale(input_tensor, tokens_per_batch)

I can follow up with matching validation in the CUDA binding.

cpp/tensorrt_llm/thop/fp4Quantize.cpp (1)

238-251: Bindings look correct. Consider registering a CPU stub that throws a clearer error.

When called on CPU tensors, the operator currently fails the CUDA checks. A CPU backend that throws a NotImplementedError with a short message can improve UX.

Example:
 TORCH_LIBRARY_IMPL(trtllm, CUDA, m)
 {
     m.impl("fp4_quantize", TORCH_FN(torch_ext::fp4_quantize));
     m.impl("calculate_nvfp4_global_scale", TORCH_FN(torch_ext::calculate_nvfp4_global_scale));
 }
+
+TORCH_LIBRARY_IMPL(trtllm, CPU, m)
+{
+    m.impl("calculate_nvfp4_global_scale", [](at::Tensor const&, c10::optional<at::Tensor> const&) -> at::Tensor {
+        C10_THROW_ERROR(NotImplementedError, "calculate_nvfp4_global_scale is CUDA-only");
+    });
+}

cpp/tensorrt_llm/kernels/quantization.cu (1)

2-2: Update copyright year to 2025

The copyright header should reflect the current year (2025) for new code additions.
- * Copyright (c) 2019-2023, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2019-2025, NVIDIA CORPORATION.  All rights reserved.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 90bfc8c and 0f0f0e5.

📒 Files selected for processing (6)

cpp/tensorrt_llm/kernels/quantization.cu (3 hunks)
cpp/tensorrt_llm/kernels/quantization.h (1 hunks)
cpp/tensorrt_llm/thop/fp4Quantize.cpp (2 hunks)
cpp/tensorrt_llm/thop/fp4Quantize.h (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py (1 hunks)
tests/unittest/_torch/thop/test_fp4_calculate_global_scale.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (6)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

cpp/tensorrt_llm/thop/fp4Quantize.h
cpp/tensorrt_llm/kernels/quantization.h
cpp/tensorrt_llm/thop/fp4Quantize.cpp
cpp/tensorrt_llm/kernels/quantization.cu

**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

cpp/tensorrt_llm/thop/fp4Quantize.h
cpp/tensorrt_llm/kernels/quantization.h
cpp/tensorrt_llm/thop/fp4Quantize.cpp
cpp/tensorrt_llm/kernels/quantization.cu

**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

cpp/tensorrt_llm/thop/fp4Quantize.h
cpp/tensorrt_llm/kernels/quantization.h

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

cpp/tensorrt_llm/thop/fp4Quantize.h
cpp/tensorrt_llm/kernels/quantization.h
tests/unittest/_torch/thop/test_fp4_calculate_global_scale.py
cpp/tensorrt_llm/thop/fp4Quantize.cpp
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
cpp/tensorrt_llm/kernels/quantization.cu

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tests/unittest/_torch/thop/test_fp4_calculate_global_scale.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

cpp/tensorrt_llm/thop/fp4Quantize.cpp
cpp/tensorrt_llm/kernels/quantization.cu

🧠 Learnings (1)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/unittest/_torch/thop/test_fp4_calculate_global_scale.py

🪛 Ruff (0.12.2)

tests/unittest/_torch/thop/test_fp4_calculate_global_scale.py

99-99: Line too long (121 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (5)

cpp/tensorrt_llm/kernels/quantization.h (1)

91-94: computePerTokenGlobalScaleForFP4Quantization parameter names are consistent
The declaration in cpp/tensorrt_llm/kernels/quantization.h (line 92) and the definition in cpp/tensorrt_llm/kernels/quantization.cu (line 374) use the same parameter names and order (b, m, n, input, tokensPerBatch, globalScale, multiProcessorCount, stream). No changes needed.
cpp/tensorrt_llm/kernels/quantization.cu (4)
305-324: VecType implementation looks correct

The template specializations for half and __nv_bfloat16 properly map to their vectorized types (half2 and __nv_bfloat162), which is appropriate for vectorized max-abs reduction.

406-407: Consolidate duplicate template instantiations

The template instantiations for computePerTokenGlobalScaleForFP4Quantization appear to be duplicated - once for half (lines 406-407) and once for __nv_bfloat16 (lines 417-419). This is correct and matches the pattern for other functions in the file.

Also applies to: 417-419

343-344: Synchronization in blockReduceMaxV2 is correct

The blockReduceMaxV2 implementation internally handles thread coordination:

After each warp computes its partial maximum via warpReduceMaxV2, threads with lane == 0 write their results into the shared array.

A full-block barrier (__syncthreads()) follows, ensuring all writes complete before any thread reads from shared memory.

Finally, each thread loads either its warp’s shared result or a sentinel value and calls warpReduceMaxV2 again to compute the block-wide maximum.

No additional barriers are required, as warp‐level shuffles used by warpReduceMaxV2 are implicitly synchronous within a warp. You can safely call blockReduceMaxV2 without adding extra synchronization.

379-379: Quantization kernel alignment requirement needs clarification

The assertion
TLLM_CHECK(n % (ElemsPerVec * 32) == 0 and b > 0);
ensures that the number of elements per row (n) is a multiple of ElemsPerVec * 32 (i.e. vector-width × warp size), not just ElemsPerVec. As a result, for FP16 (ElemsPerVec = 8) you require n % 256 == 0, whereas the bare vectorized load only needs n % 8 == 0.

• If the extra factor of 32 is intentional (to guarantee full-warp coalesced loads for performance), please document this rationale in a comment above the check.
• Otherwise, consider relaxing the check to
TLLM_CHECK(n % ElemsPerVec == 0 and b > 0);
so that smaller, non-warp-aligned row sizes are supported.

cpp/tensorrt_llm/kernels/quantization.cu

cpp/tensorrt_llm/thop/fp4Quantize.cpp

tensorrt-cicd · 2025-08-21T15:14:43Z

PR_Github #16046 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12067 completed with status: 'FAILURE'

yilin-void · 2025-08-22T01:43:07Z

/bot run

tensorrt-cicd · 2025-08-22T01:48:24Z

PR_Github #16095 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1)
1-1: Add NVIDIA copyright header to cpp_custom_ops.py

This source file is missing the required NVIDIA copyright header mandated by CODING_GUIDELINES.md for all non-test Python files. Please prepend the following at the very top of tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 from typing import List, Optional
• File to update:
– tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py

• After inserting, verify the exact punctuation, capitalization, and any surrounding blank lines match the project’s canonical header style (you can mirror an existing header from another source file, e.g., a .cpp in the repo).

🧹 Nitpick comments (1)

tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1)
182-187: Add an optional default for tokens_per_batch in the meta stub and early‐error checks

You can safely make the second argument optional in the fake (meta) kernel without impacting the real CUDA dispatch or existing Python‐op signature. The C++ registration already declares
m.def("calculate_nvfp4_global_scale(Tensor input, Tensor? tokensPerBatch) -> Tensor");
so downstream calls (including all existing call sites and tests on CUDA) will continue to require you to pass a second argument when executing on CUDA. The Python‐only default (=None) and shape assertions live entirely in the fake kernel used during tracing/compile.

– Modify the fake registration in tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py:
 @torch.library.register_fake("trtllm::calculate_nvfp4_global_scale")
-def _(input: torch.Tensor, tokens_per_batch: Optional[torch.Tensor]):
+def _(input: torch.Tensor, tokens_per_batch: Optional[torch.Tensor] = None):
     shape = list(input.shape)
     shape[-1] = 1
+    # Meta‐time sanity checks
+    if tokens_per_batch is not None:
+        assert tokens_per_batch.ndim == 1, "tokens_per_batch must be 1D [batch]"
+        assert tokens_per_batch.shape[0] == input.shape[0], \
+            "tokens_per_batch length must match input.batch size"
     return input.new_empty(shape, dtype=torch.float32)
– No changes are needed to any C++ files—the schema in
cpp/tensorrt_llm/thop/fp4Quantize.cpp already matches (Tensor, Tensor?) -> Tensor.
– All existing Python call sites (e.g. in fused_moe_wide_ep.py) still pass two arguments.
– The real‐device tests in
tests/unittest/_torch/thop/test_fp4_calculate_global_scale.py
will continue to run unchanged, since they invoke the CUDA kernel, not the fake.

This adds lightweight meta‐time validation and improves trace‐time ergonomics without affecting production behavior.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0f0f0e5 and 8197039.

📒 Files selected for processing (1)

tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

hyukn

Some nits. LGTM.

cpp/tensorrt_llm/kernels/quantization.cu

tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py

cpp/tensorrt_llm/thop/fp4Quantize.cpp

tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py

tensorrt-cicd · 2025-08-22T11:59:28Z

PR_Github #16095 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12104 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Signed-off-by: Yilin Zhang <[email protected]>

yilin-void · 2025-08-26T02:55:19Z

/bot run

tensorrt-cicd · 2025-08-26T03:03:50Z

PR_Github #16509 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-26T16:05:39Z

PR_Github #16509 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12400 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

yilin-void requested a review from a team as a code owner August 21, 2025 10:14

yilin-void requested a review from yuxianq August 21, 2025 10:14

yilin-void force-pushed the dev/fp4_global_scale branch from 7f80217 to 0f0f0e5 Compare August 21, 2025 10:17

yilin-void requested a review from hyukn August 21, 2025 10:19

coderabbitai bot reviewed Aug 21, 2025

View reviewed changes

cpp/tensorrt_llm/kernels/quantization.cu Show resolved Hide resolved

cpp/tensorrt_llm/kernels/quantization.cu Show resolved Hide resolved

cpp/tensorrt_llm/kernels/quantization.cu Show resolved Hide resolved

cpp/tensorrt_llm/thop/fp4Quantize.cpp Show resolved Hide resolved

yilin-void requested a review from a team as a code owner August 22, 2025 01:43

coderabbitai bot reviewed Aug 22, 2025

View reviewed changes

hyukn approved these changes Aug 22, 2025

View reviewed changes

custom kernel for calculate nvfp4 global scale

aecf242

Signed-off-by: Yilin Zhang <[email protected]>

yilin-void force-pushed the dev/fp4_global_scale branch from 8197039 to aecf242 Compare August 26, 2025 02:54

yuxianq approved these changes Aug 26, 2025

View reviewed changes

yilin-void merged commit 040f4c7 into NVIDIA:main Aug 26, 2025
5 checks passed

coderabbitai bot mentioned this pull request Aug 27, 2025

[TRTLLM-6876][feat] Add low precision all2all for mnnvl #7155

Merged

yilin-void deleted the dev/fp4_global_scale branch September 28, 2025 03:27

[None][perf] Accelerate global scale calculations for deepEP fp4 combine #7126

[None][perf] Accelerate global scale calculations for deepEP fp4 combine #7126

Conversation

yilin-void commented Aug 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

yilin-void commented Aug 21, 2025

Uh oh!

tensorrt-cicd commented Aug 21, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Aug 21, 2025

Uh oh!

yilin-void commented Aug 22, 2025

Uh oh!

tensorrt-cicd commented Aug 22, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

hyukn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Aug 22, 2025

Uh oh!

yilin-void commented Aug 26, 2025

Uh oh!

tensorrt-cicd commented Aug 26, 2025

Uh oh!

tensorrt-cicd commented Aug 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yilin-void commented Aug 21, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 21, 2025 •

edited

Loading