Skip to content

Conversation

sunnyqgg
Copy link
Collaborator

@sunnyqgg sunnyqgg commented Aug 18, 2025

Summary by CodeRabbit

  • New Features
    • MTP Eagle decoding now supports both one-model and two-model modes.
    • Configurable Eagle capture layers and relaxed acceptance controls (top-k, delta).
    • MTP draft decoding enabled with automatic routing; draft prompts drop first token in Eagle/MTP Eagle modes.
  • Refactor
    • Weight-loading modularized for more reliable loading, dequantization, and tensor-parallel support.
    • Dynamic capture-layer handling to reduce memory and scale with configuration.
  • Chores
    • Updated decoding-mode and config wiring to enable new variants.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Aug 18, 2025

📝 Walkthrough

Walkthrough

Adds MTP Eagle variants and one-model vs two-model speculative decoding, introduces MTP draft models and a dedicated DeepseekV3 weight loader, makes Eagle3 capture layers dynamic, extends decoding configs and speculative-mode predicates, and updates auto-model routing and executor draft overrides.

Changes

Cohort / File(s) Summary
Example setup (Quickstart)
examples/llm-api/quickstart_advanced.py
Adjusts MTP/Eagle branching: configures EagleDecodingConfig for two-model flow, augments MTPDecodingConfig fields for one-model flow, and updates printed messaging.
Decoding config / CLI API
tensorrt_llm/llmapi/llm_args.py
Adds eagle3_layers_to_capture, is_mtp_eagle, num_capture_layers to EagleDecodingConfig; adds mtp_eagle_one_model to MTPDecodingConfig; updates from_dict and spec_dec_mode logic.
Speculative interface & mode predicates
tensorrt_llm/_torch/speculative/interface.py
Adds MTP_EAGLE_ONE_MODEL; replaces is_mtp with is_mtp_one_model; adds is_mtp_eagle_one_model and SpecMetadata.is_layer_capture; updates helper predicates.
Speculative utilities & resource wiring
tensorrt_llm/_torch/speculative/utils.py
Refactors mode checks for one-model vs two-model MTP/Eagle, propagates layers_to_capture/is_mtp_eagle, expands Eagle3ResourceManager signature, and updates resource/worker selection and extra-KV logic.
Eagle3 metadata & resource manager
tensorrt_llm/_torch/speculative/eagle3.py
Makes capture-layer count dynamic (lazy layers_to_capture), removes fixed num_capture_layers, adds is_mtp_eagle, and sizes hidden-state buffers based on capture count.
MTP metadata & flow
tensorrt_llm/_torch/speculative/mtp.py
Switches checks to one-model predicates, adjusts token-count adjustments and cross-rank token handling for MTP Eagle one-model vs two-model.
Speculative drafter
tensorrt_llm/_torch/speculative/model_drafter.py
Drops first token in draft prompt for Eagle3 and MTP Eagle modes.
New draft models & speculative modeling
tensorrt_llm/_torch/models/modeling_speculative.py
Adds MTPDraftModel and MTPDraftModelForCausalLM; updates MTP/Eagle layer-count checks; conditions Eagle3DraftModel FC init; wires draft-model streams and load paths.
Auto-model routing
tensorrt_llm/_torch/models/modeling_auto.py
Remaps arch to MTPDraftModelForCausalLM when DeepseekV3ForCausalLM and max_draft_len == 0 before class lookup.
DeepseekV3 weight loading
tensorrt_llm/_torch/models/modeling_deepseekv3.py
Replaces DeepseekV3MTPHead with DeepseekV3WeightLoader; moves weight-loading, TP-splitting, KV projection reconstruction and dequant into load_weights invoked from model load path.
Executor draft override
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
When draft model arch is DeepseekV3ForCausalLM, sets num_hidden_layers=1 on draft model pretrained_config before KV cache/weight wiring.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant Quickstart
  participant Config
  participant AutoModel
  participant DraftModel
  participant TargetModel
  Note over Quickstart,Config: build decoding config (Eagle/MTP + one-model flags)

  User->>Quickstart: setup_llm(args)
  Quickstart->>Config: construct Eagle/MTP config (is_mtp_eagle, mtp_eagle_one_model, layers_to_capture)
  Quickstart->>AutoModel: create draft model (spec mode)
  AutoModel->>AutoModel: remap arch (DeepseekV3 & max_draft_len==0 -> MTPDraftModelForCausalLM)
  AutoModel-->>DraftModel: instantiate draft model
  Quickstart->>TargetModel: instantiate target model
  DraftModel->>DraftModel: load_weights (DeepseekV3WeightLoader.load_weights)
  DraftModel->>TargetModel: optionally wire shared embeddings / lm_head
  User->>TargetModel: generate(request with spec_metadata)
  TargetModel->>DraftModel: request draft logits/hidden states
  DraftModel-->>TargetModel: draft outputs
  TargetModel-->>User: final tokens
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested reviewers

  • syuoni
  • nv-guomingz
  • hlu1
  • 2ez4bz

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@sunnyqgg sunnyqgg changed the title TRTLLM-6746]Enable two-model spec dec for MTP Eagle [TRTLLM-6746]Enable two-model spec dec for MTP Eagle Aug 18, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🔭 Outside diff range comments (8)
tensorrt_llm/_torch/speculative/mtp.py (3)

1-1: Add NVIDIA copyright header.

Per coding guidelines, prepend the NVIDIA copyright header (current year) to all source files.

Apply at top of file:

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.

70-71: Scalar copy_ misuse — use zero_/fill_ instead.

copy_ expects a Tensor, not a Python scalar. This will error at runtime on some PyTorch versions.

Use zero_ or fill_ for in-place scalar writes:

-                    self.mtp_relaxed_delta_pool[slot_id].copy_(
-                        0, non_blocking=True)
+                    self.mtp_relaxed_delta_pool[slot_id].zero_()
-            self.mtp_relaxed_delta_pool[free_slot_id].copy_(0,
-                                                            non_blocking=True)
+            self.mtp_relaxed_delta_pool[free_slot_id].zero_()

Also applies to: 79-81


1168-1196: Draft loop calls first MTP layer repeatedly (likely bug).

Inside the loop over self.mtp_num_modules, you always invoke draft_model.mtp_layers[0] rather than indexing by i. This will ignore all but the first layer and produce incorrect tokens/hidden states when num_nextn_predict_layers > 1.

Use the ith layer:

-            if i == 0:
-                hidden_states = draft_model.mtp_layers[0](
+            if i == 0:
+                hidden_states = draft_model.mtp_layers[i](
                     embed_tokens=draft_model.embed_tokens,
                     all_rank_num_tokens=spec_metadata.all_rank_num_tokens,
                     all_rank_max_num_tokens=spec_metadata.
                     all_rank_max_num_tokens,
                     **inputs)
...
-            else:
-                hidden_states = draft_model.mtp_layers[0](
+            else:
+                hidden_states = draft_model.mtp_layers[i](
                     embed_tokens=draft_model.embed_tokens,
                     all_rank_num_tokens=spec_metadata.
                     subseq_all_rank_num_tokens,
                     all_rank_max_num_tokens=max(
                         spec_metadata.subseq_all_rank_num_tokens)
                     if spec_metadata.subseq_all_rank_num_tokens is not None else
                     None,
                     **inputs)
examples/llm-api/quickstart_advanced.py (1)

1-1: Add NVIDIA copyright header.

Per coding guidelines, prepend the NVIDIA copyright header (current year) to all source files.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/speculative/eagle3.py (1)

1-1: Add NVIDIA copyright header.

Per coding guidelines, prepend the NVIDIA copyright header (current year) to all source files.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/speculative/utils.py (1)

1-1: Add NVIDIA copyright header.

Per coding guidelines, prepend the NVIDIA copyright header (current year) to all source files.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/llmapi/llm_args.py (2)

1-1: Add NVIDIA copyright header.

Per coding guidelines, prepend the NVIDIA copyright header (current year) to all source files.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.

1676-1691: EagleDecodingConfig validate: reinforces need for draft_model_dir in quickstart.

Given speculative_model_dir is required for Eagle-based paths, the quickstart must pass --draft_model_dir in two-model MTP Eagle.

See related comment in quickstart_advanced.py with a concrete diff to fix this wiring mistake.

🧹 Nitpick comments (15)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)

264-266: Avoid post-construction mutation of num_hidden_layers; guard against MTPDraftModel remap

Mutating pretrained_config.num_hidden_layers after the draft model is already instantiated risks mismatched assumptions (e.g., KV cache sizing, layer-indexed logic) and will also trigger even when AutoModel remapped to MTPDraftModelForCausalLM (architectures remains "DeepseekV3ForCausalLM"). Add guards so this only applies to an actual DeepseekV3ForCausalLM draft engine. Also handle missing/empty architectures gracefully.

Apply this minimal guard:

-            if draft_model_engine.model.model_config.pretrained_config.architectures[
-                    0] == "DeepseekV3ForCausalLM":
-                draft_model_engine.model.model_config.pretrained_config.num_hidden_layers = 1
+            archs = getattr(
+                draft_model_engine.model.model_config.pretrained_config,
+                "architectures", None)
+            model_arch = archs[0] if archs and len(archs) > 0 else None
+            # Only force to 1 layer when the instantiated class is DeepseekV3ForCausalLM
+            if (model_arch == "DeepseekV3ForCausalLM"
+                    and draft_model_engine.model.__class__.__name__
+                    == "DeepseekV3ForCausalLM"):
+                draft_model_engine.model.model_config.pretrained_config.num_hidden_layers = 1

If the intent was to influence KV cache allocation only, consider moving this into a draft-engine construction-time knob (or pass through PyTorchModelEngine is_draft_model logic) rather than mutating config after instantiation.

Also, per repository guidelines, prepend the NVIDIA copyright header to this file.

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
tensorrt_llm/_torch/speculative/interface.py (1)

99-109: Verify SM gating in extend_ctx; consider >=100 instead of ==100

extend_ctx disables capture for TrtllmAttention only when get_sm_version() == 100. If future Blackwell revisions report >100, this may degrade unexpectedly. Consider using >= 100 unless there’s a known incompatibility.

Would you like me to adjust the predicate to >= 100?

tensorrt_llm/_torch/models/modeling_deepseekv3.py (4)

132-141: New DeepseekV3WeightLoader improves modularity; add a brief docstring

The separation of weight-loading concerns into a dedicated class is a solid move. Add a short docstring to clarify scope and expected inputs per guidelines.

Example:

class DeepseekV3WeightLoader:
    """TP-aware weight loader for DeepseekV3 variants.

    Args:
        model: The target model instance (e.g., DeepseekV3ForCausalLM or draft variant).
        is_draft_model: Whether weights are being loaded for the draft path.
    """

Also, add the NVIDIA header to this file in addition to preserving upstream license:

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

514-535: Wrap long line; comply with Ruff E501 (<=120 chars)

predicted_tokens_per_seq assignment exceeds 120 chars.

-        predicted_tokens_per_seq = model_config.spec_config.max_draft_len + 1 if model_config.spec_config is not None else 1
+        spec_cfg = model_config.spec_config
+        predicted_tokens_per_seq = (spec_cfg.max_draft_len + 1) if spec_cfg is not None else 1

566-574: Prefer torch.sigmoid over torch.nn.functional.sigmoid (deprecation)

torch.nn.functional.sigmoid is deprecated; use torch.sigmoid for clarity and future-proofing.

-        scores = F.sigmoid(logits)
+        scores = torch.sigmoid(logits)

659-667: Ruff E501: wrap long call to DeepseekV3Gate constructor

A few of the argument lines exceed 120 chars. Wrap to satisfy E501.

For example, break routed_scaling_factor line:

-            routed_scaling_factor=config.routed_scaling_factor,
+            routed_scaling_factor=config.routed_scaling_factor,

and ensure each argument is on its own line with hanging indentation (PEP8).

If you want, I can run through the file and submit a style-only patch for the flagged lines.

tensorrt_llm/_torch/speculative/model_drafter.py (1)

32-35: Add NVIDIA copyright header to file

Per repo guidelines, prepend the header.

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
tensorrt_llm/_torch/speculative/mtp.py (2)

495-497: None check order in boolean guard.

Accessing attn_metadata.is_cuda_graph before verifying attn_metadata is not None risks AttributeError. Reverse the condition order.

-        if attn_metadata.is_cuda_graph and attn_metadata is not None:
+        if attn_metadata is not None and attn_metadata.is_cuda_graph:

292-296: Minor performance nit: avoid clone() when not needed.

Shifting input buffers uses clone() unnecessarily. copy_ achieves the same semantics with less overhead.

-            input_ids[:-1] = input_ids[1:].clone()
+            input_ids[:-1].copy_(input_ids[1:])
...
-            draft_hidden_states[:-1] = draft_hidden_states[1:].clone()
+            draft_hidden_states[:-1].copy_(draft_hidden_states[1:])
tensorrt_llm/_torch/speculative/eagle3.py (1)

95-105: Default layer-capture pattern: “last-4” could be fragile.

Using (1, mid-1, last-4) is a notable change. For shallow models (<=5 layers) you error out; for 6–7 layers, mid-1 and last-4 may collide or yield very early layers. Please confirm this is intentional and measured. Consider clamping and deduping, or fall back to capturing the last layer when spacing becomes too tight.

If intended, add a short inline comment explaining the rationale for last-4 vs last-1.

tensorrt_llm/_torch/models/modeling_speculative.py (5)

382-385: Use LongTensor for token indices in type hints.

Token IDs and position IDs are conventionally torch.LongTensor (int64). The current hints use torch.IntTensor in some places.

Apply this diff:

-        input_ids: torch.IntTensor,
-        position_ids: torch.IntTensor,
+        input_ids: torch.LongTensor,
+        position_ids: torch.LongTensor,

And:

-                input_ids: torch.IntTensor = None,
-                position_ids: torch.IntTensor = None,
+                input_ids: torch.LongTensor = None,
+                position_ids: torch.LongTensor = None,

Also applies to: 437-439


370-375: Align Embedding instantiation with TP mapping (optional).

Elsewhere embeddings are usually constructed with mapping/tensor-parallel settings and gathered outputs. Consider matching that here to avoid surprises in TP runs.

Apply this diff:

-        self.embed_tokens = Embedding(
-            self.config.vocab_size,
-            self.config.hidden_size,
-            dtype=self.config.torch_dtype,
-        )
+        self.embed_tokens = Embedding(
+            self.config.vocab_size,
+            self.config.hidden_size,
+            dtype=self.config.torch_dtype,
+            mapping=model_config.mapping,
+            tensor_parallel_mode=TensorParallelMode.COLUMN,
+            gather_output=True,
+        )

431-433: Embedding sharing likely never happens; decide whether to always share or always own.

self.model.embed_tokens is created in MTPDraftModel.__init__, so this if ... is None branch will never run. If the intent is to share embeddings with the target model to save memory and keep weights in sync, set it unconditionally; otherwise, remove the dead branch.

Proposed change to share:

-        if self.model.embed_tokens is None:
-            self.model.embed_tokens = target_model.model.embed_tokens
+        # Share embeddings with target model to save memory and keep weights in sync.
+        self.model.embed_tokens = target_model.model.embed_tokens

Please confirm whether the draft model should own separate embeddings (loaded via weight loader) or share with the target model.


353-359: Optional: Add concise docstring to new public class.

Add a short, Google-style docstring to clarify the per-layer draft behavior and expected inputs.

Example:

 class MTPDraftModel(nn.Module):
 
     def __init__(self, model_config: ModelConfig[PretrainedConfig],
                  layer_idx: int, aux_stream_dict: Dict[AuxStreamType,
                                                        torch.cuda.Stream]):
         super().__init__()
+        """
+        Per-layer MTP draft block used in the two-model MTP-Eagle flow.
+        
+        Args:
+            model_config: Global model configuration.
+            layer_idx: Index of the draft layer relative to the target stack.
+            aux_stream_dict: CUDA aux streams for attention/MoE overlap.
+        """

403-411: Optional: Add docstring to the public MTPDraftModelForCausalLM wrapper.

Clarify the two-stream setup and how weights are loaded.

Example:

 @register_auto_model("MTPDraftModelForCausalLM")
 class MTPDraftModelForCausalLM(DecoderModelForCausalLM[MTPDraftModel,
                                                        PretrainedConfig]):
 
     def __init__(self, model_config: ModelConfig[PretrainedConfig]):
-        self.model_config = model_config
+        """
+        Causal LM wrapper around a per-layer MTP draft model for two-model
+        MTP-Eagle speculative decoding.
+
+        Sets up CUDA aux streams and delegates weight loading to the DeepseekV3 loader.
+        """
+        self.model_config = model_config
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1ce2354 and 9ae0936.

📒 Files selected for processing (11)
  • examples/llm-api/quickstart_advanced.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_auto.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (9 hunks)
  • tensorrt_llm/_torch/models/modeling_speculative.py (5 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1 hunks)
  • tensorrt_llm/_torch/speculative/eagle3.py (4 hunks)
  • tensorrt_llm/_torch/speculative/interface.py (6 hunks)
  • tensorrt_llm/_torch/speculative/model_drafter.py (1 hunks)
  • tensorrt_llm/_torch/speculative/mtp.py (3 hunks)
  • tensorrt_llm/_torch/speculative/utils.py (10 hunks)
  • tensorrt_llm/llmapi/llm_args.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/speculative/model_drafter.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tensorrt_llm/_torch/speculative/mtp.py
  • tensorrt_llm/_torch/models/modeling_auto.py
  • examples/llm-api/quickstart_advanced.py
  • tensorrt_llm/_torch/speculative/interface.py
  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/speculative/eagle3.py
  • tensorrt_llm/_torch/speculative/utils.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/speculative/model_drafter.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tensorrt_llm/_torch/speculative/mtp.py
  • tensorrt_llm/_torch/models/modeling_auto.py
  • examples/llm-api/quickstart_advanced.py
  • tensorrt_llm/_torch/speculative/interface.py
  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/speculative/eagle3.py
  • tensorrt_llm/_torch/speculative/utils.py
🧠 Learnings (1)
📚 Learning: 2025-08-14T06:36:40.681Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.681Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_deepseekv3.py

514-514: Line too long (124 > 120)

(E501)


659-659: Line too long (122 > 120)

(E501)


797-797: Line too long (125 > 120)

(E501)


803-803: Line too long (136 > 120)

(E501)

tensorrt_llm/_torch/models/modeling_speculative.py

463-463: Comparison to None should be cond is not None

Replace with cond is not None

(E711)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (24)
tensorrt_llm/_torch/speculative/interface.py (3)

24-29: Helper predicates for one-model MTP look good

is_mtp_one_model and is_mtp_eagle_one_model are clear and simplify downstream checks. No issues.


39-41: use_one_engine correctly covers MTP one-model

The updated predicate now includes MTP one-model alongside EAGLE3 one-model, which is consistent with the intended architecture split.


196-202: Default SpecMetadata.is_layer_capture stub is fine

Providing an overridable hook with a safe default is appropriate.

tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

1496-1499: Weight loader integration looks correct

Switching to DeepseekV3WeightLoader.load_weights simplifies the class and centralizes logic. LGTM.

tensorrt_llm/_torch/speculative/model_drafter.py (1)

32-35: Drop-first-token logic extended to MTP_EAGLE (two-model) — LGTM

Aligns with EAGLE3 behavior and the broader MTP Eagle flow. Safe for short prompts (returns empty tensor when len==1).

tensorrt_llm/_torch/speculative/mtp.py (3)

154-156: One-model Eagle gating: confirm behavior.

Setting subseq_all_rank_num_tokens only for is_mtp_eagle_one_model looks intentional for the MTPEagle one-model path. Please confirm that two-model Eagle does not rely on subseq_all_rank_num_tokens here (it should be using Eagle3 metadata instead). If not, add the appropriate branch.


171-173: Double-check num_tokens adjustment gating.

MTP vanilla uses total max_draft_len tokens, while MTP Eagle’s first step uses max_draft_len+1 and subsequent steps 1. The change to gate on not is_mtp_eagle_one_model() seems correct for the one-model Eagle path. Please verify that the two-model Eagle path won’t traverse this code (i.e., uses Eagle3 metadata exclusively).


181-202: Vanilla-only hidden-state updates gating looks correct.

Updating hidden-state pointers only under is_mtp_one_model() aligns with excluding the MTP Eagle one-model path, which has its own flow.

examples/llm-api/quickstart_advanced.py (1)

186-193: One-model MTP path wiring looks consistent.

Using MTPDecodingConfig with num_nextn_predict_layers and mtp_eagle_one_model mirrors the new spec_dec_mode predicates.

tensorrt_llm/_torch/speculative/eagle3.py (2)

39-41: Capacity of hidden_states now depends on config.num_capture_layers — verify alignment.

Resource allocation uses EagleDecodingConfig.num_capture_layers, while metadata can derive layers_to_capture dynamically. Ensure config.num_capture_layers and metadata.layers_to_capture length always match, or document the assumption. Mismatch can cause OOB writes during capture.


187-219: Eagle3OneModelSpecMetadata mirrors capture logic — good parity.

The mirrored lazy init and allocation for one-model Eagle3 aligns with the two-model path.

tensorrt_llm/_torch/speculative/utils.py (6)

22-30: One-model MTP metadata unification looks correct.

Routing both vanilla MTP and MTP Eagle one-model through MTPSpecMetadata via is_mtp_one_model() simplifies the stack.


75-85: Resource manager for MTP Eagle one-model only when relaxed acceptance is enabled.

This conditional avoids unnecessary buffers. Please confirm MTPEagleWorker never expects mtp_* pools when use_relaxed_acceptance_for_thinking=False. Otherwise, allocate conditionally per usage site.


92-101: Two-model Eagle and MTP Eagle share Eagle3ResourceManager — assert message updated.

The assertion covers both modes. Good consistency.


111-118: Decoder selection via is_mtp_one_model() is fine if it includes Eagle one-model.

Ensure SpeculativeDecodingMode.is_mtp_one_model() returns True for both vanilla MTP and MTP Eagle one-model. If not, add an explicit is_mtp_eagle_one_model() branch to return MTPSampler.


162-170: Worker selection split is clear and correct.

MTP Vanilla → MTPWorker; MTP Eagle one-model → MTPEagleWorker; Eagle3 one-model → Eagle3OneModelWorker.


186-191: MTP one-model max_draft_len reconciliation is necessary — good.

Ensures low-level APIs expecting max_draft_len receive a consistent value derived from num_nextn_predict_layers.

tensorrt_llm/llmapi/llm_args.py (4)

445-452: num_capture_layers logic: confirm alignment with eagle3.py defaults.

If eagle3_layers_to_capture is unset, you return 1 for MTP Eagle and 3 otherwise. This should match eagle3.py’s computed layers_to_capture length. Please verify consistency to avoid allocation/usage mismatches.


457-461: Spec mode mapping for MTP Eagle is correct.

EAGLE3_ONE_MODEL takes priority; otherwise MTP_EAGLE if flagged.


542-547: from_dict sets max_draft_len — good for downstream APIs.

Keeps legacy code that reads max_draft_len working.


557-561: MTP_EAGLE vs MTP_EAGLE_ONE_MODEL selection looks right.

Using num_nextn_predict_layers_from_model_config and mtp_eagle_one_model disambiguates both modes cleanly.

tensorrt_llm/_torch/models/modeling_speculative.py (3)

22-22: LGTM: AuxStreamType import is appropriate.

Import looks correct and aligns with the new aux-stream wiring.


340-343: LGTM: MTP layer count computation is consistent with the new predicates.

Using is_mtp_one_model() and is_mtp_eagle_one_model() reads clearly, and the fallback to num_nextn_predict_layers makes sense.


468-474: LGTM: New MTP-Eagle branch in get_draft_model.

Adding the is_mtp_eagle() path to return MTPDraftModelForCausalLM is consistent with enabling the two-model spec-dec for MTP Eagle.

Comment on lines 173 to 184
if not args.use_one_model:
print(
"MTP only supports one model style spec decode; ignoring default use_one_model=False"
print("Running MTP eagle with two model style.")
spec_config = EagleDecodingConfig(
max_draft_len=args.spec_decode_max_draft_len,
speculative_model_dir=args.model_dir,
eagle3_one_model=args.use_one_model,
is_mtp_eagle=True,
use_relaxed_acceptance_for_thinking=args.
use_relaxed_acceptance_for_thinking,
relaxed_topk=args.relaxed_topk,
relaxed_delta=args.relaxed_delta,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Two-model MTP Eagle should use draft_model_dir, not model_dir.

EagleDecodingConfig.validate requires a draft model path. Using args.model_dir here is likely incorrect and will fail if EAGLE weights aren’t colocated with the main model.

Enforce draft_model_dir and wire it:

-        if not args.use_one_model:
-            print("Running MTP eagle with two model style.")
-            spec_config = EagleDecodingConfig(
-                max_draft_len=args.spec_decode_max_draft_len,
-                speculative_model_dir=args.model_dir,
+        if not args.use_one_model:
+            print("Running MTP eagle with two model style.")
+            if args.draft_model_dir is None:
+                raise ValueError("For two-model MTP Eagle, --draft_model_dir must be provided.")
+            spec_config = EagleDecodingConfig(
+                max_draft_len=args.spec_decode_max_draft_len,
+                speculative_model_dir=args.draft_model_dir,
                 eagle3_one_model=args.use_one_model,
                 is_mtp_eagle=True,
                 use_relaxed_acceptance_for_thinking=args.
                 use_relaxed_acceptance_for_thinking,
                 relaxed_topk=args.relaxed_topk,
                 relaxed_delta=args.relaxed_delta,
             )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not args.use_one_model:
print(
"MTP only supports one model style spec decode; ignoring default use_one_model=False"
print("Running MTP eagle with two model style.")
spec_config = EagleDecodingConfig(
max_draft_len=args.spec_decode_max_draft_len,
speculative_model_dir=args.model_dir,
eagle3_one_model=args.use_one_model,
is_mtp_eagle=True,
use_relaxed_acceptance_for_thinking=args.
use_relaxed_acceptance_for_thinking,
relaxed_topk=args.relaxed_topk,
relaxed_delta=args.relaxed_delta,
)
if not args.use_one_model:
print("Running MTP eagle with two model style.")
if args.draft_model_dir is None:
raise ValueError("For two-model MTP Eagle, --draft_model_dir must be provided.")
spec_config = EagleDecodingConfig(
max_draft_len=args.spec_decode_max_draft_len,
speculative_model_dir=args.draft_model_dir,
eagle3_one_model=args.use_one_model,
is_mtp_eagle=True,
use_relaxed_acceptance_for_thinking=args.
use_relaxed_acceptance_for_thinking,
relaxed_topk=args.relaxed_topk,
relaxed_delta=args.relaxed_delta,
)
🤖 Prompt for AI Agents
In examples/llm-api/quickstart_advanced.py around lines 173 to 184, the config
for two-model MTP Eagle incorrectly passes args.model_dir as the
draft/speculative model path; change it to use the draft model path
(args.draft_model_dir) and/or the proper EagleDecodingConfig parameter name
(draft_model_dir or speculative_model_dir as defined in the config API) so the
spec/draft weights are read from the draft model directory instead of the main
model_dir; update the argument passed to EagleDecodingConfig accordingly and
ensure the CLI/argparser exposes args.draft_model_dir.

Comment on lines 23 to 24
if model_arch == "DeepseekV3ForCausalLM" and config.spec_config.max_draft_len == 0:
model_arch = "MTPDraftModelForCausalLM"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix: guard config.spec_config None to avoid AttributeError

Accessing config.spec_config.max_draft_len without a None-check will raise if spec_config is None. Add a guard.

-        if model_arch == "DeepseekV3ForCausalLM" and config.spec_config.max_draft_len == 0:
+        if (model_arch == "DeepseekV3ForCausalLM"
+                and config.spec_config is not None
+                and config.spec_config.max_draft_len == 0):
             model_arch = "MTPDraftModelForCausalLM"

Also, per repository guidelines, prepend the NVIDIA copyright header:

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_auto.py around lines 23-24, accessing
config.spec_config.max_draft_len can raise AttributeError when spec_config is
None; add a None-check so you only read max_draft_len if config.spec_config is
not None (e.g., guard with if config.spec_config and
config.spec_config.max_draft_len == 0) and update the branch accordingly; also
prepend the NVIDIA copyright header line "# Copyright (c) 2025, NVIDIA
CORPORATION. All rights reserved." at the top of the file.

Comment on lines +773 to +809
def _compute_shared_expert_tp_size(self, intermediate_size: int,
block_size: int) -> int:
"""
In the case of Deepseek-R1, the TP size of MLP is capped by intermediate_size // block_size.
For example, when the intermediate_size is 2048 and block scaling size is 128,
TP sizes are limited to {1, 2, 4, 8, 16} because of 2048/128 = 16.

Args:
intermediate_size (int): MLP intermediate size.
block_size (int): The quantization block scale size. In the case of Deepseek FP8 recipe,
it's 128. For NVFP4, it's 16.

Returns:
int: The computed tp_size.
"""

assert intermediate_size % block_size == 0, "intermediate_size must be divisible by block_size."

shared_output_scale = None
# The block scale size is 128, which requires shared_expert_intermediate_size to be divisible by 128.
if self.use_dp:
# If using attention DP, the shared experts also use DP instead of TP.
shared_tp_size = 1
else:
# Due to the restriction of block scale size (i.e., 128), the supported TP sizes only include 1, 2, 4, 8, and 16.
# The math.gcd operation ensures that shared_tp_size falls in the supported TP sizes.
shared_tp_size = math.gcd(
intermediate_size // block_size,
self.mapping.tp_size,
)
# If shared_tp_size has been overridden, the output of shared experts needs to be scaled down accordingly before all-reduce.
if shared_tp_size != self.mapping.tp_size:
shared_output_scale = shared_tp_size / self.mapping.tp_size

return shared_tp_size, shared_output_scale

def compute_routed_output(self, hidden_states, hidden_states_fp4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix type annotation and docstring: function returns tuple, not int

_compute_shared_expert_tp_size returns a tuple (shared_tp_size, shared_output_scale), but the annotation and docstring say int. Align both.

-    def _compute_shared_expert_tp_size(self, intermediate_size: int,
-                                       block_size: int) -> int:
+    def _compute_shared_expert_tp_size(
+        self, intermediate_size: int, block_size: int
+    ) -> Tuple[int, Optional[float]]:
@@
-        Returns:
-            int: The computed tp_size.
+        Returns:
+            Tuple[int, Optional[float]]: (tp_size, shared_output_scale)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _compute_shared_expert_tp_size(self, intermediate_size: int,
block_size: int) -> int:
"""
In the case of Deepseek-R1, the TP size of MLP is capped by intermediate_size // block_size.
For example, when the intermediate_size is 2048 and block scaling size is 128,
TP sizes are limited to {1, 2, 4, 8, 16} because of 2048/128 = 16.
Args:
intermediate_size (int): MLP intermediate size.
block_size (int): The quantization block scale size. In the case of Deepseek FP8 recipe,
it's 128. For NVFP4, it's 16.
Returns:
int: The computed tp_size.
"""
assert intermediate_size % block_size == 0, "intermediate_size must be divisible by block_size."
shared_output_scale = None
# The block scale size is 128, which requires shared_expert_intermediate_size to be divisible by 128.
if self.use_dp:
# If using attention DP, the shared experts also use DP instead of TP.
shared_tp_size = 1
else:
# Due to the restriction of block scale size (i.e., 128), the supported TP sizes only include 1, 2, 4, 8, and 16.
# The math.gcd operation ensures that shared_tp_size falls in the supported TP sizes.
shared_tp_size = math.gcd(
intermediate_size // block_size,
self.mapping.tp_size,
)
# If shared_tp_size has been overridden, the output of shared experts needs to be scaled down accordingly before all-reduce.
if shared_tp_size != self.mapping.tp_size:
shared_output_scale = shared_tp_size / self.mapping.tp_size
return shared_tp_size, shared_output_scale
def compute_routed_output(self, hidden_states, hidden_states_fp4,
def _compute_shared_expert_tp_size(
self, intermediate_size: int, block_size: int
) -> Tuple[int, Optional[float]]:
"""
In the case of Deepseek-R1, the TP size of MLP is capped by intermediate_size // block_size.
For example, when the intermediate_size is 2048 and block scaling size is 128,
TP sizes are limited to {1, 2, 4, 8, 16} because of 2048/128 = 16.
Args:
intermediate_size (int): MLP intermediate size.
block_size (int): The quantization block scale size. In the case of Deepseek FP8 recipe,
it's 128. For NVFP4, it's 16.
Returns:
Tuple[int, Optional[float]]: (tp_size, shared_output_scale)
"""
assert intermediate_size % block_size == 0, "intermediate_size must be divisible by block_size."
shared_output_scale = None
# The block scale size is 128, which requires shared_expert_intermediate_size to be divisible by 128.
if self.use_dp:
# If using attention DP, the shared experts also use DP instead of TP.
shared_tp_size = 1
else:
# Due to the restriction of block scale size (i.e., 128), the supported TP sizes only include 1, 2, 4, 8, and 16.
# The math.gcd operation ensures that shared_tp_size falls in the supported TP sizes.
shared_tp_size = math.gcd(
intermediate_size // block_size,
self.mapping.tp_size,
)
# If shared_tp_size has been overridden, the output of shared experts needs to be scaled down accordingly before all-reduce.
if shared_tp_size != self.mapping.tp_size:
shared_output_scale = shared_tp_size / self.mapping.tp_size
return shared_tp_size, shared_output_scale
🧰 Tools
🪛 Ruff (0.12.2)

797-797: Line too long (125 > 120)

(E501)


803-803: Line too long (136 > 120)

(E501)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_deepseekv3.py around lines 773 to 809,
the _compute_shared_expert_tp_size function actually returns a tuple
(shared_tp_size, shared_output_scale) but is annotated and documented as
returning an int; update the function signature to return the correct typing
(e.g., Tuple[int, Optional[float]]), add necessary imports from typing (Tuple,
Optional) if not present, and modify the docstring Returns section to describe
the two-tuple (shared_tp_size: int, shared_output_scale: Optional[float]) and
their meanings.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Aug 18, 2025
@mikeiovine mikeiovine self-requested a review August 18, 2025 14:38
@@ -20,6 +20,8 @@ def from_config(
"") # Strip the appended EAGLE3
if hasattr(config.pretrained_config, "draft_vocab_size"):
model_arch = "EAGLE3" + model_arch
if model_arch == "DeepseekV3ForCausalLM" and config.spec_config.max_draft_len == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could config.spec_config be None?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 157 to +158
else:
hidden_states = hidden_states[-1].unsqueeze(0)
return torch.chunk(v, tp_size, dim=dim)[idx].contiguous()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else is not necessary.

Suggested change
else:
hidden_states = hidden_states[-1].unsqueeze(0)
return torch.chunk(v, tp_size, dim=dim)[idx].contiguous()
return torch.chunk(v, tp_size, dim=dim)[idx].contiguous()


if not do_finalize:
return [shared_output, *routed_output]
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else is not necessary.

Comment on lines +102 to +106
if get_sm_version() == 100 and issubclass(attention_backend,
TrtllmAttention):
return False
else:
return True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be simplified to return not (get_sm_version() == ....

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi ziyixiong, I'll continue to modify it when #6210 is merged

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should revert this to the way it is currently on trunk

Copy link
Collaborator

@mikeiovine mikeiovine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have not reviewed fully. Will come back to it once #6210 is merged; looks like this one has a lot of overlap with it.

print("Running MTP eagle with two model style.")
spec_config = EagleDecodingConfig(
max_draft_len=args.spec_decode_max_draft_len,
speculative_model_dir=args.model_dir,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the weights for MTP included in the target model checkpoint?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's included in the target model checkpoint but target model doesn't contain them anymore as the target model does't contain MTP layers anymore. There're 2 methods we can deal with it:

  1. Load it twice like what we did now.
  2. Save the weights as target.weights and load it in function load_weights_from_target_model() with weight_loader.load_weights(target.weights)
    The second method is much better, do you think so?

print(
"MTP only supports one model style spec decode; ignoring default use_one_model=False"
print("Running MTP eagle with two model style.")
spec_config = EagleDecodingConfig(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have to be EagleDecodingConfig? On the UX side, it's not great that you have to use different config objects for MTP_EAGLE 2-model vs 1-model.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pipeline of MTP_EAGLE 2-model is using Eagle3 pipeline, which is different from 1-model style, and I'm also using Eagle3ResourceManager and Eagle3SpecMetadata, so is this more reasonable to use EagleDecodingConfig?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use MTPDecodingConfig for MTP even if it's using eagle stuff behind the scenes. It will be too confusing for end users otherwise.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will change it

@sunnyqgg sunnyqgg changed the title [TRTLLM-6746]Enable two-model spec dec for MTP Eagle [TRTLLM-6746][feat]Enable two-model spec dec for MTP Eagle Aug 20, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
examples/llm-api/quickstart_advanced.py (1)

173-179: Two-model MTP Eagle must use draft_model_dir (not model_dir) and guard when missing

EagleDecodingConfig.validate requires a draft model path. Passing model_dir here will resolve to the target model path and likely break draft loading for two-model MTP Eagle.

Apply:

-        if not args.use_one_model:
-            print("Running MTP eagle with two model style.")
-            spec_config = EagleDecodingConfig(
-                max_draft_len=args.spec_decode_max_draft_len,
-                speculative_model_dir=args.model_dir,
-                eagle3_one_model=args.use_one_model,
-                is_mtp_eagle=True)
+        if not args.use_one_model:
+            print("Running MTP eagle with two model style.")
+            if args.draft_model_dir is None:
+                raise ValueError("For two-model MTP Eagle, --draft_model_dir must be provided.")
+            spec_config = EagleDecodingConfig(
+                max_draft_len=args.spec_decode_max_draft_len,
+                speculative_model_dir=args.draft_model_dir,
+                eagle3_one_model=args.use_one_model,
+                is_mtp_eagle=True)
🧹 Nitpick comments (7)
tensorrt_llm/_torch/models/modeling_auto.py (2)

23-24: Good guard on spec_config; consider intent for EAGLE3 and wrap long line

  • The None-check on spec_config prevents AttributeError. Looks good.
  • Verify intent: this remap won’t trigger when draft_vocab_size is present (EAGLE3 path) since model_arch is prefixed to "EAGLE3...". If you also want the remap for EAGLE3 checkpoints, include "EAGLE3DeepseekV3ForCausalLM" in the condition or move the remap above the EAGLE3 prefix step.

Also fix Ruff E501 (126 > 120). Example:

-        if model_arch == "DeepseekV3ForCausalLM" and config.spec_config is not None and config.spec_config.max_draft_len == 0:
+        if (model_arch == "DeepseekV3ForCausalLM"
+                and config.spec_config is not None
+                and config.spec_config.max_draft_len == 0):
             model_arch = "MTPDraftModelForCausalLM"

1-1: Prepend NVIDIA copyright header

Per repo guidelines, prepend the 2025 NVIDIA header.

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
 from typing import Generic
examples/llm-api/quickstart_advanced.py (2)

174-174: Prefer logger over print in library/examples

Minor: use the project logger to keep output consistent and suppressible.

-            print("Running MTP eagle with two model style.")
+            from tensorrt_llm.logger import logger
+            logger.info("Running MTP eagle with two model style.")

1-1: Prepend NVIDIA copyright header

Per repo guidelines, add the 2025 header.

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
 import argparse
tensorrt_llm/llmapi/llm_args.py (3)

428-447: Eagle capture layer controls and derived num_capture_layers are reasonable

  • Tuple[int, ...] for eagle3_layers_to_capture is fine.
  • num_capture_layers fallback to 1 for MTP Eagle and 3 otherwise makes sense.

Optional: add a validator to ensure elements are non-negative and strictly increasing when provided.


540-542: from_dict sets max_draft_len: OK; minor duplication elsewhere

Setting max_draft_len = num_nextn_predict_layers here is fine. Note it’s also set again in BaseLlmArgs.validate_speculative_config (Lines 1701-1704), so you could deduplicate in one place later.


1-1: Prepend NVIDIA copyright header

Per repo guidelines, prepend the 2025 NVIDIA header.

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
 import copy
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9ae0936 and 830a4a2.

📒 Files selected for processing (3)
  • examples/llm-api/quickstart_advanced.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_auto.py (1 hunks)
  • tensorrt_llm/llmapi/llm_args.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/llmapi/llm_args.py
  • examples/llm-api/quickstart_advanced.py
  • tensorrt_llm/_torch/models/modeling_auto.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/llmapi/llm_args.py
  • examples/llm-api/quickstart_advanced.py
  • tensorrt_llm/_torch/models/modeling_auto.py
🧬 Code Graph Analysis (3)
tensorrt_llm/llmapi/llm_args.py (1)
tensorrt_llm/_torch/speculative/interface.py (2)
  • SpeculativeDecodingMode (12-120)
  • is_mtp_eagle (33-34)
examples/llm-api/quickstart_advanced.py (2)
tensorrt_llm/llmapi/llm_args.py (5)
  • EagleDecodingConfig (411-449)
  • speculative_model_dir (1347-1348)
  • model_dir (1104-1106)
  • model_dir (1109-1113)
  • MTPDecodingConfig (510-548)
tensorrt_llm/_torch/speculative/interface.py (1)
  • is_mtp_eagle (33-34)
tensorrt_llm/_torch/models/modeling_auto.py (1)
tensorrt_llm/llmapi/llm_utils.py (1)
  • model_arch (92-94)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_auto.py

23-23: Line too long (126 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (4)
tensorrt_llm/llmapi/llm_args.py (4)

12-13: Typing import updates look correct

Importing Tuple/Type/TypeAlias is appropriate for new annotations.


454-456: Extended spec_dec_mode mapping for MTP Eagle

The new branch returning MTP_EAGLE when is_mtp_eagle is set (and not one-model) aligns with the new SpeculativeDecodingMode predicates.


525-525: New flag for one-model MTP Eagle

mtp_eagle_one_model adds the necessary distinction for one-model vs two-model handling. No concerns.


553-556: MTP spec_dec_mode branching aligns with new modes

Branching on num_nextn_predict_layers_from_model_config and mtp_eagle_one_model to return MTP_EAGLE_ONE_MODEL / MTP_EAGLE looks consistent with interface predicates.

@sunnyqgg
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15924 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15924 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11968 completed with status: 'FAILURE'

Comment on lines +102 to +106
if get_sm_version() == 100 and issubclass(attention_backend,
TrtllmAttention):
return False
else:
return True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should revert this to the way it is currently on trunk

print(
"MTP only supports one model style spec decode; ignoring default use_one_model=False"
print("Running MTP eagle with two model style.")
spec_config = EagleDecodingConfig(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use MTPDecodingConfig for MTP even if it's using eagle stuff behind the scenes. It will be too confusing for end users otherwise.

@@ -435,12 +438,22 @@ def validate(self) -> None:
if self.speculative_model_dir is None:
raise ValueError("Draft model must be provided for EAGLE")

@functools.cached_property
def num_capture_layers(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a return type annotation for this function? It would improve readability for both humans and IDEs. The same applies to the other places as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community want to contribute PRs initiated from Community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants