Skip to content

Conversation

@IzzyPutterman
Copy link
Collaborator

@IzzyPutterman IzzyPutterman commented Aug 26, 2025

Summary by CodeRabbit

  • New Features

    • Multi-layer draft decoding support with a configurable number of draft layers in one-model mode.
    • Draft models now default to capturing the last layer when appropriate.
  • Behavior

    • No breaking changes to public interfaces; runtime behavior adapts to the configured or default layer counts.
  • Tests

    • Added end-to-end tests exercising multi-layer / multi-model draft decoding paths.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 26, 2025

📝 Walkthrough

Walkthrough

Adds multi-layer support to Eagle3DraftModel (new public num_layers; midlayer becomes nn.ModuleList when >1) and updates its forward to iterate layers. get_num_spec_layers now honors spec_config.num_eagle_layers in one-model mode. Eagle3SpecMetadata defaults layers_to_capture to the last layer for draft models.

Changes

Cohort / File(s) Summary
Eagle3 Draft Model multi-layer support
tensorrt_llm/_torch/models/modeling_speculative.py
Adds public num_layers (from model_config.pretrained_config.num_hidden_layers). midlayer becomes nn.ModuleList when num_layers > 1, otherwise a single Eagle3DecoderLayer. forward iterates through midlayer when multi-layer; retains single-layer behavior when num_layers == 1.
Speculative utils layer count configurability
tensorrt_llm/_torch/speculative/utils.py
get_num_spec_layers for eagle3_one_model now returns spec_config.num_eagle_layers when set, otherwise falls back to 1. Other branches unchanged.
Eagle3 metadata capture defaults
tensorrt_llm/_torch/speculative/eagle3.py
Eagle3SpecMetadata.__post_init__ changes condition so draft models (is_draft_model == True) default layers_to_capture to only the last layer when layers_to_capture is None. Existing multi-layer non-draft logic unchanged.
Tests: multi Eagle3 scenarios
tests/unittest/_torch/speculative/test_eagle3.py
Adds test_multi_eagle3 (parameterized on use_one_model) to exercise speculative generation with multi-/one-model Eagle3 configs; orchestrates temp model creation, config writing, LLM instantiation, and streaming generate_async run. (Note: the patch includes two identical copies of this test.)

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant Eagle3 as Eagle3DraftModel
  participant Layer as Eagle3DecoderLayer*
  participant Output

  Caller->>Eagle3: forward(hidden_states, ...)
  alt num_layers > 1
    Note over Eagle3,Layer: midlayer = ModuleList (layer0..layerN-1)
    loop per layer
      Eagle3->>Layer: layer_i.forward(h, ...)
      Layer-->>Eagle3: h = h'
    end
    Eagle3-->>Output: final outputs
  else num_layers == 1
    Note over Eagle3,Layer: midlayer = single Eagle3DecoderLayer
    Eagle3->>Layer: midlayer.forward(h, ...)
    Layer-->>Eagle3: outputs
    Eagle3-->>Output: outputs
  end
Loading
sequenceDiagram
  autonumber
  participant SpecCfg as spec_config
  participant Utils as get_num_spec_layers
  participant Meta as Eagle3SpecMetadata

  SpecCfg->>Utils: query for eagle3_one_model
  alt spec_config.num_eagle_layers set
    Utils-->>SpecCfg: return num_eagle_layers
  else
    Utils-->>SpecCfg: return 1
  end

  SpecCfg->>Meta: init with is_draft_model, num_layers, layers_to_capture=None
  alt is_draft_model or num_layers == 1
    Meta-->>SpecCfg: set layers_to_capture=(num_layers-1,)
  else
    Meta-->>SpecCfg: apply existing multi-layer selection logic
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

Community want to contribute, Speculative Decoding, CI

Suggested reviewers

  • syuoni
  • Wanli-Jiang
  • HuiGao-NV
  • mikeiovine

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tensorrt_llm/_torch/speculative/utils.py (1)

154-156: Guard against invalid num_eagle_layers; default to 1

Protects downstream logic if the value is unset or accidentally set to <1.

Apply this diff:

-        num_eagle_layers = spec_config.num_eagle_layers
-        return num_eagle_layers if num_eagle_layers is not None else 1
+        num_eagle_layers = spec_config.num_eagle_layers
+        return 1 if (num_eagle_layers is None or num_eagle_layers < 1) else num_eagle_layers
tensorrt_llm/_torch/models/modeling_speculative.py (1)

222-235: Type hints out-of-sync with returned values; tighten annotations

Both Eagle3DraftModel.forward and Eagle3DecoderLayer.forward return (hidden_states, residual) tuples, but their annotations declare torch.Tensor. This can confuse tooling and readers.

Apply these changes (outside the selected lines) to align annotations:

@@
-class Eagle3DecoderLayer(DecoderLayer):
+class Eagle3DecoderLayer(DecoderLayer):
@@
-    def __init__(
+    def __init__(
         self,
         model_config: LlamaConfig,
         layer_idx: int = 0,
-    ) -> Tuple[torch.Tensor, torch.Tensor]:
+    ) -> None:
         super().__init__()
@@
-    ) -> torch.Tensor:
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
         residual = hidden_states
@@
-class Eagle3DraftModel(DecoderModel):
+class Eagle3DraftModel(DecoderModel):
@@
-    ) -> torch.Tensor:
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
         assert self.embed_tokens is not None

Note: Tuple is already imported at the top of the file.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9df15b2 and 9e57506.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/models/modeling_speculative.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1 hunks)
  • tensorrt_llm/_torch/speculative/utils.py (1 hunks)
  • tensorrt_llm/llmapi/llm_args.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tensorrt_llm/_torch/speculative/utils.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/llmapi/llm_args.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tensorrt_llm/_torch/speculative/utils.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/llmapi/llm_args.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
tensorrt_llm/llmapi/llm_args.py (2)
  • update_for_draft_init (394-397)
  • update_for_draft_init (464-472)
tensorrt_llm/_torch/models/modeling_speculative.py (2)
tensorrt_llm/module.py (1)
  • ModuleList (229-288)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
  • attn_metadata (68-69)
  • spec_metadata (60-61)
tensorrt_llm/llmapi/llm_args.py (2)
tensorrt_llm/runtime/model_runner_cpp.py (1)
  • num_layers (505-509)
tensorrt_llm/runtime/model_runner.py (1)
  • num_layers (801-802)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tensorrt_llm/llmapi/llm_args.py (2)

394-398: Good addition: explicit draft-initialization hook on DecodingBaseConfig

A no-op default hook is a clean extension point and keeps subclasses opt-in.


453-463: Switching from cached_property to property avoids stale reads

Given eagle3_layers_to_capture can be mutated during draft init, making num_capture_layers a plain property is the right choice to keep it consistent at runtime.

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)

255-256: Right placement: update draft spec before constructing the draft engine

Calling draft_spec_config.update_for_draft_init() after copying and before model construction ensures draft-only knobs (e.g., layers_to_capture) are populated without mutating the original config.

tensorrt_llm/_torch/models/modeling_speculative.py (2)

154-154: Confirm layer-count source matches configuration

Eagle3DraftModel.num_layers is taken from the draft model’s pretrained_config.num_hidden_layers. If users also set spec_config.num_eagle_layers (used elsewhere for planning), ensure these two remain consistent (either by convention or validation) to avoid surprises.

Would you like me to add a validator that raises if spec_config.num_eagle_layers is set and disagrees with draft_config.pretrained_config.num_hidden_layers?


167-174: Multi-layer wiring looks correct; uses distinct layer_idx with offset

ModuleList over Eagle3DecoderLayer with start_layer_idx offset prevents index collisions with target layers during metadata capture.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
tensorrt_llm/_torch/speculative/eagle3.py (3)

82-88: Make layers_to_capture type consistent and ordered (use Tuple[int, ...]).

layers_to_capture is annotated as Optional[Set[int]] but is populated with tuples/lists and its order is relied upon to pack channels in hidden_states (enumerate(i) drives [i * hidden_size:(i + 1) * hidden_size]). Sets are unordered, which is misleading and can cause accidental misuse. Make it Tuple[int, ...] and normalize to a sorted tuple in post_init.

Apply:

diff --git a/tensorrt_llm/_torch/speculative/eagle3.py b/tensorrt_llm/_torch/speculative/eagle3.py
@@
-from typing import List, Optional, Set
+from typing import List, Optional, Set, Tuple
@@
 class Eagle3SpecMetadata(SpecMetadata):
     hidden_states: List[torch.Tensor] = field(default_factory=list)
-    layers_to_capture: Optional[Set[int]] = None
+    layers_to_capture: Optional[Tuple[int, ...]] = None
@@
     def __post_init__(self):
-        if self.layers_to_capture is None:
+        if self.layers_to_capture is None:
+            if self.num_layers < 1:
+                raise ValueError("num_layers must be >= 1 for EAGLE3 capture")
             if self.is_draft_model or self.num_layers == 1:
-                self.layers_to_capture = (self.num_layers - 1, )
+                self.layers_to_capture = (self.num_layers - 1,)
             else:
                 if self.num_layers <= 5:
                     raise ValueError(
                         "Not enough hidden layers for default EAGLE3 capture")
 
-                self.layers_to_capture = (1, self.num_layers // 2 - 1,
-                                          self.num_layers - 4)
+                self.layers_to_capture = (
+                    1,
+                    self.num_layers // 2 - 1,
+                    self.num_layers - 4,
+                )
         else:
-            self.layers_to_capture = sorted(list(self.layers_to_capture))
+            # Normalize to a deterministic, immutable, ascending order
+            self.layers_to_capture = tuple(sorted(self.layers_to_capture))
         self.num_capture_layers = len(self.layers_to_capture)

Optional follow-up (same consistency) for Eagle3OneModelSpecMetadata:

diff --git a/tensorrt_llm/_torch/speculative/eagle3.py b/tensorrt_llm/_torch/speculative/eagle3.py
@@
-class Eagle3OneModelSpecMetadata(SpecMetadata):
+class Eagle3OneModelSpecMetadata(SpecMetadata):
@@
-    layers_to_capture: Optional[Set[int]] = None
+    layers_to_capture: Optional[Tuple[int, ...]] = None
@@
-        if self.layers_to_capture is None:
+        if self.layers_to_capture is None:
             if self.num_layers == 1:
-                self.layers_to_capture = (self.num_layers - 1, )
+                self.layers_to_capture = (self.num_layers - 1,)
             else:
                 if self.num_layers <= 5:
                     raise ValueError(
                         "Not enough hidden layers for default EAGLE3 capture")
 
-                self.layers_to_capture = (1, self.num_layers // 2 - 1,
-                                          self.num_layers - 4)
+                self.layers_to_capture = (
+                    1,
+                    self.num_layers // 2 - 1,
+                    self.num_layers - 4,
+                )
         else:
-            self.layers_to_capture = sorted(list(self.layers_to_capture))
+            self.layers_to_capture = tuple(sorted(self.layers_to_capture))

Also applies to: 92-106


1-1: Missing NVIDIA copyright header (2025).

Per repo guidelines, prepend the current-year NVIDIA header and SPDX. Example:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0

Please align the SPDX/license string with the repository’s standard if different.


94-96: Align hidden_states allocation with actual capture layers for draft models

The current implementation in Eagle3ResourceManager allocates the hidden_states tensor using config.num_capture_layers (which defaults to 3) regardless of whether we’re running in draft mode. Meanwhile, Eagle3SpecMetadata.post_init overrides layers_to_capture to a single layer when is_draft_model=True (len(layers_to_capture)=1). Although the indexing in maybe_capture_hidden_states never goes out of bounds, this discrepancy:

  • Wastes GPU memory by allocating two extra hidden‐state slots that are never written or read
  • Risks confusion or future errors if additional logic ever assumes these values must match

Action items:

  • In Eagle3ResourceManager.init, when is_draft_model=True, use metadata.num_capture_layers instead of config.num_capture_layers to size the second dimension of hidden_states
  • Or, override the config’s num_capture_layers to len(layers_to_capture) before allocating resources in draft runs
  • Add an assertion or validation to guarantee config.num_capture_layers == len(metadata.layers_to_capture) whenever is_draft_model=True

Implementing one of these changes will ensure that the allocated tensor shape always aligns with the actual number of layers being captured in draft mode.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9e57506 and ee12005.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/speculative/eagle3.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/speculative/eagle3.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/speculative/eagle3.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/speculative/eagle3.py (1)

181-186: get_hidden_states slicing remains correct under the new draft default.

With draft models now capturing only the last layer by default, returning the first hidden_size slice on non-first draft steps still yields the single captured layer’s features, preserving shape/semantics.

@IzzyPutterman IzzyPutterman changed the title [None][feat] Draft: MultiLayer Eagle [None][feat] MultiLayer Eagle Aug 26, 2025
@IzzyPutterman IzzyPutterman force-pushed the iputterman/multi-layer-eagle branch from 37e41dd to e693c30 Compare August 26, 2025 23:01
@IzzyPutterman
Copy link
Collaborator Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/unittest/_torch/speculative/test_eagle3.py (1)

1-1: Add NVIDIA copyright header (2025) at top of file.

Per coding guidelines, prepend the NVIDIA copyright header to all .py files.

Apply this diff at the top of the file:

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
 import json
🧹 Nitpick comments (2)
tests/unittest/_torch/speculative/test_eagle3.py (2)

312-319: Comment is misleading; Llama 3 path is parameterized to enable/disable one-model.

The inline comment states “Llama 3 does not support one model eagle,” yet the test intentionally toggles eagle3_one_model=use_one_model. This is confusing to readers and future maintainers.

Apply this diff to clarify intent:

-            # Llama 3 does not support one model eagle.
+            # Toggle one-model EAGLE3 path for coverage; some target models may not support it.

245-285: Optional: reduce duplication by extracting a helper for Eagle3 test config.

eagle_config construction largely duplicates patterns in this file. Consider a small factory to keep tests concise and consistent.

Example helper (outside this hunk):

def build_eagle3_config(hidden_size: int, num_layers: int, max_pos: int, vocab: int) -> dict:
    return {
        'architectures': ['LlamaForCausalLMEagle3'],
        'attention_bias': False,
        'attention_dropout': 0.0,
        'bos_token_id': 128000,
        'eos_token_id': [128001, 128008, 128009],
        'eagle_config': {
            'use_aux_hidden_state': False,
            'use_input_layernorm_in_first_layer': True,
            'use_last_layernorm': True,
            'use_mtp_layernorm': False
        },
        'head_dim': 128,
        'hidden_act': 'silu',
        'hidden_size': hidden_size,
        'initializer_range': 0.02,
        'intermediate_size': 16384,
        'max_position_embeddings': max_pos,
        'mlp_bias': False,
        'model_type': 'llama',
        'num_attention_heads': 32,
        'num_eagle_features': 1,
        'num_hidden_layers': num_layers,
        'num_key_value_heads': 8,
        'pretraining_tp': 1,
        'rms_norm_eps': 1e-05,
        'rope_scaling': {
            'factor': 8.0, 'high_freq_factor': 4.0, 'low_freq_factor': 1.0,
            'original_max_position_embeddings': 8192, 'rope_type': 'llama3'
        },
        'rope_theta': 500000.0,
        'tie_word_embeddings': False,
        'torch_dtype': 'bfloat16',
        'transformers_version': '4.52.4',
        'use_cache': True,
        'vocab_size': vocab,
        'draft_vocab_size': vocab
    }

Then call eagle_config = build_eagle3_config(4096, 2, 131072, 128256).

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ee12005 and e693c30.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/models/modeling_speculative.py (3 hunks)
  • tensorrt_llm/_torch/speculative/eagle3.py (1 hunks)
  • tensorrt_llm/_torch/speculative/utils.py (1 hunks)
  • tests/unittest/_torch/speculative/test_eagle3.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • tensorrt_llm/_torch/speculative/eagle3.py
  • tensorrt_llm/_torch/speculative/utils.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tests/unittest/_torch/speculative/test_eagle3.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tests/unittest/_torch/speculative/test_eagle3.py
🧬 Code graph analysis (1)
tests/unittest/_torch/speculative/test_eagle3.py (2)
tensorrt_llm/llmapi/llm_args.py (4)
  • KvCacheConfig (946-1077)
  • CudaGraphConfig (108-165)
  • EagleDecodingConfig (421-459)
  • speculative_model_dir (1400-1401)
tensorrt_llm/llmapi/llm.py (4)
  • LLM (1011-1027)
  • tokenizer (691-695)
  • tokenizer (698-699)
  • generate_async (317-450)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16605 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16605 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12466 completed with status: 'FAILURE'

@IzzyPutterman IzzyPutterman force-pushed the iputterman/multi-layer-eagle branch from e693c30 to 5868935 Compare August 27, 2025 18:56
@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16740 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tests/unittest/_torch/speculative/test_eagle3.py (1)

317-326: Ensure executor shutdown; wrap in try/finally.

Not shutting down llm_spec can leak GPU memory and destabilize CI.

Apply:

-        llm_spec = LLM(**llm_common_config, speculative_config=spec_config)
-
-        tok_ids = llm_spec.tokenizer.encode("The future of AI is")
-
-        sampling_params = SamplingParams(max_tokens=32, temperature=0)
-        for output in llm_spec.generate_async(tok_ids,
-                                              sampling_params,
-                                              streaming=True):
-            pass
+        llm_spec = LLM(**llm_common_config, speculative_config=spec_config)
+        try:
+            tok_ids = llm_spec.tokenizer.encode("The future of AI is")
+            sampling_params = SamplingParams(max_tokens=32, temperature=0)
+            for _output in llm_spec.generate_async(
+                tok_ids, sampling_params, streaming=True
+            ):
+                pass
+        finally:
+            llm_spec.shutdown()
🧹 Nitpick comments (5)
tensorrt_llm/_torch/speculative/eagle3.py (2)

92-106: layers_to_capture type is Set[int] but assigned tuples/lists; align the annotation or normalize representation.

To reduce confusion and mypy noise, either annotate as Sequence[int] or always coerce to a tuple/list in both branches. Also dedupe when a Set is provided.

Outside this hunk, consider:

# at line 83
layers_to_capture: Optional[Sequence[int]] = None

# normalize at the end of __post_init__
self.layers_to_capture = tuple(sorted(set(self.layers_to_capture)))
self.num_capture_layers = len(self.layers_to_capture)

92-106: Ensure EAGLE3 capture-layer defaults are aligned to avoid over-allocation

  • The property EagleDecodingConfig.num_capture_layers always returns 3 when eagle3_layers_to_capture is unset, but in Eagle3SpecMetadata.__post_init__ draft models (or single-layer models) default to capturing only one layer—so self.num_capture_layers == 1 while config.num_capture_layers == 3. This discrepancy causes Eagle3ResourceManager to allocate a hidden-states tensor sized for three layers even when only one is written, wasting GPU memory and risking shape mismatches downstream (see llmapi/llm_args.py:num_capture_layers and eagle3.py:ResourceManager.init).
  • Short-term workaround: always pass eagle3_layers_to_capture={num_layers-1} to your EagleDecodingConfig so that config.num_capture_layers == len(layers_to_capture).
  • Longer-term recommendations:
    • Derive config.num_capture_layers directly from the metadata’s layers_to_capture set instead of using a hard-coded default of 3.
    • Or change the default in EagleDecodingConfig.num_capture_layers to 1 when eagle3_layers_to_capture is unset for EAGLE3, matching the metadata logic.

Files to update:

  • tensorrt_llm/llmapi/llm_args.py – adjust num_capture_layers default
  • tensorrt_llm/_torch/speculative/eagle3.py – use metadata’s layer count for resource allocation
tests/unittest/_torch/speculative/test_eagle3.py (3)

230-236: Skip unsupported one-model path for Llama 3 to avoid spurious failures.

Comment states Llama 3 doesn’t support one-model EAGLE3. Guard the test when use_one_model is True.

Apply:

 def test_multi_eagle3(use_one_model: bool):
+    if use_one_model:
+        pytest.skip("Llama 3 does not support one-model EAGLE3.")

321-325: Nit: avoid unused variable warning.

Rename output to _output.

(Handled in the diff above.)


228-236: Add NVIDIA copyright header (2025).

Per repo guidelines, prepend the NVIDIA copyright header.

Outside this hunk, at the very top of the file:

# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e693c30 and 5868935.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/models/modeling_speculative.py (3 hunks)
  • tensorrt_llm/_torch/speculative/eagle3.py (1 hunks)
  • tensorrt_llm/_torch/speculative/utils.py (1 hunks)
  • tests/unittest/_torch/speculative/test_eagle3.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/speculative/utils.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/speculative/eagle3.py
  • tests/unittest/_torch/speculative/test_eagle3.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/speculative/eagle3.py
  • tests/unittest/_torch/speculative/test_eagle3.py
🧬 Code graph analysis (1)
tests/unittest/_torch/speculative/test_eagle3.py (2)
tensorrt_llm/llmapi/llm_args.py (4)
  • KvCacheConfig (946-1077)
  • CudaGraphConfig (108-165)
  • EagleDecodingConfig (421-459)
  • speculative_model_dir (1400-1401)
tensorrt_llm/llmapi/llm.py (3)
  • tokenizer (691-695)
  • tokenizer (698-699)
  • generate_async (317-450)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tests/unittest/_torch/speculative/test_eagle3.py (1)

228-229: LGTM: parametrize now passes booleans.

This fixes the earlier issue where lists were passed instead of bools.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16740 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12564 completed with status: 'FAILURE'

@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16746 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16746 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12568 completed with status: 'FAILURE'

@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16915 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16915 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12709 completed with status: 'FAILURE'

@IzzyPutterman IzzyPutterman force-pushed the iputterman/multi-layer-eagle branch from 98221f0 to 15fad81 Compare September 2, 2025 20:28
@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17406 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17406 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #13083 completed with status: 'FAILURE'

@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17412 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17412 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13086 completed with status: 'FAILURE'

@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17457 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17457 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #13118 completed with status: 'FAILURE'

@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17470 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17470 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #13126 completed with status: 'FAILURE'

@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17477 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17477 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #13132 completed with status: 'FAILURE'

@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17488 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17488 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #13143 completed with status: 'FAILURE'

Signed-off-by: Izzy Putterman <[email protected]>
Signed-off-by: Izzy Putterman <[email protected]>
Signed-off-by: Izzy Putterman <[email protected]>
Signed-off-by: Izzy Putterman <[email protected]>
Signed-off-by: Izzy Putterman <[email protected]>
@IzzyPutterman IzzyPutterman force-pushed the iputterman/multi-layer-eagle branch from 463efa3 to ebc77ce Compare September 3, 2025 14:51
@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17545 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17545 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13190 completed with status: 'SUCCESS'

@mikeiovine mikeiovine merged commit 26b133f into NVIDIA:main Sep 4, 2025
5 checks passed
Wong4j pushed a commit to Wong4j/TensorRT-LLM that referenced this pull request Sep 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants