-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][feat] MultiLayer Eagle #7234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][feat] MultiLayer Eagle #7234
Conversation
📝 WalkthroughWalkthroughAdds multi-layer support to Eagle3DraftModel (new public Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller
participant Eagle3 as Eagle3DraftModel
participant Layer as Eagle3DecoderLayer*
participant Output
Caller->>Eagle3: forward(hidden_states, ...)
alt num_layers > 1
Note over Eagle3,Layer: midlayer = ModuleList (layer0..layerN-1)
loop per layer
Eagle3->>Layer: layer_i.forward(h, ...)
Layer-->>Eagle3: h = h'
end
Eagle3-->>Output: final outputs
else num_layers == 1
Note over Eagle3,Layer: midlayer = single Eagle3DecoderLayer
Eagle3->>Layer: midlayer.forward(h, ...)
Layer-->>Eagle3: outputs
Eagle3-->>Output: outputs
end
sequenceDiagram
autonumber
participant SpecCfg as spec_config
participant Utils as get_num_spec_layers
participant Meta as Eagle3SpecMetadata
SpecCfg->>Utils: query for eagle3_one_model
alt spec_config.num_eagle_layers set
Utils-->>SpecCfg: return num_eagle_layers
else
Utils-->>SpecCfg: return 1
end
SpecCfg->>Meta: init with is_draft_model, num_layers, layers_to_capture=None
alt is_draft_model or num_layers == 1
Meta-->>SpecCfg: set layers_to_capture=(num_layers-1,)
else
Meta-->>SpecCfg: apply existing multi-layer selection logic
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tensorrt_llm/_torch/speculative/utils.py (1)
154-156: Guard against invalid num_eagle_layers; default to 1Protects downstream logic if the value is unset or accidentally set to <1.
Apply this diff:
- num_eagle_layers = spec_config.num_eagle_layers - return num_eagle_layers if num_eagle_layers is not None else 1 + num_eagle_layers = spec_config.num_eagle_layers + return 1 if (num_eagle_layers is None or num_eagle_layers < 1) else num_eagle_layerstensorrt_llm/_torch/models/modeling_speculative.py (1)
222-235: Type hints out-of-sync with returned values; tighten annotationsBoth Eagle3DraftModel.forward and Eagle3DecoderLayer.forward return (hidden_states, residual) tuples, but their annotations declare torch.Tensor. This can confuse tooling and readers.
Apply these changes (outside the selected lines) to align annotations:
@@ -class Eagle3DecoderLayer(DecoderLayer): +class Eagle3DecoderLayer(DecoderLayer): @@ - def __init__( + def __init__( self, model_config: LlamaConfig, layer_idx: int = 0, - ) -> Tuple[torch.Tensor, torch.Tensor]: + ) -> None: super().__init__() @@ - ) -> torch.Tensor: + ) -> Tuple[torch.Tensor, torch.Tensor]: residual = hidden_states @@ -class Eagle3DraftModel(DecoderModel): +class Eagle3DraftModel(DecoderModel): @@ - ) -> torch.Tensor: + ) -> Tuple[torch.Tensor, torch.Tensor]: assert self.embed_tokens is not NoneNote: Tuple is already imported at the top of the file.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tensorrt_llm/_torch/models/modeling_speculative.py(3 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py(1 hunks)tensorrt_llm/_torch/speculative/utils.py(1 hunks)tensorrt_llm/llmapi/llm_args.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/pyexecutor/py_executor_creator.pytensorrt_llm/_torch/speculative/utils.pytensorrt_llm/_torch/models/modeling_speculative.pytensorrt_llm/llmapi/llm_args.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/pyexecutor/py_executor_creator.pytensorrt_llm/_torch/speculative/utils.pytensorrt_llm/_torch/models/modeling_speculative.pytensorrt_llm/llmapi/llm_args.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
tensorrt_llm/llmapi/llm_args.py (2)
update_for_draft_init(394-397)update_for_draft_init(464-472)
tensorrt_llm/_torch/models/modeling_speculative.py (2)
tensorrt_llm/module.py (1)
ModuleList(229-288)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
attn_metadata(68-69)spec_metadata(60-61)
tensorrt_llm/llmapi/llm_args.py (2)
tensorrt_llm/runtime/model_runner_cpp.py (1)
num_layers(505-509)tensorrt_llm/runtime/model_runner.py (1)
num_layers(801-802)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tensorrt_llm/llmapi/llm_args.py (2)
394-398: Good addition: explicit draft-initialization hook on DecodingBaseConfigA no-op default hook is a clean extension point and keeps subclasses opt-in.
453-463: Switching from cached_property to property avoids stale readsGiven eagle3_layers_to_capture can be mutated during draft init, making num_capture_layers a plain property is the right choice to keep it consistent at runtime.
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
255-256: Right placement: update draft spec before constructing the draft engineCalling draft_spec_config.update_for_draft_init() after copying and before model construction ensures draft-only knobs (e.g., layers_to_capture) are populated without mutating the original config.
tensorrt_llm/_torch/models/modeling_speculative.py (2)
154-154: Confirm layer-count source matches configurationEagle3DraftModel.num_layers is taken from the draft model’s pretrained_config.num_hidden_layers. If users also set spec_config.num_eagle_layers (used elsewhere for planning), ensure these two remain consistent (either by convention or validation) to avoid surprises.
Would you like me to add a validator that raises if spec_config.num_eagle_layers is set and disagrees with draft_config.pretrained_config.num_hidden_layers?
167-174: Multi-layer wiring looks correct; uses distinct layer_idx with offsetModuleList over Eagle3DecoderLayer with start_layer_idx offset prevents index collisions with target layers during metadata capture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
tensorrt_llm/_torch/speculative/eagle3.py (3)
82-88: Make layers_to_capture type consistent and ordered (use Tuple[int, ...]).layers_to_capture is annotated as Optional[Set[int]] but is populated with tuples/lists and its order is relied upon to pack channels in hidden_states (enumerate(i) drives [i * hidden_size:(i + 1) * hidden_size]). Sets are unordered, which is misleading and can cause accidental misuse. Make it Tuple[int, ...] and normalize to a sorted tuple in post_init.
Apply:
diff --git a/tensorrt_llm/_torch/speculative/eagle3.py b/tensorrt_llm/_torch/speculative/eagle3.py @@ -from typing import List, Optional, Set +from typing import List, Optional, Set, Tuple @@ class Eagle3SpecMetadata(SpecMetadata): hidden_states: List[torch.Tensor] = field(default_factory=list) - layers_to_capture: Optional[Set[int]] = None + layers_to_capture: Optional[Tuple[int, ...]] = None @@ def __post_init__(self): - if self.layers_to_capture is None: + if self.layers_to_capture is None: + if self.num_layers < 1: + raise ValueError("num_layers must be >= 1 for EAGLE3 capture") if self.is_draft_model or self.num_layers == 1: - self.layers_to_capture = (self.num_layers - 1, ) + self.layers_to_capture = (self.num_layers - 1,) else: if self.num_layers <= 5: raise ValueError( "Not enough hidden layers for default EAGLE3 capture") - self.layers_to_capture = (1, self.num_layers // 2 - 1, - self.num_layers - 4) + self.layers_to_capture = ( + 1, + self.num_layers // 2 - 1, + self.num_layers - 4, + ) else: - self.layers_to_capture = sorted(list(self.layers_to_capture)) + # Normalize to a deterministic, immutable, ascending order + self.layers_to_capture = tuple(sorted(self.layers_to_capture)) self.num_capture_layers = len(self.layers_to_capture)Optional follow-up (same consistency) for Eagle3OneModelSpecMetadata:
diff --git a/tensorrt_llm/_torch/speculative/eagle3.py b/tensorrt_llm/_torch/speculative/eagle3.py @@ -class Eagle3OneModelSpecMetadata(SpecMetadata): +class Eagle3OneModelSpecMetadata(SpecMetadata): @@ - layers_to_capture: Optional[Set[int]] = None + layers_to_capture: Optional[Tuple[int, ...]] = None @@ - if self.layers_to_capture is None: + if self.layers_to_capture is None: if self.num_layers == 1: - self.layers_to_capture = (self.num_layers - 1, ) + self.layers_to_capture = (self.num_layers - 1,) else: if self.num_layers <= 5: raise ValueError( "Not enough hidden layers for default EAGLE3 capture") - self.layers_to_capture = (1, self.num_layers // 2 - 1, - self.num_layers - 4) + self.layers_to_capture = ( + 1, + self.num_layers // 2 - 1, + self.num_layers - 4, + ) else: - self.layers_to_capture = sorted(list(self.layers_to_capture)) + self.layers_to_capture = tuple(sorted(self.layers_to_capture))Also applies to: 92-106
1-1: Missing NVIDIA copyright header (2025).Per repo guidelines, prepend the current-year NVIDIA header and SPDX. Example:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# SPDX-License-Identifier: Apache-2.0Please align the SPDX/license string with the repository’s standard if different.
94-96: Align hidden_states allocation with actual capture layers for draft modelsThe current implementation in Eagle3ResourceManager allocates the hidden_states tensor using config.num_capture_layers (which defaults to 3) regardless of whether we’re running in draft mode. Meanwhile, Eagle3SpecMetadata.post_init overrides layers_to_capture to a single layer when is_draft_model=True (len(layers_to_capture)=1). Although the indexing in maybe_capture_hidden_states never goes out of bounds, this discrepancy:
- Wastes GPU memory by allocating two extra hidden‐state slots that are never written or read
- Risks confusion or future errors if additional logic ever assumes these values must match
Action items:
- In Eagle3ResourceManager.init, when is_draft_model=True, use metadata.num_capture_layers instead of config.num_capture_layers to size the second dimension of hidden_states
- Or, override the config’s num_capture_layers to len(layers_to_capture) before allocating resources in draft runs
- Add an assertion or validation to guarantee config.num_capture_layers == len(metadata.layers_to_capture) whenever is_draft_model=True
Implementing one of these changes will ensure that the allocated tensor shape always aligns with the actual number of layers being captured in draft mode.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tensorrt_llm/_torch/speculative/eagle3.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/speculative/eagle3.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/speculative/eagle3.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/speculative/eagle3.py (1)
181-186: get_hidden_states slicing remains correct under the new draft default.With draft models now capturing only the last layer by default, returning the first hidden_size slice on non-first draft steps still yields the single captured layer’s features, preserving shape/semantics.
37e41dd to
e693c30
Compare
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/unittest/_torch/speculative/test_eagle3.py (1)
1-1: Add NVIDIA copyright header (2025) at top of file.Per coding guidelines, prepend the NVIDIA copyright header to all .py files.
Apply this diff at the top of the file:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + import json
🧹 Nitpick comments (2)
tests/unittest/_torch/speculative/test_eagle3.py (2)
312-319: Comment is misleading; Llama 3 path is parameterized to enable/disable one-model.The inline comment states “Llama 3 does not support one model eagle,” yet the test intentionally toggles
eagle3_one_model=use_one_model. This is confusing to readers and future maintainers.Apply this diff to clarify intent:
- # Llama 3 does not support one model eagle. + # Toggle one-model EAGLE3 path for coverage; some target models may not support it.
245-285: Optional: reduce duplication by extracting a helper for Eagle3 test config.
eagle_configconstruction largely duplicates patterns in this file. Consider a small factory to keep tests concise and consistent.Example helper (outside this hunk):
def build_eagle3_config(hidden_size: int, num_layers: int, max_pos: int, vocab: int) -> dict: return { 'architectures': ['LlamaForCausalLMEagle3'], 'attention_bias': False, 'attention_dropout': 0.0, 'bos_token_id': 128000, 'eos_token_id': [128001, 128008, 128009], 'eagle_config': { 'use_aux_hidden_state': False, 'use_input_layernorm_in_first_layer': True, 'use_last_layernorm': True, 'use_mtp_layernorm': False }, 'head_dim': 128, 'hidden_act': 'silu', 'hidden_size': hidden_size, 'initializer_range': 0.02, 'intermediate_size': 16384, 'max_position_embeddings': max_pos, 'mlp_bias': False, 'model_type': 'llama', 'num_attention_heads': 32, 'num_eagle_features': 1, 'num_hidden_layers': num_layers, 'num_key_value_heads': 8, 'pretraining_tp': 1, 'rms_norm_eps': 1e-05, 'rope_scaling': { 'factor': 8.0, 'high_freq_factor': 4.0, 'low_freq_factor': 1.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3' }, 'rope_theta': 500000.0, 'tie_word_embeddings': False, 'torch_dtype': 'bfloat16', 'transformers_version': '4.52.4', 'use_cache': True, 'vocab_size': vocab, 'draft_vocab_size': vocab }Then call
eagle_config = build_eagle3_config(4096, 2, 131072, 128256).
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tensorrt_llm/_torch/models/modeling_speculative.py(3 hunks)tensorrt_llm/_torch/speculative/eagle3.py(1 hunks)tensorrt_llm/_torch/speculative/utils.py(1 hunks)tests/unittest/_torch/speculative/test_eagle3.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- tensorrt_llm/_torch/speculative/eagle3.py
- tensorrt_llm/_torch/speculative/utils.py
- tensorrt_llm/_torch/models/modeling_speculative.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tests/unittest/_torch/speculative/test_eagle3.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tests/unittest/_torch/speculative/test_eagle3.py
🧬 Code graph analysis (1)
tests/unittest/_torch/speculative/test_eagle3.py (2)
tensorrt_llm/llmapi/llm_args.py (4)
KvCacheConfig(946-1077)CudaGraphConfig(108-165)EagleDecodingConfig(421-459)speculative_model_dir(1400-1401)tensorrt_llm/llmapi/llm.py (4)
LLM(1011-1027)tokenizer(691-695)tokenizer(698-699)generate_async(317-450)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
|
PR_Github #16605 [ run ] triggered by Bot |
|
PR_Github #16605 [ run ] completed with state |
e693c30 to
5868935
Compare
|
/bot run |
|
PR_Github #16740 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tests/unittest/_torch/speculative/test_eagle3.py (1)
317-326: Ensure executor shutdown; wrap in try/finally.Not shutting down llm_spec can leak GPU memory and destabilize CI.
Apply:
- llm_spec = LLM(**llm_common_config, speculative_config=spec_config) - - tok_ids = llm_spec.tokenizer.encode("The future of AI is") - - sampling_params = SamplingParams(max_tokens=32, temperature=0) - for output in llm_spec.generate_async(tok_ids, - sampling_params, - streaming=True): - pass + llm_spec = LLM(**llm_common_config, speculative_config=spec_config) + try: + tok_ids = llm_spec.tokenizer.encode("The future of AI is") + sampling_params = SamplingParams(max_tokens=32, temperature=0) + for _output in llm_spec.generate_async( + tok_ids, sampling_params, streaming=True + ): + pass + finally: + llm_spec.shutdown()
🧹 Nitpick comments (5)
tensorrt_llm/_torch/speculative/eagle3.py (2)
92-106: layers_to_capture type is Set[int] but assigned tuples/lists; align the annotation or normalize representation.To reduce confusion and mypy noise, either annotate as Sequence[int] or always coerce to a tuple/list in both branches. Also dedupe when a Set is provided.
Outside this hunk, consider:
# at line 83 layers_to_capture: Optional[Sequence[int]] = None # normalize at the end of __post_init__ self.layers_to_capture = tuple(sorted(set(self.layers_to_capture))) self.num_capture_layers = len(self.layers_to_capture)
92-106: Ensure EAGLE3 capture-layer defaults are aligned to avoid over-allocation
- The property
EagleDecodingConfig.num_capture_layersalways returns 3 wheneagle3_layers_to_captureis unset, but inEagle3SpecMetadata.__post_init__draft models (or single-layer models) default to capturing only one layer—soself.num_capture_layers == 1whileconfig.num_capture_layers == 3. This discrepancy causesEagle3ResourceManagerto allocate a hidden-states tensor sized for three layers even when only one is written, wasting GPU memory and risking shape mismatches downstream (see llmapi/llm_args.py:num_capture_layers and eagle3.py:ResourceManager.init).- Short-term workaround: always pass
eagle3_layers_to_capture={num_layers-1}to yourEagleDecodingConfigso thatconfig.num_capture_layers == len(layers_to_capture).- Longer-term recommendations:
• Deriveconfig.num_capture_layersdirectly from the metadata’slayers_to_captureset instead of using a hard-coded default of 3.
• Or change the default inEagleDecodingConfig.num_capture_layersto 1 wheneagle3_layers_to_captureis unset for EAGLE3, matching the metadata logic.Files to update:
- tensorrt_llm/llmapi/llm_args.py – adjust
num_capture_layersdefault- tensorrt_llm/_torch/speculative/eagle3.py – use metadata’s layer count for resource allocation
tests/unittest/_torch/speculative/test_eagle3.py (3)
230-236: Skip unsupported one-model path for Llama 3 to avoid spurious failures.Comment states Llama 3 doesn’t support one-model EAGLE3. Guard the test when use_one_model is True.
Apply:
def test_multi_eagle3(use_one_model: bool): + if use_one_model: + pytest.skip("Llama 3 does not support one-model EAGLE3.")
321-325: Nit: avoid unused variable warning.Rename output to _output.
(Handled in the diff above.)
228-236: Add NVIDIA copyright header (2025).Per repo guidelines, prepend the NVIDIA copyright header.
Outside this hunk, at the very top of the file:
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tensorrt_llm/_torch/models/modeling_speculative.py(3 hunks)tensorrt_llm/_torch/speculative/eagle3.py(1 hunks)tensorrt_llm/_torch/speculative/utils.py(1 hunks)tests/unittest/_torch/speculative/test_eagle3.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tensorrt_llm/_torch/models/modeling_speculative.py
- tensorrt_llm/_torch/speculative/utils.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/speculative/eagle3.pytests/unittest/_torch/speculative/test_eagle3.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/speculative/eagle3.pytests/unittest/_torch/speculative/test_eagle3.py
🧬 Code graph analysis (1)
tests/unittest/_torch/speculative/test_eagle3.py (2)
tensorrt_llm/llmapi/llm_args.py (4)
KvCacheConfig(946-1077)CudaGraphConfig(108-165)EagleDecodingConfig(421-459)speculative_model_dir(1400-1401)tensorrt_llm/llmapi/llm.py (3)
tokenizer(691-695)tokenizer(698-699)generate_async(317-450)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tests/unittest/_torch/speculative/test_eagle3.py (1)
228-229: LGTM: parametrize now passes booleans.This fixes the earlier issue where lists were passed instead of bools.
|
PR_Github #16740 [ run ] completed with state |
|
/bot run |
|
PR_Github #16746 [ run ] triggered by Bot |
|
PR_Github #16746 [ run ] completed with state |
|
/bot run |
|
PR_Github #16915 [ run ] triggered by Bot |
|
PR_Github #16915 [ run ] completed with state |
98221f0 to
15fad81
Compare
|
/bot run |
|
PR_Github #17406 [ run ] triggered by Bot |
|
PR_Github #17406 [ run ] completed with state |
|
/bot run |
|
PR_Github #17412 [ run ] triggered by Bot |
|
PR_Github #17412 [ run ] completed with state |
|
/bot run |
|
PR_Github #17457 [ run ] triggered by Bot |
|
PR_Github #17457 [ run ] completed with state |
|
/bot run |
|
PR_Github #17470 [ run ] triggered by Bot |
|
PR_Github #17470 [ run ] completed with state |
|
/bot run |
|
PR_Github #17477 [ run ] triggered by Bot |
|
PR_Github #17477 [ run ] completed with state |
|
/bot run |
|
PR_Github #17488 [ run ] triggered by Bot |
|
PR_Github #17488 [ run ] completed with state |
Signed-off-by: Izzy Putterman <[email protected]>
Signed-off-by: Izzy Putterman <[email protected]>
Signed-off-by: Izzy Putterman <[email protected]>
Signed-off-by: Izzy Putterman <[email protected]>
Signed-off-by: Izzy Putterman <[email protected]>
463efa3 to
ebc77ce
Compare
|
/bot run |
|
PR_Github #17545 [ run ] triggered by Bot |
|
PR_Github #17545 [ run ] completed with state |
Signed-off-by: Izzy Putterman <[email protected]>
Summary by CodeRabbit
New Features
Behavior
Tests
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.