Skip to content

Commit e277766

Browse files
authored
chores: merge examples for v1.0 doc (#5736)
Signed-off-by: Erin Ho <[email protected]>
1 parent 5ab1cf5 commit e277766

30 files changed

+114
-211
lines changed

README.md

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -61,21 +61,22 @@ TensorRT-LLM
6161
* [02/12] 🌟 How Scaling Laws Drive Smarter, More Powerful AI
6262
[➡️ link](https://blogs.nvidia.com/blog/ai-scaling-laws/?ncid=so-link-889273&linkId=100000338837832)
6363

64-
* [01/25] Nvidia moves AI focus to inference cost, efficiency [➡️ link](https://www.fierceelectronics.com/ai/nvidia-moves-ai-focus-inference-cost-efficiency?linkId=100000332985606)
6564

66-
* [01/24] 🏎️ Optimize AI Inference Performance with NVIDIA Full-Stack Solutions [➡️ link](https://developer.nvidia.com/blog/optimize-ai-inference-performance-with-nvidia-full-stack-solutions/?ncid=so-twit-400810&linkId=100000332621049)
65+
<details close>
66+
<summary>Previous News</summary>
6767

68-
* [01/23] 🚀 Fast, Low-Cost Inference Offers Key to Profitable AI [➡️ link](https://blogs.nvidia.com/blog/ai-inference-platform/?ncid=so-twit-693236-vt04&linkId=100000332307804)
68+
* [2025/01/25] Nvidia moves AI focus to inference cost, efficiency [➡️ link](https://www.fierceelectronics.com/ai/nvidia-moves-ai-focus-inference-cost-efficiency?linkId=100000332985606)
6969

70-
* [01/16] Introducing New KV Cache Reuse Optimizations in TensorRT-LLM [➡️ link](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/?ncid=so-twit-363876&linkId=100000330323229)
70+
* [2025/01/24] 🏎️ Optimize AI Inference Performance with NVIDIA Full-Stack Solutions [➡️ link](https://developer.nvidia.com/blog/optimize-ai-inference-performance-with-nvidia-full-stack-solutions/?ncid=so-twit-400810&linkId=100000332621049)
7171

72-
* [01/14] 📣 Bing's Transition to LLM/SLM Models: Optimizing Search with TensorRT-LLM [➡️ link](https://blogs.bing.com/search-quality-insights/December-2024/Bing-s-Transition-to-LLM-SLM-Models-Optimizing-Search-with-TensorRT-LLM)
72+
* [2025/01/23] 🚀 Fast, Low-Cost Inference Offers Key to Profitable AI [➡️ link](https://blogs.nvidia.com/blog/ai-inference-platform/?ncid=so-twit-693236-vt04&linkId=100000332307804)
7373

74-
* [01/04] ⚡Boost Llama 3.3 70B Inference Throughput 3x with TensorRT-LLM Speculative Decoding
75-
[➡️ link](https://developer.nvidia.com/blog/boost-llama-3-3-70b-inference-throughput-3x-with-nvidia-tensorrt-llm-speculative-decoding/)
74+
* [2025/01/16] Introducing New KV Cache Reuse Optimizations in TensorRT-LLM [➡️ link](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/?ncid=so-twit-363876&linkId=100000330323229)
7675

77-
<details close>
78-
<summary>Previous News</summary>
76+
* [2025/01/14] 📣 Bing's Transition to LLM/SLM Models: Optimizing Search with TensorRT-LLM [➡️ link](https://blogs.bing.com/search-quality-insights/December-2024/Bing-s-Transition-to-LLM-SLM-Models-Optimizing-Search-with-TensorRT-LLM)
77+
78+
* [2025/01/04] ⚡Boost Llama 3.3 70B Inference Throughput 3x with TensorRT-LLM Speculative Decoding
79+
[➡️ link](https://developer.nvidia.com/blog/boost-llama-3-3-70b-inference-throughput-3x-with-nvidia-tensorrt-llm-speculative-decoding/)
7980

8081
* [2024/12/10] ⚡ Llama 3.3 70B from AI at Meta is accelerated by TensorRT-LLM. 🌟 State-of-the-art model on par with Llama 3.1 405B for reasoning, math, instruction following and tool use. Explore the preview
8182
[➡️ link](https://build.nvidia.com/meta/llama-3_3-70b-instruct)
@@ -204,11 +205,9 @@ Serverless TensorRT-LLM (LLaMA 3 8B) | Modal Docs [➡️ link](https://modal.co
204205

205206
TensorRT-LLM is an open-sourced library for optimizing Large Language Model (LLM) inference. It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, [FP4](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/), INT4 [AWQ](https://arxiv.org/abs/2306.00978), INT8 [SmoothQuant](https://arxiv.org/abs/2211.10438), ...), speculative decoding, and much more, to perform inference efficiently on NVIDIA GPUs.
206207

207-
Recently [re-architected with a **PyTorch backend**](https://nvidia.github.io/TensorRT-LLM/torch.html), TensorRT-LLM now combines peak performance with a more flexible and developer-friendly workflow. The original [TensorRT](https://developer.nvidia.com/tensorrt)-based backend remains supported and continues to provide an ahead-of-time compilation path for building highly optimized "[Engines](https://docs.nvidia.com/deeplearning/tensorrt/quick-start-guide/index.html#ecosystem)" for deployment. The PyTorch backend complements this by enabling faster development iteration and rapid experimentation.
208-
209-
TensorRT-LLM provides a flexible [**LLM API**](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api) to simplify model setup and inference across both PyTorch and TensorRT backends. It supports a wide range of inference use cases from a single GPU to multiple nodes with multiple GPUs using [Tensor Parallelism](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/parallelisms.html#tensor-parallelism) and/or [Pipeline Parallelism](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/parallelisms.html#pipeline-parallelism). It also includes a [backend](https://github.com/triton-inference-server/tensorrtllm_backend) for integration with the [NVIDIA Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server).
208+
[Architected on PyTorch](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/torch/arch_overview.md), TensorRT-LLM provides a high-level Python [LLM API](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api) that supports a wide range of inference setups - from single-GPU to multi-GPU or multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo) and the [Triton Inference Server](https://github.com/triton-inference-server/server).
210209

211-
Several popular models are pre-defined and can be easily customized or extended using [native PyTorch code](./tensorrt_llm/_torch/models/modeling_deepseekv3.py) (for the PyTorch backend) or a [PyTorch-style Python API](./tensorrt_llm/models/llama/model.py) (for the TensorRT backend).
210+
TensorRT-LLM is designed to be modular and easy to modify. Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. Several popular models are also pre-defined and can be customized using [native PyTorch code](./tensorrt_llm/_torch/models/modeling_deepseekv3.py), making it easy to adapt the system to specific needs.
212211

213212

214213
## Getting Started

docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -110,10 +110,10 @@ The MTP module follows the design in DeepSeek-V3. The embedding layer and output
110110
Attention is also a very important component in supporting MTP inference. The changes are mainly in the attention kernels for the generation phase. For the normal request, there will be only one input token in the generation phase, but for MTP, there will be $K+1$ input tokens. Since MTP sequentially predicts additional tokens, the predicted draft tokens are chained. Though we have an MTP Eagle path, currently, we only have the chain-based support for MTP Eagle. So, a causal mask is enough for the attention kernel to support MTP. In our implementation, TensorRT-LLM will use the fp8 flashMLA generation kernel on Hopper GPU, while using TRTLLM customized attention kernels on Blackwell for better performance.
111111

112112
### How to run DeepSeek models with MTP
113-
Run DeepSeek-V3/R1 models with MTP, use [examples/pytorch/quickstart_advanced.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/pytorch/quickstart_advanced.py) with additional options:
113+
Run DeepSeek-V3/R1 models with MTP, use [examples/llm-api/quickstart_advanced.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llm-api/quickstart_advanced.py) with additional options:
114114

115115
```bash
116-
cd examples/pytorch
116+
cd examples/llm-api
117117
python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_nextn N
118118
```
119119

@@ -165,10 +165,10 @@ Note that the Relaxed Acceptance will only be used during the thinking phase, wh
165165

166166
### How to run the DeepSeek-R1 model with Relaxed Acceptance
167167

168-
Run DeepSeek-R1 models with MTP Relaxed Acceptance, use [examples/pytorch/quickstart_advanced.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/pytorch/quickstart_advanced.py) with additional options:
168+
Run DeepSeek-R1 models with MTP Relaxed Acceptance, use [examples/llm-api/quickstart_advanced.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llm-api/quickstart_advanced.py) with additional options:
169169

170170
```bash
171-
cd examples/pytorch
171+
cd examples/llm-api
172172
python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_nextn N --use_relaxed_acceptance_for_thinking --relaxed_topk 10 --relaxed_delta 0.6
173173
```
174174

docs/source/helper.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,10 @@ def extract_meta_info(filename: str) -> Optional[DocMeta]:
5959

6060
def generate_examples():
6161
root_dir = Path(__file__).parent.parent.parent.resolve()
62-
ignore_list = {'__init__.py', 'quickstart_example.py'}
62+
ignore_list = {
63+
'__init__.py', 'quickstart_example.py', 'quickstart_advanced.py',
64+
'quickstart_multimodal.py', 'star_attention.py'
65+
}
6366
doc_dir = root_dir / "docs/source/examples"
6467

6568
def collect_script_paths(examples_subdir: str) -> list[Path]:

docs/source/llm-api/index.md

Lines changed: 3 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -2,36 +2,20 @@
22

33
The LLM API is a high-level Python API designed to streamline LLM inference workflows.
44

5-
It supports a broad range of use cases, from single-GPU setups to multi-GPU and multi-node deployments, with built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo) and the [Triton Inference Server](https://github.com/triton-inference-server/server).
5+
It supports a broad range of use cases, from single-GPU setups to multi-GPU and multi-node deployments, with built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo).
66

77
While the LLM API simplifies inference workflows with a high-level interface, it is also designed with flexibility in mind. Under the hood, it uses a PyTorch-native and modular backend, making it easy to customize, extend, or experiment with the runtime.
88

99

10-
## Supported Models
11-
12-
* DeepSeek variants
13-
* Llama (including variants Mistral, Mixtral, InternLM)
14-
* GPT (including variants Starcoder-1/2, Santacoder)
15-
* Gemma-1/2/3
16-
* Phi-1/2/3/4
17-
* ChatGLM (including variants glm-10b, chatglm, chatglm2, chatglm3, glm4)
18-
* QWen-1/1.5/2/3
19-
* Falcon
20-
* Baichuan-1/2
21-
* GPT-J
22-
* Mamba-1/2
23-
24-
25-
> **Note:** For the most up-to-date list of supported models, you may refer to the [TensorRT-LLM model definitions](https://github.com/NVIDIA/TensorRT-LLM/tree/main/tensorrt_llm/_torch/models).
26-
2710
## Quick Start Example
2811
A simple inference example with TinyLlama using the LLM API:
2912

3013
```{literalinclude} ../../examples/llm-api/quickstart_example.py
3114
:language: python
3215
:linenos:
3316
```
34-
More examples can be found [here]().
17+
18+
For more advanced usage including distributed inference, multimodal, and speculative decoding, please refer to this [README](../../../examples/llm-api/README.md).
3519

3620
## Model Input
3721

@@ -65,7 +49,6 @@ llm = LLM(model=<local_path_to_model>)
6549
> **Note:** Some models require accepting specific [license agreements]((https://ai.meta.com/resources/models-and-libraries/llama-downloads/)). Make sure you have agreed to the terms and authenticated with Hugging Face before downloading.
6650
6751

68-
6952
## Tips and Troubleshooting
7053

7154
The following tips typically assist new LLM API users who are familiar with other APIs that are part of TensorRT-LLM:

docs/source/torch/adding_new_model.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -196,8 +196,8 @@ if __name__ == '__main__':
196196
main()
197197
```
198198

199-
We provide an out-of-tree modeling example in `examples/pytorch/out_of_tree_example`. The model is implemented in `modeling_opt.py` and you can run the example by:
199+
We provide an out-of-tree modeling example in `examples/llm-api/out_of_tree_example`. The model is implemented in `modeling_opt.py` and you can run the example by:
200200

201201
```bash
202-
python examples/pytorch/out_of_tree_example/main.py
202+
python examples/llm-api/out_of_tree_example/main.py
203203
```

examples/llm-api/README.md

Lines changed: 55 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,57 @@
11
# LLM API Examples
22

3-
Please refer to the [official documentation](https://nvidia.github.io/TensorRT-LLM/llm-api/), [examples](https://nvidia.github.io/TensorRT-LLM/latest/examples/llm_api_examples.html) and [customization](https://nvidia.github.io/TensorRT-LLM/examples/customization.html) for detailed information and usage guidelines regarding the LLM API.
3+
Please refer to the [official documentation](https://nvidia.github.io/TensorRT-LLM/llm-api/) including [customization](https://nvidia.github.io/TensorRT-LLM/examples/customization.html) for detailed information and usage guidelines regarding the LLM API.
4+
5+
6+
## Run the advanced usage example script:
7+
8+
```bash
9+
# FP8 + TP=2
10+
python3 quickstart_advanced.py --model_dir nvidia/Llama-3.1-8B-Instruct-FP8 --tp_size 2
11+
12+
# FP8 (e4m3) kvcache
13+
python3 quickstart_advanced.py --model_dir nvidia/Llama-3.1-8B-Instruct-FP8 --kv_cache_dtype fp8
14+
15+
# BF16 + TP=8
16+
python3 quickstart_advanced.py --model_dir nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 --tp_size 8
17+
18+
# Nemotron-H requires disabling cache reuse in kv cache
19+
python3 quickstart_advanced.py --model_dir nvidia/Nemotron-H-8B-Base-8K --disable_kv_cache_reuse --max_batch_size 8
20+
```
21+
22+
## Run the multimodal example script:
23+
24+
```bash
25+
# default inputs
26+
python3 quickstart_multimodal.py --model_dir Efficient-Large-Model/NVILA-8B --modality image [--use_cuda_graph]
27+
28+
# user inputs
29+
# supported modes:
30+
# (1) N prompt, N media (N requests are in-flight batched)
31+
# (2) 1 prompt, N media
32+
# Note: media should be either image or video. Mixing image and video is not supported.
33+
python3 quickstart_multimodal.py --model_dir Efficient-Large-Model/NVILA-8B --modality video --prompt "Tell me what you see in the video briefly." "Describe the scene in the video briefly." --media "https://huggingface.co/datasets/Efficient-Large-Model/VILA-inference-demos/resolve/main/OAI-sora-tokyo-walk.mp4" "https://huggingface.co/datasets/Efficient-Large-Model/VILA-inference-demos/resolve/main/world.mp4" --max_tokens 128 [--use_cuda_graph]
34+
```
35+
36+
## Run the speculative decoding script:
37+
38+
```bash
39+
# NGram drafter
40+
python3 quickstart_advanced.py \
41+
--model_dir meta-llama/Llama-3.1-8B-Instruct \
42+
--spec_decode_algo NGRAM \
43+
--max_matching_ngram_size=2 \
44+
--spec_decode_nextn=4 \
45+
--disable_overlap_scheduler
46+
```
47+
48+
```bash
49+
# Draft Taret
50+
python3 quickstart_advanced.py \
51+
--model_dir meta-llama/Llama-3.1-8B-Instruct \
52+
--spec_decode_algo draft_target \
53+
--spec_decode_nextn 5 \
54+
--draft_model_dir meta-llama/Llama-3.2-1B-Instruct \
55+
--disable_overlap_scheduler
56+
--disable_kv_cache_reuse
57+
```

examples/llm-api/_tensorrt_engine/llm_eagle2_decoding.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,6 @@ def main():
99
# Sample prompts.
1010
prompts = [
1111
"Hello, my name is",
12-
"The president of the United States is",
1312
"The capital of France is",
1413
"The future of AI is",
1514
]

examples/llm-api/_tensorrt_engine/llm_eagle_decoding.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,6 @@ def main():
99
# Sample prompts.
1010
prompts = [
1111
"Hello, my name is",
12-
"The president of the United States is",
1312
"The capital of France is",
1413
"The future of AI is",
1514
]

examples/llm-api/_tensorrt_engine/llm_inference_customize.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,6 @@ def main():
3030
# Sample prompts.
3131
prompts = [
3232
"Hello, my name is",
33-
"The president of the United States is",
3433
"The capital of France is",
3534
"The future of AI is",
3635
]
@@ -48,7 +47,6 @@ def main():
4847

4948
# Got output like
5049
# Prompt: 'Hello, my name is', Generated text: '\n\nJane Smith. I am a student pursuing my degree in Computer Science at [university]. I enjoy learning new things, especially technology and programming'
51-
# Prompt: 'The president of the United States is', Generated text: 'likely to nominate a new Supreme Court justice to fill the seat vacated by the death of Antonin Scalia. The Senate should vote to confirm the'
5250
# Prompt: 'The capital of France is', Generated text: 'Paris.'
5351
# Prompt: 'The future of AI is', Generated text: 'an exciting time for us. We are constantly researching, developing, and improving our platform to create the most advanced and efficient model available. We are'
5452

examples/llm-api/_tensorrt_engine/llm_medusa_decoding.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,6 @@ def run_medusa_decoding(use_modelopt_ckpt=False, model_dir=None):
1111
# Sample prompts.
1212
prompts = [
1313
"Hello, my name is",
14-
"The president of the United States is",
1514
"The capital of France is",
1615
"The future of AI is",
1716
]

0 commit comments

Comments
 (0)