NVIDIA
diff --git a/‎README.md‎
Lines changed: 12 additions & 13 deletions b/‎README.md‎
Lines changed: 12 additions & 13 deletions
diff --git a/‎docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md‎
Lines changed: 4 additions & 4 deletions b/‎docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎docs/source/helper.py‎
Lines changed: 4 additions & 1 deletion b/‎docs/source/helper.py‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎docs/source/llm-api/index.md‎
Lines changed: 3 additions & 20 deletions b/‎docs/source/llm-api/index.md‎
Lines changed: 3 additions & 20 deletions
diff --git a/‎docs/source/torch/adding_new_model.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/source/torch/adding_new_model.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎examples/llm-api/README.md‎
Lines changed: 55 additions & 1 deletion b/‎examples/llm-api/README.md‎
Lines changed: 55 additions & 1 deletion
diff --git a/‎examples/llm-api/_tensorrt_engine/llm_eagle2_decoding.py‎
Lines changed: 0 additions & 1 deletion b/‎examples/llm-api/_tensorrt_engine/llm_eagle2_decoding.py‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎examples/llm-api/_tensorrt_engine/llm_eagle_decoding.py‎
Lines changed: 0 additions & 1 deletion b/‎examples/llm-api/_tensorrt_engine/llm_eagle_decoding.py‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎examples/llm-api/_tensorrt_engine/llm_inference_customize.py‎
Lines changed: 0 additions & 2 deletions b/‎examples/llm-api/_tensorrt_engine/llm_inference_customize.py‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎examples/llm-api/_tensorrt_engine/llm_medusa_decoding.py‎
Lines changed: 0 additions & 1 deletion b/‎examples/llm-api/_tensorrt_engine/llm_medusa_decoding.py‎
Lines changed: 0 additions & 1 deletion
@@ -61,21 +61,22 @@ TensorRT-LLM
 * [02/12] 🌟 How Scaling Laws Drive Smarter, More Powerful AI
 [➡️ link](https://blogs.nvidia.com/blog/ai-scaling-laws/?ncid=so-link-889273&linkId=100000338837832)
 
-* [01/25] Nvidia moves AI focus to inference cost, efficiency [➡️ link](https://www.fierceelectronics.com/ai/nvidia-moves-ai-focus-inference-cost-efficiency?linkId=100000332985606)
 
-* [01/24] 🏎️ Optimize AI Inference Performance with NVIDIA Full-Stack Solutions [➡️ link](https://developer.nvidia.com/blog/optimize-ai-inference-performance-with-nvidia-full-stack-solutions/?ncid=so-twit-400810&linkId=100000332621049)
+<details close>
+<summary>Previous News</summary>
 
-* [01/23] 🚀 Fast, Low-Cost Inference Offers Key to Profitable AI [➡️ link](https://blogs.nvidia.com/blog/ai-inference-platform/?ncid=so-twit-693236-vt04&linkId=100000332307804)
+* [2025/01/25] Nvidia moves AI focus to inference cost, efficiency [➡️ link](https://www.fierceelectronics.com/ai/nvidia-moves-ai-focus-inference-cost-efficiency?linkId=100000332985606)
 
-* [01/16] Introducing New KV Cache Reuse Optimizations in TensorRT-LLM [➡️ link](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/?ncid=so-twit-363876&linkId=100000330323229)
+* [2025/01/24] 🏎️ Optimize AI Inference Performance with NVIDIA Full-Stack Solutions [➡️ link](https://developer.nvidia.com/blog/optimize-ai-inference-performance-with-nvidia-full-stack-solutions/?ncid=so-twit-400810&linkId=100000332621049)
 
-* [01/14] 📣 Bing's Transition to LLM/SLM Models: Optimizing Search with TensorRT-LLM [➡️ link](https://blogs.bing.com/search-quality-insights/December-2024/Bing-s-Transition-to-LLM-SLM-Models-Optimizing-Search-with-TensorRT-LLM)
+* [2025/01/23] 🚀 Fast, Low-Cost Inference Offers Key to Profitable AI [➡️ link](https://blogs.nvidia.com/blog/ai-inference-platform/?ncid=so-twit-693236-vt04&linkId=100000332307804)
 
-* [01/04] ⚡Boost Llama 3.3 70B Inference Throughput 3x with TensorRT-LLM Speculative Decoding
-[➡️ link](https://developer.nvidia.com/blog/boost-llama-3-3-70b-inference-throughput-3x-with-nvidia-tensorrt-llm-speculative-decoding/)
+* [2025/01/16] Introducing New KV Cache Reuse Optimizations in TensorRT-LLM [➡️ link](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/?ncid=so-twit-363876&linkId=100000330323229)
 
-<details close>
-<summary>Previous News</summary>
+* [2025/01/14] 📣 Bing's Transition to LLM/SLM Models: Optimizing Search with TensorRT-LLM [➡️ link](https://blogs.bing.com/search-quality-insights/December-2024/Bing-s-Transition-to-LLM-SLM-Models-Optimizing-Search-with-TensorRT-LLM)
+
+* [2025/01/04] ⚡Boost Llama 3.3 70B Inference Throughput 3x with TensorRT-LLM Speculative Decoding
+[➡️ link](https://developer.nvidia.com/blog/boost-llama-3-3-70b-inference-throughput-3x-with-nvidia-tensorrt-llm-speculative-decoding/)
 
 * [2024/12/10] ⚡ Llama 3.3 70B from AI at Meta is accelerated by TensorRT-LLM. 🌟 State-of-the-art model on par with Llama 3.1 405B for reasoning, math, instruction following and tool use. Explore the preview
 [➡️ link](https://build.nvidia.com/meta/llama-3_3-70b-instruct)
@@ -204,11 +205,9 @@ Serverless TensorRT-LLM (LLaMA 3 8B) | Modal Docs [➡️ link](https://modal.co
 
 TensorRT-LLM is an open-sourced library for optimizing Large Language Model (LLM) inference. It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, [FP4](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/), INT4 [AWQ](https://arxiv.org/abs/2306.00978), INT8 [SmoothQuant](https://arxiv.org/abs/2211.10438), ...), speculative decoding, and much more, to perform inference efficiently on NVIDIA GPUs.
 
-Recently [re-architected with a **PyTorch backend**](https://nvidia.github.io/TensorRT-LLM/torch.html), TensorRT-LLM now combines peak performance with a more flexible and developer-friendly workflow. The original [TensorRT](https://developer.nvidia.com/tensorrt)-based backend remains supported and continues to provide an ahead-of-time compilation path for building highly optimized "[Engines](https://docs.nvidia.com/deeplearning/tensorrt/quick-start-guide/index.html#ecosystem)" for deployment. The PyTorch backend complements this by enabling faster development iteration and rapid experimentation.
-
-TensorRT-LLM provides a flexible [**LLM API**](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api) to simplify model setup and inference across both PyTorch and TensorRT backends. It supports a wide range of inference use cases from a single GPU to multiple nodes with multiple GPUs using [Tensor Parallelism](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/parallelisms.html#tensor-parallelism) and/or [Pipeline Parallelism](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/parallelisms.html#pipeline-parallelism). It also includes a [backend](https://github.com/triton-inference-server/tensorrtllm_backend) for integration with the [NVIDIA Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server).
+[Architected on PyTorch](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/torch/arch_overview.md), TensorRT-LLM provides a high-level Python [LLM API](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api) that supports a wide range of inference setups - from single-GPU to multi-GPU or multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo) and the [Triton Inference Server](https://github.com/triton-inference-server/server).
 
-Several popular models are pre-defined and can be easily customized or extended using [native PyTorch code](./tensorrt_llm/_torch/models/modeling_deepseekv3.py) (for the PyTorch backend) or a [PyTorch-style Python API](./tensorrt_llm/models/llama/model.py) (for the TensorRT backend).
+TensorRT-LLM is designed to be modular and easy to modify. Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. Several popular models are also pre-defined and can be customized using [native PyTorch code](./tensorrt_llm/_torch/models/modeling_deepseekv3.py), making it easy to adapt the system to specific needs.
 
 
 ## Getting Started
 
@@ -110,10 +110,10 @@ The MTP module follows the design in DeepSeek-V3. The embedding layer and output
 Attention is also a very important component in supporting MTP inference. The changes are mainly in the attention kernels for the generation phase. For the normal request, there will be only one input token in the generation phase, but for MTP, there will be $K+1$ input tokens. Since MTP sequentially predicts additional tokens, the predicted draft tokens are chained. Though we have an MTP Eagle path, currently, we only have the chain-based support for MTP Eagle. So, a causal mask is enough for the attention kernel to support MTP. In our implementation, TensorRT-LLM will use the fp8 flashMLA generation kernel on Hopper GPU, while using TRTLLM customized attention kernels on Blackwell for better performance.
 
 ### How to run DeepSeek models with MTP
-Run DeepSeek-V3/R1 models with MTP, use [examples/pytorch/quickstart_advanced.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/pytorch/quickstart_advanced.py) with additional options:
+Run DeepSeek-V3/R1 models with MTP, use [examples/llm-api/quickstart_advanced.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llm-api/quickstart_advanced.py) with additional options:
 
 ```bash
-cd examples/pytorch
+cd examples/llm-api
 python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_nextn N
 ```
 
@@ -165,10 +165,10 @@ Note that the Relaxed Acceptance will only be used during the thinking phase, wh
 
 ### How to run the DeepSeek-R1 model with Relaxed Acceptance
 
-Run DeepSeek-R1 models with MTP Relaxed Acceptance, use [examples/pytorch/quickstart_advanced.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/pytorch/quickstart_advanced.py) with additional options:
+Run DeepSeek-R1 models with MTP Relaxed Acceptance, use [examples/llm-api/quickstart_advanced.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llm-api/quickstart_advanced.py) with additional options:
 
 ```bash
-cd examples/pytorch
+cd examples/llm-api
 python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_nextn N --use_relaxed_acceptance_for_thinking --relaxed_topk 10 --relaxed_delta 0.6
 ```
 
 
@@ -59,7 +59,10 @@ def extract_meta_info(filename: str) -> Optional[DocMeta]:
 
 def generate_examples():
     root_dir = Path(__file__).parent.parent.parent.resolve()
-    ignore_list = {'__init__.py', 'quickstart_example.py'}
+    ignore_list = {
+        '__init__.py', 'quickstart_example.py', 'quickstart_advanced.py',
+        'quickstart_multimodal.py', 'star_attention.py'
+    }
     doc_dir = root_dir / "docs/source/examples"
 
     def collect_script_paths(examples_subdir: str) -> list[Path]:
 
@@ -2,36 +2,20 @@
 
 The LLM API is a high-level Python API designed to streamline LLM inference workflows.
 
-It supports a broad range of use cases, from single-GPU setups to multi-GPU and multi-node deployments, with built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo) and the [Triton Inference Server](https://github.com/triton-inference-server/server).
+It supports a broad range of use cases, from single-GPU setups to multi-GPU and multi-node deployments, with built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo).
 
 While the LLM API simplifies inference workflows with a high-level interface, it is also designed with flexibility in mind. Under the hood, it uses a PyTorch-native and modular backend, making it easy to customize, extend, or experiment with the runtime.
 
 
-## Supported Models
-
-* DeepSeek variants
-* Llama (including variants Mistral, Mixtral, InternLM)
-* GPT (including variants Starcoder-1/2, Santacoder)
-* Gemma-1/2/3
-* Phi-1/2/3/4
-* ChatGLM (including variants glm-10b, chatglm, chatglm2, chatglm3, glm4)
-* QWen-1/1.5/2/3
-* Falcon
-* Baichuan-1/2
-* GPT-J
-* Mamba-1/2
-
-
-> **Note:** For the most up-to-date list of supported models, you may refer to the [TensorRT-LLM model definitions](https://github.com/NVIDIA/TensorRT-LLM/tree/main/tensorrt_llm/_torch/models).
-
 ## Quick Start Example
 A simple inference example with TinyLlama using the LLM API:
 
 ```{literalinclude} ../../examples/llm-api/quickstart_example.py
     :language: python
     :linenos:
 ```
-More examples can be found [here]().
+
+For more advanced usage including distributed inference, multimodal, and speculative decoding, please refer to this [README](../../../examples/llm-api/README.md).
 
 ## Model Input
 
@@ -65,7 +49,6 @@ llm = LLM(model=<local_path_to_model>)
 > **Note:** Some models require accepting specific [license agreements]((https://ai.meta.com/resources/models-and-libraries/llama-downloads/)). Make sure you have agreed to the terms and authenticated with Hugging Face before downloading.
 
 
-
 ## Tips and Troubleshooting
 
 The following tips typically assist new LLM API users who are familiar with other APIs that are part of TensorRT-LLM:
 
@@ -196,8 +196,8 @@ if __name__ == '__main__':
     main()
 ```
 
-We provide an out-of-tree modeling example in `examples/pytorch/out_of_tree_example`. The model is implemented in `modeling_opt.py` and you can run the example by:
+We provide an out-of-tree modeling example in `examples/llm-api/out_of_tree_example`. The model is implemented in `modeling_opt.py` and you can run the example by:
 
 ```bash
-python examples/pytorch/out_of_tree_example/main.py
+python examples/llm-api/out_of_tree_example/main.py
 ```
@@ -1,3 +1,57 @@
 # LLM API Examples
 
-Please refer to the [official documentation](https://nvidia.github.io/TensorRT-LLM/llm-api/), [examples](https://nvidia.github.io/TensorRT-LLM/latest/examples/llm_api_examples.html) and [customization](https://nvidia.github.io/TensorRT-LLM/examples/customization.html) for detailed information and usage guidelines regarding the LLM API.
+Please refer to the [official documentation](https://nvidia.github.io/TensorRT-LLM/llm-api/) including [customization](https://nvidia.github.io/TensorRT-LLM/examples/customization.html) for detailed information and usage guidelines regarding the LLM API.
+
+
+## Run the advanced usage example script:
+
+```bash
+# FP8 + TP=2
+python3 quickstart_advanced.py --model_dir nvidia/Llama-3.1-8B-Instruct-FP8 --tp_size 2
+
+# FP8 (e4m3) kvcache
+python3 quickstart_advanced.py --model_dir nvidia/Llama-3.1-8B-Instruct-FP8 --kv_cache_dtype fp8
+
+# BF16 + TP=8
+python3 quickstart_advanced.py --model_dir nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 --tp_size 8
+
+# Nemotron-H requires disabling cache reuse in kv cache
+python3 quickstart_advanced.py --model_dir nvidia/Nemotron-H-8B-Base-8K --disable_kv_cache_reuse --max_batch_size 8
+```
+
+## Run the multimodal example script:
+
+```bash
+# default inputs
+python3 quickstart_multimodal.py --model_dir Efficient-Large-Model/NVILA-8B --modality image [--use_cuda_graph]
+
+# user inputs
+# supported modes:
+# (1) N prompt, N media (N requests are in-flight batched)
+# (2) 1 prompt, N media
+# Note: media should be either image or video. Mixing image and video is not supported.
+python3 quickstart_multimodal.py --model_dir Efficient-Large-Model/NVILA-8B --modality video --prompt "Tell me what you see in the video briefly." "Describe the scene in the video briefly." --media "https://huggingface.co/datasets/Efficient-Large-Model/VILA-inference-demos/resolve/main/OAI-sora-tokyo-walk.mp4" "https://huggingface.co/datasets/Efficient-Large-Model/VILA-inference-demos/resolve/main/world.mp4" --max_tokens 128 [--use_cuda_graph]
+```
+
+## Run the speculative decoding script:
+
+```bash
+# NGram drafter
+python3 quickstart_advanced.py \
+    --model_dir meta-llama/Llama-3.1-8B-Instruct \
+    --spec_decode_algo NGRAM \
+    --max_matching_ngram_size=2 \
+    --spec_decode_nextn=4 \
+    --disable_overlap_scheduler
+```
+
+```bash
+# Draft Taret
+python3 quickstart_advanced.py \
+    --model_dir meta-llama/Llama-3.1-8B-Instruct \
+    --spec_decode_algo draft_target \
+    --spec_decode_nextn 5 \
+    --draft_model_dir meta-llama/Llama-3.2-1B-Instruct \
+    --disable_overlap_scheduler
+    --disable_kv_cache_reuse
+```
@@ -9,7 +9,6 @@ def main():
     # Sample prompts.
     prompts = [
         "Hello, my name is",
-        "The president of the United States is",
         "The capital of France is",
         "The future of AI is",
     ]
 
@@ -9,7 +9,6 @@ def main():
     # Sample prompts.
     prompts = [
         "Hello, my name is",
-        "The president of the United States is",
         "The capital of France is",
         "The future of AI is",
     ]
 
@@ -30,7 +30,6 @@ def main():
     # Sample prompts.
     prompts = [
         "Hello, my name is",
-        "The president of the United States is",
         "The capital of France is",
         "The future of AI is",
     ]
@@ -48,7 +47,6 @@ def main():
 
     # Got output like
     # Prompt: 'Hello, my name is', Generated text: '\n\nJane Smith. I am a student pursuing my degree in Computer Science at [university]. I enjoy learning new things, especially technology and programming'
-    # Prompt: 'The president of the United States is', Generated text: 'likely to nominate a new Supreme Court justice to fill the seat vacated by the death of Antonin Scalia. The Senate should vote to confirm the'
     # Prompt: 'The capital of France is', Generated text: 'Paris.'
     # Prompt: 'The future of AI is', Generated text: 'an exciting time for us. We are constantly researching, developing, and improving our platform to create the most advanced and efficient model available. We are'
 
 
@@ -11,7 +11,6 @@ def run_medusa_decoding(use_modelopt_ckpt=False, model_dir=None):
     # Sample prompts.
     prompts = [
         "Hello, my name is",
-        "The president of the United States is",
         "The capital of France is",
         "The future of AI is",
     ]
Original file line number	Diff line number	Diff line change
`@@ -9,7 +9,6 @@ def main():`
`9`	`9`	`# Sample prompts.`
`10`	`10`	`prompts = [`
`11`	`11`	`"Hello, my name is",`
`12`		`- "The president of the United States is",`
`13`	`12`	`"The capital of France is",`
`14`	`13`	`"The future of AI is",`
`15`	`14`	`]`
Original file line number	Diff line number	Diff line change
`@@ -11,7 +11,6 @@ def run_medusa_decoding(use_modelopt_ckpt=False, model_dir=None):`
`11`	`11`	`# Sample prompts.`
`12`	`12`	`prompts = [`
`13`	`13`	`"Hello, my name is",`
`14`		`- "The president of the United States is",`
`15`	`14`	`"The capital of France is",`
`16`	`15`	`"The future of AI is",`
`17`	`16`	`]`