Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
ee20f84
[TRTLLM-6975][test] Add multi-turn test cases for VLM models (#6749)
crazydemo Aug 13, 2025
b7a7977
[None][chore] waive GB300 known issues (#6812)
xinhe-nv Aug 13, 2025
76736a1
[None][fix] fix Llama3 eagle3 test case OOM (#6832)
crazydemo Aug 13, 2025
eea0ebd
[https://nvbugs/5375594][fix] fix oom issue on structural_tag test ca…
nv-guomingz Aug 13, 2025
b2c953f
[https://nvbugs/5401114][fix] Unwaive Gemma3 tests (#6870)
brb-nv Aug 14, 2025
ccd36f4
[TRTLLM-5252][feat] Add fp8 support for Mistral Small 3.1 (#6731)
2ez4bz Aug 14, 2025
60a944c
[None][infra] Setup the code review rule on the release branch (#6725)
yiqingy0 Aug 14, 2025
92e209c
[TRTLLM-6308][feat] Support Aggregate mode for phi4-mm (#6820)
Wanli-Jiang Aug 14, 2025
8928405
[None][fix] Fix batching bug in Mistral3 model (#6841)
2ez4bz Aug 14, 2025
771786f
[None][fix] Revert phi4-mm aggregate mode (#6907)
amukkara Aug 14, 2025
0a4f757
[None][fix] Complete the last missing allreduce op in Llama3/4. (#6850)
hyukn Aug 15, 2025
f519b8c
[None][chore] Add docs for Gemma3 VLMs (#6880)
brb-nv Aug 15, 2025
be7c94f
[None][doc] add legacy section for tensorrt engine (#6724)
Superjomn Aug 15, 2025
e223cdb
[TRTLLM-7048][feat] add benchmark TRT flow test for MIG (#6884)
xinhe-nv Aug 15, 2025
34feef8
[https://nvbugs/5451434][fix] Fix triton docker build (#6898)
Tabrizian Aug 15, 2025
7ecbcc2
[None][ci] unwaive test_ptp_star_attention_example (#6943)
Superjomn Aug 15, 2025
89ddff3
[https://nvbugs/5455836][fix] Fix llama 4 FP4 (#6911)
mikeiovine Aug 15, 2025
0b9c2ca
[None][infra] update CODEOWNERS for release (#6905)
venkywonka Aug 15, 2025
a0edae4
[https://nvbugs/5453667] [fix] reverting a breaking change: make trtl…
venkywonka Aug 16, 2025
c959a07
[https://nvbugs/5405041][fix] Update wide ep doc (#6950)
qiaoxj07 Aug 17, 2025
ebe78d8
[https://nvbugs/5412562][feat] Allocate MoE workspace only when neces…
nv-yilinf Aug 18, 2025
3b8c574
[TRTLLM-6835][fix] Fix potential hang caused by python multiprocessin…
lancelly Aug 18, 2025
f4378c2
[https://nvbugs/5448525][fix] Mistral Small 3.1 accuracy tests (#6909)
2ez4bz Aug 18, 2025
f64603e
[https://nvbugs/5375646][fix] update waives.txt for nvbug 5375646 (#6…
nv-guomingz Aug 18, 2025
ac36633
[None][fix] update skip config (#6891)
crazydemo Aug 18, 2025
caa1897
[https://nvbugs/5449218][fix] Fix KvCacheConfig error in test_perf (#…
peaceh-nv Aug 18, 2025
7e98138
[None][infra] Waive failed tests for release branch 0818 (#6993)
EmmaQiaoCh Aug 18, 2025
788adf2
[None][chore] Remove duplicate test waives (#6999)
yiqingy0 Aug 18, 2025
7c1529b
[None][infra] Cherry-pick #6836 from main branch and improve SSH conn…
chzblych Aug 18, 2025
e6b473a
[https://nvbugs/5449155][fix] Fix DeepSeek R1 weight loading for TP16…
achartier Aug 19, 2025
c5fc171
[https://nvbugs/5374016][fix] improve error message (#6893)
QiJune Aug 19, 2025
3e29505
[https://nvbugs/5474037][fix] Fix building tritonbuild/tritonrelease …
dbari Aug 22, 2025
7040a83
[None][fix] Fix build of tritonbuild/tritonrelease image (#7003)
dbari Aug 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,6 @@ examples/**/.git
examples/**/*.bin
examples/**/*.engine
examples/**/*.onnx
examples/**/*.safetensors
examples/**/c-model
examples/models/core/gpt/gpt*
11 changes: 6 additions & 5 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -1,10 +1,5 @@
# This file defines code ownership rules for the repository.

# The following rule should only be uncommented on release branches (e.g., release/0.19).
# The rule below requires that any PR to release/**/* branches must be approved by at least one member
# of the NVIDIA/trt-llm-release-branch-approval team, regardless of who else approves the PR.
# Without approval from a member of this team, PRs cannot be merged to release branches.
# * @NVIDIA/trt-llm-release-branch-approval

## TensorRT-LLM Infra
### CI
Expand Down Expand Up @@ -160,3 +155,9 @@ docs/source/performance/perf-benchmarking.md @NVIDIA/trtllm-bench-reviewers
# from a member of this team, PRs affecting public APIs cannot be merged to main or release branches.
/tests/unittest/api_stability/ @NVIDIA/trt-llm-noncommitted-api-review-committee
/tests/unittest/api_stability/references_committed/ @NVIDIA/trt-llm-committed-api-review-committee

# The following rule should only be uncommented on release branches (e.g., release/0.19).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to find a way of preventing to merge this back into main.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious how has this been handled all this while in previous MIs 🤔

# The rule below requires that any PR to release/**/* branches must be approved by at least one member
# of the NVIDIA/trt-llm-release-branch-approval team, regardless of who else approves the PR.
# Without approval from a member of this team, PRs cannot be merged to release branches.
# * @NVIDIA/trt-llm-release-branch-approval
Comment on lines +159 to +163
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Clarify comment: enabling the catch‑all on release branches will override ALL specific owners (last‑match‑wins).

If you need AND semantics (release approver in addition to path/API owners), CODEOWNERS can’t express it with a trailing *. Use a dedicated CODEOWNERS on release branches that appends @NVIDIA/trt-llm-release-branch-approval to each path (scripted), or enforce via GitHub Rulesets “Required reviewers.” Also modernize the branch example.

-# The following rule should only be uncommented on release branches (e.g., release/0.19).
-# The rule below requires that any PR to release/**/* branches must be approved by at least one member
-# of the NVIDIA/trt-llm-release-branch-approval team, regardless of who else approves the PR.
-# Without approval from a member of this team, PRs cannot be merged to release branches.
+# Release-branch only — DO NOT uncomment on main.
+# Important: CODEOWNERS is last-match-wins. Uncommenting the catch-all below on a release branch
+# will override all specific owners above (including API review committees). If you need AND semantics
+# (release approver in addition to per-path owners), maintain a separate CODEOWNERS on release/**
+# that appends @NVIDIA/trt-llm-release-branch-approval to each path, or enforce via Rulesets “Required reviewers”.
+# Example for release/* (keep commented here):
 # * @NVIDIA/trt-llm-release-branch-approval
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# The following rule should only be uncommented on release branches (e.g., release/0.19).
# The rule below requires that any PR to release/**/* branches must be approved by at least one member
# of the NVIDIA/trt-llm-release-branch-approval team, regardless of who else approves the PR.
# Without approval from a member of this team, PRs cannot be merged to release branches.
# * @NVIDIA/trt-llm-release-branch-approval
# Release-branch only — DO NOT uncomment on main.
# Important: CODEOWNERS is last-match-wins. Uncommenting the catch-all below on a release branch
# will override all specific owners above (including API review committees). If you need AND semantics
# (release approver in addition to per-path owners), maintain a separate CODEOWNERS on release/**
# that appends @NVIDIA/trt-llm-release-branch-approval to each path, or enforce via Rulesets “Required reviewers”.
# Example for release/* (keep commented here):
# * @NVIDIA/trt-llm-release-branch-approval
🤖 Prompt for AI Agents
.github/CODEOWNERS around lines 159 to 163: the comment about uncommenting a
catch‑all owner on release branches is misleading because a trailing '*' is
last‑match‑wins and will override all specific owners; to fix, update the
comment to state that enabling the catch‑all will replace per‑path owners on
those branches and recommend two proper approaches: (1) create a branch‑specific
CODEOWNERS file on release branches that appends
@NVIDIA/trt-llm-release-branch-approval to each existing path (generate it via a
script during release branch creation), or (2) enforce the additional reviewer
requirement via GitHub Rulesets "Required reviewers" so you get AND semantics;
also update the branch example to a modern format like release/v0.19 or
release/0.x.

41 changes: 28 additions & 13 deletions cpp/tensorrt_llm/thop/moeOp.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -392,8 +392,8 @@ class FusedMoeRunner : public torch::CustomClassHolder
std::vector<int64_t> output_shape = {num_rows, unpadded_hidden_size_val};
auto output = torch::empty(output_shape, input.options().dtype(mOutputDtype));

WorkspaceInfo workspace_info = getWorkspaceInfo(num_rows, hidden_size, inter_size, num_experts_total,
static_cast<int>(experts_per_token), base_activation_type, parallelism_config, min_latency_mode);
WorkspaceInfo const& workspace_info = getWorkspaceInfo(num_rows, hidden_size, inter_size, num_experts_total,
static_cast<int>(experts_per_token), base_activation_type, parallelism_config, min_latency_mode, stream);

auto const quant_params = getQuantParams(num_experts_on_rank, hidden_size, inter_size, quant_scales);
kernels::MoeMinLatencyParams min_latency_params{};
Expand Down Expand Up @@ -553,8 +553,8 @@ class FusedMoeRunner : public torch::CustomClassHolder
min_latency_params.experts_to_token_score = static_cast<float*>(experts_to_token_score.data_ptr());
min_latency_params.active_expert_global_ids = static_cast<int*>(active_expert_global_ids.data_ptr());

WorkspaceInfo workspace_info = getWorkspaceInfo(num_rows, hidden_size, inter_size, num_experts_total,
static_cast<int>(experts_per_token), base_activation_type, parallelism_config, min_latency_mode);
WorkspaceInfo const& workspace_info = getWorkspaceInfo(num_rows, hidden_size, inter_size, num_experts_total,
static_cast<int>(experts_per_token), base_activation_type, parallelism_config, min_latency_mode, stream);

auto const quant_params = getQuantParams(num_experts_on_rank, hidden_size, inter_size, quant_scales);

Expand Down Expand Up @@ -709,6 +709,7 @@ class FusedMoeRunner : public torch::CustomClassHolder
// e.g. 16 nvfp4 elements are packed into a single int64 element
int64_t mInnerDimMultiplier;
char* mProfileWorkspace = nullptr;
WorkspaceInfo workspace_info;

bool mUseDeepSeekFP8BlockScaling = false;
bool mUseW4GroupScaling = false;
Expand Down Expand Up @@ -757,9 +758,9 @@ class FusedMoeRunner : public torch::CustomClassHolder
mKernelRunner->setTactic(best_gemm1_profile, best_gemm2_profile);
}

WorkspaceInfo getWorkspaceInfo(int64_t const num_rows, int64_t const hidden_size, int64_t const inter_size,
WorkspaceInfo const& getWorkspaceInfo(int64_t const num_rows, int64_t const hidden_size, int64_t const inter_size,
int num_experts, int experts_per_token, ActivationType activation_type,
kernels::MOEParallelismConfig const& parallelismConfig, bool min_latency_mode)
kernels::MOEParallelismConfig const& parallelismConfig, bool min_latency_mode, cudaStream_t stream)
{
size_t moe_workspace_size = mKernelRunner->getWorkspaceSize(num_rows, hidden_size, inter_size, num_experts,
experts_per_token, activation_type, parallelismConfig, /* use_lora */ false, mUseDeepSeekFP8BlockScaling,
Expand All @@ -768,15 +769,29 @@ class FusedMoeRunner : public torch::CustomClassHolder

std::vector<size_t> workspaces{moe_workspace_size, src_to_dest_map_size};

size_t total_workspace_size = common::calculateTotalWorkspaceSize(workspaces.data(), workspaces.size());
int64_t const total_workspace_size = common::calculateTotalWorkspaceSize(workspaces.data(), workspaces.size());

WorkspaceInfo info{};
info.workspace = torch::empty({static_cast<long>(total_workspace_size)},
torch::dtype(torch::kInt8).device(torch::kCUDA).requires_grad(false));
info.src_to_dest_map
= common::nextWorkspacePtr(static_cast<int8_t*>(info.workspace.data_ptr()), moe_workspace_size);
bool is_capturing = tensorrt_llm::common::isCapturing(stream);
// Always allocate workspace when capturing cuda graph to avoid illegal memory access during replay
if (is_capturing || workspace_info.workspace.numel() < total_workspace_size)
{
if (is_capturing)
{
TLLM_LOG_DEBUG(
"Allocating MoE workspace with %ld bytes size during cuda graph capture", total_workspace_size);
}
else
{
TLLM_LOG_DEBUG("MoE workspace size is not enough, increase the size from %ld bytes to %ld bytes",
workspace_info.workspace.numel(), total_workspace_size);
}
workspace_info.workspace = torch::empty({static_cast<long>(total_workspace_size)},
torch::dtype(torch::kInt8).device(torch::kCUDA).requires_grad(false));
}
Comment on lines +772 to +790
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use size_t for byte sizes and correct printf-specifiers; avoid narrowing to long.

  • calculateTotalWorkspaceSize returns size_t; keep it size_t.
  • Prefer %zu for size_t in logs, and cast int64 prints explicitly to long long (%lld).
  • Avoid static_cast(...) for tensor shapes; Torch expects int64_t dimension.

Apply this diff:

-        int64_t const total_workspace_size = common::calculateTotalWorkspaceSize(workspaces.data(), workspaces.size());
+        size_t const total_workspace_size = common::calculateTotalWorkspaceSize(workspaces.data(), workspaces.size());
@@
-                TLLM_LOG_DEBUG(
-                    "Allocating MoE workspace with %ld bytes size during cuda graph capture", total_workspace_size);
+                TLLM_LOG_DEBUG(
+                    "Allocating MoE workspace with %zu bytes size during cuda graph capture", total_workspace_size);
@@
-                TLLM_LOG_DEBUG("MoE workspace size is not enough, increase the size from %ld bytes to %ld bytes",
-                    workspace_info.workspace.numel(), total_workspace_size);
+                TLLM_LOG_DEBUG("MoE workspace size is not enough, increase the size from %lld bytes to %zu bytes",
+                    static_cast<long long>(mWorkspaceInfo.workspace.numel()), total_workspace_size);
@@
-            workspace_info.workspace = torch::empty({static_cast<long>(total_workspace_size)},
+            mWorkspaceInfo.workspace = torch::empty({static_cast<int64_t>(total_workspace_size)},
                 torch::dtype(torch::kInt8).device(torch::kCUDA).requires_grad(false));

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In cpp/tensorrt_llm/thop/moeOp.cpp around lines 765-783, change
total_workspace_size to size_t (it’s returned as size_t), update log format
specifiers to use %zu for size_t values, and when printing int64 values like
workspace_info.workspace.numel() cast explicitly to long long and use %lld;
finally avoid static_cast<long>(...) when building the torch::empty size — use
an int64_t cast (e.g., static_cast<int64_t>(total_workspace_size)) so the tensor
dimension type matches Torch’s expected int64_t.

workspace_info.src_to_dest_map
= common::nextWorkspacePtr(static_cast<int8_t*>(workspace_info.workspace.data_ptr()), moe_workspace_size);

return info;
return workspace_info;
}

kernels::QuantParams getQuantParams(int64_t const num_experts_on_rank, int64_t const hidden_size,
Expand Down
3 changes: 2 additions & 1 deletion docker/Dockerfile.multi
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,8 @@ FROM wheel AS tritonbuild
WORKDIR /src/tensorrt_llm
RUN pip install /src/tensorrt_llm/build/tensorrt_llm*.whl
COPY ./triton_backend/ ./triton_backend/
RUN bash ./triton_backend/inflight_batcher_llm/scripts/build.sh
ARG TRITON_BASE_TAG
RUN bash ./triton_backend/inflight_batcher_llm/scripts/build.sh -s "r${TRITON_BASE_TAG%-py3}"

Comment on lines +177 to 179
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Pass TRITON_SHORT_TAG from TRITON_BASE_TAG: good; broaden suffix stripping.

The current pattern strips only “-py3”. If we ever move to tags like “-py3.11”, the suffix won’t be removed. Use “-py3*”.

Apply this diff:

-RUN bash ./triton_backend/inflight_batcher_llm/scripts/build.sh -s "r${TRITON_BASE_TAG%-py3}"
+RUN bash ./triton_backend/inflight_batcher_llm/scripts/build.sh -s "r${TRITON_BASE_TAG%-py3*}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ARG TRITON_BASE_TAG
RUN bash ./triton_backend/inflight_batcher_llm/scripts/build.sh -s "r${TRITON_BASE_TAG%-py3}"
ARG TRITON_BASE_TAG
RUN bash ./triton_backend/inflight_batcher_llm/scripts/build.sh -s "r${TRITON_BASE_TAG%-py3*}"
🤖 Prompt for AI Agents
In docker/Dockerfile.multi around lines 177-179, the shell parameter expansion
only strips the literal suffix "-py3" from TRITON_BASE_TAG; change it to strip
any "-py3" plus additional chars by using the pattern with a glob (e.g. use the
%-py3* form) so tags like "-py3.11" are handled; update the ARG/RUN usage to
reference ${TRITON_BASE_TAG%-py3*} (or assign that to TRITON_SHORT_TAG first)
and pass that into the build.sh invocation.


FROM release AS tritonrelease
Expand Down
2 changes: 1 addition & 1 deletion docker/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ CODE_DIR ?= /code/tensorrt_llm
EXTRA_VOLUMES ?=
CCACHE_DIR ?= $(CODE_DIR)/cpp/.ccache
CONAN_DIR ?= $(CODE_DIR)/cpp/.conan
USER_CACHE_DIR ?= $(HOME_DIR)/.cache
USER_CACHE_DIR ?= $(shell readlink -f "${HOME_DIR}/.cache")
RUN_CMD ?=
CONTAINER_NAME ?= tensorrt_llm
WORK_DIR ?= $(CODE_DIR)
Expand Down
6 changes: 6 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,12 @@ Welcome to TensorRT-LLM's Documentation!
blogs/XQA-kernel.md
blogs/tech_blog/*

.. toctree::
:maxdepth: 2
:caption: Use TensorRT Engine
:hidden:

legacy/tensorrt_quickstart.md

Indices and tables
==================
Expand Down
9 changes: 9 additions & 0 deletions docs/source/legacy/tensorrt_quickstart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# LLM API with TensorRT Engine
A simple inference example with TinyLlama using the LLM API:

```{literalinclude} ../../examples/llm-api/_tensorrt_engine/quickstart_example.py
:language: python
:linenos:
```
Comment on lines +4 to +7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Fix literalinclude relative path (currently off by one level).

The code block includes the script via "../../examples/…", but this document lives under docs/source/legacy. To reach the repo-level examples directory, you need to go up three levels (not two). The link below (Line 9) already uses "../../../…", which further indicates the mismatch.

Apply this diff:

-```{literalinclude} ../../examples/llm-api/_tensorrt_engine/quickstart_example.py
+```{literalinclude} ../../../examples/llm-api/_tensorrt_engine/quickstart_example.py
     :language: python
     :linenos:


Build docs locally to catch include errors:

---

🏁 Script executed:

```shell
#!/bin/bash
# Strict docs build to fail on warnings
python -m pip install -r docs/requirements.txt
sphinx-build -nW docs/source docs/_build/html

Length of output: 11225


Let’s bypass the PATH issue and invoke Sphinx as a module:


🏁 Script executed:

#!/bin/bash
# Strict docs build to fail on warnings (invoke via module to avoid PATH problems)
python -m sphinx -nW docs/source docs/_build/html

Length of output: 2359


🏁 Script executed:

#!/bin/bash
# Verify existence of example file via both relative paths from the docs file
BASE="docs/source/legacy"
INCORRECT="../../examples/llm-api/_tensorrt_engine/quickstart_example.py"
CORRECT="../../../examples/llm-api/_tensorrt_engine/quickstart_example.py"

echo "Checking incorrect path (../../examples/...):"
if [[ -f "${BASE}/${INCORRECT}" ]]; then
  echo "FOUND: ${BASE}/${INCORRECT}"
else
  echo "MISSING: ${BASE}/${INCORRECT}"
fi

echo "Checking corrected path (../../../examples/...):"
if [[ -f "${BASE}/${CORRECT}" ]]; then
  echo "FOUND: ${BASE}/${CORRECT}"
else
  echo "MISSING: ${BASE}/${CORRECT}"
fi

Length of output: 949


Fix literalinclude path in docs/source/legacy/tensorrt_quickstart.md

The current {literalinclude} goes up only two levels and fails to locate the example; it must go up three levels to reach the repository’s examples directory (verified via shell check).

Apply this patch:

--- docs/source/legacy/tensorrt_quickstart.md
@@ lines 4-7
-```{literalinclude} ../../examples/llm-api/_tensorrt_engine/quickstart_example.py
+```{literalinclude} ../../../examples/llm-api/_tensorrt_engine/quickstart_example.py
     :language: python
     :linenos:

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion

🤖 Prompt for AI Agents
In docs/source/legacy/tensorrt_quickstart.md around lines 4 to 7 the
{literalinclude} path incorrectly goes up two directories and cannot find the
example; change the include to go up three directories so it points to
../../../examples/llm-api/_tensorrt_engine/quickstart_example.py instead of
../../examples/llm-api/_tensorrt_engine/quickstart_example.py, keeping the
:language: python and :linenos: options intact.


For more advanced usage including distributed inference, multimodal, and speculative decoding, please refer to this [README](../../../examples/llm-api/README.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Avoid linking to files outside the docs tree (may break on published site).

Relative link to ../../../examples/... likely won’t exist in the built site output. Prefer an absolute GitHub URL or move/duplicate the README content into the docs tree and link via :doc: or toctree.

To validate locally (fail on warnings and catch include/link issues):


🏁 Script executed:

#!/bin/bash
set -euo pipefail
python -m pip install -r docs/requirements.txt
# Build HTML with nitpicky warnings as errors
python -m sphinx -nW docs/source docs/_build/html
# Spot-check the target file exists relative to this doc (sanity)
BASE="docs/source/legacy"
TARGET="../../../examples/llm-api/README.md"
if [[ -f "${BASE}/${TARGET}" ]]; then
  echo "Found file at ${BASE}/${TARGET} (but it still won't be copied into the site)."
else
  echo "File not found at ${BASE}/${TARGET} — link will be broken in source as well."
fi

Length of output: 13460


🏁 Script executed:

#!/bin/bash
# Verify existence of the linked README relative to the docs tree
if [[ -f docs/source/legacy/../../../examples/llm-api/README.md ]]; then
  echo "Found file at docs/source/legacy/../../../examples/llm-api/README.md"
else
  echo "File not found at docs/source/legacy/../../../examples/llm-api/README.md"
fi

Length of output: 212


Ensure documentation links remain within the built site

Linking directly to ../../../examples/llm-api/README.md will work in your local repo checkout but won’t be included in the Sphinx‐built HTML (files outside docs/source aren’t copied into _build/html). This will lead to a broken link on the published documentation site.

To resolve, choose one of the following approaches:

  • Move or duplicate the contents of examples/llm-api/README.md into the docs tree (e.g. docs/source/legacy/llm-api.md) and update the link to use a Sphinx doc reference (:doc: or via your toctree).
  • Use an absolute GitHub URL if you prefer not to duplicate content, for example:
    For more advanced usage … please refer to this `README on GitHub <https://github.com/your-org/your-repo/blob/main/examples/llm-api/README.md>`_.

Please update the link accordingly to prevent broken references in the published site.

🧰 Tools
🪛 LanguageTool

[grammar] ~9-~9: There might be a mistake here.
Context: ...E](../../../examples/llm-api/README.md).

(QB_NEW_EN)

🤖 Prompt for AI Agents
In docs/source/legacy/tensorrt_quickstart.md around line 9, the relative link
"../../../examples/llm-api/README.md" points outside the docs tree and will be
broken in the Sphinx-built site; fix by either moving/duplicating the README
content into the docs tree (e.g. docs/source/legacy/llm-api.md) and updating the
reference to a Sphinx doc link (:doc: or toctree), or replace the relative path
with an absolute GitHub URL to the README (using an rst external link) so the
published site resolves the target.

33 changes: 33 additions & 0 deletions examples/llm-api/_tensorrt_engine/quickstart_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
from tensorrt_llm import LLM, SamplingParams
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add NVIDIA SPDX header per repo guidelines.

Examples are source files and should carry the NVIDIA SPDX header.

Apply this diff to prepend the header:

+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+
 from tensorrt_llm import LLM, SamplingParams
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from tensorrt_llm import LLM, SamplingParams
# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# SPDX-License-Identifier: Apache-2.0
from tensorrt_llm import LLM, SamplingParams
🤖 Prompt for AI Agents
In examples/llm-api/_tensorrt_engine/quickstart_example.py around lines 1 to 1,
the file is missing the required NVIDIA SPDX header; prepend the repository's
standard NVIDIA SPDX header block (e.g., SPDX-License-Identifier and
copyright/ownership lines as specified by project guidelines) at the very top of
the file before any imports so the example carries the required license
metadata.



def main():

# Model could accept HF model name, a path to local HF model,
# or TensorRT Model Optimizer's quantized checkpoints like nvidia/Llama-3.1-8B-Instruct-FP8 on HF.
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

Comment on lines +1 to +9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Import the TRT engine LLM and accept --engine_dir.
Otherwise this “TRT engine” example runs the Torch LLM.

-from tensorrt_llm import LLM, SamplingParams
+import argparse
+from tensorrt_llm._tensorrt_engine import LLM, SamplingParams
@@
 def main():
-
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--engine_dir", type=str, default=None)
+    args = parser.parse_args()
@@
-    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
+    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", engine_dir=args.engine_dir)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from tensorrt_llm import LLM, SamplingParams
def main():
# Model could accept HF model name, a path to local HF model,
# or TensorRT Model Optimizer's quantized checkpoints like nvidia/Llama-3.1-8B-Instruct-FP8 on HF.
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
import argparse
from tensorrt_llm._tensorrt_engine import LLM, SamplingParams
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--engine_dir", type=str, default=None)
args = parser.parse_args()
# Model could accept HF model name, a path to local HF model,
# or TensorRT Model Optimizer's quantized checkpoints like nvidia/Llama-3.1-8B-Instruct-FP8 on HF.
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", engine_dir=args.engine_dir)
# ...

# Sample prompts.
prompts = [
"Hello, my name is",
"The capital of France is",
"The future of AI is",
]

# Create a sampling params.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

for output in llm.generate(prompts, sampling_params):
print(
f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}"
)

# Got output like
# Prompt: 'Hello, my name is', Generated text: '\n\nJane Smith. I am a student pursuing my degree in Computer Science at [university]. I enjoy learning new things, especially technology and programming'
# Prompt: 'The president of the United States is', Generated text: 'likely to nominate a new Supreme Court justice to fill the seat vacated by the death of Antonin Scalia. The Senate should vote to confirm the'
# Prompt: 'The capital of France is', Generated text: 'Paris.'
# Prompt: 'The future of AI is', Generated text: 'an exciting time for us. We are constantly researching, developing, and improving our platform to create the most advanced and efficient model available. We are'


if __name__ == '__main__':
main()
5 changes: 2 additions & 3 deletions examples/llm-api/llm_runtime.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,7 @@ def example_cuda_graph_config():
cuda_graph_config=cuda_graph_config, # Enable CUDA graphs
max_batch_size=4,
max_seq_len=512,
kv_cache_config=KvCacheConfig(free_gpu_memory_fraction=0.8,
enable_block_reuse=True))
kv_cache_config=KvCacheConfig(free_gpu_memory_fraction=0.5))

prompts = [
"Hello, my name is",
Expand All @@ -56,7 +55,7 @@ def example_kv_cache_config():
max_batch_size=8,
max_seq_len=1024,
kv_cache_config=KvCacheConfig(
free_gpu_memory_fraction=0.85,
free_gpu_memory_fraction=0.5,
enable_block_reuse=True))

prompts = [
Expand Down
10 changes: 8 additions & 2 deletions examples/llm-api/quickstart_example.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,17 @@
from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm import BuildConfig, SamplingParams
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add NVIDIA SPDX header per repo guidelines.

+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+
 from tensorrt_llm import BuildConfig, SamplingParams
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from tensorrt_llm import BuildConfig, SamplingParams
# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# SPDX-License-Identifier: Apache-2.0
from tensorrt_llm import BuildConfig, SamplingParams
🤖 Prompt for AI Agents
In examples/llm-api/quickstart_example.py around line 1, the file is missing the
required NVIDIA SPDX header; add the standard NVIDIA SPDX header block at the
very top of the file (before any imports) as a Python comment: include the
SPDX-FileCopyrightText line with the correct year and owner per repo guidelines
and the SPDX-License-Identifier line used by the project, then leave a blank
line and keep the existing import statement.

from tensorrt_llm._tensorrt_engine import LLM # NOTE the change
Comment on lines +1 to +2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Use public API import for LLM; avoid private module path

End-user examples should import LLM from the public package namespace. Importing from _tensorrt_engine is an internal detail and may change without notice.

Apply this diff:

-from tensorrt_llm import BuildConfig, SamplingParams
-from tensorrt_llm._tensorrt_engine import LLM  # NOTE the change
+from tensorrt_llm import BuildConfig, SamplingParams, LLM
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from tensorrt_llm import BuildConfig, SamplingParams
from tensorrt_llm._tensorrt_engine import LLM # NOTE the change
from tensorrt_llm import BuildConfig, SamplingParams, LLM
🤖 Prompt for AI Agents
In examples/llm-api/quickstart_example.py around lines 1 to 2, the code imports
LLM from a private module path (tensorrt_llm._tensorrt_engine) instead of the
public package API; change the import to pull LLM from the public package (e.g.,
from tensorrt_llm import LLM alongside BuildConfig and SamplingParams) so
examples rely only on the stable public surface area and avoid referencing
internal modules.



def main():

build_config = BuildConfig()
build_config.max_batch_size = 256
build_config.max_num_tokens = 1024

# Model could accept HF model name, a path to local HF model,
# or TensorRT Model Optimizer's quantized checkpoints like nvidia/Llama-3.1-8B-Instruct-FP8 on HF.
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
build_config=build_config)

# Sample prompts.
prompts = [
Expand Down
83 changes: 83 additions & 0 deletions examples/llm-api/quickstart_multimodal.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,15 @@ def add_multimodal_args(parser):
" ├── __init__.py"
" ├── <model_name>.py"
" └── <sub_dirs>"))
# Add multiturn conversation related parameters
parser.add_argument("--multiturn",
action="store_true",
help="Enable multi-turn conversation mode.")
parser.add_argument(
"--conversation_turns",
type=int,
default=2,
help="Number of conversation turns for automated testing.")
return parser


Expand Down Expand Up @@ -188,6 +197,80 @@ def main():
f"Unsupported model_type: {model_type} found!\n" \
f"Supported types: {MULTIMODAL_PLACEHOLDER_REGISTRY.get_registered_model_types()}"

# If multiturn mode is enabled
if args.multiturn:
# Run predefined multiturn conversation examples
assert args.prompt is not None, "Please provide a prompt for multiturn conversation."
assert args.media is not None, "Please provide media for multiturn conversation."
# Determine how many turns to run
max_turns = min(args.conversation_turns, len(args.prompt))
generated_outputs = [] # Store generated outputs for return

# Initialize conversation history with the first prompt
conversation_history = args.prompt[0] if args.prompt else ""

for i in range(max_turns):
print(f"\n--- Turn {i+1} ---")

try:
# Use multimodal input loader to process input with conversation context
# Use accumulated conversation history instead of just the current prompt
cur_prompt = conversation_history
inputs = default_multimodal_input_loader(
tokenizer=llm.tokenizer,
model_dir=llm._hf_model_dir,
model_type=model_type,
modality=args.modality,
prompts=[cur_prompt],
media=args.media,
image_data_format="pt",
num_frames=8,
device="cpu")

Comment on lines +219 to +229
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix: nested-media modalities will assert in loader; also honor CLI image_format/num_frames/device.

default_multimodal_input_loader asserts when len(prompts)=1 and media is a list-of-lists (e.g., modality="image_audio"). Your current call passes prompts=[cur_prompt] and media=args.media unchanged, which will trip the assert for nested media. Additionally, the code ignores user CLI values and hardcodes "pt"/8/"cpu".

Apply this refactor to (1) select a single sample for nested-media modalities, (2) pass user-specified format/frames/device, and (3) keep model_dir type consistent:

-                inputs = default_multimodal_input_loader(
-                    tokenizer=llm.tokenizer,
-                    model_dir=llm._hf_model_dir,
-                    model_type=model_type,
-                    modality=args.modality,
-                    prompts=[cur_prompt],
-                    media=args.media,
-                    image_data_format="pt",
-                    num_frames=8,
-                    device="cpu")
+                # For nested-media (e.g., image_audio = [ [img,aud], [img,aud], ... ]),
+                # pick one sample to pair with a single-turn prompt. For flat media
+                # (image/video/audio), 1 prompt + N media is supported by the loader.
+                media_for_turn = args.media
+                if isinstance(media_for_turn, list) and media_for_turn and isinstance(media_for_turn[0], list):
+                    media_for_turn = [media_for_turn[0]]
+
+                inputs = default_multimodal_input_loader(
+                    tokenizer=llm.tokenizer,
+                    model_dir=str(llm._hf_model_dir),
+                    model_type=model_type,
+                    modality=args.modality,
+                    prompts=[cur_prompt],
+                    media=media_for_turn,
+                    image_data_format=image_format,
+                    num_frames=args.num_frames,
+                    device=args.device)

Follow-up: If you want to reuse the same nested media across turns, consider extracting the first sample once outside the loop and reusing it to avoid repeated conditionals.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
inputs = default_multimodal_input_loader(
tokenizer=llm.tokenizer,
model_dir=llm._hf_model_dir,
model_type=model_type,
modality=args.modality,
prompts=[cur_prompt],
media=args.media,
image_data_format="pt",
num_frames=8,
device="cpu")
# For nested-media (e.g., image_audio = [[img, aud], [img, aud], …]),
# pick one sample to pair with a single-turn prompt. For flat media
# (image/video/audio), 1 prompt + N media is supported by the loader.
media_for_turn = args.media
if isinstance(media_for_turn, list) and media_for_turn and isinstance(media_for_turn[0], list):
media_for_turn = [media_for_turn[0]]
inputs = default_multimodal_input_loader(
tokenizer=llm.tokenizer,
model_dir=str(llm._hf_model_dir),
model_type=model_type,
modality=args.modality,
prompts=[cur_prompt],
media=media_for_turn,
image_data_format=image_format,
num_frames=args.num_frames,
device=args.device)
🤖 Prompt for AI Agents
In examples/llm-api/quickstart_multimodal.py around lines 219 to 229, the call
to default_multimodal_input_loader will assert for nested-media modalities when
prompts=[cur_prompt] because media may be a list-of-lists; also
image_data_format/num_frames/device are hardcoded and model_dir type must remain
consistent. Fix by: if media is nested (e.g., a list and its first element is a
list) and you are sending a single prompt, select the corresponding single-media
sample (media = media[0]) before calling the loader; replace hardcoded "pt", 8
and "cpu" with args.image_format, args.num_frames, and args.device respectively;
and ensure the model_dir argument uses the same type the loader expects (use
llm._hf_model_dir as-is or cast consistently). Keep this selection logic minimal
and consider hoisting the single-sample extraction outside any loop if you want
to reuse the same nested media across turns.

lora_request = None
if args.load_lora:
if model_class is None:
raise ValueError(
"model_class must be provided when load_lora is True"
)
lora_request = model_class.lora_request(
len(inputs), args.modality, llm._hf_model_dir)

# Generate response
outputs = llm.generate(inputs,
sampling_params,
lora_request=lora_request)
assert outputs and len(
outputs) > 0 and outputs[0].outputs and len(
outputs[0].outputs) > 0
response = outputs[0].outputs[0].text.strip()

# Store generated output
generated_outputs.append({
"turn": i + 1,
"user_input": cur_prompt,
"assistant_response": response,
"media": args.media
})

conversation_history = conversation_history + "\n" + response
if i + 1 < len(args.prompt):
conversation_history = conversation_history + "\n" + args.prompt[
i + 1]

except Exception as e:
print(f"Error in turn {i+1}: {e}")
import traceback
traceback.print_exc()
continue

for i, output in enumerate(generated_outputs):
print(
f"[{i}] Prompt: {output['user_input']!r}, Generated text: {output['assistant_response']!r}"
)
return

# Original single-turn processing logic
# set prompts and media to example prompts and images if they are not provided
if args.prompt is None:
args.prompt = example_medias_and_prompts[args.modality]["prompt"]
Expand Down
Loading