TensorRT-LLM Release 0.19.0
Key Features and Enhancements
- The C++ runtime is now open sourced.
- PyTorch workflow
- Added DeepSeek V3/R1 support. Refer to
examples/deepseek_v3/README.md
, also to the blogdocs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md
. - Added Llava-Next support.
- Added BERT support.
- Added a C++ based decoder, which added support for:
- TopK / TopP.
- Bad words.
- Stop words.
- Embedding bias.
- Added Autotuner for custom-op-compatible tuning process.
- Added a Python-based Autotuner core framework for kernel tuning.
- Applied the Autotuner to fused MoE and NVFP4 linear operators for concept and performance evaluations.
- Added guided decoding support (XGrammar integration).
- Added pipeline parallelism support for the overlap scheduler in
PyExecutor
. - Added Qwen2VL model support.
- Added mixed precision quantization support.
- Added pipeline parallelism with attention DP support.
- Added no-cache attention support.
- Added
PeftCacheManager
support. - Added Qwen2.5‑VL support and refactored Qwen2‑VL.
- Added trtllm‑gen FP4 GEMM support.
- Added Qwen2 MoE support.
- Applied
AutoTuner
to both Fused MoE and NVFP4 Linear operators. - Introduced a
UserBuffers
allocator. - Added Deepseek eager mode AllReduce fusion support.
- Added Multi-Token Prediction (MTP) support. Refer to the “Multi-Token Prediction (MTP)” section of
examples/deepseek_v3/README.md
. - Added FlashMLA support for SM90.
- Added support for enabling MTP with CUDA graph padding.
- Added initial EAGLE-3 implementation.
- Added support for FP8 MLA on NVIDIA Hopper and Blackwell GPUs.
- Added DeepSeek V3/R1 support. Refer to
- AutoDeploy for PyTorch workflow.
- The AutoDeploy for PyTorch workflow is an experimental feature in
tensorrt_llm._torch.auto_deploy
. - AutoDeploy provides an automated path from off-the-shelf models to optimized deployment in the TensorRT-LLM runtime.
- Check out
examples/auto_deploy/README.md
for more details.
- The AutoDeploy for PyTorch workflow is an experimental feature in
- LLM API
- [BREAKING CHANGE] Added dynamic logits processor support, and deprecated static logits processor.
- Added batched logits processor support.
- Added EAGLE support.
- Added abort request support.
- Added
get_stats
support. - Added multi-node support for Slurm-based clusters, refer to
examples/llm-api/llm_mgmn_*.sh
.
- Added InternLM-XComposer2 support. Refer to “InternLM-XComposer2” section in
examples/multimodal/README.md
. - Added INT4-AWQ support for MoE models. Refer to the “AWQ Quantization” section in
examples/mixtral/README.md
. - Added Qwen2-Audio support. Refer to
examples/qwen2audio/README.md
. - Added Language-Adapter support. Refer to
examples/language_adapter/README.md
. - Added STDiT for OpenSoRA text-to-video support. Refer to
examples/stdit/README.md
. - Added vision encoders with tensor parallelism and context parallelism support. Refer to
examples/vit/README.md
. - Added EXAONE-Deep support. Refer to
examples/exaone/README.md
. - Added support for Phi-4-mini and Phi‑4‑MM.
- Added Gemma3 text‑only model support. Refer to "Run Gemma 3" section at
examples/gemma/README.md
. - Added FP8 quantization support for Qwen2-VL.
- Added batched inference support for the LLM API MMLU example
examples/mmlu_llmapi.py
. - Added FP4 quantization-layernorm fusion plugin support. (Llama models only)
- Added Mamba-Hybrid support.
- Added NVILA video support. The support includes 1 prompt - N media and N prompt - N media batching modes.
- Added a
--quantize_lm_head
optionexamples/quantization/quantize.py
to supportlm_head
quantization. - Added batched tensor FP4 quantization support.
- Added a
/metrics
endpoint fortrtllm-serve
to log iteration statistics. - Added LoRA support for Phi-2 model.
- Added returning context logits support for
trtllm-serve
. - Added one-shot version for UserBuffer AllReduce-Normalization on FP16/BF16.
- Added request BW metric measurement for
disaggServerBenchmark
. - Updated logits bitmask kernel to v3.
- Enabled CUDA graphs when attention DP was used and active requests on different GPUs were uneven.
- Added iteration log support for
trtllm-bench
. fp8_blockscale_gemm
is now open-sourced.- Added AWQ support for ModelOpt checkpoints.
- Added Linear block scale layout support in FP4 quantization.
- Added pre-quantized FP8 checkpoint support for Nemotron-mini-4b-instruct.
- Added Variable-Beam-Width-Search (VBWS) support (part2).
- Added LoRA support for Gemma.
- Refactored scaffolding worker, added OpenAI API worker support.
- Optionally split MoE inputs into chunks to reduce GPU memory usage.
- Added UCX IP interface support.
- [BREAKING CHANGE] Added output of first token to additional generation outputs.
- Added FP8 support for SM120 architecture.
- Registered
ENABLE_MULTI_DEVICE
andENABLE_UCX
as CMake options. - Made the scaffolding Controller more generic.
- Breaking change: Added individual gatherContext support for each additional output.
- Enabled
PyExecutor
inference flow to estimatemax_num_tokens
forkv_cache_manager
. - Added
TLLM_OVERRIDE_LAYER_NUM
andTLLM_TRACE_MODEL_FORWARD
environment variables for debugging. - Supported aborting disconnected requests.
- Added an option to run disaggregated serving without context servers.
- Fixed and improved allreduce and fusion kernels.
- Enhanced the integrated robustness of scaffolding via
init.py
.
API Changes
- Exposed
kc_cache_retention_config
from C++executor
API to the LLM API. - Moved
BuildConfig
arguments toLlmArgs
. - Removed speculative decoding parameters from stateful decoders.
- Exposed
DecoderState
via bindings and integrated it in decoder. - Refactored the
LlmArgs
withPydantic
and migrated remaining pybinding configurations to Python. - Refactored disaggregated serving scripts.
- Added
numNodes
toParallelConfig
. - Redesigned the multi‑stream API for DeepSeek.
Fixed Issues
- Fixed misused length argument of PluginField. This also fixes #2685.
- Fixed a Llama-3.2 SmoothQuant convert checkpoint issue. (#2677)
- Fixed a bug when loading an engine using LoRA through the LLM API. (#2782)
- Fixed incorrect batch slot usage in
addCumLogProbs
kernel. - Fixed incorrect output for Llama-3.2-11B-Vision-Instruct. (#2796)
- Removed the necessary of
--extra-index-url https://pypi.nvidia.com
when runningpip install tensorrt-llm
.
Infrastructure Changes
- The dependent NVIDIA ModelOpt version is updated to 0.27.
Known Issues
- The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the PyTorch NGC Container for optimal support on SBSA platforms.