TensorRT-LLM Release 0.17.0 #2726

zeroepoch · 2025-01-30T22:12:12Z

zeroepoch
Jan 30, 2025
Maintainer

Hi,

We are very pleased to announce the 0.17.0 version of TensorRT-LLM. This update includes:

Model Support

Added InternLM-XComposer2 support. Refer to “InternLM-XComposer2” section in examples/multimodal/README.md.

Features

Blackwell support
- Added support for B200.
- Added support for GeForce RTX 50 series using Windows Subsystem for Linux (WSL) for limited models.
- Added NVFP4 GEMM support for Llama and Mixtral models.
- Added NVFP4 support for the LLM API and trtllm-bench command.
- GB200 NVL is not fully supported.
- Added benchmark script to measure perf benefits of KV cache host offload with expected runtime improvements from GH200.
PyTorch workflow
- The PyTorch workflow is an experimental feature in tensorrt_llm._torch. The following is a list of supported infrastructure, models, and features that can be used with the PyTorch workflow.
- Added support for H100/H200/B200.
- Added support for Llama models, Mixtral, QWen, Vila.
- Added support for FP16/BF16/FP8/NVFP4 Gemm, FP16/BF16/FP8 KVCache.
- Added custom context and decoding attention kernels support via PyTorch custom op.
- Added support for chunked context (default off).
- Added CudaGraph support for decoding only.
- Added overlap scheduler support to overlap prepare inputs and model forward by decoding 1 extra token.
Added FP8 context FMHA support for the W4A8 quantization workflow.
Added ModelOpt quantized checkpoint support for the LLM API.
Added support for min_p. Refer to https://arxiv.org/pdf/2407.01082.
- Thanks for the contribution from @pathorn in Use first bad_words as extra parameters, and implement min-p #1536.
- This also addresses Feature Request: Add Min-P sampling layer #1154 and Any chance to adopt min_p sampling? #1683.
Added FP8 support for encoder-decoder models. Refer to the “FP8 Post-Training Quantization” section in examples/enc_dec/README.md.
Added up and gate projection fusion support for LoRA modules.

API

[BREAKING CHANGE] paged_context_fmha and fp8_context_fmha are enabled by default.
[BREAKING CHANGE] KV cache reuse is enabled automatically when paged_context_fmha is enabled.
[BREAKING CHANGE] tokens_per_block is set to 32 by default.
Added --concurrency support for the throughput subcommand of trtllm-bench.

Bug fixes

Fixed incorrect LoRA output dimension. Thanks for the contribution from @akhoroshev in bugfix/incorrect lora out dims #2484.
Added NVIDIA H200 GPU into the cluster_key for auto parallelism feature. ([feature request] Can we add H200 in infer_cluster_key() method? #2552)
Fixed workspace size issue in the GPT attention plugin. Thanks for the contribution from @AIDC-AI.
Fixed an assertion error in the LoRA plugin. ([Encoder-Decoder] LoRA - BART not working - LoraParams and input dims don't match, lora tokens 1 input tokens 0 #2282)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT-LLM Release 0.17.0 #2726

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

TensorRT-LLM Release 0.17.0 #2726

zeroepoch Jan 30, 2025 Maintainer

Model Support

Features

API

Bug fixes

Replies: 0 comments

zeroepoch
Jan 30, 2025
Maintainer