TensorRT-LLM 0.8.0 Release #1193

kaiyux · 2024-02-29T09:54:50Z

kaiyux
Feb 29, 2024
Maintainer

Hi,

We are very pleased to announce the 0.8.0 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.

This update includes:

Model Support
- Phi-1.5/2.0
- Mamba support (see examples/mamba/README.md)
  - The support is limited to beam width = 1 and single-node single-GPU
- Nougat support (see examples/multimodal/README.md#nougat)
- Qwen-VL support (see examples/qwenvl/README.md)
- RoBERTa support, thanks to the contribution from @erenup
- Skywork model support
- Add example for multimodal models (BLIP with OPT or T5, LlaVA)
Features
- Chunked context support (see docs/source/gpt_attention.md#chunked-context)
- LoRA support for C++ runtime (see docs/source/lora.md)
- Medusa decoding support (see examples/medusa/README.md)
  - The support is limited to Python runtime for Ampere or newer GPUs with fp16 and bf16 accuracy, and the temperature parameter of sampling configuration should be 0
- StreamingLLM support for LLaMA (see docs/source/gpt_attention.md#streamingllm)
- Support for batch manager to return logits from context and/or generation phases
  - Include support in the Triton backend
- Support AWQ and GPTQ for QWEN
- Support ReduceScatter plugin
- Support for combining repetition_penalty and presence_penalty Support for combining repetition_penalty, presence_penalty #274
- Support for frequency_penalty Support for frequency_penalty #275
- OOTB functionality support:
  - Baichuan
  - InternLM
  - Qwen
  - BART
- LLaMA
  - Support enabling INT4-AWQ along with FP8 KV Cache
  - Support BF16 for weight-only plugin
- Baichuan
  - P-tuning support
  - INT4-AWQ and INT4-GPTQ support
- Decoder iteration-level profiling improvements
- Add masked_select and cumsum function for modeling
- Smooth Quantization support for ChatGLM2-6B / ChatGLM3-6B / ChatGLM2-6B-32K
- Add Weight-Only Support To Whisper Add Weight-Only Support To Whisper #794, thanks to the contribution from @Eddie-Wang1120
- Support FP16 fMHA on NVIDIA V100 GPU
API
- Add a set of High-level APIs for end-to-end generation tasks (see examples/high-level-api/README.md)
- [BREAKING CHANGES] Migrate models to the new build workflow, including LLaMA, Mistral, Mixtral, InternLM, ChatGLM, Falcon, GPT-J, GPT-NeoX, Medusa, MPT, Baichuan and Phi (see docs/source/new_workflow.md)
- [BREAKING CHANGES] Deprecate LayerNorm and RMSNorm plugins and removed corresponding build parameters
- [BREAKING CHANGES] Remove optional parameter maxNumSequences for GPT manager
Bug fixes
- Fix the first token being abnormal issue when --gather_all_token_logits is enabled When '--gather_all_token_logits' is enabled, the first token appears to be abnormal." #639
- Fix LLaMA with LoRA enabled build failure Failed to build Llama-2-13b-hf engine with Lora adapter #673
- Fix InternLM SmoothQuant build failure [0.6.1] InternLM SmoothQuant does not work #705
- Fix Bloom int8_kv_cache functionality 【bloom】convert_checkpoint.py local variable 'int8_weights' referenced before assignment #741
- Fix crash in gptManagerBenchmark gptManagerBenchmark launch failed #649
- Fix Blip2 build error blip-2 build error #695
- Add pickle support for InferenceRequest GptManager pybind 2/4TP run demo #701
- Fix Mixtral-8x7b build failure with custom_all_reduce Mixtral-8x7b build fails with custom_all_reduce #825
- Fix INT8 GEMM shape int8 gemm slower than fp16 on A100. #935
- Minor bug fixes
Performance
- [BREAKING CHANGES] Increase default freeGpuMemoryFraction parameter from 0.85 to 0.9 for higher throughput
- [BREAKING CHANGES] Disable enable_trt_overlap argument for GPT manager by default
- Performance optimization of beam search kernel
- Add bfloat16 and paged kv cache support for optimized generation MQA/GQA kernels
- Custom AllReduce plugins performance optimization
- Top-P sampling performance optimization
- LoRA performance optimization
- Custom allreduce performance optimization by introducing a ping-pong buffer to avoid an extra synchronization cost
- Integrate XQA kernels for GPT-J (beamWidth=4)
Documentation
- Batch manager arguments documentation updates
- Add documentation for best practices for tuning the performance of TensorRT-LLM (See docs/source/perf_best_practices.md)
- Add documentation for Falcon AWQ support (See examples/falcon/README.md)
- Update to the docs/source/new_workflow.md documentation
- Update AWQ INT4 weight only quantization documentation for GPT-J
- Add blog: Speed up inference with SOTA quantization techniques in TRT-LLM
- Refine TensorRT-LLM backend README structure kv-int8 output wrong result #133
- Typo fix INT4 support on Volta? #739

Currently, there are two key branches in the project:

The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
The main branch is the dev branch. It is more experimental.

We are updating the main branch regularly with new features, bug fixes and performance optimizations. The stable branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,

The TensorRT-LLM Engineering Team

This discussion was created from the release TensorRT-LLM 0.8.0 Release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT-LLM 0.8.0 Release #1193

{{title}}

Replies: 0 comments

Select a reply

TensorRT-LLM 0.8.0 Release #1193

kaiyux Feb 29, 2024 Maintainer

Replies: 0 comments

kaiyux
Feb 29, 2024
Maintainer