Update on the development branch #1234

kaiyux · 2024-03-05T10:40:07Z

kaiyux
Mar 5, 2024
Maintainer

Hi,

The TensorRT-LLM team is pleased to announce that we are pushing an update to the development branch (and the Triton backend) this March 5, 2024.

This update includes:

Model Support
- HuggingFace StarCoder2 support
Features
- Support import and convert HuggingFace Gemma checkpoints, thanks for the contribution from @mfuntowicz in Make Gemma importable from transformers Gemma implementation #1147
API
- [BREAKING CHANGE] Move LLaMA convert checkpoint script from examples directory into the core library
- Support in LLM() API to accept engines built by trtllm-build command
Bug fixes
- Fix wrong link in examples/mixtral/README.md Mixtral - no run.py file #1181
- Fix LLaMA2-7B bad results when int8 kv cache and per-channel int8 weight only are enabled llama2-7b bad results for int8-kv-cache + per-channel-int8-weight #967
- Fix wrong head_size when importing Gemma model from HuggingFace Hub, thanks for the contribution from @mfuntowicz in Specify the head_size from the config when importing Gemma from Hugging Face. #1148
Documentation
- Update performance numbers in docs/source/performance.md
- Add documentation for emulated static batching in benchmarks/cpp/README.md

Thanks,
The TensorRT-LLM Engineering Team