Update on the development branch #1726

kaiyux · 2024-06-04T12:39:02Z

kaiyux
Jun 4, 2024
Maintainer

Hi,

The TensorRT-LLM team is pleased to announce that we are pushing an update to the development branch (and the Triton backend) this June 4, 2024.

This update includes:

Model Support
- Support Grok-1, see examples/grok/README.md.
- Support Qwen1.5-110B with FP8 PTQ.
- Support Phi-3 small model with block sparse attention.
- Support InternLM2 7B/20B, thanks to the contribution from @RunningLeon in Support internlm2 #1392.
Features
- Add a fused GEMM-SwiGLU plugin for FP8 on SM90.
- Support HuggingFace model automatically download for the Python high level API.
- Support running FP8 LLaMA with FP16 LoRA checkpoints.
- Support very long context for LLaMA (see “Long context evaluation” section in examples/llama/README.md).
- Support encoder-decoder models for C++ runtime with paged KV cache and inflight batching. enc-dec triton backend support #800
Bug fixes
- Fix qkv_bias shape issue for Qwen1.5-32B (convert qwen 110b gptq checkpoint的时候，qkv_bias 的shape不能被3整除 #1589), thanks to the contribution from @Tlntin in fix up qkv.bias error when use qwen1.5-32b-gptq-int4 #1637.
- Fix the error of Ada traits for fpA_intB, thanks to the contribution from @JamesTheZ in Fix the error of Ada traits for fpA_intB. #1583.
- Update examples/qwenvl/requirements.txt, thanks to the contribution from @ngoanpv in Update requirements.txt #1248.
- Fix rsLoRA scaling in lora_manager, thanks to the contribution from @TheCodeWrangler in Fixed rslora scaling in lora_manager #1669.
- Fix Qwen1.5 checkpoint convert failure convert_checkpoint qwen1.5 error #1675.
- Fix Medusa safetensors and AWQ conversion, thanks to the contribution from @Tushar-ml in Loading Medusa Safetensors + AWQ Conversion correction #1535.
- Bump gradio from 3.40.1 to 4.19.0, thanks to the contribution from @dependabot Bump gradio from 3.40.1 to 4.19.2 in /examples/qwen #1640
Documentation
- Add HuggingFace model zoo from the community, thanks to the contribution from @matichon-vultureprime in Add Huggingface model zoo from community #1674.
- Fix benchmarks/cpp/README.md for gptManagerBenchmark seems to go into a dead loop with GPU usage 0% #1562 and Cannot process new request: [TensorRT-LLM][ERROR] Assertion failed: LoRA task 0 not found in cache. Please send LoRA weights with request (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp:182) #1552.
- Fix dead link, thanks to the help from @DefTruth, @buvnswrn and @sunjiabin17 in: [Docs] Fixed inference-request.md dead link triton-inference-server/tensorrtllm_backend#478, Fixed README.md for broken links triton-inference-server/tensorrtllm_backend#482 and FIX link reference in README.md triton-inference-server/tensorrtllm_backend#449.

Thanks,
The TensorRT-LLM Engineering Team

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update on the development branch #1726

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Update on the development branch #1726

kaiyux Jun 4, 2024 Maintainer

Replies: 0 comments

kaiyux
Jun 4, 2024
Maintainer