Is it possible for streaming llm to support the QWEN2 model? #1872
lai-serena
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
My machine is Ubuntu, with 2 Nvidia Tesla A100 40G GPUs. I use vLLM to run the QWEN2-7B-INSTRUCT model, and it achieves an inference speed of about 34.74 tokens/s. I read the paper that streaming LLMs could improve inference performance by 46%, so I hope to run a streaming LLM with QWEN2. After I set this command
trtllm-build --checkpoint_dir /Qwen2-7B-Instruct-checkpoint --output_dir /Qwen2-7B-Instruct-checkpoint-1gpu --gemm_plugin float16 --streamingllm enable
, I got the above result. Another question is, if your have ever compared the inference performance of TensorRT-LLM with vLLM?Beta Was this translation helpful? Give feedback.
All reactions