-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] why do compiled trtllm models have bad performance compared to torch.compile models? #2627
Comments
@nv-guomingz |
The model structures are kind of like llama. |
I use this command option to build trtllm models: trtllm-build --checkpoint_dir ./tllm_checkpoint/base_decoder/ --output_dir tllm_engine/base_decoder --max_beam_width 1 --context_fmha enable --max_batch_size 1 --gemm_plugin fp8 GPU: L40 And I run the engine like this: |
I run summarize.py to test llama 7B FP16 in L40. And the tokens/sec is 48, while hf llama model has 40 tokens/sec. |
Hi guys. I use trtllm to build, compile and run my own llm models. But when I compare the trtllm model with PyTorch model (accelerated by torch.compile), it is slower in the latency.
Here is the nsys profile report. I think there are too many gaps between cuda operator, which may make trtllm slow. How can I make the gaps less? Will CUDA Graph help? Or maybe I have to use other trtllm-build option?
trtllm-L40.nsys-rep.zip
The text was updated successfully, but these errors were encountered: