Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] why do compiled trtllm models have bad performance compared to torch.compile models? #2627

Open
FPTMMC opened this issue Dec 25, 2024 · 4 comments

Comments

@FPTMMC
Copy link

FPTMMC commented Dec 25, 2024

Hi guys. I use trtllm to build, compile and run my own llm models. But when I compare the trtllm model with PyTorch model (accelerated by torch.compile), it is slower in the latency.
Here is the nsys profile report. I think there are too many gaps between cuda operator, which may make trtllm slow. How can I make the gaps less? Will CUDA Graph help? Or maybe I have to use other trtllm-build option?
trtllm-L40.nsys-rep.zip

@FPTMMC
Copy link
Author

FPTMMC commented Dec 25, 2024

@nv-guomingz
please help me.

@FPTMMC
Copy link
Author

FPTMMC commented Dec 25, 2024

The model structures are kind of like llama.
There are many transformer decoder blocks, which make too many cuda operators like gemm, attention. Can tensorrt make the cuda operators fused to reduce kernel launch time or calulating time?

@FPTMMC FPTMMC changed the title why do compiled trtllm models have bad performance compared to torch.compile models? [Performance] why do compiled trtllm models have bad performance compared to torch.compile models? Dec 25, 2024
@FPTMMC
Copy link
Author

FPTMMC commented Dec 25, 2024

I use this command option to build trtllm models:

trtllm-build --checkpoint_dir ./tllm_checkpoint/base_decoder/ --output_dir tllm_engine/base_decoder --max_beam_width 1 --context_fmha enable --max_batch_size 1 --gemm_plugin fp8

GPU: L40
NVIDIA-SMI 535.161.07
Driver Version: 535.161.07
CUDA Version: 12.6
Python: 3.10
Tensorrt-LLM: 0.15.0

And I run the engine like this:
set context buffer and input shape.
runtime._set_shape(context, input_shape)
runtime._set_buffer(context, input_buffer)
runtime._run(context)

@FPTMMC
Copy link
Author

FPTMMC commented Dec 26, 2024

I run summarize.py to test llama 7B FP16 in L40. And the tokens/sec is 48, while hf llama model has 40 tokens/sec.
According to the benchmark data of tensorrt-llm, it has at least 1000 tokens/sec in better GPUs like L40S.
So I think maybe on gpus which has normal performance, trtllm doesn't have greater performance compared to the model accelerated by torch.compile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant