[Performance] why do compiled trtllm models have bad performance compared to torch.compile models? #2627

FPTMMC · 2024-12-25T07:02:43Z

Hi guys. I use trtllm to build, compile and run my own llm models. But when I compare the trtllm model with PyTorch model (accelerated by torch.compile), it is slower in the latency.
Here is the nsys profile report. I think there are too many gaps between cuda operator, which may make trtllm slow. How can I make the gaps less? Will CUDA Graph help? Or maybe I have to use other trtllm-build option?
trtllm-L40.nsys-rep.zip

FPTMMC · 2024-12-25T07:03:41Z

@nv-guomingz
please help me.

FPTMMC · 2024-12-25T07:11:13Z

The model structures are kind of like llama.
There are many transformer decoder blocks, which make too many cuda operators like gemm, attention. Can tensorrt make the cuda operators fused to reduce kernel launch time or calulating time?

FPTMMC · 2024-12-25T07:36:57Z

I use this command option to build trtllm models:

trtllm-build --checkpoint_dir ./tllm_checkpoint/base_decoder/ --output_dir tllm_engine/base_decoder --max_beam_width 1 --context_fmha enable --max_batch_size 1 --gemm_plugin fp8

GPU: L40
NVIDIA-SMI 535.161.07
Driver Version: 535.161.07
CUDA Version: 12.6
Python: 3.10
Tensorrt-LLM: 0.15.0

And I run the engine like this:
set context buffer and input shape.
runtime._set_shape(context, input_shape)
runtime._set_buffer(context, input_buffer)
runtime._run(context)

FPTMMC · 2024-12-26T00:59:23Z

I run summarize.py to test llama 7B FP16 in L40. And the tokens/sec is 48, while hf llama model has 40 tokens/sec.
According to the benchmark data of tensorrt-llm, it has at least 1000 tokens/sec in better GPUs like L40S.
So I think maybe on gpus which has normal performance, trtllm doesn't have greater performance compared to the model accelerated by torch.compile.

FPTMMC changed the title ~~why do compiled trtllm models have bad performance compared to torch.compile models?~~ [Performance] why do compiled trtllm models have bad performance compared to torch.compile models? Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] why do compiled trtllm models have bad performance compared to torch.compile models? #2627

[Performance] why do compiled trtllm models have bad performance compared to torch.compile models? #2627

FPTMMC commented Dec 25, 2024

FPTMMC commented Dec 25, 2024

FPTMMC commented Dec 25, 2024

FPTMMC commented Dec 25, 2024 •

edited

Loading

FPTMMC commented Dec 26, 2024 •

edited

Loading

[Performance] why do compiled trtllm models have bad performance compared to torch.compile models? #2627

[Performance] why do compiled trtllm models have bad performance compared to torch.compile models? #2627

Comments

FPTMMC commented Dec 25, 2024

FPTMMC commented Dec 25, 2024

FPTMMC commented Dec 25, 2024

FPTMMC commented Dec 25, 2024 • edited Loading

FPTMMC commented Dec 26, 2024 • edited Loading

FPTMMC commented Dec 25, 2024 •

edited

Loading

FPTMMC commented Dec 26, 2024 •

edited

Loading