Attention did not use mha kernel (muti head attention) on orin TRT8.6.10 #3575

Nusselder9 · 2023-12-28T08:17:12Z

My model has an attention module like this:

It did not use mha kernel on my orin with TensorRT8.6.10 (os 6.0.7.0):

However, on x86 TensorRT 8.6.1, it can use mha kernel:

I would like to use mha on orin. What can I do? Thanks!

Nusselder9 · 2023-12-29T06:57:47Z

To avoid uncertain influence, I export a toy attention with seq_length=128, batchsize=1, emd_dim=256 like this:

However, it still can NOT use mha kernel on orin TRT8.6.10:

Please give some help. Thanks very much!

zerollzeng · 2023-12-30T09:25:03Z

@nvpohanh I have a vague memory that this is expected(I've seen a internal bug before) that the fused mha kernel doens't enable in TRT 8.6, am I correct?

nvpohanh · 2024-01-01T14:54:01Z

Is it possible to try TRT 9.2? https://github.com/NVIDIA/TensorRT/blob/release/9.2/docker/ubuntu-22.04.Dockerfile#L92-L95

Nusselder9 · 2024-01-02T02:25:08Z

Sorry but our business scenario won't be able to upgrade TRT within the near future.

Is there a way to fix or avoid the bug and enable fused mha on TRT 8.6? @zerollzeng @nvpohanh

thx!

nvpohanh · 2024-01-02T03:28:46Z

Could you try adding a LayerNorm into the network? That will encourage TRT 8.6 to trigger the Transformer-specific fusions.

Nusselder9 · 2024-01-02T08:15:33Z

Thanks for your reply.@nvpohanh I have tried adding LN after attention, but it does not work.

nvpohanh · 2024-01-02T09:30:54Z

@Nusselder9 could you share the ONNX with the LayerNorm? TRT 8.6 has quite restricted MHA pattern matching code and we need to find out why it didn't trigger the fusion. TRT 9.2 has much looser checking.

I would also try to make the MHA looks like:

[B, S, H] -MatMul-> [B, S, H] -Reshape-> [B, S, N, h] -Transpose-> [B, N, S, h] -> MatMul -> [B, N, S, S] -> MatMul -> [B, N, S, h] -Transpose-> [B, S, N, h] -Reshape-> [B, S, H] -LayerNorm->...
[B, S, H] -MatMul-> [B, S, H] -Reshape-> [B, S, N, h] -Transpose-> [B, N, h, S] ---^                           ^
[B, S, H] -MatMul-> [B, S, H] -Reshape-> [B, S, N, h] -Transpose-> [B, N, S, h] --------------------------------

where B=1, S=128, H=256, N=8, h=32

Nusselder9 · 2024-01-02T13:18:24Z

Thanks for your kind reply. The attachment is my attention.
attention.zip
The zip file has two type of attention and both of them can not use fmha.

nvpohanh · 2024-01-03T01:40:43Z

@Nusselder9 Could you share the ONNX files with the LayerNorm? Thanks!

Nusselder9 · 2024-01-03T02:25:03Z

Here is attention with LN. @nvpohanh
attention_ln.zip

nvpohanh · 2024-01-03T06:28:55Z

Filed internal tracker 4438093 . Will let you know if we have some findings. Thanks

nvpohanh · 2024-01-22T03:09:58Z

Internal investigation shows that TRT 8.6.10 did not have any MHA fusion support on Orin. Could you try TRT 8.6.11?

Nusselder9 · 2024-01-31T11:01:39Z

thanks, I will try.

lix19937 · 2024-03-25T15:09:11Z

@nvpohanh I think the question mha kernel is not inaccurate description.

Internal investigation shows that TRT 8.6.10 did not have any MHA fusion support on Orin. Could you try TRT 8.6.11?

If an onnx include standard transformer struct(like ViT decoder), TRT 8.6.11 can open MHA fusion ?

In my opinion, CustomQKVToContextPluginDynamic can do some fusion but it need match some conditions if user use plugin.

XuDeshengCat · 2024-10-10T08:32:35Z

Did TRT 8.6.13 have any MHA fusion support？

Feynman1999 · 2025-01-22T03:27:07Z

Did TRT 9.2 have any MHA fusion support？ and how to make them work

zerollzeng self-assigned this Dec 30, 2023

zerollzeng added the triaged Issue has been triaged by maintainers label Dec 30, 2023

DefTruth mentioned this issue Jan 23, 2024

[Myelin] Myelin fused Attn but not run at MHA Kernel #3620

Closed

Nusselder9 closed this as completed Jan 31, 2024

steventu27 mentioned this issue Sep 27, 2024

No MHA (muti head attention) kernal is called in Tensorrt 10.3 in Orin with Jetpack 6.1 #4167

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attention did not use mha kernel (muti head attention) on orin TRT8.6.10 #3575

Attention did not use mha kernel (muti head attention) on orin TRT8.6.10 #3575

Nusselder9 commented Dec 28, 2023 •

edited

Loading

Nusselder9 commented Dec 29, 2023

zerollzeng commented Dec 30, 2023

nvpohanh commented Jan 1, 2024

Nusselder9 commented Jan 2, 2024

nvpohanh commented Jan 2, 2024

Nusselder9 commented Jan 2, 2024

nvpohanh commented Jan 2, 2024 •

edited

Loading

Nusselder9 commented Jan 2, 2024

nvpohanh commented Jan 3, 2024

Nusselder9 commented Jan 3, 2024

nvpohanh commented Jan 3, 2024

nvpohanh commented Jan 22, 2024

Nusselder9 commented Jan 31, 2024

lix19937 commented Mar 25, 2024 •

edited

Loading

XuDeshengCat commented Oct 10, 2024

Feynman1999 commented Jan 22, 2025

Attention did not use mha kernel (muti head attention) on orin TRT8.6.10 #3575

Attention did not use mha kernel (muti head attention) on orin TRT8.6.10 #3575

Comments

Nusselder9 commented Dec 28, 2023 • edited Loading

Nusselder9 commented Dec 29, 2023

zerollzeng commented Dec 30, 2023

nvpohanh commented Jan 1, 2024

Nusselder9 commented Jan 2, 2024

nvpohanh commented Jan 2, 2024

Nusselder9 commented Jan 2, 2024

nvpohanh commented Jan 2, 2024 • edited Loading

Nusselder9 commented Jan 2, 2024

nvpohanh commented Jan 3, 2024

Nusselder9 commented Jan 3, 2024

nvpohanh commented Jan 3, 2024

nvpohanh commented Jan 22, 2024

Nusselder9 commented Jan 31, 2024

lix19937 commented Mar 25, 2024 • edited Loading

XuDeshengCat commented Oct 10, 2024

Feynman1999 commented Jan 22, 2025

Nusselder9 commented Dec 28, 2023 •

edited

Loading

nvpohanh commented Jan 2, 2024 •

edited

Loading

lix19937 commented Mar 25, 2024 •

edited

Loading