[question] Myelin: attention fusion and FlashAttention #3243

vadimkantorov · 2023-08-21T11:53:37Z

Hi! When attention op gets fused in a single op with Myelin, it's not written in trex-tooltip if it's using FlashAttention / proper fusion or not (and if it's using quantization under the hood, especially for the implicit quantization mode). How can we know if it's using fused attention impl like FlashAttention? Thanks :)

zerollzeng · 2023-08-22T15:55:54Z

@nvpohanh ^ ^

nvpohanh · 2023-08-23T02:21:17Z

For now, you can only check the Nsight Systems profiles: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#nvprof

If the MHA is fused, there should be kernel names with _mha in the profile.

In next TRT version, you will be able to get this info by using the IEngineInspector (or --profilingVerbosity=detailed --dumpLayerInfo if you're using trtexec).

nvpohanh · 2023-08-23T02:23:36Z

(and if it's using quantization under the hood, especially for the implicit quantization mode)

TRT's MHA fusion does not support implicit quantizations yet. Please use explicit quantization instead: Add Q/DQ ops before the two batch gemms in the MHA and also add Q/DQ ops before the ResidualAdd.

Here is an ugly example:

Jeremalloch · 2023-09-11T18:35:10Z

Hey, a couple questions to tack on:

What version of TensorRT will this be included in? Is that TRT 8.7? And if --profilingVerbosity=detailed dumps the data, does that mean using TRT explorer / TREx tooling will show the kernels (instead of a single myelin kernel)? I'd like to be able [paritially] quantize a model, and being able to see which kernels are executing in INT8 vs FP16 would be very helpful.
Does TRT support int8 flash attention?
In your attached figure, should we be inserting a QDQ node between the bias addition and the residual connection addition, or should we just be inserting it on the identity / shortcut branch? For ResNet style CNN architectures, I know its recommended to not insert on the residual side so that a conv -> add pattern can be fused into a single kernel. Is this the case for a linear layer -> add in transformers as well? (TRT doc link)

nvpohanh · 2023-09-12T01:55:51Z

The INT8 MHA fused kernels are already integrated in TRT 8.6. The only caveat is that SeqLen must be 512 or below.

It does use flash attention if applicable.

In your attached figure, should we be inserting a QDQ node between the bias addition and the residual connection addition, or should we just be inserting it on the identity / shortcut branch? For ResNet style CNN architectures, I know its recommended to not insert on the residual side so that a conv -> add pattern can be fused into a single kernel. Is this the case for a linear layer -> add in transformers as well?

For Transformers, it is recommended to add Q/DQs on both the inputs of the ResidualAdd. This is because in ConvNets, we fuse the ResidualAdd with the Conv, but for Transformers, we fuse the ResidualAdd with the LayerNorm that comes right after the ResidualAdd.

Jeremalloch · 2023-09-21T22:50:45Z

Hey, one followup question. How does the fusion scheme work for pre-norm transformers (as the layer norm would only be applied to the residual branch, and not the identity branch)? Does a norm first transformer come with a performance penalty / less optimized kernels?

nvpohanh · 2023-09-22T09:57:59Z

TRT should be able to fuse the add_1 with norm2, so it should not cause any perf issue.

WeixiangXu · 2023-10-17T13:15:06Z

TRT's MHA fusion does not support implicit quantizations yet. Please use explicit quantization instead: Add Q/DQ ops before the two batch gemms in the MHA and also add Q/DQ ops before the ResidualAdd.

@nvpohanh How about the int8 attention speed v.s. fp16? Do you have more detailed documents about how to conduct explicit quantization on attention? thanks!

zhexinli · 2024-03-14T06:25:38Z

The INT8 MHA fused kernels are already integrated in TRT 8.6. The only caveat is that SeqLen must be 512 or below.

It does use flash attention if applicable.

In your attached figure, should we be inserting a QDQ node between the bias addition and the residual connection addition, or should we just be inserting it on the identity / shortcut branch? For ResNet style CNN architectures, I know its recommended to not insert on the residual side so that a conv -> add pattern can be fused into a single kernel. Is this the case for a linear layer -> add in transformers as well?

For Transformers, it is recommended to add Q/DQs on both the inputs of the ResidualAdd. This is because in ConvNets, we fuse the ResidualAdd with the Conv, but for Transformers, we fuse the ResidualAdd with the LayerNorm that comes right after the ResidualAdd.

hi, I'm new to TRT and not familiear with TRT's docs. Is there somewhere we can view the features and constraints of all of TRT supported kernels/fuse pattern/plugins? Such as how to insert QDQ for MHA and its supported layout?

Aktcob · 2024-05-10T02:53:50Z

(and if it's using quantization under the hood, especially for the implicit quantization mode)

TRT's MHA fusion does not support implicit quantizations yet. Please use explicit quantization instead: Add Q/DQ ops before the two batch gemms in the MHA and also add Q/DQ ops before the ResidualAdd.

Here is an ugly example:

Hi, @nvpohanh I try to convert this onnx to Tensorrt Engine. But there is no kernel name with _mha.
Tensorrt Version: 8.6.2

nvpohanh · 2024-05-10T03:09:36Z

@Aktcob Could you share your trtexec command and the ONNX? Also, could you try TRT 10.0.1.6 GA release and make sure you have enabled FP16?

Aktcob · 2024-05-10T03:19:04Z

@Aktcob Could you share your trtexec command and the ONNX? Also, could you try TRT 10.0.1.6 GA release and make sure you have enabled FP16?

@nvpohanh Thanks for reply!
/usr/src/tensorrt/bin/trtexec --onnx=selfattention.onnx --fp16 --iterations=300 --avgRuns=100 --dumpProfile --workspace=1000 --saveEngine=selfattention.engine

I try it on Jetson Orin devkit. So I cannot try it with TRT 10.0.1.6 GA.

Onnx File: https://wenshu.sankuai.com/file/share/download/35BB3D5239CBFD33D79A5FE4DA4F17BFF7B46221
Password：3xv7u6

Aktcob · 2024-05-10T07:18:20Z

@nvpohanh I try it on another SelfAttention Module without AttentionMask. And there is a kernel with name mha_v2 which fuses matmal + softmax + matmal.
So the question is: how to use mha on selfattention With AttentionMask?

ecilay · 2024-06-10T21:13:50Z

Does Myelin fuses the attention during runtime at engine running or Ahead of time at engine build?

Aktcob · 2024-11-27T10:29:56Z

@nvpohanh Hi, I follow ur example and test my MHA module.

And the onnx model is shown as:

I test it on Jetson Orin Device, Tensorrt version is 8.6.0. And it can be built Tensorrt Engine with mha kernel. I use Nsys to show this.

However, the INT8 mha kernel is much slower than the FP16 mha kernel.(800us vs 200us). How to solve this problem? Thanks for ur reply.

(and if it's using quantization under the hood, especially for the implicit quantization mode)

TRT's MHA fusion does not support implicit quantizations yet. Please use explicit quantization instead: Add Q/DQ ops before the two batch gemms in the MHA and also add Q/DQ ops before the ResidualAdd.

Here is an ugly example:

zerollzeng assigned nvpohanh Aug 22, 2023

zerollzeng added the triaged Issue has been triaged by maintainers label Aug 22, 2023

zhexinli mentioned this issue Mar 14, 2024

Is there a way to activate int8 MHA_v2 kernel when SeqLen > 512? #3714

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[question] Myelin: attention fusion and FlashAttention #3243

[question] Myelin: attention fusion and FlashAttention #3243

vadimkantorov commented Aug 21, 2023

zerollzeng commented Aug 22, 2023

nvpohanh commented Aug 23, 2023

nvpohanh commented Aug 23, 2023

Jeremalloch commented Sep 11, 2023

nvpohanh commented Sep 12, 2023

Jeremalloch commented Sep 21, 2023

nvpohanh commented Sep 22, 2023

WeixiangXu commented Oct 17, 2023

zhexinli commented Mar 14, 2024 •

edited

Loading

Aktcob commented May 10, 2024 •

edited

Loading

nvpohanh commented May 10, 2024

Aktcob commented May 10, 2024

Aktcob commented May 10, 2024

ecilay commented Jun 10, 2024

Aktcob commented Nov 27, 2024 •

edited

Loading

[question] Myelin: attention fusion and FlashAttention #3243

[question] Myelin: attention fusion and FlashAttention #3243

Comments

vadimkantorov commented Aug 21, 2023

zerollzeng commented Aug 22, 2023

nvpohanh commented Aug 23, 2023

nvpohanh commented Aug 23, 2023

Jeremalloch commented Sep 11, 2023

nvpohanh commented Sep 12, 2023

Jeremalloch commented Sep 21, 2023

nvpohanh commented Sep 22, 2023

WeixiangXu commented Oct 17, 2023

zhexinli commented Mar 14, 2024 • edited Loading

Aktcob commented May 10, 2024 • edited Loading

nvpohanh commented May 10, 2024

Aktcob commented May 10, 2024

Aktcob commented May 10, 2024

ecilay commented Jun 10, 2024

Aktcob commented Nov 27, 2024 • edited Loading

zhexinli commented Mar 14, 2024 •

edited

Loading

Aktcob commented May 10, 2024 •

edited

Loading

Aktcob commented Nov 27, 2024 •

edited

Loading