Flash attention for Inference #969

dhaivatd · 2024-10-23T15:04:13Z

dhaivatd
Oct 23, 2024

I'm using ViT-H-14-quickgelu_dfn5b, does it natively support Flash attention to speed up inference? Has anyone experience with speeding up inference for similar scale open clip model ?

Noticed a prior discussion on flash attention #317, not sure if this is still the case for later models came out after this.

Thanks.

Answered by rwightman

Oct 23, 2024

Any model that uses either the builtin OpenCLIP ViT / text transformer (what the DFN models use) OR a timm vit will use F.sdpa by default. The OpenCLIP ViT does so via the nn.MultiHeadAttention (which calls F.sdpa internally).

F.sdpa will dispatch to one of several fused attention kernel, aa PyTorch impl of flash attention is one of them. I feel most of the models here would dispatch to the flash kernel if it's available.

Don't currently have plans to support flash attention library directly, using it via the PyTorch impl is the lowest maintenance burden for us.

View full answer

rwightman · 2024-10-23T19:05:01Z

rwightman
Oct 23, 2024
Maintainer

Any model that uses either the builtin OpenCLIP ViT / text transformer (what the DFN models use) OR a timm vit will use F.sdpa by default. The OpenCLIP ViT does so via the nn.MultiHeadAttention (which calls F.sdpa internally).

F.sdpa will dispatch to one of several fused attention kernel, aa PyTorch impl of flash attention is one of them. I feel most of the models here would dispatch to the flash kernel if it's available.

Don't currently have plans to support flash attention library directly, using it via the PyTorch impl is the lowest maintenance burden for us.

4 replies

dhaivatd Oct 24, 2024
Author

Thanks @rwightman, makes sense. If not direct flash attention, what are the ways you would recommend to speed up model inference on a gpu over and beyond using half precision? Is there a plan to support implementations for TensorRT or Torch Script versions of OpenCLIP ViT models?

rwightman Oct 24, 2024
Maintainer

I doubt you'll notice a difference between the torch impl of flash attn and the original for these models, torchscript will not be more performant, but torch.compile (dynamo) can help quite a bit for inference. There's also a newer export in torch that uses the dynamo internals that seems to be working okay now.

TensorRT can be useful for inference depolyment but it's a non-stop set of headaches in terms of ver compat issues that only people needing it will want to take on, too much overhead for a project like this to maintain. I don't see why it wouldn't work though. I found that torch.compile w/ reduce-overhead mode + F.sdpa in timm vits was pretty close to someone's onnx -> TensorRT export of the same model. The timm vit is very similar to the native one here in terms of performance from what I've measured so I expect the same.

rwightman Oct 24, 2024
Maintainer

I also consider torchscript to be a bit of a deadend now, there are no new additions and no fixes being made for old issues. At some point I hope to be able to kill support for it (ideally it gets officially deprecated first) as it's been a headache to support since day one.

dhaivatd Oct 24, 2024
Author

Thanks @rwightman - that helps. I will attempt torch.compile, unless if you already know of a good public source of exported OpenCLIP models out there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash attention for Inference #969

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Flash attention for Inference #969

dhaivatd Oct 23, 2024

Replies: 1 comment · 4 replies

rwightman Oct 23, 2024 Maintainer

dhaivatd Oct 24, 2024 Author

rwightman Oct 24, 2024 Maintainer

rwightman Oct 24, 2024 Maintainer

dhaivatd Oct 24, 2024 Author

dhaivatd
Oct 23, 2024

Replies: 1 comment 4 replies

rwightman
Oct 23, 2024
Maintainer

dhaivatd Oct 24, 2024
Author

rwightman Oct 24, 2024
Maintainer

rwightman Oct 24, 2024
Maintainer

dhaivatd Oct 24, 2024
Author