-
I'm using ViT-H-14-quickgelu_dfn5b, does it natively support Flash attention to speed up inference? Has anyone experience with speeding up inference for similar scale open clip model ? Noticed a prior discussion on flash attention #317, not sure if this is still the case for later models came out after this. Thanks. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
Any model that uses either the builtin OpenCLIP ViT / text transformer (what the DFN models use) OR a timm vit will use F.sdpa by default. The OpenCLIP ViT does so via the nn.MultiHeadAttention (which calls F.sdpa internally). F.sdpa will dispatch to one of several fused attention kernel, aa PyTorch impl of flash attention is one of them. I feel most of the models here would dispatch to the flash kernel if it's available. Don't currently have plans to support flash attention library directly, using it via the PyTorch impl is the lowest maintenance burden for us. |
Beta Was this translation helpful? Give feedback.
Any model that uses either the builtin OpenCLIP ViT / text transformer (what the DFN models use) OR a timm vit will use F.sdpa by default. The OpenCLIP ViT does so via the nn.MultiHeadAttention (which calls F.sdpa internally).
F.sdpa will dispatch to one of several fused attention kernel, aa PyTorch impl of flash attention is one of them. I feel most of the models here would dispatch to the flash kernel if it's available.
Don't currently have plans to support flash attention library directly, using it via the PyTorch impl is the lowest maintenance burden for us.