Skip to content

Flash attention for Inference #969

Answered by rwightman
dhaivatd asked this question in Q&A
Discussion options

You must be logged in to vote

Any model that uses either the builtin OpenCLIP ViT / text transformer (what the DFN models use) OR a timm vit will use F.sdpa by default. The OpenCLIP ViT does so via the nn.MultiHeadAttention (which calls F.sdpa internally).

F.sdpa will dispatch to one of several fused attention kernel, aa PyTorch impl of flash attention is one of them. I feel most of the models here would dispatch to the flash kernel if it's available.

Don't currently have plans to support flash attention library directly, using it via the PyTorch impl is the lowest maintenance burden for us.

Replies: 1 comment 4 replies

Comment options

You must be logged in to vote
4 replies
@dhaivatd
Comment options

@rwightman
Comment options

@rwightman
Comment options

@dhaivatd
Comment options

Answer selected by dhaivatd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants