-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace multi head attention in decoder #3
Comments
Hi! You can replace SimA with any self-attention module. It may requires some parameters tuning for specific models. We tried on CvT, ViT and XCiT. It also works with DINO loss (self-supervised). I plan to try it with MAE in the future. When you said decoder, which model are you referring to (e.g, decoder of DETR or MAE)? Thanks! Have a good day! |
@soroush-abbasi Thanks for the reply. I meant the decoder of ConvTransformer, which incorporates convolutions in transformer. |
@soroush-abbasi Also, is it possible to share the code for ViT with SimA? Thank you in advance! |
I guess it should work with decoder of ConvTransformer. You can simply replace self-attention with SimA attention (SimA class in below). To run with ViT/DeiT architecture, please replace these classes in sima.py as below (removing LPI layer, removing class attention layer):
|
@soroush-abbasi Thank you! So, don't we need SimABlock class for ConvTransformer? What is the purpose of it? Can you please explain? |
SimABlock is a regular transformer block which has both self-attention(SimA) and MLP layer. As long as you replace self-attention in your code with SimA you should be fine I guess. So you need to figure out which dimensions in your input is sequence (N) and which is Token dimensions (D) . Or if your features are after splitting to multi-head, you need to find the ordering of B (batch size), H (heads), D (dimension after splitting) and N (sequence length/ number of tokens). sometimes tokens are not flattens. For example, one can look at the image feature maps as a set of tokens with 2D shape. If you have 512x16x16 feature map, you can flatten the last two dimensions to get 512x256 tokens (D=512, N=256). I guess last two dimensions are feature maps of the image in your case, but I'm not sure. Unfortunately, I'm not familiar with ConTransformer. Thanks! |
Sure, thanks for the reply!! |
@soroush-abbasi Can we use SimA if it's a masked self-attention? |
Hi, It's a little complicated. So we normalize tokens in channel dimension before doing QKV dot product. Since we normalize tokens in the channel dimension, each token have effect on other tokens. Therefore, if you want to mask tokens, you need to apply masking before L1-normalization. Please let me know if you have more questions. Thanks! Have a great day! |
Hi,
May I know whether I can use sima instead of multi head attention in decoder, to reduce complexity?
Thanks!
The text was updated successfully, but these errors were encountered: