Start with The illustrated Transformer. It is the most comprehensible guide to start with visual aid.
For deeper level understanding:
-
The annotated Transformer comments the code implementation.
-
The illustrated BERT, a pretty popular model which is now the base for new models that are coming around, and its paper (2018).
-
Hugging Face library that encapsules a lot of transformer models for you to choose :).
-
The original Vision Transformer and its github. It is easier to use the hugging face library ViT though.
-
The next gen ViT that is a backbone to most Vision tasks, The Swin Transformer, and its second version, Swin V2. The second version had some tweaks so as to be more scalable and use images with high-resolution. They also pre-trained Swin V2 in a self-supervised way. Both models are available at huggingface through the pages: Swin V1 and Swin V2.
-
Another model that seems to be excellent is BEIT which is based on BERT. This is also available at huggingface.
Now the ultimate level, the one that will test your limits and understanding of the previous models. The Video transformer. I will confess that I am new to this Video Transformer thing, but here come some suggestions. I personally just read about the Video Swin Transformer.
- Video Swin Transformer. The code is in the same repository as Swin V1 and V2. There is an implementation of this model on torchvision lib.
The ones that I do not know but want to know more about:
-
Hugging face implementation of TimeSFormer.
-
Torchvision implementation of MVIT another video transformer for classification and detection.
-
Another interesting model is Swin2SR which performs Super Resolution tasks on compressed image and videos.