Skip to content

Commit 4674374

Browse files
committed
Merge branch 'transformers' of https://github.com/kota2004/Algorithms-Explanation into transformers
2 parents 2519ce5 + 7c9bcfb commit 4674374

File tree

1 file changed

+34
-0
lines changed

1 file changed

+34
-0
lines changed
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Vision Transformer in Deep Learning
2+
3+
## "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"
4+
5+
## Introduction:
6+
The Vision Transformer (ViT) paper demonstrated that a pure transformer-based architecture, without any convolutional layers, could achieve state-of-the-art results on large-scale image classification benchmarks. While initially applied to tasks like image classification, ViT has since been adapted for a wide range of computer vision tasks, including object detection and segmentation.
7+
8+
## Note:
9+
The explanation uses a standard image size=224x224x3.
10+
11+
## Explanation:
12+
1) Given image of size 224x224x3 divide into non-overlapping patches of size 16x16x3 which give 14x14=196 patches.
13+
2) Each patch is flattened into a vector via learnable linear projection which gives seqquence of patch embeddings.
14+
3) Adding a position embedding to each patch embedding(learnable) to retain the spatial information of image.
15+
4) Prepending a class token to (patch+position) embedding of the image.
16+
5) Input of Transformer Encoder is class+(patch+position) embeddings and the Self-Attention in encoder learns relationship among patches.
17+
6) Only the class token from output of Transformer Encoder is passed through MLP(Multi layer perceptron) aka classifier.
18+
## Why:
19+
Ex: There is Dog image, the output of Transformer Encoder gives class token along with patch and position embeddings but the necessary features of Dog are captured in class embedding to classify the Dog image.
20+
7) The output of MLP is fed to softmax layer which gives the probability vector->class label(based on high probability index).
21+
22+
## Reference:
23+
24+
AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
25+
For reference on how the attention mechanism(Transformer Encoder in ViT) works check: Attention Is All You Need
26+
27+
28+
29+
30+
31+
32+
33+
34+

0 commit comments

Comments
 (0)