Skip to content

Latest commit

 

History

History
39 lines (25 loc) · 1.46 KB

README.md

File metadata and controls

39 lines (25 loc) · 1.46 KB

Vision-Transformers

Vision Transformers

Steps:

  • Split the image into patches
  • Embed each patch into a lower dimensional vector using a linear projection layer.
  • Add positional encodings
  • Add an additional learnable classification token to the sequence that will be used to make predictions.
  • Initialize the classification token with random values and then train it along with the rest of the model.
  • Pass the embeddings (including the classification token) through a series of encoder blocks.
  • The output of the final encoder block is called the pooled features or global representation of the image. The term "contextual embeddings" is not commonly used in the context of vision transformers.
  • Pass the classification token's embedding through a Multi-Layer Perceptron (MLP) to make predictions.

  • ViT is a simple vision transformer architecture that replaces the convolutions in the backbone of the popular convolutional neural networks with a transformer encoder.

Abstract:

ViT-1

Proposed Model:

ViT-2

Pre-Train then fine-tune ViT:

ViT-3

Architecture:

ViT-4


Sample Prediction: