Original ViT paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Original Transformer paper: Attention is all you need
Dataset: FoodVision Mini dataset
Welcome to the FoodMini Vision Transformer project! In this repository, I am excited to share my journey of replicating a machine learning research paper, "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (ViT paper), using PyTorch. The primary goal is to create a Vision Transformer (ViT) from scratch and achieve a test accuracy of above 90% on the FoodVision Mini problem.
Transformer neural network architecture, which was originally introduced in the machine learning research paper "Attention is all you need." Initially designed for one-dimensional (1D) sequences of text, the Transformer architecture leverages the attention mechanism as its primary learning layer. Similar to how a convolutional neural network (CNN) uses convolutions, a Transformer architecture is generally any neural network that utilizes the attention mechanism as its primary learning layer.
Replicated the impressive results presented in the ViT paper by implementing the Vision Transformer architecture with PyTorch. ViT has emerged as a state-of-the-art solution for computer vision tasks, showcasing remarkable performance in image recognition at scale.
To embark on this replication journey and run the Vision Transformer on the FoodVision Mini problem, follow these steps:
- Clone the repository:
git clone https://github.com/rkstu/FoodMini-Vision-Transformer.git
cd FoodMini-Vision-Transformer