FoodMini Vision Transformer (ViT) Project

Original ViT paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Original Transformer paper: Attention is all you need

Overview

Welcome to the FoodMini Vision Transformer project! In this repository, I am excited to share my journey of replicating a machine learning research paper, "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (ViT paper), using PyTorch. The primary goal is to create a Vision Transformer (ViT) from scratch and achieve a test accuracy of above 90% on the FoodVision Mini problem.

Background

Transformer neural network architecture, which was originally introduced in the machine learning research paper "Attention is all you need." Initially designed for one-dimensional (1D) sequences of text, the Transformer architecture leverages the attention mechanism as its primary learning layer. Similar to how a convolutional neural network (CNN) uses convolutions, a Transformer architecture is generally any neural network that utilizes the attention mechanism as its primary learning layer.

Replication

Replicated the impressive results presented in the ViT paper by implementing the Vision Transformer architecture with PyTorch. ViT has emerged as a state-of-the-art solution for computer vision tasks, showcasing remarkable performance in image recognition at scale.

Getting Started

To embark on this replication journey and run the Vision Transformer on the FoodVision Mini problem, follow these steps:

Clone the repository:

git clone https://github.com/rkstu/FoodMini-Vision-Transformer.git
cd FoodMini-Vision-Transformer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

FoodMini Vision Transformer (ViT) Project

Overview

Background

Replication

Getting Started

Files

README.md

Latest commit

History

README.md

File metadata and controls

FoodMini Vision Transformer (ViT) Project

Overview

Background

Replication

Getting Started