Skip to content

Latest commit

 

History

History
31 lines (17 loc) · 1.9 KB

README.md

File metadata and controls

31 lines (17 loc) · 1.9 KB

FoodMini Vision Transformer (ViT) Project

Original ViT paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Original Transformer paper: Attention is all you need

Dataset: FoodVision Mini dataset

Overview

Welcome to the FoodMini Vision Transformer project! In this repository, I am excited to share my journey of replicating a machine learning research paper, "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (ViT paper), using PyTorch. The primary goal is to create a Vision Transformer (ViT) from scratch and achieve a test accuracy of above 90% on the FoodVision Mini problem.

Background

Transformer neural network architecture, which was originally introduced in the machine learning research paper "Attention is all you need." Initially designed for one-dimensional (1D) sequences of text, the Transformer architecture leverages the attention mechanism as its primary learning layer. Similar to how a convolutional neural network (CNN) uses convolutions, a Transformer architecture is generally any neural network that utilizes the attention mechanism as its primary learning layer.

Replication

Replicated the impressive results presented in the ViT paper by implementing the Vision Transformer architecture with PyTorch. ViT has emerged as a state-of-the-art solution for computer vision tasks, showcasing remarkable performance in image recognition at scale.

Getting Started

To embark on this replication journey and run the Vision Transformer on the FoodVision Mini problem, follow these steps:

  1. Clone the repository:
git clone https://github.com/rkstu/FoodMini-Vision-Transformer.git
cd FoodMini-Vision-Transformer