Vision Transformer Replication

I replicated the Vision Transformer (ViT) using Python and PyTorch, absolutely from scratch. This project involves building the Vision Transformer architecture without relying on pre-built modules, providing a deeper understanding of how this state-of-the-art model works.

Vision Transformer Architecture

The Vision Transformer (ViT) divides an image into patches, applies a transformer to these patches, and then predicts labels. This implementation includes:

Patch Embedding: Splits images into patches and embeds them.
Transformer Encoder: Applies multi-head self-attention and feed-forward layers.
Classification Head: Final dense layers for classification tasks.

Original paper: https://arxiv.org/abs/2010.11929

Name	Name	Last commit message	Last commit date
Latest commit Mateusz-best-creator Update README.md Aug 29, 2024 0f32f42 · Aug 29, 2024 History 4 Commits
README.md	README.md	Update README.md	Aug 29, 2024
ViT_paper_replicating.ipynb	ViT_paper_replicating.ipynb	Add notebook.	Aug 29, 2024
screenshot_1.png	screenshot_1.png	adding screenshots	Aug 29, 2024
screenshot_2.png	screenshot_2.png	adding screenshots	Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision Transformer Replication

Vision Transformer Architecture

About

Releases

Packages

Languages

Mateusz-best-creator/ViT-paper-replicating

Folders and files

Latest commit

History

Repository files navigation

Vision Transformer Replication

Vision Transformer Architecture

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages