The notebook contains
- Building a vision transformer from scratch
- Training a pre-trained ViT
- Comparing ViTs and CNNs
In the first part, we trained the model with 20 epochs and 4 attention heads and 4 layers. Also embedding dimension has been set to 64.
There is also a built-in version of ViT model which was trained on ImageNet-21k at resolution 224 * 224. Here we try this with 3 epochs model.
ResNet18 with 3 epochs has been trained on the dataset.