This folder contains the implementation of the InternViT-6B for image classification.
The codebase for this part is derived from InternImage, with some code references to EVA and DINOv2. Thanks for their great work.
InternViT-6B follows the structure of vanilla ViT, and its hyperparameters are listed in the table below.
See INSTALLATION.md
Please prepare the dataset according to your needs.
-
ImageNet-1K
: We use the standard ImageNet dataset, you can download it from http://image-net.org/. -
ImageNet-A
: Download it from https://people.eecs.berkeley.edu/~hendrycks/imagenet-a.tar. -
ImageNet-R
: Download it from https://people.eecs.berkeley.edu/~hendrycks/imagenet-r.tar. -
ImageNetV2
: Download it from https://imagenetv2public.s3-us-west-2.amazonaws.com/imagenetv2-matched-frequency.tar.gz. -
ImageNet-Sketch
: Download it usinggdown
.# GDown is needed to download the dataset. Please install it via `pip install gdown` gdown --id 1Mj0i5HBthqH1p_yeXzsg22gZduvgoNeA
First, please prepare the ImageNet-1K
, ImageNet-A
, ImageNet-R
, ImageNetV2
, and ImageNet-Sketch
datasets following the directory structure outlined below.
$ tree data
data
├── imagenet-1k
│ ├── train
│ ├── n01498041
│ └── ...
│ └── val
│ ├── ILSVRC2012_val_00000001.JPEG
│ └── ...
├── imagenet-a
│ ├── n01498041
│ └── ...
├── imagenet-r
│ ├── n01443537
│ └── ...
├── imagenet-sketch
│ ├── n01440764
│ └── ...
└── imagenetv2
└── ImageNetV2-matched-frequency
Then, unzip the train.txt.zip
and val.txt.zip
in meta_data/
.
cd meta_data/
unzip train.txt.zip
unzip val.txt.zip
model name | type | download | size |
---|---|---|---|
InternViT-6B-224px | pytorch | 🤗 HF link | 12 GB |
InternViT-6B-224px-head | pytorch | 🤗 HF link | 25.7 MB |
Please download the above model weights and place them in the pretrained/
folder.
cd pretrained/
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/intern_vit_6b_224px.pth
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/intern_vit_6b_224px_head.pth
The directory structure is:
pretrained
├── intern_vit_6b_224px_head.pth
└── intern_vit_6b_224px.pth
Note, please install apex before training (see installation guide above for details).
To train a linear classifier for InternViT-6B
on ImageNet with 8 GPUs, run:
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --cfg configs/intern_vit_6b_1k_224.yaml
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224.yaml --launcher slurm
model name | IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch | download |
---|---|---|---|---|---|---|---|
intern_vit_6b_1k_224.yaml | 88.2 | 90.4 | 79.9 | 77.5 | 89.8 | 69.1 | ckpt | log |
Evaluate InternViT-6B on ImageNet-1K val with 8 GPUs (click to expand).
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
--cfg configs/intern_vit_6b_1k_224.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
Expected results:
* Acc@1 88.230 Acc@5 98.474
Accuracy of the network on the 50000 test images: 88.2%
Evaluate InternViT-6B on ImageNet-ReaL with 1 GPU (click to expand).
Note: ImageNet-ReaL now only supports single-GPU testing.
python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --eval \
--cfg configs/intern_vit_6b_1k_224_test_imagenet_real.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=1 GPUS_PER_NODE=1 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_real.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
Expected results:
* ReaL Acc@1 90.437 Acc@5 98.567 loss 0.605
ReaL Accuracy of the network on the 50000 test images: 90.4%
Evaluate InternViT-6B on ImageNetV2 with 8 GPUs (click to expand).
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
--cfg configs/intern_vit_6b_1k_224_test_imagenetv2.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenetv2.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
Expected results:
* Acc@1 79.940 Acc@5 95.340
Accuracy of the network on the 10000 test images: 79.9%
Evaluate InternViT-6B on ImageNet-A with 8 GPUs (click to expand).
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
--cfg configs/intern_vit_6b_1k_224_test_imagenet_a.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_a.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
Expected results:
* Acc@1 77.479 Acc@5 92.737
Accuracy of the network on the 7500 test images: 77.5%
Evaluate InternViT-6B on ImageNet-R with 8 GPUs (click to expand).
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
--cfg configs/intern_vit_6b_1k_224_test_imagenet_r.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_r.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
Expected results:
* Acc@1 89.777 Acc@5 97.023
Accuracy of the network on the 30000 test images: 89.8%
Evaluate InternViT-6B on ImageNet-Sketch with 8 GPUs (click to expand).
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
--cfg configs/intern_vit_6b_1k_224_test_imagenet_sketch.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_sketch.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
Expected results:
* Acc@1 69.117 Acc@5 88.341
Accuracy of the network on the 50889 test images: 69.1%