Skip to content

Commit

Permalink
[Feature] add eva02 backbone (open-mmlab#1450)
Browse files Browse the repository at this point in the history
* [CI] Add test mim CI. (open-mmlab#879)

* [CI] Add test mim CI. (open-mmlab#879)

* feat: add eva02 backbone

* feat: add eva02 backbone

* feat: add eva02 backbone

* feat: add eva02 backbone

* feat: add eva02 backbone

* feat: add eva02 backbone

* feat: add eva02 backbone

* feat: add eva02 backbone

* update

* update ci

* rebase

* feat: add eva02 backbone

* feat: add eva02 backbone

* feat: add eva02 backbone

* feat: add eva02 backbone

* feat: add eva02 backbone

* feat: add eva02 backbone

* feat: add eva02 backbone

* feat: add eva02 backbone

* update

* update readme and configs

* update readme and configs

* refactore eva02

* [CI] Add test mim CI. (open-mmlab#879)

* feat: add eva02 backbone

* feat: add eva02 backbone

* feat: add eva02 backbone

* feat: add eva02 backbone

* feat: add eva02 backbone

* feat: add eva02 backbone

* feat: add eva02 backbone

* feat: add eva02 backbone

* update

* update ci

* rebase

* feat: add eva02 backbone

* feat: add eva02 backbone

* feat: add eva02 backbone

* update

* update readme and configs

* refactore eva02

* update readme and metafile

* update readme and metafile

* update readme and metafile

* update

* rename eva02

* rename eva02

* fix uts

* rename configs

---------

Co-authored-by: Ma Zerun <[email protected]>
Co-authored-by: Ezra-Yu <[email protected]>
  • Loading branch information
3 people authored May 6, 2023
1 parent 7f4eccb commit 034919d
Show file tree
Hide file tree
Showing 20 changed files with 1,317 additions and 2 deletions.
62 changes: 62 additions & 0 deletions configs/_base_/datasets/imagenet_bs16_eva_448.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# dataset settings
dataset_type = 'ImageNet'
data_preprocessor = dict(
num_classes=1000,
# RGB format normalization parameters
mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
# convert image from BGR to RGB
to_rgb=True,
)

train_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='RandomResizedCrop',
scale=448,
backend='pillow',
interpolation='bicubic'),
dict(type='RandomFlip', prob=0.5, direction='horizontal'),
dict(type='PackInputs'),
]

test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='ResizeEdge',
scale=448,
edge='short',
backend='pillow',
interpolation='bicubic'),
dict(type='CenterCrop', crop_size=448),
dict(type='PackInputs'),
]

train_dataloader = dict(
batch_size=16,
num_workers=5,
dataset=dict(
type=dataset_type,
data_root='data/imagenet',
ann_file='meta/train.txt',
data_prefix='train',
pipeline=train_pipeline),
sampler=dict(type='DefaultSampler', shuffle=True),
)

val_dataloader = dict(
batch_size=8,
num_workers=5,
dataset=dict(
type=dataset_type,
data_root='data/imagenet',
ann_file='meta/val.txt',
data_prefix='val',
pipeline=test_pipeline),
sampler=dict(type='DefaultSampler', shuffle=False),
)
val_evaluator = dict(type='Accuracy', topk=(1, 5))

# If you want standard test, please manually configure the test dataset
test_dataloader = val_dataloader
test_evaluator = val_evaluator
109 changes: 109 additions & 0 deletions configs/eva02/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# EVA-02

> [EVA-02: A Visual Representation for Neon Genesis](https://arxiv.org/abs/2303.11331)
<!-- [ALGORITHM] -->

## Abstract

We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling. With an updated plain Transformer architecture as well as extensive pre-training from an open & accessible giant CLIP vision encoder, EVA-02 demonstrates superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets. Notably, using exclusively publicly accessible training data, EVA-02 with only 304M parameters achieves a phenomenal 90.0 fine-tuning top-1 accuracy on ImageNet-1K val set. Additionally, our EVA-02-CLIP can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest & best open-sourced CLIP with only ~1/6 parameters and ~1/6 image-text training data. We offer four EVA-02 variants in various model sizes, ranging from 6M to 304M parameters, all with impressive performance. To facilitate open accessand open research, we release the complete suite of EVA-02 to the community.

<div align=center>
<img src="https://user-images.githubusercontent.com/40905160/229037980-b83dceb5-41d6-406c-a20b-63b83c80136d.png" width="70%" alt="TrV builds upon the original plain ViT architecture and includes several enhancements: SwinGLU FFN, sub-LN, 2D RoPE, and JAX weight initialization. To keep the parameter & FLOPs consistent with the baseline, the FFN hidden dim of SwiGLU is 2/3× of the typical MLP counterpart."/>
</div>

## How to use it?

<!-- [TABS-BEGIN] -->

**Predict image**

```python
from mmpretrain import inference_model

predict = inference_model('vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px', 'demo/bird.JPEG')
print(predict['pred_class'])
print(predict['pred_score'])
```

**Use the model**

```python
import torch
from mmpretrain import get_model

model = get_model('vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px', pretrained=True)
inputs = torch.rand(1, 3, 336, 336)
out = model(inputs)
print(type(out))
# To extract features.
feats = model.extract_feat(inputs)
print(type(feats))
```

**Train/Test Command**

Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).

Train:

```shell
python tools/train.py configs/eva02/eva02-tiny-p14_in1k.py
```

Test:

```shell
python tools/test.py configs/eva02/eva02-tiny-p14_in1k.py /path/to/eva02-tiny-p14_in1k.pth
```

<!-- [TABS-END] -->

## Models and results

### Pretrained models

| Model | Params (M) | Flops (G) | Config | Download |
| :-------------------------------- | :--------: | :-------: | :-----------------------------------: | :-----------------------------------------------------------------------------------------------------------: |
| `vit-tiny-p14_eva02-pre_in21k`\* | 5.50 | 1.70 | [config](eva02-tiny-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_pre_in21k_20230505-d703e7b1.pth) |
| `vit-small-p14_eva02-pre_in21k`\* | 21.62 | 6.14 | [config](eva02-small-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_pre_in21k_20230505-3175f463.pth) |
| `vit-base-p14_eva02-pre_in21k`\* | 85.77 | 23.22 | [config](eva02-base-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_pre_in21k_20230505-2f2d4d3c.pth) |
| `vit-large-p14_eva02-pre_in21k`\* | 303.29 | 81.15 | [config](eva02-large-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_in21k_20230505-9072de5d.pth) |
| `vit-large-p14_eva02-pre_m38m`\* | 303.29 | 81.15 | [config](eva02-large-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_m38m_20230505-b8a1a261.pth) |

- The input size / patch size of MIM pre-trained EVA-02 is `224x224` / `14x14`.

*Models with * are converted from the [official repo](https://github.com/baaivision/EVA).*

### Image Classification on ImageNet-1k

#### (*w/o* IN-21K intermediate fine-tuning)

| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
| :---------------------------------------------------- | :----------------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------: | :-------------------------------------------------------: |
| `vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px`\* | EVA02 ImageNet-21k | 5.76 | 4.68 | 80.69 | 95.54 | [config](./eva02-tiny-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_in21k-pre_3rdparty_in1k-336px_20230505-a4e8708a.pth) |
| `vit-small-p14_eva02-in21k-pre_3rdparty_in1k-336px`\* | EVA02 ImageNet-21k | 22.13 | 15.48 | 85.78 | 97.60 | [config](./eva02-small-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_in21k-pre_3rdparty_in1k-336px_20230505-9c5b0e85.pth) |
| `vit-base-p14_eva02-in21k-pre_3rdparty_in1k-448px`\* | EVA02 ImageNet-21k | 87.13 | 107.11 | 88.29 | 98.53 | [config](./eva02-base-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_3rdparty_in1k-448px_20230505-8ad211c5.pth) |

*Models with * are converted from the [official repo](https://github.com/baaivision/EVA/tree/master/EVA-02). The config files of these models are only for inference. We haven't reprodcue the training results.*

#### (*w* IN-21K intermediate fine-tuning)

| Model | Pretrain | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) | Config | Download |
| :---------------------------------------------------- | :----------------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------: | :-------------------------------------------------------: |
| `vit-base-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px`\* | EVA02 ImageNet-21k | 87.13 | 107.11 | 88.47 | 98.62 | [config](./eva02-base-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-5cd4d87f.pth) |
| `vit-large-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px`\* | EVA02 ImageNet-21k | 305.08 | 362.33 | 89.65 | 98.95 | [config](./eva02-large-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-926d1599.pth) |
| `vit-large-p14_eva02_m38m-pre_in21k-medft_3rdparty_in1k-448px`\* | EVA02 Merged-38M | 305.10 | 362.33 | 89.83 | 99.00 | [config](./eva02-large-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_m38m-pre_in21k-medft_3rdparty_in1k-448px_20230505-150dc5ed.pth) |

*Models with * are converted from the [official repo](https://github.com/baaivision/EVA/tree/master/EVA-02). The config files of these models are only for inference. We haven't reprodcue the training results.*

## Citation

```bibtex
@article{EVA-02,
title={EVA-02: A Visual Representation for Neon Genesis},
author={Yuxin Fang and Quan Sun and Xinggang Wang and Tiejun Huang and Xinlong Wang and Yue Cao},
journal={arXiv preprint arXiv:2303.11331},
year={2023}
}
```
21 changes: 21 additions & 0 deletions configs/eva02/eva02-base-p14_headless.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
model = dict(
type='ImageClassifier',
backbone=dict(
type='ViTEVA02',
arch='b',
img_size=224,
patch_size=14,
sub_ln=True,
final_norm=False,
out_type='avg_featmap'),
neck=None,
head=None,
)

data_preprocessor = dict(
# RGB format normalization parameters
mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
# convert image from BGR to RGB
to_rgb=True,
)
32 changes: 32 additions & 0 deletions configs/eva02/eva02-base-p14_in1k.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
_base_ = [
'../_base_/datasets/imagenet_bs16_eva_448.py',
'../_base_/schedules/imagenet_bs2048_AdamW.py',
'../_base_/default_runtime.py'
]

model = dict(
type='ImageClassifier',
backbone=dict(
type='ViTEVA02',
arch='b',
img_size=448,
patch_size=14,
sub_ln=True,
final_norm=False,
out_type='avg_featmap'),
neck=None,
head=dict(
type='LinearClsHead',
num_classes=1000,
in_channels=768,
loss=dict(
type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
),
init_cfg=[
dict(type='TruncNormal', layer='Linear', std=.02),
dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
],
train_cfg=dict(augments=[
dict(type='Mixup', alpha=0.8),
dict(type='CutMix', alpha=1.0)
]))
21 changes: 21 additions & 0 deletions configs/eva02/eva02-large-p14_headless.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
model = dict(
type='ImageClassifier',
backbone=dict(
type='ViTEVA02',
arch='l',
img_size=224,
patch_size=14,
sub_ln=True,
final_norm=False,
out_type='avg_featmap'),
neck=None,
head=None,
)

data_preprocessor = dict(
# RGB format normalization parameters
mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
# convert image from BGR to RGB
to_rgb=True,
)
32 changes: 32 additions & 0 deletions configs/eva02/eva02-large-p14_in1k.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
_base_ = [
'../_base_/datasets/imagenet_bs16_eva_448.py',
'../_base_/schedules/imagenet_bs2048_AdamW.py',
'../_base_/default_runtime.py'
]

model = dict(
type='ImageClassifier',
backbone=dict(
type='ViTEVA02',
arch='l',
img_size=448,
patch_size=14,
sub_ln=True,
final_norm=False,
out_type='avg_featmap'),
neck=None,
head=dict(
type='LinearClsHead',
num_classes=1000,
in_channels=1024,
loss=dict(
type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
),
init_cfg=[
dict(type='TruncNormal', layer='Linear', std=.02),
dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
],
train_cfg=dict(augments=[
dict(type='Mixup', alpha=0.8),
dict(type='CutMix', alpha=1.0)
]))
20 changes: 20 additions & 0 deletions configs/eva02/eva02-small-p14_headless.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
model = dict(
type='ImageClassifier',
backbone=dict(
type='ViTEVA02',
arch='s',
img_size=224,
patch_size=14,
final_norm=False,
out_type='avg_featmap'),
neck=None,
head=None,
)

data_preprocessor = dict(
# RGB format normalization parameters
mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
# convert image from BGR to RGB
to_rgb=True,
)
31 changes: 31 additions & 0 deletions configs/eva02/eva02-small-p14_in1k.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
_base_ = [
'../_base_/datasets/imagenet_bs16_eva_336.py',
'../_base_/schedules/imagenet_bs2048_AdamW.py',
'../_base_/default_runtime.py'
]

model = dict(
type='ImageClassifier',
backbone=dict(
type='ViTEVA02',
arch='s',
img_size=336,
patch_size=14,
final_norm=False,
out_type='avg_featmap'),
neck=None,
head=dict(
type='LinearClsHead',
num_classes=1000,
in_channels=384,
loss=dict(
type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
),
init_cfg=[
dict(type='TruncNormal', layer='Linear', std=.02),
dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
],
train_cfg=dict(augments=[
dict(type='Mixup', alpha=0.8),
dict(type='CutMix', alpha=1.0)
]))
20 changes: 20 additions & 0 deletions configs/eva02/eva02-tiny-p14_headless.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
model = dict(
type='ImageClassifier',
backbone=dict(
type='ViTEVA02',
arch='t',
img_size=224,
patch_size=14,
final_norm=False,
out_type='avg_featmap'),
neck=None,
head=None,
)

data_preprocessor = dict(
# RGB format normalization parameters
mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
# convert image from BGR to RGB
to_rgb=True,
)
31 changes: 31 additions & 0 deletions configs/eva02/eva02-tiny-p14_in1k.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
_base_ = [
'../_base_/datasets/imagenet_bs16_eva_336.py',
'../_base_/schedules/imagenet_bs2048_AdamW.py',
'../_base_/default_runtime.py'
]

model = dict(
type='ImageClassifier',
backbone=dict(
type='ViTEVA02',
arch='t',
img_size=336,
patch_size=14,
final_norm=False,
out_type='avg_featmap'),
neck=None,
head=dict(
type='LinearClsHead',
num_classes=1000,
in_channels=192,
loss=dict(
type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
),
init_cfg=[
dict(type='TruncNormal', layer='Linear', std=.02),
dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
],
train_cfg=dict(augments=[
dict(type='Mixup', alpha=0.8),
dict(type='CutMix', alpha=1.0)
]))
Loading

0 comments on commit 034919d

Please sign in to comment.