Skip to content

Commit

Permalink
metax, ViT模型支持 (FlagOpen#518)
Browse files Browse the repository at this point in the history
* add vit

* ch readme

---------

Co-authored-by: xiaofeng guo <[email protected]>
  • Loading branch information
xfguo-ucas and xiaofeng guo authored Apr 18, 2024
1 parent 65a82fe commit bc0f310
Show file tree
Hide file tree
Showing 8 changed files with 90 additions and 0 deletions.
43 changes: 43 additions & 0 deletions training/metax/vit-pytorch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
### 测试数据集下载
[测试数据集下载](../../benchmarks/vit/README.md#数据集)

### 沐曦集成电路 C500 GPU配置与运行信息参考
#### 环境配置
- ##### 硬件环境
- 机器、加速卡型号: 曦云®C500 64G
- 多机网络类型、带宽: InfiniBand,2x200 Gb/s
- ##### 软件环境
- OS版本:Ubuntu 20.04.6
- OS kernel版本: 5.4.0-26-generic
- 加速卡驱动版本:2.2.0
- Docker 版本:24.0.7
- 训练框架版本:pytorch-2.0.0+mc2.18.0.8-cp38-cp38-linux_x86_64.whl
- 依赖软件版本:无


### 运行情况
* 通用指标

| 指标名称 | 指标值 | 特殊说明 |
| -------------- | --------------------------------------------- | ------------------------------------------- |
| 任务类别 | Image Classification | |
| 模型 | vit | |
| 数据集 | Imagenet2012 1K | |
| 数据精度 | precision,见“性能指标” | 可选fp32/amp/fp16/tf32 |
| 超参修改 | fix_hp,见“性能指标” | 跑满硬件设备评测吞吐量所需特殊超参 |
| 硬件设备简称 | MXC500 | |
| 硬件存储使用 | mem,见“性能指标” | 通常称为“显存”,单位为GiB |
| 端到端时间 | e2e_time,见“性能指标” | 总时间+Perf初始化等时间 |
| 总吞吐量 | p_whole,见“性能指标” | 实际训练样本数除以总时间(performance_whole) |
| 训练吞吐量 | p_train,见“性能指标” | 不包含每个epoch末尾的评估部分耗时 |
| **计算吞吐量** | **p_core,见“性能指标”** | 不包含数据IO部分的耗时(p3>p2>p1) |
| 训练结果 | final_acc1,见“性能指标” | 验证准确率 |
| 额外修改项 || |

* 性能指标

| 配置 | precision | fix_hp | e2e_time | p_whole | p_train | p_core | final_acc1 | mem |
| ----------------- | --------- | ------ | -------- | ------- | ------- | ------ | ---------- | --------- |
| C500单机8卡(1x8) | fp32 | bs=256 | / | | | | 79.822 | 33.3/64.0 |
| C500单机单卡(1x1) | fp32 | bs=256 | / | | | | / | 30.3/64.0 |
| C500两机8卡(2x8) | fp32 | bs=256 | / | | | | / | 33.1/64.0 |
6 changes: 6 additions & 0 deletions training/metax/vit-pytorch/config/config_C500x1x1.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from config_common import *

train_batch_size = 256
eval_batch_size = 512
gradient_accumulation_steps = 2
epochs = 2
5 changes: 5 additions & 0 deletions training/metax/vit-pytorch/config/config_C500x1x8.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from config_common import *

train_batch_size = 256
eval_batch_size = 512
gradient_accumulation_steps = 2
6 changes: 6 additions & 0 deletions training/metax/vit-pytorch/config/config_C500x2x8.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from config_common import *

train_batch_size = 256
eval_batch_size = 512
gradient_accumulation_steps = 2
epochs = 28
18 changes: 18 additions & 0 deletions training/metax/vit-pytorch/config/config_common.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
vendor = "metax"
dist_backend = "nccl"

epochs = 300
opt = "adamw"
lr = 0.003
weight_decay = 0.3
lr_scheduler = "cosineannealinglr"
lr_warmup_method = "linear"
lr_warmup_epochs = 30
lr_warmup_decay = 0.033
amp = False
label_smoothing = 0.11
mixup_alpha = 0.2
auto_augment = "ra"
clip_grad_norm = 1
ra_sampler = True
cutmix_alpha = 1.0
6 changes: 6 additions & 0 deletions training/metax/vit-pytorch/config/environment_variables.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# =================================================
# Export variables
# =================================================

export METAX_USE_TF32=1
export PYTORCH_USE_FLASHATTN=1
6 changes: 6 additions & 0 deletions training/metax/vit-pytorch/config/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
http://repo.metax-tech.com/r/pypi/simple/torch-2.0.0+gite544b36-cp38-cp38-linux_x86_64.whl
numpy
tqdm
schedule
timm==0.4.12
pyyaml
Empty file.

0 comments on commit bc0f310

Please sign in to comment.