metax, ViT模型支持 (FlagOpen#518)

* add vit * ch readme --------- Co-authored-by: xiaofeng guo <[email protected]>
shh2000 · Apr 18, 2024 · bc0f310 · bc0f310
1 parent 65a82fe
commit bc0f310
Show file tree

Hide file tree

Showing 8 changed files with 90 additions and 0 deletions.
diff --git a/training/metax/vit-pytorch/README.md b/training/metax/vit-pytorch/README.md
@@ -0,0 +1,43 @@
+### 测试数据集下载
+[测试数据集下载](../../benchmarks/vit/README.md#数据集)
+
+### 沐曦集成电路 C500 GPU配置与运行信息参考
+#### 环境配置
+- ##### 硬件环境
+    - 机器、加速卡型号: 曦云®C500 64G 
+    - 多机网络类型、带宽: InfiniBand，2x200 Gb/s
+- ##### 软件环境
+   - OS版本：Ubuntu 20.04.6
+   - OS kernel版本:  5.4.0-26-generic
+   - 加速卡驱动版本：2.2.0
+   - Docker 版本：24.0.7
+   - 训练框架版本：pytorch-2.0.0+mc2.18.0.8-cp38-cp38-linux_x86_64.whl
+   - 依赖软件版本：无
+
+
+### 运行情况
+* 通用指标
+
+| 指标名称       | 指标值                                        | 特殊说明                                    |
+| -------------- | --------------------------------------------- | ------------------------------------------- |
+| 任务类别       | Image Classification |                                             |
+| 模型           | vit                              |                                             |
+| 数据集         | Imagenet2012 1K                               |                                             |
+| 数据精度       | precision,见“性能指标”                        | 可选fp32/amp/fp16/tf32                      |
+| 超参修改       | fix_hp,见“性能指标”                           | 跑满硬件设备评测吞吐量所需特殊超参          |
+| 硬件设备简称   | MXC500                                    |                                             |
+| 硬件存储使用   | mem,见“性能指标”                              | 通常称为“显存”,单位为GiB                    |
+| 端到端时间     | e2e_time,见“性能指标”                         | 总时间+Perf初始化等时间                     |
+| 总吞吐量       | p_whole,见“性能指标”                          | 实际训练样本数除以总时间(performance_whole) |
+| 训练吞吐量     | p_train,见“性能指标”                          | 不包含每个epoch末尾的评估部分耗时           |
+| **计算吞吐量** | **p_core,见“性能指标”**                       | 不包含数据IO部分的耗时(p3>p2>p1)            |
+| 训练结果       | final_acc1,见“性能指标”                         | 验证准确率                                    |
+| 额外修改项     | 无                                            |                                             |
+
+* 性能指标
+
+| 配置              | precision | fix_hp | e2e_time | p_whole | p_train | p_core | final_acc1 | mem       |
+| ----------------- | --------- | ------ | -------- | ------- | ------- | ------ | ---------- | --------- |
+| C500单机8卡(1x8)  | fp32      | bs=256 | /    |     |     |    | 79.822     | 33.3/64.0 |
+| C500单机单卡(1x1) | fp32      | bs=256 | /        |      |     |    | /          | 30.3/64.0 |
+| C500两机8卡(2x8)  | fp32      | bs=256 | /        |     |    |   | /          | 33.1/64.0 |
diff --git a/training/metax/vit-pytorch/config/config_C500x1x1.py b/training/metax/vit-pytorch/config/config_C500x1x1.py
@@ -0,0 +1,6 @@
+from config_common import *
+
+train_batch_size = 256
+eval_batch_size = 512
+gradient_accumulation_steps = 2
+epochs = 2
diff --git a/training/metax/vit-pytorch/config/config_C500x1x8.py b/training/metax/vit-pytorch/config/config_C500x1x8.py
@@ -0,0 +1,5 @@
+from config_common import *
+
+train_batch_size = 256
+eval_batch_size = 512
+gradient_accumulation_steps = 2
diff --git a/training/metax/vit-pytorch/config/config_C500x2x8.py b/training/metax/vit-pytorch/config/config_C500x2x8.py
@@ -0,0 +1,6 @@
+from config_common import *
+
+train_batch_size = 256
+eval_batch_size = 512
+gradient_accumulation_steps = 2
+epochs = 28
diff --git a/training/metax/vit-pytorch/config/config_common.py b/training/metax/vit-pytorch/config/config_common.py
@@ -0,0 +1,18 @@
+vendor = "metax"
+dist_backend = "nccl"
+
+epochs = 300
+opt = "adamw"
+lr = 0.003
+weight_decay = 0.3
+lr_scheduler = "cosineannealinglr"
+lr_warmup_method = "linear" 
+lr_warmup_epochs = 30
+lr_warmup_decay = 0.033 
+amp = False
+label_smoothing = 0.11
+mixup_alpha = 0.2
+auto_augment = "ra"
+clip_grad_norm = 1
+ra_sampler = True
+cutmix_alpha = 1.0
diff --git a/training/metax/vit-pytorch/config/environment_variables.sh b/training/metax/vit-pytorch/config/environment_variables.sh
@@ -0,0 +1,6 @@
+# =================================================
+# Export variables
+# =================================================
+
+export METAX_USE_TF32=1
+export PYTORCH_USE_FLASHATTN=1
diff --git a/training/metax/vit-pytorch/config/requirements.txt b/training/metax/vit-pytorch/config/requirements.txt
@@ -0,0 +1,6 @@
+http://repo.metax-tech.com/r/pypi/simple/torch-2.0.0+gite544b36-cp38-cp38-linux_x86_64.whl
+numpy
+tqdm
+schedule
+timm==0.4.12
+pyyaml
diff --git a/training/metax/vit-pytorch/extern/.gitkeep b/training/metax/vit-pytorch/extern/.gitkeep