Skip to content

Commit

Permalink
[kunlunxin] fix Vit and add configs (#362)
Browse files Browse the repository at this point in the history
* init

* add efficientnet

* modify config

* modify config

* modify config

* add efficientnet

* modify config

* add efficientnet

* bug fix

* add efficientnet

* add efficientnet

* fix code style

* fix code style

* fix code style

* Revert "fix code style"

This reverts commit ae86109.

* fix code style

* fix code style

* fix code style

* fix code style

* fix code style

* bug fix

* add kunlunxin readme

* fix mobilenetv2 on kunlunxin

* add mobilenet config_R300x2x8.py

* fix mobilenetv2 on kunlunxin

* fix vit on kunlunxin

* add vit 1x8 on kunlunxin

* add vit 2x8 1x1 on kunlunxin

* add vit 2x8 1x1 on kunlunxin

* add vit 2x8 1x1 on kunlunxin

* fix code style

* fix code style

---------

Co-authored-by: Feilei Du <[email protected]>
  • Loading branch information
ScoThunder and Feilei Du authored Dec 18, 2023
1 parent 1e54399 commit c7fb7a1
Show file tree
Hide file tree
Showing 7 changed files with 45 additions and 7 deletions.
32 changes: 25 additions & 7 deletions training/kunlunxin/vit-pytorch/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,33 @@


### 运行情况
| 训练资源 | 配置文件 | 运行时长(s) | 目标精度 | 收敛精度 | Steps数 | 性能(samples/s) |
| -------- | --------------- | ----------- | -------- | -------- | ------- | ---------------- |
| 单机1卡 | config_R300x1x1 | / | | / | | |
| 单机8卡 | config_R300x1x8 | | 79.982 | 66.166 | 181380 | |
| 两机8卡 | config_R300x2x8 | / | | / | | |
* 通用指标

### 收敛曲线
![acc](acc.png)
| 指标名称 | 指标值 | 特殊说明 |
| -------------- | --------------------------------------------- | ------------------------------------------- |
| 任务类别 | Image Classification && Semantic Segmantation | |
| 模型 | Vision Transformer | |
| 数据集 | Imagenet2012 1K | |
| 数据精度 | precision,见“性能指标” | 可选fp32/amp/fp16 |
| 超参修改 | fix_hp,见“性能指标” | 跑满硬件设备评测吞吐量所需特殊超参 |
| 硬件设备简称 | R300 | |
| 硬件存储使用 | mem,见“性能指标” | 通常称为“显存”,单位为GiB |
| 端到端时间 | e2e_time,见“性能指标” | 总时间+Perf初始化等时间 |
| 总吞吐量 | p_whole,见“性能指标” | 实际训练图片数除以总时间(performance_whole) |
| 训练吞吐量 | p_train,见“性能指标” | 不包含每个epoch末尾的评估部分耗时 |
| **计算吞吐量** | **p_core,见“性能指标”** | 不包含数据IO部分的耗时(p3>p2>p1) |
| 训练结果 | acc,见“性能指标” | 单位为top1分类准确率(acc1) |
| 额外修改项 || |



* 性能指标

| 配置 | precision | fix_hp | e2e_time | p_whole | p_train | p_core | acc | mem |
| ------------------- | --------- | -------------- | -------- | ------- | ------- | ------ | ------ | --------- |
| R300单机单卡(1x1) | fp32 | / | / | | | | / | 23.4/32.0 |
| R300单机8卡(1x8) | fp32 | bs=128,lr=0.0015 | | | | | 79.30% | 24.6/32.0 |
| R300两机8卡(2x8) | fp32 | bs=128,lr=0.003 | / | | | | / | 24.0/32.0 |
### 许可证

Apache 2.0 license。
6 changes: 6 additions & 0 deletions training/kunlunxin/vit-pytorch/config/config_R300x1x1.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from config_common import *

train_batch_size = 128
eval_batch_size = 512
gradient_accumulation_steps = 4
# epochs = 1
1 change: 1 addition & 0 deletions training/kunlunxin/vit-pytorch/config/config_R300x1x8.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@

train_batch_size = 128
eval_batch_size = 512
lr = 0.003 * 0.5
gradient_accumulation_steps = 4
6 changes: 6 additions & 0 deletions training/kunlunxin/vit-pytorch/config/config_R300x2x8.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from config_common import *

train_batch_size = 128
eval_batch_size = 512
gradient_accumulation_steps = 2
# epochs = 8
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
export XACC=1
export BKCL_PCIE_RING=1
export BKCL_TIMEOUT=1800
export XMLIR_D_XPU_L3_SIZE=66060288
2 changes: 2 additions & 0 deletions training/kunlunxin/vit-pytorch/config/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
https://download.pytorch.org/whl/cpu/torchvision-0.13.1%2Bcpu-cp38-cp38-linux_x86_64.whl
tabulate
1 change: 1 addition & 0 deletions training/run_benchmarks/config/test_conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@
# "transformer_xl:pytorch:R300:1:8:1": "/raid/dataset/transformer_xl/",
# "glm:pytorch:R300:1:8:1": "/raid/home_datasets_ckpt/glm/train/",
# "mobilenetv2:pytorch:R300:1:8:1": "/raid/dataset/ImageNet_1k_2012/",
# "vit:pytorch:R300:1:8:1": "/raid/dataset/ImageNet_1k_2012/",
# "bert:pytorch:R300:1:8:1": "/raid/dataset/bert_large/train",
# "longformer:pytorch:R300:1:8:1": "/raid/dataset/longformer_train",
# "distilbert:pytorch:R300:1:8:1": "/raid/dataset/distilbert/",
Expand Down

0 comments on commit c7fb7a1

Please sign in to comment.