BR-IDL · skpig · Nov 21, 2021 · Nov 21, 2021 · Nov 21, 2021 · Nov 23, 2021
diff --git a/image_classification/Multi_Node_Training/README.md b/image_classification/Multi_Node_Training/README.md
@@ -0,0 +1,57 @@
+# Multiple Node Training
+English | [简体中文](./README_cn.md)
+
+PaddleVit also supports multi-node distributed training under collective mode.
+
+Here we provides a simple tutorial to modify multi-gpus training scripts 
+to multi-nodes training scripts for any models in PaddleViT.
+
+This folder takes ViT model as an example.
+
+## Tutorial
+For any models in PaddleViT, one can implement multi-node training by modifying 
+`main_multi_gpu.py`.
+1. Just add arguments `ips='[host ips]' ` in `dist.spawn()`.
+2. Then run training script in every host.
+
+## Training example: ViT
+Suppose you have 2 hosts (denoted as node) with 4 gpus on each machine. 
+Nodes IP addresses are `192.168.0.16` and `192.168.0.17`.
+
+1. Then modify some lines of `run_train_multi_node.sh`:
+    ```shell
+    CUDA_VISIBLE_DEVICES=0,1,2,3 # number of gpus
+
+    -ips= '192.168.0.16, 192.168.0.17' # seperated by comma
+    ```
+2. Run training script in every host:
+    ```shell
+    sh run_train_multi.sh
+    ```
+
+##Multi-nodes training with one host
+It is possible to try multi-node training even when you have only one machine.
+
+1. Install docker and paddle. For more details, please refer 
+    [here](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/docker/fromdocker.html).
+
+2. Create a network between docker containers.
+    ```shell
+    docker network create -d bridge paddle_net
+    ```
+3. Create multiple containers as virtual hosts/nodes. Suppose creating 2 containers 
+with 2 gpus on each node.
+    ```shell
+    docker run --name paddle0 -it -d --gpus "device=0,1" --network paddle_net\
+    paddlepaddle/paddle:2.2.0-gpu-cuda10.2-cudnn7 /bin/bash
+    docker run --name paddle1 -it -d --gpus "device=2,3" --network paddle_net\
+    paddlepaddle/paddle:2.2.0-gpu-cuda10.2-cudnn7 /bin/bash
+    ```
+    >   Noted: 
+    >   1. One can assign one gpu device to different containers. But it may occur OOM since multiple models will run on the same gpu. 
+    >   2. One should use `-v` to bind PaddleViT repository to container.
+
+4. Modify `run_train_multi_node.sh` as described above and run the training script on every container.
+
+    >   Noted: One can use `ping` or `ip -a` bash command to check containers' ip addresses. 
+
diff --git a/image_classification/Multi_Node_Training/README_cn.md b/image_classification/Multi_Node_Training/README_cn.md
@@ -0,0 +1,45 @@
+#多机多卡分布式训练
+
+简体中文 | [English](./README.md)
+
+PaddleViT 同样支持Collective多机多卡分布式训练。
+
+##教程
+对于每一个模型，用户可以直接通过修改对应模型文件夹下的`main_multi_gpu.py`
+以进行多机训练。
+1. 在`dist.spawn()`里加入`ips='[host ips]' `。
+2. 在每个主机上运行代码。
+
+##样例：ViT
+这个文件夹提供了分布式训练ViT模型的代码和shell脚本。
+假设你有2台主机，每个主机上有4张显卡。主机的ip地址为`192.168.0.16`和`192.168.0.17`。
+
+1. 修改shell脚本`run_train_multi_node.sh`的参数
+   ```shell
+    CUDA_VISIBLE_DEVICES=0,1,2,3 # number of gpus
+
+    -ips= '192.168.0.16, 192.168.0.17' # seperated by comma
+    ```
+2. 在每个主机上运行脚本代码。
+    ```shell
+    sh run_train_multi.sh
+    ```
+##单机上运行分布式训练
+如果仅有一台主机，同样可以通过docker实现单机上的分布式训练。
+1. 安装docker和paddle。[这里](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/docker/fromdocker.html)
+可以下载paddlepaddle提供的docker镜像。
+2. 创建docker容器间网络。
+   ```shell
+    docker network create -d bridge paddle_net
+    ```
+3. 创建多个docker容器作为虚拟主机，假设我们创建2个容器，并分别分配2个GPU。
+    ```shell
+    docker run --name paddle0 -it -d --gpus "device=0,1" --network paddle_net\
+    paddlepaddle/paddle:2.2.0-gpu-cuda10.2-cudnn7 /bin/bash
+    docker run --name paddle1 -it -d --gpus "device=2,3" --network paddle_net\
+    paddlepaddle/paddle:2.2.0-gpu-cuda10.2-cudnn7 /bin/bash
+    ```
+    >   注意: 
+    >   1. 可以将同一个GPU同时分配给多个容器，但是这可能会产生OOM错误，因为多个模型将同时运行在这个GPU上。 
+    >   2. 使用`-v`挂载PaddleViT所在的目录。
+
diff --git a/image_classification/Multi_Node_Training/config.py b/image_classification/Multi_Node_Training/config.py
@@ -0,0 +1,153 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration
+
+Configuration for data, model archtecture, and training, etc.
+Config can be set by .yaml file or by argparser(limited usage)
+
+
+"""
+import os
+from yacs.config import CfgNode as CN
+import yaml
+
+_C = CN()
+_C.BASE = ['']
+
+# data settings
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 256 #256 # train batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 8 #64 # val batch_size for single GPU
+_C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
+_C.DATA.DATASET = 'imagenet2012' # dataset name
+_C.DATA.IMAGE_SIZE = 224 # input image size: 224 for pretrain, 384 for finetune
+_C.DATA.CROP_PCT = 0.875 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 2 # number of data loading threads 
+
+# model settings
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'ViT'
+_C.MODEL.NAME = 'ViT'
+_C.MODEL.RESUME = None
+_C.MODEL.PRETRAINED = None
+_C.MODEL.NUM_CLASSES = 1000
+_C.MODEL.DROPOUT = 0.1
+_C.MODEL.DROPPATH = 0.1
+_C.MODEL.ATTENTION_DROPOUT = 0.1
+
+# transformer settings
+_C.MODEL.TRANS = CN()
+_C.MODEL.TRANS.PATCH_SIZE = 32
+_C.MODEL.TRANS.EMBED_DIM = 768
+_C.MODEL.TRANS.MLP_RATIO= 4.0
+_C.MODEL.TRANS.NUM_HEADS = 12
+_C.MODEL.TRANS.DEPTH = 12
+_C.MODEL.TRANS.QKV_BIAS = True
+
+# training settings
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 300
+_C.TRAIN.WARMUP_EPOCHS = 3 #34 # ~ 10k steps for 4096 batch size
+_C.TRAIN.WEIGHT_DECAY = 0.05 #0.3 # 0.0 for finetune
+_C.TRAIN.BASE_LR = 0.001 #0.003 for pretrain # 0.03 for finetune
+_C.TRAIN.WARMUP_START_LR = 1e-6 #0.0
+_C.TRAIN.END_LR = 5e-4
+_C.TRAIN.GRAD_CLIP = 1.0
+_C.TRAIN.ACCUM_ITER = 2 #1
+
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
+
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+# misc
+_C.SAVE = "./output"
+_C.TAG = "default"
+_C.SAVE_FREQ = 10 # freq to save chpt
+_C.REPORT_FREQ = 100 # freq to logging info
+_C.VALIDATE_FREQ = 100 # freq to do validation
+_C.SEED = 0
+_C.EVAL = False # run evaluation only
+_C.AMP = False # mix precision training
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as infile:
+        yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('merging config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    config.defrost()
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.image_size:
+        config.DATA.IMAGE_SIZE = args.image_size
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
+
+    #config.freeze()
+    return config
+
+
+def get_config(cfg_file=None):
+    """Return a clone of config or load from yaml file"""
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
diff --git a/image_classification/Multi_Node_Training/configs/vit_base_patch16_224.yaml b/image_classification/Multi_Node_Training/configs/vit_base_patch16_224.yaml
@@ -0,0 +1,21 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: ViT
+    NAME: vit_base_patch16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 768
+        MLP_RATIO: 4.0
+        DEPTH: 12
+        NUM_HEADS: 12
+        QKV_BIAS: true
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 3
+    WEIGHT_DECAY: 0.3
+    BASE_LR: 0.003
+    WARMUP_START_LR: 1e-6
+    END_LR: 5e-4
+    ACCUM_ITER: 2
diff --git a/image_classification/Multi_Node_Training/configs/vit_base_patch16_384.yaml b/image_classification/Multi_Node_Training/configs/vit_base_patch16_384.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: ViT
+    NAME: vit_base_patch16_384
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 768
+        MLP_RATIO: 4.0
+        DEPTH: 12
+        NUM_HEADS: 12
+        QKV_BIAS: true
+
diff --git a/image_classification/Multi_Node_Training/configs/vit_base_patch32_224.yaml b/image_classification/Multi_Node_Training/configs/vit_base_patch32_224.yaml
@@ -0,0 +1,21 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: ViT
+    NAME: vit_base_patch32_224
+    TRANS:
+        PATCH_SIZE: 32
+        EMBED_DIM: 768
+        MLP_RATIO: 4.0
+        DEPTH: 12
+        NUM_HEADS: 12
+        QKV_BIAS: true
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 3
+    WEIGHT_DECAY: 0.3
+    BASE_LR: 0.003
+    WARMUP_START_LR: 1e-6
+    END_LR: 5e-4
+    ACCUM_ITER: 2
diff --git a/image_classification/Multi_Node_Training/configs/vit_base_patch32_384.yaml b/image_classification/Multi_Node_Training/configs/vit_base_patch32_384.yaml
@@ -0,0 +1,21 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: ViT
+    NAME: vit_base_patch32_384
+    TRANS:
+        PATCH_SIZE: 32
+        EMBED_DIM: 768
+        MLP_RATIO: 4.0
+        DEPTH: 12
+        NUM_HEADS: 12
+        QKV_BIAS: true
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 3
+    WEIGHT_DECAY: 0.3
+    BASE_LR: 0.003
+    WARMUP_START_LR: 1e-6
+    END_LR: 5e-4
+    ACCUM_ITER: 2
diff --git a/image_classification/Multi_Node_Training/configs/vit_large_patch16_224.yaml b/image_classification/Multi_Node_Training/configs/vit_large_patch16_224.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: ViT
+    NAME: vit_large_patch16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 1024
+        MLP_RATIO: 4.0
+        DEPTH: 24
+        NUM_HEADS: 16
+        QKV_BIAS: true
+
diff --git a/image_classification/Multi_Node_Training/configs/vit_large_patch16_384.yaml b/image_classification/Multi_Node_Training/configs/vit_large_patch16_384.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: ViT
+    NAME: vit_large_patch16_384
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 1024
+        MLP_RATIO: 4.0
+        DEPTH: 24
+        NUM_HEADS: 16
+        QKV_BIAS: true
+
diff --git a/image_classification/Multi_Node_Training/configs/vit_large_patch32_384.yaml b/image_classification/Multi_Node_Training/configs/vit_large_patch32_384.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: ViT
+    NAME: vit_large_patch32_384
+    TRANS:
+        PATCH_SIZE: 32
+        EMBED_DIM: 1024
+        MLP_RATIO: 4.0
+        DEPTH: 24
+        NUM_HEADS: 16
+        QKV_BIAS: true
+