Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple nodes training #114

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions image_classification/Multi_Node_Training/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Multiple Node Training
English | [简体中文](./README_cn.md)

PaddleVit also supports multi-node distributed training under collective mode.

Here we provides a simple tutorial to modify multi-gpus training scripts
to multi-nodes training scripts for any models in PaddleViT.

This folder takes ViT model as an example.

## Tutorial
For any models in PaddleViT, one can implement multi-node training by modifying
`main_multi_gpu.py`.
1. Just add arguments `ips='[host ips]' ` in `dist.spawn()`.
2. Then run training script in every host.

## Training example: ViT
Suppose you have 2 hosts (denoted as node) with 4 gpus on each machine.
Nodes IP addresses are `192.168.0.16` and `192.168.0.17`.

1. Then modify some lines of `run_train_multi_node.sh`:
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3 # number of gpus

-ips= '192.168.0.16, 192.168.0.17' # seperated by comma
```
2. Run training script in every host:
```shell
sh run_train_multi.sh
```

##Multi-nodes training with one host
It is possible to try multi-node training even when you have only one machine.

1. Install docker and paddle. For more details, please refer
[here](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/docker/fromdocker.html).

2. Create a network between docker containers.
```shell
docker network create -d bridge paddle_net
```
3. Create multiple containers as virtual hosts/nodes. Suppose creating 2 containers
with 2 gpus on each node.
```shell
docker run --name paddle0 -it -d --gpus "device=0,1" --network paddle_net\
paddlepaddle/paddle:2.2.0-gpu-cuda10.2-cudnn7 /bin/bash
docker run --name paddle1 -it -d --gpus "device=2,3" --network paddle_net\
paddlepaddle/paddle:2.2.0-gpu-cuda10.2-cudnn7 /bin/bash
```
> Noted:
> 1. One can assign one gpu device to different containers. But it may occur OOM since multiple models will run on the same gpu.
> 2. One should use `-v` to bind PaddleViT repository to container.

4. Modify `run_train_multi_node.sh` as described above and run the training script on every container.

> Noted: One can use `ping` or `ip -a` bash command to check containers' ip addresses.

45 changes: 45 additions & 0 deletions image_classification/Multi_Node_Training/README_cn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
#多机多卡分布式训练

简体中文 | [English](./README.md)

PaddleViT 同样支持Collective多机多卡分布式训练。

##教程
对于每一个模型,用户可以直接通过修改对应模型文件夹下的`main_multi_gpu.py`
以进行多机训练。
1. 在`dist.spawn()`里加入`ips='[host ips]' `。
2. 在每个主机上运行代码。

##样例:ViT
这个文件夹提供了分布式训练ViT模型的代码和shell脚本。
假设你有2台主机,每个主机上有4张显卡。主机的ip地址为`192.168.0.16`和`192.168.0.17`。

1. 修改shell脚本`run_train_multi_node.sh`的参数
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3 # number of gpus

-ips= '192.168.0.16, 192.168.0.17' # seperated by comma
```
2. 在每个主机上运行脚本代码。
```shell
sh run_train_multi.sh
```
##单机上运行分布式训练
如果仅有一台主机,同样可以通过docker实现单机上的分布式训练。
1. 安装docker和paddle。[这里](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/docker/fromdocker.html)
可以下载paddlepaddle提供的docker镜像。
2. 创建docker容器间网络。
```shell
docker network create -d bridge paddle_net
```
3. 创建多个docker容器作为虚拟主机,假设我们创建2个容器,并分别分配2个GPU。
```shell
docker run --name paddle0 -it -d --gpus "device=0,1" --network paddle_net\
paddlepaddle/paddle:2.2.0-gpu-cuda10.2-cudnn7 /bin/bash
docker run --name paddle1 -it -d --gpus "device=2,3" --network paddle_net\
paddlepaddle/paddle:2.2.0-gpu-cuda10.2-cudnn7 /bin/bash
```
> 注意:
> 1. 可以将同一个GPU同时分配给多个容器,但是这可能会产生OOM错误,因为多个模型将同时运行在这个GPU上。
> 2. 使用`-v`挂载PaddleViT所在的目录。

153 changes: 153 additions & 0 deletions image_classification/Multi_Node_Training/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Configuration

Configuration for data, model archtecture, and training, etc.
Config can be set by .yaml file or by argparser(limited usage)


"""
import os
from yacs.config import CfgNode as CN
import yaml

_C = CN()
_C.BASE = ['']

# data settings
_C.DATA = CN()
_C.DATA.BATCH_SIZE = 256 #256 # train batch_size for single GPU
_C.DATA.BATCH_SIZE_EVAL = 8 #64 # val batch_size for single GPU
_C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
_C.DATA.DATASET = 'imagenet2012' # dataset name
_C.DATA.IMAGE_SIZE = 224 # input image size: 224 for pretrain, 384 for finetune
_C.DATA.CROP_PCT = 0.875 # input image scale ratio, scale is applied before centercrop in eval mode
_C.DATA.NUM_WORKERS = 2 # number of data loading threads

# model settings
_C.MODEL = CN()
_C.MODEL.TYPE = 'ViT'
_C.MODEL.NAME = 'ViT'
_C.MODEL.RESUME = None
_C.MODEL.PRETRAINED = None
_C.MODEL.NUM_CLASSES = 1000
_C.MODEL.DROPOUT = 0.1
_C.MODEL.DROPPATH = 0.1
_C.MODEL.ATTENTION_DROPOUT = 0.1

# transformer settings
_C.MODEL.TRANS = CN()
_C.MODEL.TRANS.PATCH_SIZE = 32
_C.MODEL.TRANS.EMBED_DIM = 768
_C.MODEL.TRANS.MLP_RATIO= 4.0
_C.MODEL.TRANS.NUM_HEADS = 12
_C.MODEL.TRANS.DEPTH = 12
_C.MODEL.TRANS.QKV_BIAS = True

# training settings
_C.TRAIN = CN()
_C.TRAIN.LAST_EPOCH = 0
_C.TRAIN.NUM_EPOCHS = 300
_C.TRAIN.WARMUP_EPOCHS = 3 #34 # ~ 10k steps for 4096 batch size
_C.TRAIN.WEIGHT_DECAY = 0.05 #0.3 # 0.0 for finetune
_C.TRAIN.BASE_LR = 0.001 #0.003 for pretrain # 0.03 for finetune
_C.TRAIN.WARMUP_START_LR = 1e-6 #0.0
_C.TRAIN.END_LR = 5e-4
_C.TRAIN.GRAD_CLIP = 1.0
_C.TRAIN.ACCUM_ITER = 2 #1

_C.TRAIN.LR_SCHEDULER = CN()
_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler

_C.TRAIN.OPTIMIZER = CN()
_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
_C.TRAIN.OPTIMIZER.EPS = 1e-8
_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999) # for adamW
_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9

# misc
_C.SAVE = "./output"
_C.TAG = "default"
_C.SAVE_FREQ = 10 # freq to save chpt
_C.REPORT_FREQ = 100 # freq to logging info
_C.VALIDATE_FREQ = 100 # freq to do validation
_C.SEED = 0
_C.EVAL = False # run evaluation only
_C.AMP = False # mix precision training
_C.LOCAL_RANK = 0
_C.NGPUS = -1


def _update_config_from_file(config, cfg_file):
config.defrost()
with open(cfg_file, 'r') as infile:
yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
for cfg in yaml_cfg.setdefault('BASE', ['']):
if cfg:
_update_config_from_file(
config, os.path.join(os.path.dirname(cfg_file), cfg)
)
print('merging config from {}'.format(cfg_file))
config.merge_from_file(cfg_file)
config.freeze()

def update_config(config, args):
"""Update config by ArgumentParser
Args:
args: ArgumentParser contains options
Return:
config: updated config
"""
if args.cfg:
_update_config_from_file(config, args.cfg)
config.defrost()
if args.dataset:
config.DATA.DATASET = args.dataset
if args.batch_size:
config.DATA.BATCH_SIZE = args.batch_size
if args.image_size:
config.DATA.IMAGE_SIZE = args.image_size
if args.data_path:
config.DATA.DATA_PATH = args.data_path
if args.ngpus:
config.NGPUS = args.ngpus
if args.eval:
config.EVAL = True
config.DATA.BATCH_SIZE_EVAL = args.batch_size
if args.pretrained:
config.MODEL.PRETRAINED = args.pretrained
if args.resume:
config.MODEL.RESUME = args.resume
if args.last_epoch:
config.TRAIN.LAST_EPOCH = args.last_epoch
if args.amp: # only during training
if config.EVAL is True:
config.AMP = False
else:
config.AMP = True

#config.freeze()
return config


def get_config(cfg_file=None):
"""Return a clone of config or load from yaml file"""
config = _C.clone()
if cfg_file:
_update_config_from_file(config, cfg_file)
return config
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
DATA:
IMAGE_SIZE: 224
CROP_PCT: 0.875
MODEL:
TYPE: ViT
NAME: vit_base_patch16_224
TRANS:
PATCH_SIZE: 16
EMBED_DIM: 768
MLP_RATIO: 4.0
DEPTH: 12
NUM_HEADS: 12
QKV_BIAS: true
TRAIN:
NUM_EPOCHS: 300
WARMUP_EPOCHS: 3
WEIGHT_DECAY: 0.3
BASE_LR: 0.003
WARMUP_START_LR: 1e-6
END_LR: 5e-4
ACCUM_ITER: 2
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
DATA:
IMAGE_SIZE: 384
CROP_PCT: 1.0
MODEL:
TYPE: ViT
NAME: vit_base_patch16_384
TRANS:
PATCH_SIZE: 16
EMBED_DIM: 768
MLP_RATIO: 4.0
DEPTH: 12
NUM_HEADS: 12
QKV_BIAS: true

Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
DATA:
IMAGE_SIZE: 224
CROP_PCT: 0.875
MODEL:
TYPE: ViT
NAME: vit_base_patch32_224
TRANS:
PATCH_SIZE: 32
EMBED_DIM: 768
MLP_RATIO: 4.0
DEPTH: 12
NUM_HEADS: 12
QKV_BIAS: true
TRAIN:
NUM_EPOCHS: 300
WARMUP_EPOCHS: 3
WEIGHT_DECAY: 0.3
BASE_LR: 0.003
WARMUP_START_LR: 1e-6
END_LR: 5e-4
ACCUM_ITER: 2
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
DATA:
IMAGE_SIZE: 384
CROP_PCT: 1.0
MODEL:
TYPE: ViT
NAME: vit_base_patch32_384
TRANS:
PATCH_SIZE: 32
EMBED_DIM: 768
MLP_RATIO: 4.0
DEPTH: 12
NUM_HEADS: 12
QKV_BIAS: true
TRAIN:
NUM_EPOCHS: 300
WARMUP_EPOCHS: 3
WEIGHT_DECAY: 0.3
BASE_LR: 0.003
WARMUP_START_LR: 1e-6
END_LR: 5e-4
ACCUM_ITER: 2
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
DATA:
IMAGE_SIZE: 224
CROP_PCT: 0.875
MODEL:
TYPE: ViT
NAME: vit_large_patch16_224
TRANS:
PATCH_SIZE: 16
EMBED_DIM: 1024
MLP_RATIO: 4.0
DEPTH: 24
NUM_HEADS: 16
QKV_BIAS: true

Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
DATA:
IMAGE_SIZE: 384
CROP_PCT: 1.0
MODEL:
TYPE: ViT
NAME: vit_large_patch16_384
TRANS:
PATCH_SIZE: 16
EMBED_DIM: 1024
MLP_RATIO: 4.0
DEPTH: 24
NUM_HEADS: 16
QKV_BIAS: true

Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
DATA:
IMAGE_SIZE: 384
CROP_PCT: 1.0
MODEL:
TYPE: ViT
NAME: vit_large_patch32_384
TRANS:
PATCH_SIZE: 32
EMBED_DIM: 1024
MLP_RATIO: 4.0
DEPTH: 24
NUM_HEADS: 16
QKV_BIAS: true

Loading