Skip to content

Commit

Permalink
Add README
Browse files Browse the repository at this point in the history
  • Loading branch information
skpig committed Dec 9, 2021
1 parent 6ef1f09 commit caf5f7d
Show file tree
Hide file tree
Showing 3 changed files with 101 additions and 0 deletions.
56 changes: 56 additions & 0 deletions image_classification/Multi_Node_Training/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,57 @@
# Multiple Node Training
English | [简体中文](./README_cn.md)

PaddleVit also supports multi-node distributed training under collective mode.

Here we provides a simple tutorial to modify multi-gpus training scripts
to multi-nodes training scripts for any models in PaddleViT.

This folder takes ViT model as an example.

## Tutorial
For any models in PaddleViT, one can implement multi-node training by modifying
`main_multi_gpu.py`.
1. Just add arguments `ips='[host ips]' ` in `dist.spawn()`.
2. Then run training script in every host.

## Training example: ViT
Suppose you have 2 hosts (denoted as node) with 4 gpus on each machine.
Nodes IP addresses are `192.168.0.16` and `192.168.0.17`.

1. Then modify some lines of `run_train_multi_node.sh`:
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3 # number of gpus

-ips= '192.168.0.16, 192.168.0.17' # seperated by comma
```
2. Run training script in every host:
```shell
sh run_train_multi.sh
```

##Multi-nodes training with one host
It is possible to try multi-node training even when you have only one machine.

1. Install docker and paddle. For more details, please refer
[here](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/docker/fromdocker.html).

2. Create a network between docker containers.
```shell
docker network create -d bridge paddle_net
```
3. Create multiple containers as virtual hosts/nodes. Suppose creating 2 containers
with 2 gpus on each node.
```shell
docker run --name paddle0 -it -d --gpus "device=0,1" --network paddle_net\
paddlepaddle/paddle:2.2.0-gpu-cuda10.2-cudnn7 /bin/bash
docker run --name paddle1 -it -d --gpus "device=2,3" --network paddle_net\
paddlepaddle/paddle:2.2.0-gpu-cuda10.2-cudnn7 /bin/bash
```
> Noted:
> 1. One can assign one gpu device to different containers. But it may occur OOM since multiple models will run on the same gpu.
> 2. One should use `-v` to bind PaddleViT repository to container.

4. Modify `run_train_multi_node.sh` as described above and run the training script on every container.

> Noted: One can use `ping` or `ip -a` bash command to check containers' ip addresses.
45 changes: 45 additions & 0 deletions image_classification/Multi_Node_Training/README_cn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
#多机多卡分布式训练

简体中文 | [English](./README.md)

PaddleViT 同样支持Collective多机多卡分布式训练。

##教程
对于每一个模型,用户可以直接通过修改对应模型文件夹下的`main_multi_gpu.py`
以进行多机训练。
1.`dist.spawn()`里加入`ips='[host ips]' `
2. 在每个主机上运行代码。

##样例:ViT
这个文件夹提供了分布式训练ViT模型的代码和shell脚本。
假设你有2台主机,每个主机上有4张显卡。主机的ip地址为`192.168.0.16``192.168.0.17`

1. 修改shell脚本`run_train_multi_node.sh`的参数
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3 # number of gpus

-ips= '192.168.0.16, 192.168.0.17' # seperated by comma
```
2. 在每个主机上运行脚本代码。
```shell
sh run_train_multi.sh
```
##单机上运行分布式训练
如果仅有一台主机,同样可以通过docker实现单机上的分布式训练。
1. 安装docker和paddle。[这里](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/docker/fromdocker.html)
可以下载paddlepaddle提供的docker镜像。
2. 创建docker容器间网络。
```shell
docker network create -d bridge paddle_net
```
3. 创建多个docker容器作为虚拟主机,假设我们创建2个容器,并分别分配2个GPU。
```shell
docker run --name paddle0 -it -d --gpus "device=0,1" --network paddle_net\
paddlepaddle/paddle:2.2.0-gpu-cuda10.2-cudnn7 /bin/bash
docker run --name paddle1 -it -d --gpus "device=2,3" --network paddle_net\
paddlepaddle/paddle:2.2.0-gpu-cuda10.2-cudnn7 /bin/bash
```
> 注意:
> 1. 可以将同一个GPU同时分配给多个容器,但是这可能会产生OOM错误,因为多个模型将同时运行在这个GPU上。
> 2. 使用`-v`挂载PaddleViT所在的目录。

Binary file removed image_classification/ViT/vit.png
Binary file not shown.

0 comments on commit caf5f7d

Please sign in to comment.