PyTorch-MPI-DDP-example

This github's target is to enable MPI-DDP in PyTorch. As you know, PyTorch DDP only support nccl and gloo backends.

You will be able to enable the distributed MPI-backend PyTorch Training with only 2 lines:

add DistributedSampler in your DataLoader
pass your model to DistributedDataParallel

This usage is exactly the same as the torch.nn.parallel.DistributedDataParallel() See imagenet example here: https://github.com/pytorch/examples/blob/master/imagenet/main.py#L88

Requirements

Pytorch : build from source (v0.3.1 is recommended)

Usage

bash run.sh

Strong vs Weak Scaling

This github implemented a strong scaling for mnist, which means the global batchsize is fixed no matter how many node we use. See more info about Strong vs Weak Scaling at wiki.
Since this is a strong scaling example, we should perform an average after the all_reduce, which is the same as torch.nn.parallel.DistributedDataParallel.

Our experient result:

More examples for PyTorch example

mnist: https://github.com/xhzhao/examples/tree/master/mnist
imagenet: https://github.com/xhzhao/examples/tree/master/imagenet

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
from_mingfei		from_mingfei
README.md		README.md
bs128_n1.log		bs128_n1.log
bs128_n2_allreduce_average.log		bs128_n2_allreduce_average.log
bs128_n2_allreduce_sum.log		bs128_n2_allreduce_sum.log
mnist_dist.py		mnist_dist.py
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyTorch-MPI-DDP-example

Requirements

Usage

Strong vs Weak Scaling

More examples for PyTorch example

About

Releases

Packages

Contributors 2

Languages

xhzhao/PyTorch-MPI-DDP-example

Folders and files

Latest commit

History

Repository files navigation

PyTorch-MPI-DDP-example

Requirements

Usage

Strong vs Weak Scaling

More examples for PyTorch example

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages