Assume you are familiar with PyTorch, and this tutorial show you the usage of PyTorch distributed data parallel, hope my description is helpful to you.
- Single Node Single GPU Card Training [snsc.py]
- Single Node Multi-GPU Cards Training (with DataParallel) [snmc_dp.py]
- Multiple Nodes Multi-GPU Cards Training (with DistributedDataParallel)
- torch.distributed.launch [mnmc_ddp_launch.py]
- torch.multiprocessing [mnmc_ddp_mp.py]
- Slurm Workload Manager [mnmc_ddp_slurm.py]
- ImageNet training example [imagenet.py]