Distributed Training using TensorFlow and Horovod on Amazon EKS with ImageNet Data

This document explains how to perform distributed training on Amazon EKS using TensorFlow and Horovod with ImageNet dataset. The following steps can be ued for any data set though.

Prerequisite

Create EKS cluster using GPU.
Install Kubeflow.
Create an EFS and mount it to each worker nodes as explained in efs-on-eks-worker-nodes.md.

ImageNet Data

If you work for Amazon, then reach out to the authors of this document to have access to the data. Otherwise, follow the instructions below.

Download ImageNet dataset to EFS in the data directory. This would typically be in the directory /home/ec2-user/efs/data. Use Download Original Images (for non-commercial research/educational use only) option.
TensorFlow consumes the ImageNet data in a specific format. You can preprocess them by downloading and modifying the script:
```
curl -O https://raw.githubusercontent.com/aws-samples/deep-learning-models/master/utils/tensorflow/preprocess_imagenet.sh
chmod +x preprocess_imagenet.sh
```
The following values need to be changed:
1. [your imagenet account]
2. [your imagenet access key]
3. [PATH TO TFRECORD TRAINING DATASET]
4. [PATH TO RESIZED TFRECORD TRAINING DATASET]
5. [PATH TO TFRECORD VALIDATION DATASET]
6. [PATH TO RESIZED TFRECORD VALIDATION DATASET]
Execute the script:
```
./preprocess_imagenet.sh
```

Steps

Create namespace:

NAMESPACE=kubeflow-dist-train; kubectl create namespace ${NAMESPACE}

Create ksonnet app:

APP_NAME=kubeflow-tf-hvd; ks init ${APP_NAME}; cd ${APP_NAME}

Set as default namespace:

ks env set default --namespace ${NAMESPACE}

Create the Persistent Volume (PV) based on EFS. You need to update the name of EFS server in the Kubernetes manifest file. Storage capacity based on dataset size and other requirements can be updated as well.
```
kubectl create -f ../training/distributed_training/dist_pv.yaml
```
Create the Persistent Volume Claim (PVC) based on EFS. The storage capacity based on PV's capacity may adjusted in the manifest. The storage capacity of PVC should be the at most storage capacity of PV.
```
kubectl create -f ../training/distributed_training/dist_pvc.yaml
```

Make sure that PV has been claimed by PVC under same namespace. You can verify that by running the command:

kubectl get pv -n ${NAMESPACE}
NAME       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                              STORAGECLASS   REASON   AGE
nfs-data   85Gi       RWX            Retain           Bound    kubeflow-dist-train/nfs-external   nfs-external            58s

Create secret for ssh access between nodes:

SECRET=openmpi-secret; mkdir -p .tmp; yes | ssh-keygen -N "" -f .tmp/id_rsa
kubectl delete secret ${SECRET} -n ${NAMESPACE} || true
kubectl create secret generic ${SECRET} -n ${NAMESPACE} --from-file=id_rsa=.tmp/id_rsa --from-file=id_rsa.pub=.tmp/id_rsa.pub --from-file=authorized_keys=.tmp/id_rsa.pub

Install Kubeflow openmpi component:

VERSION=master
ks registry add kubeflow github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow
ks pkg install kubeflow/openmpi@${VERSION}

Build a Docker image for Horovod using Dockerfile from training/distributed_training/Dockerfile and the command docker image build -t arungupta/horovod .. Alternatively, you can use the image that already exists on Docker Hub:
```
IMAGE=rgaut/horovod:latest
```
Define the number of workers (number of machines) and number of GPU available per machine:
```
WORKERS=2; GPU=4
```

Formulate the MPI command based on official document from Horovod:

EXEC="mpiexec -np 8 --hostfile /kubeflow/openmpi/assets/hostfile --allow-run-as-root --display-map --tag-output --timestamp-output -mca btl_tcp_if_exclude lo,docker0 --mca plm_rsh_no_tree_spawn 1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib sh -c 'NCCL_SOCKET_IFNAME=eth0 NCCL_MIN_NRINGS=8 NCCL_DEBUG=INFO python3.6 /examples/official-benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_batches=100 --model vgg16 --batch_size 64  --data_name=imagenet  --data_dir=/mnt/data --variable_update horovod --horovod_device gpu --use_fp16'"

If more than one GPU is used, then you may have to replace NCCL_SOCKET_IFNAME=eth0 with NCCL_SOCKET_IFNAME=^docker0.

MPI command needs some explanations. -np represents a total number process which will be equal to WORKERS * GPU.

Generate the config:

COMPONENT=openmpi
ks generate openmpi ${COMPONENT} --image ${IMAGE} --secret ${SECRET} --workers ${WORKERS} --gpu ${GPU} --exec "${EXEC}"

Set the parameter for the created volume:

ks param set ${COMPONENT} volumes '[{ "name": "efs-pvc", "persistentVolumeClaim": { "claimName": "nfs-external" }}]'

Set the parameter for volumeMounts:
```
ks param set ${COMPONENT} volumeMounts '[{ "name": "efs-pvc", "mountPath": "/mnt"}]'
```
This command will make the data available under /mnt/data directory for each pod.
Deploy the config to your cluster:
```
ks apply default
```

Check the pod status:

kubectl get pod -n ${NAMESPACE} -o wide

Save the log:

mkdir -p results
kubectl logs -n ${NAMESPACE} -f ${COMPONENT}-master > results/benchmark_1.out

Here is a sample output.

To iterate quickly. Remove pods, recreate openmpi component, restart from generate openmpi command
```
ks delete default
ks component rm openmpi 
```

[Optional] EXEC command for 4 workers 8 GPU

WORKERS=4
GPU=8
EXEC="mpiexec -np 32 --hostfile /kubeflow/openmpi/assets/hostfile --allow-run-as-root --display-map --tag-output --timestamp-output -mca btl_tcp_if_exclude lo,docker0 --mca plm_rsh_no_tree_spawn 1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib sh -c 'NCCL_SOCKET_IFNAME=eth0 NCCL_MIN_NRINGS=8 NCCL_DEBUG=INFO python3.6 /examples/official-benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_batches=100 --model vgg16 --batch_size 64  --data_name=imagenet  --data_dir=/mnt/data --variable_update horovod --horovod_device gpu --use_fp16'"

Then follow the steps from 10-13.

[Optional] To store checkpoints for resnet50 model on efs mnt directory.

EXEC="mpiexec -np 32 --hostfile /kubeflow/openmpi/assets/hostfile --allow-run-as-root --display-map --tag-output --timestamp-output -mca btl_tcp_if_exclude lo,docker0 --mca plm_rsh_no_tree_spawn 1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib sh -c 'NCCL_SOCKET_IFNAME=eth0 NCCL_MIN_NRINGS=8 NCCL_DEBUG=INFO python3.6 /examples/official-benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --num_epochs=25 --batch_size 64 --data_name=imagenet  --data_dir=/mnt/data --train_dir=/mnt/checkpoint  --variable_update horovod --horovod_device gpu --weight_decay=1e-4 --use_fp16'"

To store the checkpoint which is necessary to perform inference or model evaluation. Please specify --train_dir. One can use either num_epochs or num_batches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensorflow-horovod-imagenet.md

tensorflow-horovod-imagenet.md

Distributed Training using TensorFlow and Horovod on Amazon EKS with ImageNet Data

Prerequisite

ImageNet Data

Steps

Files

tensorflow-horovod-imagenet.md

Latest commit

History

tensorflow-horovod-imagenet.md

File metadata and controls

Distributed Training using TensorFlow and Horovod on Amazon EKS with ImageNet Data

Prerequisite

ImageNet Data

Steps