This document explains how to perform distributed training on Amazon EKS using TensorFlow and Horovod with ImageNet dataset. The following steps can be ued for any data set though.
-
Create EKS cluster using GPU.
-
Install Kubeflow.
-
Create an EFS and mount it to each worker nodes as explained in efs-on-eks-worker-nodes.md.
If you work for Amazon, then reach out to the authors of this document to have access to the data. Otherwise, follow the instructions below.
-
Download ImageNet dataset to EFS in the
data
directory. This would typically be in the directory/home/ec2-user/efs/data
. UseDownload Original Images (for non-commercial research/educational use only)
option. -
TensorFlow consumes the ImageNet data in a specific format. You can preprocess them by downloading and modifying the script:
curl -O https://raw.githubusercontent.com/aws-samples/deep-learning-models/master/utils/tensorflow/preprocess_imagenet.sh chmod +x preprocess_imagenet.sh
The following values need to be changed:
[your imagenet account]
[your imagenet access key]
[PATH TO TFRECORD TRAINING DATASET]
[PATH TO RESIZED TFRECORD TRAINING DATASET]
[PATH TO TFRECORD VALIDATION DATASET]
[PATH TO RESIZED TFRECORD VALIDATION DATASET]
Execute the script:
./preprocess_imagenet.sh
-
Create namespace:
NAMESPACE=kubeflow-dist-train; kubectl create namespace ${NAMESPACE}
-
Create ksonnet app:
APP_NAME=kubeflow-tf-hvd; ks init ${APP_NAME}; cd ${APP_NAME}
-
Set as default namespace:
ks env set default --namespace ${NAMESPACE}
-
Create the Persistent Volume (PV) based on EFS. You need to update the name of EFS server in the Kubernetes manifest file. Storage capacity based on dataset size and other requirements can be updated as well.
kubectl create -f ../training/distributed_training/dist_pv.yaml
-
Create the Persistent Volume Claim (PVC) based on EFS. The storage capacity based on PV's capacity may adjusted in the manifest. The storage capacity of PVC should be the at most storage capacity of PV.
kubectl create -f ../training/distributed_training/dist_pvc.yaml
-
Make sure that PV has been claimed by PVC under same namespace. You can verify that by running the command:
kubectl get pv -n ${NAMESPACE} NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE nfs-data 85Gi RWX Retain Bound kubeflow-dist-train/nfs-external nfs-external 58s
-
Create secret for ssh access between nodes:
SECRET=openmpi-secret; mkdir -p .tmp; yes | ssh-keygen -N "" -f .tmp/id_rsa kubectl delete secret ${SECRET} -n ${NAMESPACE} || true kubectl create secret generic ${SECRET} -n ${NAMESPACE} --from-file=id_rsa=.tmp/id_rsa --from-file=id_rsa.pub=.tmp/id_rsa.pub --from-file=authorized_keys=.tmp/id_rsa.pub
-
Install Kubeflow openmpi component:
VERSION=master ks registry add kubeflow github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow ks pkg install kubeflow/openmpi@${VERSION}
-
Build a Docker image for Horovod using Dockerfile from
training/distributed_training/Dockerfile
and the commanddocker image build -t arungupta/horovod .
. Alternatively, you can use the image that already exists on Docker Hub:IMAGE=rgaut/horovod:latest
-
Define the number of workers (number of machines) and number of GPU available per machine:
WORKERS=2; GPU=4
-
Formulate the MPI command based on official document from Horovod:
EXEC="mpiexec -np 8 --hostfile /kubeflow/openmpi/assets/hostfile --allow-run-as-root --display-map --tag-output --timestamp-output -mca btl_tcp_if_exclude lo,docker0 --mca plm_rsh_no_tree_spawn 1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib sh -c 'NCCL_SOCKET_IFNAME=eth0 NCCL_MIN_NRINGS=8 NCCL_DEBUG=INFO python3.6 /examples/official-benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_batches=100 --model vgg16 --batch_size 64 --data_name=imagenet --data_dir=/mnt/data --variable_update horovod --horovod_device gpu --use_fp16'"
If more than one GPU is used, then you may have to replace
NCCL_SOCKET_IFNAME=eth0
withNCCL_SOCKET_IFNAME=^docker0
.MPI command needs some explanations.
-np
represents a total number process which will be equal toWORKERS
*GPU
. -
Generate the config:
COMPONENT=openmpi ks generate openmpi ${COMPONENT} --image ${IMAGE} --secret ${SECRET} --workers ${WORKERS} --gpu ${GPU} --exec "${EXEC}"
-
Set the parameter for the created volume:
ks param set ${COMPONENT} volumes '[{ "name": "efs-pvc", "persistentVolumeClaim": { "claimName": "nfs-external" }}]'
-
Set the parameter for volumeMounts:
ks param set ${COMPONENT} volumeMounts '[{ "name": "efs-pvc", "mountPath": "/mnt"}]'
This command will make the
data
available under/mnt/data
directory for each pod. -
Deploy the config to your cluster:
ks apply default
-
Check the pod status:
kubectl get pod -n ${NAMESPACE} -o wide
-
Save the log:
mkdir -p results kubectl logs -n ${NAMESPACE} -f ${COMPONENT}-master > results/benchmark_1.out
Here is a sample output.
-
To iterate quickly. Remove pods, recreate openmpi component, restart from generate openmpi command
ks delete default ks component rm openmpi
-
[Optional] EXEC command for 4 workers 8 GPU
WORKERS=4 GPU=8 EXEC="mpiexec -np 32 --hostfile /kubeflow/openmpi/assets/hostfile --allow-run-as-root --display-map --tag-output --timestamp-output -mca btl_tcp_if_exclude lo,docker0 --mca plm_rsh_no_tree_spawn 1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib sh -c 'NCCL_SOCKET_IFNAME=eth0 NCCL_MIN_NRINGS=8 NCCL_DEBUG=INFO python3.6 /examples/official-benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_batches=100 --model vgg16 --batch_size 64 --data_name=imagenet --data_dir=/mnt/data --variable_update horovod --horovod_device gpu --use_fp16'"
Then follow the steps from 10-13.
-
[Optional] To store checkpoints for resnet50 model on efs
mnt
directory.EXEC="mpiexec -np 32 --hostfile /kubeflow/openmpi/assets/hostfile --allow-run-as-root --display-map --tag-output --timestamp-output -mca btl_tcp_if_exclude lo,docker0 --mca plm_rsh_no_tree_spawn 1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib sh -c 'NCCL_SOCKET_IFNAME=eth0 NCCL_MIN_NRINGS=8 NCCL_DEBUG=INFO python3.6 /examples/official-benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --num_epochs=25 --batch_size 64 --data_name=imagenet --data_dir=/mnt/data --train_dir=/mnt/checkpoint --variable_update horovod --horovod_device gpu --weight_decay=1e-4 --use_fp16'"
To store the checkpoint which is necessary to perform inference or model evaluation. Please specify
--train_dir
. One can use eithernum_epochs
ornum_batches
.