Serverless Distributed Deep Learning Training with Code Engine

This tutorial walks through the steps to train a deep learning model in a serverless and distributed environment with IBM Code Engine.

Components

Code Engine: provides the underlying serverless environment.
Ray: provides a Ray cluster on Code Engine for easily running distributed applications.
Horovod: provides a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

The serverless and distributed deep learning training is achieved by running Horovod code on a Ray cluster which launched in Code Engine.

Steps

Setup Code Engine.
1. Create a Code Engine project in IBM Cloud.
2. Enable the project with Ray. Currently this can only be done by sharing the namespace id of your project with Code Engine team and ask their help to enable it. Update: now you can do it yourself: https://www.ibm.com/cloud/blog/ray-on-ibm-cloud-code-engine
Prepare Ray Cluster file. Ray cluster is configured and provisioned by a single ray cluster yaml file. All cluster-related inforation (nodes spec, container image, resouce allocation, etc.) should be configured in this file. Here is a template.
Prepare training script. Modify your existing training script to use Horovod on Ray.
1. Modify for Horovod. The modification varies a little bit by your favorite framework, but is very easy and straightforward. You can find more information on Horovod with Tensorflow, Horovod with Tensorflow Keras, Horovod with PyTorch and Horovod on MXNet.
2. Modify for Ray integration. Script for running on Horovod-Ray is slightly different from running on Horovod alone, and a few more modifications for ray integration is required. This is as simple as wrapping your code from previous step in a train() function and execute it with RayExecutor. You can find the example here
Run it! Running is as simple as a two-line script:
```
# Launch Ray cluster
ray up -y --no-config-cache ray_cluster.yaml 
# Submit your training
ray submit --no-config-cache ray_cluster.yaml ray_mnist.py
```
You might have to wait couple minutes for the cluster launch to complete after running the first command, if you are launching it for the first time (as it takes time to pull the image and allocate the resources).

Examples

You can find examples here.
mnist contains an example of training Fashion MNIST model where data is consumed from COS and trained model is saved back to COS.
mnist-transfer-learning contains an example of doing transfer learning or fine-tuning or reusume-training where data is consumed from COS, pre-trained model is loaded from COS and trained model is saved back to COS.

To run the demo (e.g. mnist), just do:

# 1. Connect to your code engine project, e.g.
export KUBECONFIG=/Users/lchu/.bluemix/plugins/code-engine/horovod-5dc3ff50-23e7-46be-92b2-e2de2da9bd71.yaml
# 2. Launch Ray cluster
ray up -y --no-config-cache ray_cluster.yaml 
# 3. Submit your training
cd examples/mnist
### Modify information as needed, e.g. change credentials in ray_mnist.py, change resouce allocation in ray_cluster.yaml, etc. ###
ray submit --no-config-cache ray_cluster.yaml ray_mnist.py

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
examples		examples
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Serverless Distributed Deep Learning Training with Code Engine

Components

Steps

Examples

About

Releases

Packages

project-codeflare/serverless-distributed-dl-training

Folders and files

Latest commit

History

Repository files navigation

Serverless Distributed Deep Learning Training with Code Engine

Components

Steps

Examples

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages