- 2.1. Model Definition Files
- 2.2. Data Formatting
- 2.3. Uploading Data in Object Store
- 2.4. Creating Manifest file
- 2.5. Creating Model zip file
- 2.6. Model Deployment and Training
- 2.6.1. Train models using FfDL CLI
- 2.6.2. Train models using FfDL UI
- 2.6.3. Deploy models using Seldon-Core
- 3.1. FfDL Local Object Store
- 3.2. Cloud Object Store
- You need to have FfDL running on your cluster.
Currently, Fabric for Deep Learning supports following community frameworks
Framework | Versions | Processing Unit |
---|---|---|
tensorflow | 1.4.0, 1.4.0-py3, 1.5.0, 1.5.0-py3, 1.5.1, 1.5.1-py3, 1.6.0, 1.6.0-py3, 1.7.0, 1.7.0-py3, 1.8.0, 1.8.0-py3, 1.9.0, 1.9.0-py3, latest, latest-py3 | CPU |
tensorflow | 1.4.0-gpu, 1.4.0-gpu-py3, 1.5.0-gpu, 1.5.0-gpu-py3, 1.5.1-gpu, 1.5.1-gpu-py3, 1.6.0-gpu, 1.6.0-gpu-py3, 1.7.0-gpu, 1.7.0-gpu-py3, 1.8.0-gpu, 1.8.0-gpu-py3, 1.9.0-gpu, 1.9.0-gpu-py3, latest-gpu, latest-gpu-py3 | GPU |
caffe | cpu, intel | CPU |
caffe | gpu | GPU |
pytorch | v0.2, latest | CPU, GPU |
caffe2 | c2v0.8.1.cpu.full.ubuntu14.04, c2v0.8.0.cpu.full.ubuntu16.04 | CPU |
caffe2 | c2v0.8.1.cuda8.cudnn7.ubuntu16.04, latest | GPU |
h2o3 | latest | CPU |
horovod | 0.13.10-tf1.9.0-torch0.4.0-py2.7, 0.13.10-tf1.9.0-torch0.4.0-py3.5 | CPU, GPU |
You can deploy models based on these frameworks and then train your models using the FfDL CLI or FfDL UI.
To create new models you first need to create model definition files and data files for training and testing.
Different deep learning frameworks support different languages to define their models. For example, Torch models are defined in LuaJIT whereas Caffe models are defined using config files written in Protocol Buffer Language. Details on how to write model definition files is beyond the scope of this document.
Different frameworks require train and test datasets in different formats. For example, Caffe requires datasets in LevelDB or LMDB format while Torch requires datasets in Torch proprietary format. We assume that data is already in the format needed by the specific framework. Details on how to convert raw data to framework specific format is beyond the scope of this document.
Follow the instructions under Object Store for FfDL. You can then use the object store credentials to upload your data. The object store is also used to store the trained model.
The manifest file contains different fields describing the model in FfDL , its object store information, its resource requirements, and several arguments (including hyperparameters) required for model execution during training and testing. Here are the example manifest files for Caffe and TensorFlow models. You can use these templates to create manifest file for your models. Below we describe different fields of the manifest file for FfDL.
-
name:
After a model is deployed in FfDL a unique id for the model is created. The model id is +, where is a string of alphanumeric characters to uniquely identify the deployed model. is a prefix of the model id. You can provide any value to name. -
version:
This is version of the manifest file. This field is currently not used. -
description:
This is for users to keep track of their deployed models. Users can use in future and get information about particular model. FfDL does not interpret it. You can put anything here. -
learners:
Number of learners to use in training. As FfDL supports distributed learning, you can have more than one learner for your training job. -
gpus:
Number of gpus used by each learner during training. -
cpus:
Number of cpus used by each learner during training. The default cpu number is 5. -
memory:
Memory assigned to each learner during training. The default memory is 8Gb. -
data_stores:
You can specify as many data stores as you want in the manifest file. Each data store has the following fields.id:
Data store id (which you make up), to be used when creating a training job.type:
Type of data store, values is "mount_cos" (details below).training_data:
Location of the training data in the data store.container:
The container where the training data resides.
training_results:
Location of the training results in the data store. After training the trained model and training logs will be stored here, under "training-TRAININGJOBID".container:
The container where the training results will be stored. This filed is recommended to define by users. e.g.mnist_trained_model
. If not, the default location of trained models is FfDL object store.
connection:
The connection variables for the data store. The list of connection variables supported is data store type dependent. At present, the following connection variables are supported:- mount_cos:
auth_url
,user_name
(AWS Access Key), andpassword
(AWS Secret Access Key),region
(optional)
- mount_cos:
-
framework:
This field provides deep learning framework specific information.name:
Name of framework, values can be "caffe", "tensorflow" , "pytorch", or "caffe2".version:
Version of framework. List of available versions are in section 1. You must pick the version with the correct processing unit in order to run your jobs in GPU/CPU.command:
This field identifies the main program file along with any arguments that FfDL needs to execute. For example, the command to run a TensorFlow training can be as followspython mnist_with_summaries.py --train_images_file ${DATA_DIR}/train-images-idx3-ubyte.gz --train_labels_file ${DATA_DIR}/train-labels-idx1-ubyte.gz --test_images_file ${DATA_DIR}/t10k-images-idx3-ubyte.gz --test_labels_file ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --max_steps 400 --learning_rate 0.001
wherepython mnist_with_summaries.py
is the model code to execute while the remainder are arguments to the model.train_images_file
,train_labels_file
,test_images_file
,test_labels_file
refers to the dataset path in learner,max_steps
,learning_rate
are training parameters and hyperparameters.
Note: If the user's model and manifest files refer to some training data, they shouldn't use absolute paths. They should either:
* use relative paths, like
--train_images_file ./train-images-idx3-ubyte.gz
--test_images_file ./t10k-images-idx3-ubyte.gz
* or, use the $DATA_DIR environment variable, like
--train_images_file ${DATA_DIR}/train-images-idx3-ubyte.gz
--test_images_file ${DATA_DIR}/t10k-images-idx3-ubyte.gz
Note that FfDL CLI can take both zip or unzip files.
You need to zip all the model definition files and create a model zip file for jobs submitting on FfDL UI. At present, FfDL UI only supports zip format for model files, other compression formats like gzip, bzip, tar etc., are not supported. Note that all model definition files has to be in the first level of the zip file and there are no nested directories in the zip file.
After creating the manifest file and model definition file, you can either use the FfDL CLI or FfDL UI to deploy your model.
Note: Right now FfDL CLI only available on Mac and Linux.
In order to use the FfDL CLI, you will need your FfDL's restapi endpoint. Currently, the FfDL CLI is an executable binary located at cli/bin
restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nodePort}')
export DLAAS_URL=http://$node_ip:$restapi_port; export DLAAS_USERNAME=test-user; export DLAAS_PASSWORD=test;
Next, you need to get the correct executable binary for FfDL CLI located at cli/bin
.
CLI_CMD=cli/bin/ffdl-$(if [ "$(uname)" = "Darwin" ]; then echo 'osx'; else echo 'linux'; fi)
Now, you can use the following command to train your models.
$CLI_CMD train <manifest file location> <model definition zip | model definition directory>
After training your models, you can run $CLI_CMD logs <Job ID>
to view your model's logs and $CLI_CMD list
to view the list of models your had trained. You can also run $CLI_CMD -h
to learn more about the FfDL CLI.
To train your models using FfDL UI, simply upload your manifest file and model definition zip in the correspond fields and click Submit Training Job
Trained models can be deployed and served via REST and gRPC endpoints using Seldon-Core. For examples, see here
We will use the Amazon's S3 command line interface to access the object store. To set up a user environment to access object store, please follow instructions at AWS cli setup page and Using Amazon S3 with the AWS cli.
By default FfDL will use its local object store for storing any training and result data. You need to set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables using the default Local Object Store credentials before using this client.
export AWS_ACCESS_KEY_ID=test
export AWS_SECRET_ACCESS_KEY=test
export AWS_DEFAULT_REGION=us-east-1
# Create your training data and result buckets
aws --endpoint-url=http://$(make --no-print-directory kubernetes-ip):$(kubectl get service s3 -o jsonpath='{.spec.ports[0].nodePort}') s3 mb <trainingDataBucket>
aws --endpoint-url=http://$(make --no-print-directory kubernetes-ip):$(kubectl get service s3 -o jsonpath='{.spec.ports[0].nodePort}') s3 mb <trainingResultBucket>
Now, upload all you datasets to the training data bucket.
aws --endpoint-url=http://$(make --no-print-directory kubernetes-ip):$(kubectl get service s3 -o jsonpath='{.spec.ports[0].nodePort}') s3 cp <data file> s3://<trainingDataBucket>/<data file name>
Provision an S3 based Object Storage from your Cloud provider. Take note of your Authentication Endpoints, Access Key ID and Secret.
For IBM Cloud, you can provision an Object Storage from IBM Cloud Dashboard or from SoftLayer Portal.
You need to set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables using your Cloud Object Store credentials before using this client.
export AWS_ACCESS_KEY_ID=*****************
export AWS_SECRET_ACCESS_KEY=********************
# Create your training data and result buckets
aws --endpoint-url=http://<object storage Authentication Endpoints> s3 mb <trainingDataBucket>
aws --endpoint-url=http://<object storage Authentication Endpoints> s3 mb <trainingResultBucket>
Now, upload all you datasets to the training data bucket.
aws --endpoint-url=http://<object storage Authentication Endpoints> s3 cp <data file> s3://<trainingDataBucket>/<data file name>