diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 76d4c9e91..c30420637 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -35,7 +35,7 @@ All source code are under `./srcs//` where ` := cpp | go | python`. * Graph: A directed graph, which may contain self loops. The vertices are numbered from 0 to n - 1. -## Useful commands for development +## Useful commands ### Format code @@ -50,6 +50,16 @@ All source code are under `./srcs//` where ` := cpp | go | python`. pip3 wheel -vvv --no-index . ``` +### Docker + +```bash +# Run the following command in the KungFu folder +docker build -f docker/Dockerfile.tf-gpu -t kungfu:gpu . + +# Run the docker +docker run -it kungfu:gpu +``` + ## Use NVIDIA NCCL KungFu can use [NCCL](https://developer.nvidia.com/nccl) to leverage GPU-GPU direct communication. diff --git a/README.md b/README.md index dbb1b795d..19a75e7eb 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,8 @@ Easy, adaptive and fast distributed machine learning. KungFu enables users to achieve *fast* and *adaptive* distributed machine learning. This is important because machine learning systems must cope with growing complex models and increasingly complicated deployment environments. KungFu has the following unique features: -* Simplicity: KungFu permits distributed training by adding only one line of code in the training program. KungFu is easy to deploy. It does not require partitioning resources as in parameter servers and heavy dependency like MPI in Horovod. +* Simplicity: KungFu permits distributed training by adding only one line of code in your existing training program. +* Easy to deploy: KungFu has minimal dependency. It does not require heavy dependency like MPI in Horovod and external resource like parameter servers. Check the [GPU](docker/Dockerfile.tf-gpu) and [CPU](docker/Dockerfile.tf-cpu) docker files. * Adaptive distributed training: KungFu provides many advanced [distributed optimizers](srcs/python/kungfu/optimizers/__init__.py) such as communication-efficient [AD-PSGD](https://arxiv.org/abs/1710.06952) and small-batch-efficient [SMA](http://www.vldb.org/pvldb/vol12/p1399-koliousis.pdf) to help you address the cases in which [Synchronous SGD](https://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf) does not scale. * Monitoring: KungFu supports [distributed SGD metrics](srcs/python/kungfu/optimizers/sync_sgd.py) such as [gradient variance](https://en.wikipedia.org/wiki/Variance) and [gradient noise scale](https://openai.com/blog/science-of-ai/) to help understand the training process with low overhead. @@ -23,7 +24,7 @@ To use KungFu to scale out your TensorFlow training program, you simply need to 1. Wrap the optimizer in ``SynchronousSGDOptimizer`` or another [distributed optimizer](srcs/python/kungfu/optimizers/__init__.py). 2. Run ``distributed_initializer()`` after calling ``global_variables_initializer()``. - The distributed initializer synchronizes the initial variables on all workers. + The distributed initializer ensures the initial variables on all workers are consistent. ```python import tensorflow as tf @@ -71,7 +72,8 @@ kungfu-run -np $NUM_GPUS \ ## Install -KungFu requires [Python 3](https://www.python.org/downloads/), [CMake 3.5+](https://cmake.org/install/), [Golang 1.11+](https://golang.org/dl/) and [TensorFlow <=1.13.2](https://www.tensorflow.org/install/pip#older-versions-of-tensorflow). +KungFu requires [Python 3](https://www.python.org/downloads/), [CMake 3.5+](https://cmake.org/install/), [Golang 1.13+](https://golang.org/dl/) and [TensorFlow <=1.13.2](https://www.tensorflow.org/install/pip#older-versions-of-tensorflow). +You can also install KungFu using the following few lines assuming you have installed the above pre-requites. ```bash # Install tensorflow CPU @@ -96,6 +98,8 @@ GOBIN=$(pwd)/bin go install -v ./srcs/go/cmd/kungfu-run ./bin/kungfu-run -help ``` +You can also use KungFu within a Docker. Check the docker files for [GPU](docker/Dockerfile.tf-gpu) and [CPU](docker/Dockerfile.tf-gpu) machines. + ## Benchmark We benchmark the performance of KungFu in a cluster that has 16 V100 GPUs hosted by 2 DGX-1 machines. @@ -116,7 +120,7 @@ All benchmark scripts are available [here](benchmarks/system/). ## Convergence -The synchronization algorithms (``SynchronousSGDOptimizer``, ``PairAveragingOptimizer`` and ``SynchronousAveragingOptimizer``) +The distributed optimizers (``SynchronousSGDOptimizer``, ``PairAveragingOptimizer`` and ``SynchronousAveragingOptimizer``) can reach the same evaluation accuracy as Horovod. We validated this with the ResNet-50 and ResNet-101 models in the [TensorFlow benchmark](https://github.com/luomai/benchmarks/tree/cnn_tf_v1.12_compatible_kungfu). You can also add your own KungFu distributed optimizer to the benchmark by adding one line of code, see [here](https://github.com/luomai/benchmarks/blob/1eb102a81cdcd42cdbea56d2d19f36a8018e9f80/scripts/tf_cnn_benchmarks/benchmark_cnn.py#L1197). diff --git a/benchmarks/system/README.md b/benchmarks/system/README.md index 8da77fca1..52400ddcf 100644 --- a/benchmarks/system/README.md +++ b/benchmarks/system/README.md @@ -2,7 +2,13 @@ Distributed training benchmark of KungFu, Horovod and Parameter Servers. -We assume the benchmark runs on a server with 4 GPUs. The Tensorflow version is 1.13.1. +## Intro + +This benchmark requires TensorFlow <=1.13.2, KungFu and Horovod. +We have run this benchmark on two clusters: one has two DGX-1 machines (each has 8 V100) and one has 16 P100 machines. You can see the benchmark result [here](result/). + +In the following, we provide sample commands to run the benchmark. +We assume the benchmark runs on a server with 4 GPUs. The benchmark imports models from [tf.keras.applications](https://www.tensorflow.org/api_docs/python/tf/keras/applications). You can freely choose different models and batch sizes. @@ -48,7 +54,7 @@ kungfu-run -np 4 python3 benchmark_kungfu.py --kungfu=async-sgd --model=ResNet50 Use the following shell script to run the parameter server benchmark. ```bash -# Configure 1 local parameter server (You can create more parameter servers) +# Configure 1 local parameter server (We suggest users to have a 1:1 ratio between parameter servers and workers) PS_HOSTS="localhost:2220" # Configure four training workers diff --git a/docker/Dockerfile.builder-ubuntu18 b/docker/Dockerfile.builder-ubuntu18 index 70789a6d0..8354c5b36 100644 --- a/docker/Dockerfile.builder-ubuntu18 +++ b/docker/Dockerfile.builder-ubuntu18 @@ -16,8 +16,8 @@ RUN apt update && \ ARG PY_MIRROR='-i https://pypi.tuna.tsinghua.edu.cn/simple' RUN pip3 install ${PY_MIRROR} tensorflow -RUN wget -q https://dl.google.com/go/go1.11.linux-amd64.tar.gz && \ - tar -C /usr/local -xf go1.11.linux-amd64.tar.gz && \ - rm go1.11.linux-amd64.tar.gz +RUN wget -q https://dl.google.com/go/go1.13.linux-amd64.tar.gz && \ + tar -C /usr/local -xf go1.13.linux-amd64.tar.gz && \ + rm go1.13.linux-amd64.tar.gz ENV PATH=${PATH}:/usr/local/go/bin diff --git a/docker/Dockerfile.tf-cpu b/docker/Dockerfile.tf-cpu new file mode 100644 index 000000000..93ae01e50 --- /dev/null +++ b/docker/Dockerfile.tf-cpu @@ -0,0 +1,13 @@ +FROM tensorflow/tensorflow:1.13.1-py3 + +RUN apt update && apt install -y cmake wget +RUN wget -q https://dl.google.com/go/go1.13.linux-amd64.tar.gz && \ + tar -C /usr/local -xf go1.13.linux-amd64.tar.gz && \ + rm go1.13.linux-amd64.tar.gz +ENV PATH=${PATH}:/usr/local/go/bin + +ADD . /src/kungfu +WORKDIR /src/kungfu + +RUN pip3 install --no-index -U . +RUN GOBIN=/usr/bin go install -v ./srcs/go/cmd/kungfu-run diff --git a/docker/Dockerfile.tf-gpu b/docker/Dockerfile.tf-gpu index 5d5ef765f..35763d763 100644 --- a/docker/Dockerfile.tf-gpu +++ b/docker/Dockerfile.tf-gpu @@ -1,19 +1,13 @@ -FROM tensorflow/tensorflow:1.12.0-gpu-py3 AS builder +FROM tensorflow/tensorflow:1.13.1-gpu-py3 -ADD docker/sources.list.aliyun /etc/apt/sources.list -RUN rm -fr /etc/apt/sources.list.d/* && \ - apt update && \ - apt install -y cmake wget -RUN wget -q https://dl.google.com/go/go1.11.linux-amd64.tar.gz && \ - tar -C /usr/local -xf go1.11.linux-amd64.tar.gz && \ - rm go1.11.linux-amd64.tar.gz +RUN apt update && apt install -y cmake wget +RUN wget -q https://dl.google.com/go/go1.13.linux-amd64.tar.gz && \ + tar -C /usr/local -xf go1.13.linux-amd64.tar.gz && \ + rm go1.13.linux-amd64.tar.gz ENV PATH=${PATH}:/usr/local/go/bin -ADD scripts /src/scripts -RUN PREFIX=/usr /src/scripts/install-nccl.sh && \ - rm /usr/lib/x86_64-linux-gnu/libnccl.so.2 - ADD . /src/kungfu WORKDIR /src/kungfu -# RUN pip3 install --no-index -U . +RUN ldconfig /usr/local/cuda-10.0/targets/x86_64-linux/lib/stubs && pip3 install --no-index -U . +RUN GOBIN=/usr/bin go install -v ./srcs/go/cmd/kungfu-run