Skip to content

Commit

Permalink
Make SMA alpha configurable. (#211)
Browse files Browse the repository at this point in the history
* bump version to 0.2.0

* make alpha configurable in SMA.

* make SMA an isolated PR.

* update REAMDE
  • Loading branch information
luomai authored Nov 10, 2019
1 parent 7d0e1f6 commit 5b70865
Show file tree
Hide file tree
Showing 2 changed files with 44 additions and 43 deletions.
78 changes: 39 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,41 +118,21 @@ Check out this in an [ImageNet benchmark suite](https://github.com/luomai/benchm

* ***Pose estimation***: Pose estimation models such as [OpenPose](https://github.com/CMU-Perceptual-Computing-Lab/openpose) are often batch-size sensitive.
We used KungFu in
a popular [OpenPose implementation](https://github.com/tensorlayer/openpose-plus) and achieved robust speed up in time-to-accuracy after
using the model averaging optimizers which preserves the merits of small batch training.
a popular [OpenPose implementation](https://github.com/tensorlayer/openpose-plus) and improved time-to-accuracy
using the model averaging optimizer which preserves the merits of small batch size.

* ***Natural language processing***:
We have an [example](https://github.com/luomai/bert) that shows how you can use few lines to enable distributed training for the Google BERT model.

* ***Adversarial learning***:
Generative adversarial networks (GANs) train multiple networks in parallel and often prefer using small batches for training.
KungFu thus become an attractive option, because of its minimal changes to complex GAN programs
and new optimizers that decouple batch size and system parallelism.
Adversarial learning trains multiple networks in parallel and prefer using small batches for training.
KungFu thus become an attractive option, because of its minimal changes to GAN programs
and its optimizers that decouple batch size and system parallelism.
See the [CycleGAN example](https://github.com/tensorlayer/cyclegan).

* ***Reinforcement learning***:
We are working on an Alpha Zero distributed training example and will release it soon.

## Benchmark

We benchmark KungFu in a cluster that has 16 V100 GPUs hosted by 2 DGX-1 machines.
The machines are interconnected by a 100 Gbps network. We measure the training throughput of ResNet-50, VGG16 and InceptionV3. These models represent different kinds of training workloads.

In the synchronous training case, we compare KungFu (``SynchronousSGDOptimizer``) with [Horovod](https://github.com/horovod/horovod) (0.16.1). Horovod uses OpenMPI 4.0.0. We evaluate the spectrum of batch size (from 256 to 4096) commonly used by SGD users.
This batch size is evenly shared by the 16 GPUs.
KungFu outperforms Horovod on all tested models, in particular with small batch sizes which significantly raise the
frequency of synchronization.

![sync](benchmarks/system/result/sync-scalability.svg)

In the asynchronous training case, we compare KungFu (``PairAveragingOptimizer``) with TensorFlow parameter servers (1.13.1). We uses the same range of batch sizes as above. KungFu exhibits better scalability as well.

![async](benchmarks/system/result/async-scalability.svg)

We have also run the same benchmark in a 16-server cluster (each has a P100).
KungFu exhibits better scalability in this communication-challenging environment,
and we thus only report the 16 V100 result here. You can find the benchmark scripts [here](benchmarks/system/).

## Choosing the right optimizer

KungFu aims to help users decrease the
Expand All @@ -173,27 +153,27 @@ Scalability-wise, all S-SGD workers must exchange all gradients per iteration, m
them hard to deal with limited bandwidth and stragglers;
(ii) accuracy-wise, S-SGD *couples* training batch size with the number of workers,
enforcing users to use large batch sizes, which can adversely
affect the generality of a trained model (see [paper](https://arxiv.org/abs/1609.04836).
affect the generality of a trained model (see [paper](https://arxiv.org/abs/1609.04836)).
To compensate the loss in generality, users must explore various [methods](https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf)
for tuning hyper-parameters.

***Model averaging***:
Model averaging is implemented as ``SynchronousAveragingOptimizer`` and
``PairAveragingOptimizer`` in KungFu.
The former realizes the small-batch-efficient [SMA](http://www.vldb.org/pvldb/vol12/p1399-koliousis.pdf)
The former realizes the hyper-parameter-robust [SMA](http://www.vldb.org/pvldb/vol12/p1399-koliousis.pdf)
algorithm; while the latter implements the [AD-PSGD](https://arxiv.org/abs/1710.06952) algorithm
which reduces bandwidth consumption and tolerates stragglers.
In model averaging, each worker updates its local
model using small batch size, and exchange
models to speed up the search for minima.
Model averaging algorithms has a proven convergence guarantee (see [EA-SGD paper](https://arxiv.org/abs/1412.6651))
and often converge quickly (see [Lookahead paper](https://arxiv.org/abs/1907.08610)) with DL models.
A key property of model averaging is that: it decouples
batch size with system parallelism, making
itself *hyper-parameter robust*. We find
this property useful
as AI users often find it hard and expensive to
tune synchronous SGD.
In model averaging, each worker trains its local
model using SGD, and average
its model with peers to speed up the search for minima.
Model averaging algorithms have a convergence guarantee (see [EA-SGD paper](https://arxiv.org/abs/1412.6651))
and can converge fast with DL models (see [Lookahead paper](https://arxiv.org/abs/1907.08610)).
A useful property of model averaging is: it decouples
batch size with system parallelism, often making
it *hyper-parameter robust*. We find
this property valuable
as DL users often find it hard and expensive to
tune synchronous SGD at scale.

***Convergence evaluation***:
We have tested KungFu optimizers using ResNet-50 and ResNet-101 for ImageNet.
Expand All @@ -206,9 +186,29 @@ reach the target 75%.
All these tests use a per-GPU batch size as 64 and [hyper-parameters](https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks#getting-started)
suggested by the TensorFlow benchmark authors.

## Benchmark

We benchmark KungFu in a cluster that has 16 V100 GPUs hosted by 2 DGX-1 machines.
The machines are interconnected by a 100 Gbps network. We measure the training throughput of ResNet-50, VGG16 and InceptionV3. These models represent different kinds of training workloads.

In the ***synchronous training*** case, we compare KungFu (``SynchronousSGDOptimizer``) with [Horovod](https://github.com/horovod/horovod) (0.16.1). Horovod uses OpenMPI 4.0.0. We evaluate the spectrum of batch size (from 256 to 4096) commonly used by S-SGD users.
This batch size is evenly shared by the 16 GPUs.
KungFu outperforms Horovod on all tested models, in particular with small batch sizes which significantly raise the
frequency of synchronization.

![sync](benchmarks/system/result/sync-scalability.svg)

In the ***asynchronous training*** case, we compare KungFu (``PairAveragingOptimizer``) with TensorFlow parameter servers (1.13.1). We uses the same range of batch sizes as above. KungFu exhibits better scalability as well.

![async](benchmarks/system/result/async-scalability.svg)

We have also run the same benchmark in a 16-server cluster (each has a P100).
KungFu exhibits better scalability in this communication-challenging environment,
and we thus only report the 16 V100 result here. You can find the benchmark scripts [here](benchmarks/system/).

## Development

KungFu is designed with extensibility in mind.
It has a low-level API and a modular architecture, making
it suitable for implementing new distributed optimizers and monitoring/control algorithms.
it suitable for implementing new distributed training algorithms.
Check out the developer [guideline](CONTRIBUTING.md) for more information.
9 changes: 5 additions & 4 deletions srcs/python/kungfu/tensorflow/optimizers/sma_sgd.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

def SynchronousAveragingOptimizer(optimizer,
name=None,
alpha=0.1,
use_locking=False,
with_keras=False):
"""SynchronousAveragingOptimizer implements the [SMA]_ algorithm.
Expand All @@ -24,6 +25,7 @@ def SynchronousAveragingOptimizer(optimizer,
Keyword Arguments:
- name {str} -- name prefix for the operations created when applying gradients. Defaults to "KungFu" followed by the provided optimizer type. (default: {None})
- alpha {float} -- the ratio of a central model during averaging (Check the SMA and EA-SGD papers for its intuition). (default: {0.1})
- use_locking {bool} -- Whether to use locking when updating variables. (default: {False})
- with_keras {bool} -- Runs with pure Keras or not (default: {False})
Expand All @@ -33,18 +35,17 @@ def SynchronousAveragingOptimizer(optimizer,
Returns:
optimizer {tf.train.Optimizer, tf.keras.optimizers.Optimizer} -- KungFu distributed optimizer
"""
sma_algo = _SynchronousAveraging()
sma_algo = _SynchronousAveraging(alpha)
if not with_keras:
return _create_kungfu_optimizer(optimizer, sma_algo, name, use_locking)
else:
return _create_kungfu_keras_optimizer(optimizer, sma_algo)


class _SynchronousAveraging(_KungFuAlgorithm):
def __init__(self):
def __init__(self, alpha):
self._num_workers = current_cluster_size()
# self._alpha = 1.0 / current_cluster_size()
self._alpha = 0.1 # Suggested by [2]
self._alpha = alpha

def apply_gradients(self, apply_grads_func, grads_and_vars, **kwargs):
# It is important to apply model averaging every iteration [2]
Expand Down

0 comments on commit 5b70865

Please sign in to comment.