Make SMA alpha configurable. (#211)

* bump version to 0.2.0 * make alpha configurable in SMA. * make SMA an isolated PR. * update REAMDE
lsds · Nov 10, 2019 · 5b70865 · 5b70865
1 parent 7d0e1f6
commit 5b70865
Show file tree

Hide file tree

Showing 2 changed files with 44 additions and 43 deletions.
diff --git a/README.md b/README.md
@@ -118,41 +118,21 @@ Check out this in an [ImageNet benchmark suite](https://github.com/luomai/benchm
 
 * ***Pose estimation***: Pose estimation models such as [OpenPose](https://github.com/CMU-Perceptual-Computing-Lab/openpose) are often batch-size sensitive.
 We used KungFu in
-a popular [OpenPose implementation](https://github.com/tensorlayer/openpose-plus) and achieved robust speed up in time-to-accuracy after
-using the model averaging optimizers which preserves the merits of small batch training.
+a popular [OpenPose implementation](https://github.com/tensorlayer/openpose-plus) and improved time-to-accuracy
+using the model averaging optimizer which preserves the merits of small batch size.
 
 * ***Natural language processing***:
 We have an [example](https://github.com/luomai/bert) that shows how you can use few lines to enable distributed training for the Google BERT model.
 
 * ***Adversarial learning***:
-Generative adversarial networks (GANs) train multiple networks in parallel and often prefer using small batches for training.
-KungFu thus become an attractive option, because of its minimal changes to complex GAN programs
-and new optimizers that decouple batch size and system parallelism.
+Adversarial learning trains multiple networks in parallel and prefer using small batches for training.
+KungFu thus become an attractive option, because of its minimal changes to GAN programs
+and its optimizers that decouple batch size and system parallelism.
 See the [CycleGAN example](https://github.com/tensorlayer/cyclegan).
 
 * ***Reinforcement learning***:
 We are working on an Alpha Zero distributed training example and will release it soon.
 
-## Benchmark
-
-We benchmark KungFu in a cluster that has 16 V100 GPUs hosted by 2 DGX-1 machines.
-The machines are interconnected by a 100 Gbps network. We measure the training throughput of ResNet-50, VGG16 and InceptionV3. These models represent different kinds of training workloads.
-
-In the synchronous training case, we compare KungFu (``SynchronousSGDOptimizer``) with [Horovod](https://github.com/horovod/horovod) (0.16.1). Horovod uses OpenMPI 4.0.0. We evaluate the spectrum of batch size (from 256 to 4096) commonly used by SGD users.
-This batch size is evenly shared by the 16 GPUs.
-KungFu outperforms Horovod on all tested models, in particular with small batch sizes which significantly raise the
-frequency of synchronization.
-
-![sync](benchmarks/system/result/sync-scalability.svg)
-
-In the asynchronous training case, we compare KungFu (``PairAveragingOptimizer``) with TensorFlow parameter servers (1.13.1). We uses the same range of batch sizes as above. KungFu exhibits better scalability as well.
-
-![async](benchmarks/system/result/async-scalability.svg)
-
-We have also run the same benchmark in a 16-server cluster (each has a P100).
-KungFu exhibits better scalability in this communication-challenging environment,
-and we thus only report the 16 V100 result here. You can find the benchmark scripts [here](benchmarks/system/).
-
 ## Choosing the right optimizer
 
 KungFu aims to help users decrease the
@@ -173,27 +153,27 @@ Scalability-wise, all S-SGD workers must exchange all gradients per iteration, m
 them hard to deal with limited bandwidth and stragglers;
 (ii) accuracy-wise, S-SGD *couples* training batch size with the number of workers,
 enforcing users to use large batch sizes, which can adversely
-affect the generality of a trained model (see [paper](https://arxiv.org/abs/1609.04836).
+affect the generality of a trained model (see [paper](https://arxiv.org/abs/1609.04836)).
 To compensate the loss in generality, users must explore various [methods](https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf)
 for tuning hyper-parameters.
 
 ***Model averaging***:
 Model averaging is implemented as ``SynchronousAveragingOptimizer`` and
 ``PairAveragingOptimizer`` in KungFu.
-The former realizes the small-batch-efficient [SMA](http://www.vldb.org/pvldb/vol12/p1399-koliousis.pdf)
+The former realizes the hyper-parameter-robust [SMA](http://www.vldb.org/pvldb/vol12/p1399-koliousis.pdf)
 algorithm; while the latter implements the [AD-PSGD](https://arxiv.org/abs/1710.06952) algorithm
 which reduces bandwidth consumption and tolerates stragglers.
-In model averaging, each worker updates its local
-model using small batch size, and exchange
-models to speed up the search for minima.
-Model averaging algorithms has a proven convergence guarantee (see [EA-SGD paper](https://arxiv.org/abs/1412.6651))
-and often converge quickly (see [Lookahead paper](https://arxiv.org/abs/1907.08610)) with DL models.
-A key property of model averaging is that: it decouples
-batch size with system parallelism, making
-itself *hyper-parameter robust*. We find
-this property useful
-as AI users often find it hard and expensive to
-tune synchronous SGD.
+In model averaging, each worker trains its local
+model using SGD, and average
+its model with peers to speed up the search for minima.
+Model averaging algorithms have a convergence guarantee (see [EA-SGD paper](https://arxiv.org/abs/1412.6651))
+and can converge fast with DL models (see [Lookahead paper](https://arxiv.org/abs/1907.08610)).
+A useful property of model averaging is: it decouples
+batch size with system parallelism, often making
+it *hyper-parameter robust*. We find
+this property valuable
+as DL users often find it hard and expensive to
+tune synchronous SGD at scale.
 
 ***Convergence evaluation***:
 We have tested KungFu optimizers using ResNet-50 and ResNet-101 for ImageNet.
@@ -206,9 +186,29 @@ reach the target 75%.
 All these tests use a per-GPU batch size as 64 and [hyper-parameters](https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks#getting-started)
 suggested by the TensorFlow benchmark authors.
 
+## Benchmark
+
+We benchmark KungFu in a cluster that has 16 V100 GPUs hosted by 2 DGX-1 machines.
+The machines are interconnected by a 100 Gbps network. We measure the training throughput of ResNet-50, VGG16 and InceptionV3. These models represent different kinds of training workloads.
+
+In the ***synchronous training*** case, we compare KungFu (``SynchronousSGDOptimizer``) with [Horovod](https://github.com/horovod/horovod) (0.16.1). Horovod uses OpenMPI 4.0.0. We evaluate the spectrum of batch size (from 256 to 4096) commonly used by S-SGD users.
+This batch size is evenly shared by the 16 GPUs.
+KungFu outperforms Horovod on all tested models, in particular with small batch sizes which significantly raise the
+frequency of synchronization.
+
+![sync](benchmarks/system/result/sync-scalability.svg)
+
+In the ***asynchronous training*** case, we compare KungFu (``PairAveragingOptimizer``) with TensorFlow parameter servers (1.13.1). We uses the same range of batch sizes as above. KungFu exhibits better scalability as well.
+
+![async](benchmarks/system/result/async-scalability.svg)
+
+We have also run the same benchmark in a 16-server cluster (each has a P100).
+KungFu exhibits better scalability in this communication-challenging environment,
+and we thus only report the 16 V100 result here. You can find the benchmark scripts [here](benchmarks/system/).
+
 ## Development
 
 KungFu is designed with extensibility in mind.
 It has a low-level API and a modular architecture, making
-it suitable for implementing new distributed optimizers and monitoring/control algorithms.
+it suitable for implementing new distributed training algorithms.
 Check out the developer [guideline](CONTRIBUTING.md) for more information.
diff --git a/srcs/python/kungfu/tensorflow/optimizers/sma_sgd.py b/srcs/python/kungfu/tensorflow/optimizers/sma_sgd.py
@@ -8,6 +8,7 @@
 
 def SynchronousAveragingOptimizer(optimizer,
                                   name=None,
+                                  alpha=0.1,
                                   use_locking=False,
                                   with_keras=False):
     """SynchronousAveragingOptimizer implements the [SMA]_ algorithm.
@@ -24,6 +25,7 @@ def SynchronousAveragingOptimizer(optimizer,
 
     Keyword Arguments:
         - name {str} -- name prefix for the operations created when applying gradients. Defaults to "KungFu" followed by the provided optimizer type. (default: {None})
+        - alpha {float} -- the ratio of a central model during averaging (Check the SMA and EA-SGD papers for its intuition). (default: {0.1})
         - use_locking {bool} -- Whether to use locking when updating variables. (default: {False})
         - with_keras {bool} -- Runs with pure Keras or not (default: {False})
 
@@ -33,18 +35,17 @@ def SynchronousAveragingOptimizer(optimizer,
     Returns:
         optimizer {tf.train.Optimizer, tf.keras.optimizers.Optimizer} -- KungFu distributed optimizer
     """
-    sma_algo = _SynchronousAveraging()
+    sma_algo = _SynchronousAveraging(alpha)
     if not with_keras:
         return _create_kungfu_optimizer(optimizer, sma_algo, name, use_locking)
     else:
         return _create_kungfu_keras_optimizer(optimizer, sma_algo)
 
 
 class _SynchronousAveraging(_KungFuAlgorithm):
-    def __init__(self):
+    def __init__(self, alpha):
         self._num_workers = current_cluster_size()
-        # self._alpha = 1.0 / current_cluster_size()
-        self._alpha = 0.1  # Suggested by [2]
+        self._alpha = alpha
 
     def apply_gradients(self, apply_grads_func, grads_and_vars, **kwargs):
         # It is important to apply model averaging every iteration [2]