Skip to content

Lightweight framework for distributed TensorFlow training based on dmlc/rabit

License

Notifications You must be signed in to change notification settings

criteo/tf-collective-all-reduce

Folders and files

NameName
Last commit message
Last commit date

Latest commit

678bf3d · Jan 15, 2020

History

68 Commits
Jan 13, 2020
Jan 15, 2020
Jan 13, 2020
Jan 15, 2020
Jan 15, 2020
Jan 13, 2020
Nov 27, 2019
Oct 1, 2019
Jan 13, 2020
Oct 7, 2019
Jan 15, 2020
Oct 21, 2019
Oct 1, 2019
Jan 13, 2020
Oct 17, 2019
Jan 15, 2020
Nov 20, 2019
Oct 3, 2019
Oct 7, 2019

Repository files navigation

tf-collective-all-reduce

Lightweight framework for distributing machine learning training based on Rabit for the communication layer. We borrowed Horovod's concepts for the TensorFlow optimizer wrapper.

Installation

git clone https://github.com/criteo/tf-collective-all-reduce
python3.6 -m venv tf_env
. tf_env/bin/activate
pip install tensorflow==1.12.2
pushd tf-collective-all-reduce
  ./install.sh
  pip install -e .
popd

Prerequisites

tf-collective-all-reduce only supports Python ≥3.6

Run tests

pip install -r tests-requirements.txt
pytest -s

Local run with dmlc-submit

../dmlc-core/tracker/dmlc-submit --cluster local --num-workers 2 python examples/simple/simple_allreduce.py

Run on a Hadoop cluster with tf-yarn

Run collective_all_reduce_example

cd examples/tf-yarn
python collective_all_reduce_example.py