Run the central Rabit process on a worker #41

mrocklin · 2019-05-17T01:07:13Z

Currently we run Rabit's central process on the scheduler and the worker processes with the dask workers. This has caused issues in two cases:

Sometimes the scheduler has a more stripped down environment and doesn't have all of the libraries that the workers do.
Sometimes the scheduler's networking position is somewhat different from the workers Cannot assign requested address #23 Use Rabit tracker get_host_ip('auto') to pick best tracker IP address #40

We might consider instead running the tracker on a worker. This would also keep the scheduler more isolated. This is awkward if there is data on the worker where we want to run the tracker, but if we're comfortable moving data (as is the case in @RAMitchell 's rewrite) then maybe this doesn't matter.

@RAMitchell thought I'd bring this up now rather than later in case it affects things

javabrett · 2019-05-17T01:42:09Z

How would the worker be chosen - just workers[0]?
Are we currently fault-tolerant in any way should a single worker die? And if so, is the likelihood of worker-death higher-enough that it should occur more-frequently than on the scheduler, which is presumably running less code/load?
Are there any time-sensitive Rabit tracker tasks which would cause problems if the tracker-worker was under load-resource-pressure?

RAMitchell · 2019-05-19T20:22:57Z

So for my xgboost integration (dmlc/xgboost#4473) I will try the approach of running the tracker on worker zero and assume the performance load of the tracker is negligible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run the central Rabit process on a worker #41

Run the central Rabit process on a worker #41

mrocklin commented May 17, 2019

javabrett commented May 17, 2019 •

edited

Loading

RAMitchell commented May 19, 2019

Run the central Rabit process on a worker #41

Run the central Rabit process on a worker #41

Comments

mrocklin commented May 17, 2019

javabrett commented May 17, 2019 • edited Loading

RAMitchell commented May 19, 2019

javabrett commented May 17, 2019 •

edited

Loading