Skip to content
This repository has been archived by the owner on Jul 16, 2021. It is now read-only.

Run the central Rabit process on a worker #41

Open
mrocklin opened this issue May 17, 2019 · 2 comments
Open

Run the central Rabit process on a worker #41

mrocklin opened this issue May 17, 2019 · 2 comments

Comments

@mrocklin
Copy link
Member

Currently we run Rabit's central process on the scheduler and the worker processes with the dask workers. This has caused issues in two cases:

  1. Sometimes the scheduler has a more stripped down environment and doesn't have all of the libraries that the workers do.
  2. Sometimes the scheduler's networking position is somewhat different from the workers Cannot assign requested address #23 Use Rabit tracker get_host_ip('auto') to pick best tracker IP address #40

We might consider instead running the tracker on a worker. This would also keep the scheduler more isolated. This is awkward if there is data on the worker where we want to run the tracker, but if we're comfortable moving data (as is the case in @RAMitchell 's rewrite) then maybe this doesn't matter.

@RAMitchell thought I'd bring this up now rather than later in case it affects things

@javabrett
Copy link
Contributor

javabrett commented May 17, 2019

  • How would the worker be chosen - just workers[0]?
  • Are we currently fault-tolerant in any way should a single worker die? And if so, is the likelihood of worker-death higher-enough that it should occur more-frequently than on the scheduler, which is presumably running less code/load?
  • Are there any time-sensitive Rabit tracker tasks which would cause problems if the tracker-worker was under load-resource-pressure?

@RAMitchell
Copy link

So for my xgboost integration (dmlc/xgboost#4473) I will try the approach of running the tracker on worker zero and assume the performance load of the tracker is negligible.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants