You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 16, 2021. It is now read-only.
We might consider instead running the tracker on a worker. This would also keep the scheduler more isolated. This is awkward if there is data on the worker where we want to run the tracker, but if we're comfortable moving data (as is the case in @RAMitchell 's rewrite) then maybe this doesn't matter.
@RAMitchell thought I'd bring this up now rather than later in case it affects things
The text was updated successfully, but these errors were encountered:
Are we currently fault-tolerant in any way should a single worker die? And if so, is the likelihood of worker-death higher-enough that it should occur more-frequently than on the scheduler, which is presumably running less code/load?
Are there any time-sensitive Rabit tracker tasks which would cause problems if the tracker-worker was under load-resource-pressure?
So for my xgboost integration (dmlc/xgboost#4473) I will try the approach of running the tracker on worker zero and assume the performance load of the tracker is negligible.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Currently we run Rabit's central process on the scheduler and the worker processes with the dask workers. This has caused issues in two cases:
We might consider instead running the tracker on a worker. This would also keep the scheduler more isolated. This is awkward if there is data on the worker where we want to run the tracker, but if we're comfortable moving data (as is the case in @RAMitchell 's rewrite) then maybe this doesn't matter.
@RAMitchell thought I'd bring this up now rather than later in case it affects things
The text was updated successfully, but these errors were encountered: