You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was wondering what exactly is the appropriate way to launch multi-worker distributed training jobs with xmanager. Based on my current understanding, it seems that a Job must be created for each worker pool and all theseJobs must be combined into a single JobGroup, which is then added to the experiment. There also seems to be an option to add Constraints for the JobGroup, however, I cannot seem to find what specific form these constraints may be able to take on besides the provided example of xm_impl.SameMachine(). Furthermore, my current attempt at launching a multi-worker distributing training job raises the following error when creating the distributed strategy with strategy = tf.distribute.MultiWorkerMirroredStrategy(): RuntimeError: Collective ops must be configured at program startup. Both the CLUSTER_SPEC and TF_CONFIG environment variables seem to be set correctly and the distributed strategy is created at the very beginning of the main function, so I was curious if this error might be due to the lack of setting appropriate Constraints for the JobGroup.
The text was updated successfully, but these errors were encountered:
Thanks for pointing that out! I modified my code to align with the provided example, with the main changes being utilizing the async/await syntax of the python asyncio library. However, I am still getting RuntimeError: Collective ops must be configured at program startup when calling strategy = tf.distribute.MultiWorkerMirroredStrategy() at the very beginning of the main function of the training file. I am unsure as to what exactly might be responsible for this error, given that I am creating the strategy before calling any other Tensorflow API, as per https://github.com/tensorflow/tensorflow/blob/3f878cff5b698b82eea85db2b60d65a2e320850e/tensorflow/python/distribute/collective_all_reduce_strategy.py#L155.
I was wondering what exactly is the appropriate way to launch multi-worker distributed training jobs with xmanager. Based on my current understanding, it seems that a
Job
must be created for each worker pool and all theseJob
s must be combined into a singleJobGroup
, which is then added to the experiment. There also seems to be an option to addConstraint
s for theJobGroup
, however, I cannot seem to find what specific form these constraints may be able to take on besides the provided example ofxm_impl.SameMachine()
. Furthermore, my current attempt at launching a multi-worker distributing training job raises the following error when creating the distributed strategy withstrategy = tf.distribute.MultiWorkerMirroredStrategy()
:RuntimeError: Collective ops must be configured at program startup
. Both theCLUSTER_SPEC
andTF_CONFIG
environment variables seem to be set correctly and the distributed strategy is created at the very beginning of the main function, so I was curious if this error might be due to the lack of setting appropriateConstraint
s for theJobGroup
.The text was updated successfully, but these errors were encountered: