Support for different hardware configurations for different task roles of one distributed job. #5808

siaimes · 2022-10-12T08:23:08Z

What would you like to be added:
Support for different hardware configurations for different task roles of one distributed job.

Why is this needed:
For complex learning tasks, the programs that need to run on each computer are very different, and the requirements for CPU /GPU and RAM /GPU memory are also different. At the same time, these computers need to communicate with each other to enable joint training. For example, in reinforcement learning, the entire reinforcement learning algorithm consists of different modules. The actor uses the GPU to generate data, the learner uses the GPU to train data, the environment and MCTS use CPU to generate data in parallel, and these modules involve complex data communication.

Without this feature, how does the current module work:
Reinforcement learning tasks cannot be performed jointly by multiple computers.

Components that may involve changes:
Job protocol and related.

Downgrade vc to taskrole:

Allows each taskrole to have a different skutype:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for different hardware configurations for different task roles of one distributed job. #5808

Support for different hardware configurations for different task roles of one distributed job. #5808

siaimes commented Oct 12, 2022

Support for different hardware configurations for different task roles of one distributed job. #5808

Support for different hardware configurations for different task roles of one distributed job. #5808

Comments

siaimes commented Oct 12, 2022