Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Support Tensorflow distributed training in kfp workflow #3000

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

typhoonzero
Copy link
Contributor

What changes were proposed in this pull request?

Support running distributed Tensorflow training in kfp workflows.

Only Tensorflow/Keras distributed training with MultiWorkerMirroredStrategy is supported. Pytorch support will be added in other PRs. Distributed training using "paramter servers" will not be supported currently.

TODO:

  • Adding dependence steps before or after the distributed training step.
  • Pytorch support
  1. set number of workers for the training step:

截屏2022-11-07 11 05 24

  1. run the workflow:

截屏2022-11-07 11 06 18

How was this pull request tested?

TBD.

@elyra-bot
Copy link

elyra-bot bot commented Nov 7, 2022

Thanks for making a pull request to Elyra!

To try out this branch on binder, follow this link: Binder

@akchinSTC akchinSTC added status:Work in Progress Development in progress. A PR tagged with this label is not review ready unless stated otherwise. component:pipeline-editor pipeline editor platform: pipeline-Kubeflow Related to usage of Kubeflow Pipelines as pipeline runtime labels Nov 7, 2022
@akchinSTC
Copy link
Member

Thanks @typhoonzero for another contribution!
Few thoughts,

  1. After building the PR, I cant seem to find the new fields in the UI. (could just be my env, but I did a purge and fresh build)
    image
  2. looks like argo specific metadata labels are hardcoded, we will need to support kfp tekton as well. Thanks @ptitzler
  3. lastly, these changes are very specific to TF. Having them displayed dynamically will probably require us to describe what libraries the images contain.. in the metadata, which opens up a big can of worms. Maybe have a ENV var as a flag instead to trigger displaying these extra options...shrug...will definitely require more discussion.

@typhoonzero
Copy link
Contributor Author

@akchinSTC Thanks for the information. Actually, I'm looking for a more generic implementation, like workflow ParallelFor function, not only for distributed data parallel training, but also for parallel data processing features.

By setting a node as a parallelfor node (parallel count > 1), elyra should pass the following envs to the user program:

  • rank
  • nranks
  • runtime pod ip address for each rank

Then the user program can set either TF_CONFIG for Tensorflow distributed training or MASTER_ADDR for pytorch distributed training. In this case, the changes should be like:

  1. A parallel-count property for each node.
  2. Connect input and output to dependency nodes.
  3. bootstrapper.py will set those envs at runtime.
  4. Some examples for Tensorflow, Pytorch and general data processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:pipeline-editor pipeline editor platform: pipeline-Kubeflow Related to usage of Kubeflow Pipelines as pipeline runtime status:Needs Discussion status:Work in Progress Development in progress. A PR tagged with this label is not review ready unless stated otherwise.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants