Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple nodes training #114

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from
Open

Multiple nodes training #114

wants to merge 7 commits into from

Conversation

skpig
Copy link
Collaborator

@skpig skpig commented Dec 9, 2021

  1. Add Multi_Node_Training in image_classification.
  2. Write a tutorial about multi-node training in README.md
  3. Use ViT model as an example.

@skpig skpig added the enhancement New feature or request label Dec 9, 2021
@xperzy
Copy link
Collaborator

xperzy commented Dec 10, 2021

BTW, in the main_multi_gpu.py, please also check the logging/model saving scheme such as

Current code does not consider the case where world_size > 1.

@skpig
Copy link
Collaborator Author

skpig commented Dec 10, 2021

I just check the details of spawn() function. For example, we have 2 hosts with 2 processes running on each host. Then the
local_rank = dist.get_rank() will return 0, 1, 2, 3 respectively. I guess the original code works fine?

@xperzy xperzy self-requested a review December 13, 2021 03:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants