Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to add nodes during training? #26

Open
laochonlam opened this issue Oct 9, 2024 · 3 comments
Open

How to add nodes during training? #26

laochonlam opened this issue Oct 9, 2024 · 3 comments

Comments

@laochonlam
Copy link

Hi @insujang ,

Thanks for open-sourcing Oobleck, great work!

From the paper, it seems that the experiments show it supports both adding and removing nodes during training.

I successfully ran Oobleck with node failures (removing nodes), but I couldn't find a way to add nodes dynamically during training. Could you let me know how to make it work?

Thank you!
Lam

@insujang
Copy link
Member

insujang commented Oct 9, 2024

Hi @laochonlam ,

All experiments in the paper were done with a Bamboo simulator, by measuring throughput and overheads of reconfiguration in every configuration and combining them. Current code does not include implementation for adding nodes. This is a future work; I think simply running reconfiguration would be enough, but need to try.

@laochonlam
Copy link
Author

Got it—I'll give that a try. Thank you for your prompt response!

Lam

@insujang
Copy link
Member

insujang commented Oct 9, 2024

Let me leave it open so that later I can work on it :) You are also welcome to make a PR that adds a feature for node addition.

@insujang insujang reopened this Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants