-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Multi-node training #11
Comments
What I meant by that was that unlike Megatron or Megablocks, we did not include any additional Expert Parallelism and related infrastructure code in this repo: It's a simple implementation of MoE. So the intention was for it to be used with FSDP, which is how I have been using it myself, and it should work with other parallelisation frameworks. We do intend to eventually add Tensor Parallelism, but I'm kinda tied up at the moment. One thing @yikangshen found was that, at least in the use cases we are looking at, expert parallelism wasn't very effective due to the different tensor sizes that needed to be communicated, so expert parallelism isn't in our roadmap. As for the state of scattermoe as it is now, it seems to work best if your SMoE layer fits on your GPU, but it's mainly the 2 of us working on this, so it'll be great to know about other people's experiences as well. |
That makes sense. You would need a training framework around the actual model, which you would plug Scatter MoE into. I think it would be cool to see Scatter MoE implemented in something like pytorch/torchtune or other frameworks that does the actual training. |
I've submitted a pull request to huggingface/nanotron on their suggestion but I've heard nothing back since. |
Hi @shawntan, great work on Scatter MoE. As newer models are scaling up in the number of parameters used, I wanted to ask a question about what you put in the README: does not include any additional multi-node training infrastructure code.
The text was updated successfully, but these errors were encountered: