Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Multi-node training #11

Open
casper-hansen opened this issue Apr 13, 2024 · 3 comments
Open

Question: Multi-node training #11

casper-hansen opened this issue Apr 13, 2024 · 3 comments

Comments

@casper-hansen
Copy link

Hi @shawntan, great work on Scatter MoE. As newer models are scaling up in the number of parameters used, I wanted to ask a question about what you put in the README: does not include any additional multi-node training infrastructure code.

  • Other than using some tool, e.g. torch FSDP or DeepSpeed zero3, are there any further considerations you would make to ensure optimal performance of your kernels?
@shawntan
Copy link
Owner

What I meant by that was that unlike Megatron or Megablocks, we did not include any additional Expert Parallelism and related infrastructure code in this repo: It's a simple implementation of MoE. So the intention was for it to be used with FSDP, which is how I have been using it myself, and it should work with other parallelisation frameworks.

We do intend to eventually add Tensor Parallelism, but I'm kinda tied up at the moment.

One thing @yikangshen found was that, at least in the use cases we are looking at, expert parallelism wasn't very effective due to the different tensor sizes that needed to be communicated, so expert parallelism isn't in our roadmap.

As for the state of scattermoe as it is now, it seems to work best if your SMoE layer fits on your GPU, but it's mainly the 2 of us working on this, so it'll be great to know about other people's experiences as well.

@casper-hansen
Copy link
Author

That makes sense. You would need a training framework around the actual model, which you would plug Scatter MoE into. I think it would be cool to see Scatter MoE implemented in something like pytorch/torchtune or other frameworks that does the actual training.

@shawntan
Copy link
Owner

I've submitted a pull request to huggingface/nanotron on their suggestion but I've heard nothing back since.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants