Question: Multi-node training #11

casper-hansen · 2024-04-13T14:48:13Z

Hi @shawntan, great work on Scatter MoE. As newer models are scaling up in the number of parameters used, I wanted to ask a question about what you put in the README: does not include any additional multi-node training infrastructure code.

Other than using some tool, e.g. torch FSDP or DeepSpeed zero3, are there any further considerations you would make to ensure optimal performance of your kernels?

shawntan · 2024-04-13T16:26:25Z

What I meant by that was that unlike Megatron or Megablocks, we did not include any additional Expert Parallelism and related infrastructure code in this repo: It's a simple implementation of MoE. So the intention was for it to be used with FSDP, which is how I have been using it myself, and it should work with other parallelisation frameworks.

We do intend to eventually add Tensor Parallelism, but I'm kinda tied up at the moment.

One thing @yikangshen found was that, at least in the use cases we are looking at, expert parallelism wasn't very effective due to the different tensor sizes that needed to be communicated, so expert parallelism isn't in our roadmap.

As for the state of scattermoe as it is now, it seems to work best if your SMoE layer fits on your GPU, but it's mainly the 2 of us working on this, so it'll be great to know about other people's experiences as well.

casper-hansen · 2024-04-13T18:00:41Z

That makes sense. You would need a training framework around the actual model, which you would plug Scatter MoE into. I think it would be cool to see Scatter MoE implemented in something like pytorch/torchtune or other frameworks that does the actual training.

shawntan · 2024-04-14T06:56:45Z

I've submitted a pull request to huggingface/nanotron on their suggestion but I've heard nothing back since.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Multi-node training #11

Question: Multi-node training #11

casper-hansen commented Apr 13, 2024

shawntan commented Apr 13, 2024

casper-hansen commented Apr 13, 2024

shawntan commented Apr 14, 2024

Question: Multi-node training #11

Question: Multi-node training #11

Comments

casper-hansen commented Apr 13, 2024

shawntan commented Apr 13, 2024

casper-hansen commented Apr 13, 2024

shawntan commented Apr 14, 2024