Feature/load balance add expert replacement feature for MoE model(mixtral) #187
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I have added a new feature to the Megatron LM repository that introduces a load balance interval for expert replacement in Mixture of Experts (MoE) models. This feature allows for the redistribution of experts across GPUs at user-specified intervals, with the aim of achieving a balanced computational load across the GPUs by maintaining a similar number of tokens processed on each card.
Implementation Details
The load balance interval for expert replacement is controlled by a new command-line argument --load-balance-interval. Users can specify the number of steps after which the redistribution of experts should take place. The system then automatically adjusts the placement of experts to ensure an even workload distribution, improving the overall efficiency of the MoE model training.
Benefits
Parallel strategy:tp4pp2ep2, 16 GPUs, train from scratch and without aux loss
How to Use
To enable the load balance interval for expert replacement, users should use the
--load-balance-interval
argument.