[moe] feat: enabling expert parallelism in veScale #59
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
veScale provides an efficient framework for training Mixture of Experts (MoE) models using expert parallelism. Expert parallelism can be deployed with the
parallelize_experts()
function, which simplifies the process of distributing and managing workload during MoE training.Function Signature
Parameters
module
: The training model (an instance ofnn.Module
) to be parallelized.experts_expr
: Specifies the paths to the expert modules. Can be a string or a list of strings.experts_allocator
: An instance ofExpertsAllocator
, used for managing expert parameter allocation.token_dispatcher
: An instance ofTokenDispatcher
, responsible for token scheduling and distribution.config
: A dictionary containing the MoE training configuration, including layer count, number of experts, and other relevant settings.Custom Scheduling
veScale allows users to define custom scheduling strategies for expert parallelism by implementing the following components:
ExpertsAllocator
: Manages expert parameter allocation. It can usecollect_performance()
to profile and dynamically adjust the DP x TP device mesh for each expert. By default, veScale shards all expert parameters across devices using tensor parallelism.TokenDispatcher
: Handles token distribution. Usingassign_task()
, it determines workload allocation (e.g., expert IDs and token weights) and adjusts scheduling withcollect_performance()
. The default implementation randomly assigns tokens to a single DP rank for the selected expert.Optimizer Support
Since veScale supports dynamic placement of expert parameters, a dedicated optimizer,
MoEOptimizer
, is required. This optimizer handles the redistribution of expert parameters and their states efficiently.Future updates will integrate these functionalities into optimizers for static parameters to streamline the process.
Getting Started
Data Preparation
Prepare the Shakespeare dataset by running:
Training Command