[moe] feat: enabling expert parallelism in veScale #59

chwan1016 · 2024-12-27T06:59:09Z

Overview

veScale provides an efficient framework for training Mixture of Experts (MoE) models using expert parallelism. Expert parallelism can be deployed with the parallelize_experts() function, which simplifies the process of distributing and managing workload during MoE training.

Function Signature

model = parallelize_experts(
    module: nn.Module,
    experts_expr: Union[str, List[str]],
    experts_allocator: vescale.moe.ExpertsAllocator,
    token_dispatcher: vescale.moe.TokenDispatcher,
    config: Dict,
)

Parameters

module: The training model (an instance of nn.Module) to be parallelized.
experts_expr: Specifies the paths to the expert modules. Can be a string or a list of strings.
experts_allocator: An instance of ExpertsAllocator, used for managing expert parameter allocation.
token_dispatcher: An instance of TokenDispatcher, responsible for token scheduling and distribution.
config: A dictionary containing the MoE training configuration, including layer count, number of experts, and other relevant settings.

Custom Scheduling

veScale allows users to define custom scheduling strategies for expert parallelism by implementing the following components:

ExpertsAllocator: Manages expert parameter allocation. It can use collect_performance() to profile and dynamically adjust the DP x TP device mesh for each expert. By default, veScale shards all expert parameters across devices using tensor parallelism.
TokenDispatcher: Handles token distribution. Using assign_task(), it determines workload allocation (e.g., expert IDs and token weights) and adjusts scheduling with collect_performance(). The default implementation randomly assigns tokens to a single DP rank for the selected expert.

Optimizer Support

Since veScale supports dynamic placement of expert parameters, a dedicated optimizer, MoEOptimizer, is required. This optimizer handles the redistribution of expert parameters and their states efficiently.
Future updates will integrate these functionalities into optimizers for static parameters to streamline the process.

Getting Started

Data Preparation

Prepare the Shakespeare dataset by running:

cd data/shakespeare/
python3 prepare.py
cd ../..

Training Command

torchrun --standalone --nproc_per_node={GPU_CNT} mixtral_train.py --dp={dp_size} --tp={tp_size} --max_iters={max_iters}

CLAassistant · 2024-12-27T06:59:16Z

All committers have signed the CLA.

chwan1016 and others added 4 commits December 26, 2024 19:49

ep

ac91718

update api

4e75444

update api

36fa2bd

Update README.md

bc5763d

chwan1016 and others added 2 commits December 27, 2024 14:11

Update README.md

b7847f0

fix optimizer

6da95d8

leonardo0lyj approved these changes Dec 27, 2024

View reviewed changes

pengyanghua approved these changes Dec 27, 2024

View reviewed changes

pengyanghua merged commit ac76ffa into volcengine:main Dec 27, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[moe] feat: enabling expert parallelism in veScale #59

[moe] feat: enabling expert parallelism in veScale #59

chwan1016 commented Dec 27, 2024 •

edited

Loading

CLAassistant commented Dec 27, 2024 •

edited

Loading

[moe] feat: enabling expert parallelism in veScale #59

[moe] feat: enabling expert parallelism in veScale #59

Conversation

chwan1016 commented Dec 27, 2024 • edited Loading

Overview

Function Signature

Parameters

Custom Scheduling

Optimizer Support

Getting Started

Data Preparation

Training Command

CLAassistant commented Dec 27, 2024 • edited Loading

chwan1016 commented Dec 27, 2024 •

edited

Loading

CLAassistant commented Dec 27, 2024 •

edited

Loading