Fix moe cpu offload #5220

RezaYazdaniAminabadi · 2024-03-02T00:42:22Z

The MoE- param gradients norms don't need to be averaged when created on CPU only when using 1-DP training. However, I just moved the tensor back to GPU to get average when having data-parallel on the MoE parameters and using CPU-offload.

This PR addresses #5203

deepspeed/runtime/zero/stage_1_and_2.py

…load

…minabadi/DeepSpeed into fix-moe-cpu-offload

The MoE- param gradients norms don't need to be averaged when created on CPU only when using 1-DP training. However, I just moved the tensor back to GPU to get average when having data-parallel on the MoE parameters and using CPU-offload. This PR addresses microsoft#5203 --------- Co-authored-by: Reza Yazdani <[email protected]>

sfc-gh-reyazda added 2 commits March 1, 2024 17:57

Skip gradient-norm averaging when cpu-device is selected

feb578a

fix formatting

9eb6129

RezaYazdaniAminabadi requested review from tjruwase and mrwyattii as code owners March 2, 2024 00:42

Merge branch 'master' into fix-moe-cpu-offload

9a59df4

tjruwase reviewed Mar 2, 2024

View reviewed changes

deepspeed/runtime/zero/stage_1_and_2.py Outdated Show resolved Hide resolved

sfc-gh-reyazda and others added 3 commits March 2, 2024 05:50

average the grad-norms by sending the gradients to GPU when using off…

ae4bc93

…load

Merge branch 'fix-moe-cpu-offload' of https://github.com/RezaYazdaniA…

d8255e0

…minabadi/DeepSpeed into fix-moe-cpu-offload

Merge branch 'master' into fix-moe-cpu-offload

4fc1d8b

tjruwase approved these changes Mar 4, 2024

View reviewed changes

mrwyattii added this pull request to the merge queue Mar 4, 2024

Merged via the queue into microsoft:master with commit e6e8c13 Mar 4, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix moe cpu offload #5220

Fix moe cpu offload #5220

RezaYazdaniAminabadi commented Mar 2, 2024 •

edited

Loading

Fix moe cpu offload #5220

Fix moe cpu offload #5220

Conversation

RezaYazdaniAminabadi commented Mar 2, 2024 • edited Loading

RezaYazdaniAminabadi commented Mar 2, 2024 •

edited

Loading