Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix moe cpu offload #5220

Merged

Conversation

RezaYazdaniAminabadi
Copy link
Contributor

@RezaYazdaniAminabadi RezaYazdaniAminabadi commented Mar 2, 2024

The MoE- param gradients norms don't need to be averaged when created on CPU only when using 1-DP training. However, I just moved the tensor back to GPU to get average when having data-parallel on the MoE parameters and using CPU-offload.

This PR addresses #5203

@mrwyattii mrwyattii added this pull request to the merge queue Mar 4, 2024
Merged via the queue into microsoft:master with commit e6e8c13 Mar 4, 2024
12 checks passed
ShellyNR pushed a commit to ShellyNR/DeepSpeed that referenced this pull request Mar 11, 2024
The MoE- param gradients norms don't need to be averaged when created on
CPU only when using 1-DP training. However, I just moved the tensor back
to GPU to get average when having data-parallel on the MoE parameters
and using CPU-offload.

This PR addresses microsoft#5203

---------

Co-authored-by: Reza Yazdani <[email protected]>
rraminen pushed a commit to ROCm/DeepSpeed that referenced this pull request May 9, 2024
The MoE- param gradients norms don't need to be averaged when created on
CPU only when using 1-DP training. However, I just moved the tensor back
to GPU to get average when having data-parallel on the MoE parameters
and using CPU-offload.

This PR addresses microsoft#5203

---------

Co-authored-by: Reza Yazdani <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants