Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix moe cpu offload (microsoft#5220)
The MoE- param gradients norms don't need to be averaged when created on CPU only when using 1-DP training. However, I just moved the tensor back to GPU to get average when having data-parallel on the MoE parameters and using CPU-offload. This PR addresses microsoft#5203 --------- Co-authored-by: Reza Yazdani <[email protected]>
- Loading branch information