You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I find the differences when the iter=100. If we use alltoall dispatcher, the loss is 1.030484E+01. However, if we use allgather dispatcher, the loss is 1.030483E+01.
Stack trace/logs
I have investigate the problem, and my thought and trials are listed below:
I checked the intermediate activation tensors. I noticed the precision difference happens after an iteration which has 0 token allocated for an expert (In my environment, it happens in iteration 37)
Then, I checked the weight & grad. I noticed that in iteration 37, the forward intermediate activation and model weight are the same. However, in the backward pass of iteration 37, the module get different gradients when using different dispatcher, which is straightforward reason of precision issue.
Environment (please complete the following information):
Let me have a summary here. I am looking forward to the help and finally find the root reason.
To be honest, this bug confused me as the behavior of All-Gather Dispatcher and AlltoAll Dispatcher should be the same.
As I mentioned before, the precision issue happens when there is an expert receive 0 tokens, which maybe the core reason for the issue. I noticed that the forward activation tensors are same, but the backward weight grad tensors are different.
The text was updated successfully, but these errors were encountered:
Describe the bug
When using different dispatchers (i.e., AllGather and AlltoAll), it can cause training precision errors.
To Reproduce
You can use the following bash command to reproduce.
My dataset is en-wikipedia.
I find the differences when the iter=100. If we use alltoall dispatcher, the loss is 1.030484E+01. However, if we use allgather dispatcher, the loss is 1.030483E+01.
Stack trace/logs
I have investigate the problem, and my thought and trials are listed below:
I checked the intermediate activation tensors. I noticed the precision difference happens after an iteration which has 0 token allocated for an expert (In my environment, it happens in iteration 37)
Then, I checked the weight & grad. I noticed that in iteration 37, the forward intermediate activation and model weight are the same. However, in the backward pass of iteration 37, the module get different gradients when using different dispatcher, which is straightforward reason of precision issue.
Environment (please complete the following information):
Additional context
Let me have a summary here. I am looking forward to the help and finally find the root reason.
To be honest, this bug confused me as the behavior of All-Gather Dispatcher and AlltoAll Dispatcher should be the same.
As I mentioned before, the precision issue happens when there is an expert receive 0 tokens, which maybe the core reason for the issue. I noticed that the forward activation tensors are same, but the backward weight grad tensors are different.
The text was updated successfully, but these errors were encountered: