-
Notifications
You must be signed in to change notification settings - Fork 13.4k
graph : add clamping to ffn_moe_weights_sum to avoid div-by-zero #16655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
CC @am17an, I think we'll need to update the fusion logic to handle this. |
You can perhaps make a PR to this branch, though I don't know if it will automatically rebase on merge then? Perhaps best to make a separate PR afterwards. |
Oh yeah this will work |
Now that you can make a PR branch on another branch in this repo it will rebase automatically. 🎉 |
Unless we have adapted 3rd-party tooling for this, it will likely not be a smooth experience: Stacked PRs are poorly supported in git (and thus in github), especially when combined with squash-merges. |
|
@ORippler looks like they saw your comment https://x.com/jaredpalmer/status/1980619222918262842 |
Interesting. I read this as "we will still restack but will get acceptable run-time complexity by using git reftables", but I must say this is beyond my git knowledge 😄 |
|
@ggerganov gentle ping |
I initially added norm topk bias only for BailingMoeV2 in #16063 however it turns out this is present in all models that use
norm_topk_prob, it was probably left out because there was no easy way to add the bias at the time.Edit: Updated to use clamping as the purpose of this bias is to avoid division by zero.
DeepSeekV3
Dots1
Glm4Moe
Edit: Hmmm, ok, Lfm2Moe uses
1e-6, not sure why, I assume1e-20was originally chosen as an insignificantly small number to offset0.0. @tdakhran @paulpak58 @mlabonne?Edit2: It gets more interesting, some models (like Ernie4_5_Moe) uses clamping (to
1e-12) to achieve the same thing.