Skip to content

Conversation

@CISC
Copy link
Collaborator

@CISC CISC commented Oct 18, 2025

I initially added norm topk bias only for BailingMoeV2 in #16063 however it turns out this is present in all models that use norm_topk_prob, it was probably left out because there was no easy way to add the bias at the time.

Edit: Updated to use clamping as the purpose of this bias is to avoid division by zero.

DeepSeekV3
Dots1
Glm4Moe

Edit: Hmmm, ok, Lfm2Moe uses 1e-6, not sure why, I assume 1e-20 was originally chosen as an insignificantly small number to offset 0.0. @tdakhran @paulpak58 @mlabonne?

Edit2: It gets more interesting, some models (like Ernie4_5_Moe) uses clamping (to 1e-12) to achieve the same thing.

@CISC CISC requested a review from ggerganov October 18, 2025 20:42
@jeffbolznv
Copy link
Collaborator

CC @am17an, I think we'll need to update the fusion logic to handle this.

@am17an
Copy link
Collaborator

am17an commented Oct 19, 2025

CC @am17an, I think we'll need to update the fusion logic to handle this.

@CISC if there is a way for me to push in this PR I can make the CUDA changes. If not, then I will create a separate PR

@CISC
Copy link
Collaborator Author

CISC commented Oct 19, 2025

CC @am17an, I think we'll need to update the fusion logic to handle this.

@CISC if there is a way for me to push in this PR I can make the CUDA changes. If not, then I will create a separate PR

You can perhaps make a PR to this branch, though I don't know if it will automatically rebase on merge then? Perhaps best to make a separate PR afterwards.

@am17an
Copy link
Collaborator

am17an commented Oct 19, 2025

You can perhaps make a PR to this branch, though I don't know if it will automatically rebase on merge then

Oh yeah this will work

@CISC
Copy link
Collaborator Author

CISC commented Oct 19, 2025

You can perhaps make a PR to this branch, though I don't know if it will automatically rebase on merge then

Oh yeah this will work

Now that you can make a PR branch on another branch in this repo it will rebase automatically. 🎉

@CISC CISC changed the title graph : add missing norm topk bias graph : add clamping to ffn_moe_weights_sum to avoid div-by-zero Oct 20, 2025
@ORippler
Copy link
Contributor

Now that you can make a PR branch on another branch in this repo it will rebase automatically. 🎉

Unless we have adapted 3rd-party tooling for this, it will likely not be a smooth experience: Stacked PRs are poorly supported in git (and thus in github), especially when combined with squash-merges.

@am17an
Copy link
Collaborator

am17an commented Oct 22, 2025

@ORippler looks like they saw your comment https://x.com/jaredpalmer/status/1980619222918262842

@ORippler
Copy link
Contributor

@ORippler looks like they saw your comment https://x.com/jaredpalmer/status/1980619222918262842

Interesting. I read this as "we will still restack but will get acceptable run-time complexity by using git reftables", but I must say this is beyond my git knowledge 😄

@CISC
Copy link
Collaborator Author

CISC commented Oct 26, 2025

@ggerganov gentle ping

@CISC CISC merged commit f696428 into master Oct 26, 2025
73 of 74 checks passed
@CISC CISC deleted the cisc/norm-topk-bias branch October 26, 2025 16:20
jeffbolznv added a commit to jeffbolznv/llama.cpp that referenced this pull request Oct 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants