graph : add clamping to ffn_moe_weights_sum to avoid div-by-zero #16655

CISC · 2025-10-18T20:42:37Z

I initially added norm topk bias only for BailingMoeV2 in #16063 however it turns out this is present in all models that use norm_topk_prob, it was probably left out because there was no easy way to add the bias at the time.

Edit: Updated to use clamping as the purpose of this bias is to avoid division by zero.

DeepSeekV3
Dots1
Glm4Moe

Edit: Hmmm, ok, Lfm2Moe uses 1e-6, not sure why, I assume 1e-20 was originally chosen as an insignificantly small number to offset 0.0. @tdakhran @paulpak58 @mlabonne?

Edit2: It gets more interesting, some models (like Ernie4_5_Moe) uses clamping (to 1e-12) to achieve the same thing.

jeffbolznv · 2025-10-18T21:28:09Z

CC @am17an, I think we'll need to update the fusion logic to handle this.

src/llama-graph.cpp

am17an · 2025-10-19T02:41:35Z

CC @am17an, I think we'll need to update the fusion logic to handle this.

@CISC if there is a way for me to push in this PR I can make the CUDA changes. If not, then I will create a separate PR

CISC · 2025-10-19T06:51:03Z

CC @am17an, I think we'll need to update the fusion logic to handle this.

@CISC if there is a way for me to push in this PR I can make the CUDA changes. If not, then I will create a separate PR

You can perhaps make a PR to this branch, though I don't know if it will automatically rebase on merge then? Perhaps best to make a separate PR afterwards.

am17an · 2025-10-19T06:54:51Z

You can perhaps make a PR to this branch, though I don't know if it will automatically rebase on merge then

Oh yeah this will work

CISC · 2025-10-19T09:04:42Z

You can perhaps make a PR to this branch, though I don't know if it will automatically rebase on merge then

Oh yeah this will work

Now that you can make a PR branch on another branch in this repo it will rebase automatically. 🎉

ORippler · 2025-10-22T07:38:01Z

Now that you can make a PR branch on another branch in this repo it will rebase automatically. 🎉

Unless we have adapted 3rd-party tooling for this, it will likely not be a smooth experience: Stacked PRs are poorly supported in git (and thus in github), especially when combined with squash-merges.

am17an · 2025-10-22T08:34:46Z

@ORippler looks like they saw your comment https://x.com/jaredpalmer/status/1980619222918262842

ORippler · 2025-10-22T09:28:17Z

@ORippler looks like they saw your comment https://x.com/jaredpalmer/status/1980619222918262842

Interesting. I read this as "we will still restack but will get acceptable run-time complexity by using git reftables", but I must say this is beyond my git knowledge 😄

CISC · 2025-10-26T15:12:50Z

@ggerganov gentle ping

add missing norm topk bias

c0dfae7

CISC requested a review from ggerganov October 18, 2025 20:42

CISC commented Oct 18, 2025

View reviewed changes

src/llama-graph.cpp Outdated Show resolved Hide resolved

use clamping instead, update number and add comment

39de132

CISC changed the title ~~graph : add missing norm topk bias~~ graph : add clamping to ffn_moe_weights_sum to avoid div-by-zero Oct 20, 2025

Merge branch 'master' into cisc/norm-topk-bias

7320842

am17an mentioned this pull request Oct 21, 2025

CUDA: support for weight clamp in top-k norm #16702

Open

ggerganov approved these changes Oct 26, 2025

View reviewed changes

CISC merged commit f696428 into master Oct 26, 2025
73 of 74 checks passed

CISC deleted the cisc/norm-topk-bias branch October 26, 2025 16:20

jeffbolznv added a commit to jeffbolznv/llama.cpp that referenced this pull request Oct 26, 2025

handle clamp added in ggml-org#16655

b046c73

jeffbolznv mentioned this pull request Oct 26, 2025

vulkan: Update topk_moe fusion to handle gpt's late softmax #16656

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

graph : add clamping to ffn_moe_weights_sum to avoid div-by-zero #16655

graph : add clamping to ffn_moe_weights_sum to avoid div-by-zero #16655

CISC commented Oct 18, 2025 •

edited

Loading

Uh oh!

jeffbolznv commented Oct 18, 2025

Uh oh!

Uh oh!

am17an commented Oct 19, 2025

Uh oh!

CISC commented Oct 19, 2025

Uh oh!

am17an commented Oct 19, 2025

Uh oh!

CISC commented Oct 19, 2025

Uh oh!

ORippler commented Oct 22, 2025

Uh oh!

am17an commented Oct 22, 2025

Uh oh!

ORippler commented Oct 22, 2025

Uh oh!

CISC commented Oct 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

graph : add clamping to ffn_moe_weights_sum to avoid div-by-zero #16655

graph : add clamping to ffn_moe_weights_sum to avoid div-by-zero #16655

Conversation

CISC commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Oct 18, 2025

Uh oh!

Uh oh!

am17an commented Oct 19, 2025

Uh oh!

CISC commented Oct 19, 2025

Uh oh!

am17an commented Oct 19, 2025

Uh oh!

CISC commented Oct 19, 2025

Uh oh!

ORippler commented Oct 22, 2025

Uh oh!

am17an commented Oct 22, 2025

Uh oh!

ORippler commented Oct 22, 2025

Uh oh!

CISC commented Oct 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CISC commented Oct 18, 2025 •

edited

Loading