Invalid gradient and Dx? #3

YicongHong · 2024-09-08T12:51:10Z

Hi @Hprairie, I previously built mamba-2/hydra-based models, and I am now trying to replace the layers with your bi-mamba2 module. However, I found the new model can easily get invalid gradients (e.g., infinite gradient norm) that never appeared with mamba-2/hydra.

I tested with both torch==2.1.0, triton==3.0.0, cu122 and torch==2.4.0, triton==3.0.0, cu121, it seems that the more bi-mamba2 layers I stack or the more multi-processes I used, the easier the model gets this problem.

I tested with single-GPU-processing 1~16 layers, and the issue became very profound after stacking more than 12 layers.
Also, tested with multi-GPU-processing with DDP or FSDP. Even with just 2 layers and 2 processes, the problem always appeared.

Any ideas?

Besides, you mentioned that the kernel implements y=SS(x)+flip(SS(flip(x)))+Dx, but in BiMamba2() Line 108, the skip parameters self.D and self.fc_D are not used for Dx. Can I ask how to pass these parameters to bimamba_chunk_scan_combined(), or we should do something similar as in Hydra?

Thanks!!!

The text was updated successfully, but these errors were encountered:

Hprairie · 2024-09-08T12:54:31Z

Whoops that's a great catch. I'll go ahead and push something in a couple minutes to fix it. All you need to do is set D=D when the module layer when calling the function.

As for NaN gradients, I'm looking for parts of the kernel which cause this. Haven't identified anything yet, but I'll lyk.

Hprairie · 2024-09-08T12:56:07Z

Should be the same for z. My classes resumed so I was rather quick to push the layer out. Thanks for the catch :)

Hprairie · 2024-09-08T13:00:15Z

Alright, I pushed a fix. Lmk if NaNs still occur, as this is something I haven't been able to test for personally.

YicongHong · 2024-09-09T01:21:58Z

Thanks @Hprairie; after passing D=self.D, I got the following error:
bimamba2/src/ssd/bi/ssd_chunk_scan.py":138:11): error: operation scheduled before its operands

Also, it seems that self.fc_D is still not used? I thought self.fc_D is for Dx and self.D is a bias term?

Hprairie · 2024-09-09T01:36:57Z

The first error is with triton and I found occurs when you compile for the first time. It shouldn't keep occuring and shouldn't affect anything from what I have seen.
You are right, hydra does use self.D as a bias and self.fc_D. I am away from a computer, but will make a fix tmrw. I will give two options, as using D without fc_D is more canonical to Mamba. If you want access to it immediately just use F.linear as in Hydra and then don't pass D to the optimized kernel.

Thanks again for pointing this out, I learned something new.

YicongHong · 2024-09-09T07:47:27Z

Thanks @Hprairie,

NaNs still occur easily.
Yes, the error: operation scheduled before its operands only occurs at the start and doesn't stop anything.
"using D without fc_D is more canonical to Mamba", I see, thanks!

Hprairie · 2024-09-09T13:03:19Z

Hmmm okay I'll try to block out some time to look into the NaN problem.

GLOMQuyet · 2024-09-09T13:15:08Z

Yes, I have the same problem no matter how high you set the batch size or learning rate, they are the same. Colab link With ViT, remove the attetion and replace it with Bi-Mamba 2:https://colab.research.google.com/drive/1rgXkwnlevzZ0YPbefQS8qHRe7gFlb4J-?authuser=3. The loss result is always NaN, Bi-Mamba 2 gives too high gradient

Hprairie · 2024-12-18T20:49:17Z

I am looking into this and attempting to fix the error: operation scheduled before its operands error. Currently, the operation scheduled before operands error is simply an optimization error caused internally in Triton and is not well documented, making it difficult to look deep into it. The Triton team has a PR to get better error coding on this which will help. However, for the NaNs I have been struggling to reproduce any training pipeline with NaNs. When training on synthetic data in a 12-16 layer deep model I am not getting any NaNs when training in fp32. I will keep trying, but if anyone can post a minimum viable script where the error pops up that would be great (the colab link doesn't seem to work for me anymore). Thanks!

oggyfaker mentioned this issue Nov 18, 2024

Question about the parameter dt(delta) and its initialization #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid gradient and Dx? #3

Invalid gradient and Dx? #3

YicongHong commented Sep 8, 2024 •

edited

Loading

Hprairie commented Sep 8, 2024

Hprairie commented Sep 8, 2024

Hprairie commented Sep 8, 2024

YicongHong commented Sep 9, 2024 •

edited

Loading

Hprairie commented Sep 9, 2024

YicongHong commented Sep 9, 2024

Hprairie commented Sep 9, 2024

GLOMQuyet commented Sep 9, 2024 •

edited

Loading

Hprairie commented Dec 18, 2024

Invalid gradient and Dx? #3

Invalid gradient and Dx? #3

Comments

YicongHong commented Sep 8, 2024 • edited Loading

Hprairie commented Sep 8, 2024

Hprairie commented Sep 8, 2024

Hprairie commented Sep 8, 2024

YicongHong commented Sep 9, 2024 • edited Loading

Hprairie commented Sep 9, 2024

YicongHong commented Sep 9, 2024

Hprairie commented Sep 9, 2024

GLOMQuyet commented Sep 9, 2024 • edited Loading

Hprairie commented Dec 18, 2024

YicongHong commented Sep 8, 2024 •

edited

Loading

YicongHong commented Sep 9, 2024 •

edited

Loading

GLOMQuyet commented Sep 9, 2024 •

edited

Loading