Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid gradient and Dx? #3

Open
YicongHong opened this issue Sep 8, 2024 · 9 comments
Open

Invalid gradient and Dx? #3

YicongHong opened this issue Sep 8, 2024 · 9 comments

Comments

@YicongHong
Copy link

YicongHong commented Sep 8, 2024

Hi @Hprairie, I previously built mamba-2/hydra-based models, and I am now trying to replace the layers with your bi-mamba2 module. However, I found the new model can easily get invalid gradients (e.g., infinite gradient norm) that never appeared with mamba-2/hydra.

I tested with both torch==2.1.0, triton==3.0.0, cu122 and torch==2.4.0, triton==3.0.0, cu121, it seems that the more bi-mamba2 layers I stack or the more multi-processes I used, the easier the model gets this problem.

  • I tested with single-GPU-processing 1~16 layers, and the issue became very profound after stacking more than 12 layers.
  • Also, tested with multi-GPU-processing with DDP or FSDP. Even with just 2 layers and 2 processes, the problem always appeared.

Any ideas?

Besides, you mentioned that the kernel implements y=SS(x)+flip(SS(flip(x)))+Dx, but in BiMamba2() Line 108, the skip parameters self.D and self.fc_D are not used for Dx. Can I ask how to pass these parameters to bimamba_chunk_scan_combined(), or we should do something similar as in Hydra?

Thanks!!!

@Hprairie
Copy link
Owner

Hprairie commented Sep 8, 2024

Whoops that's a great catch. I'll go ahead and push something in a couple minutes to fix it. All you need to do is set D=D when the module layer when calling the function.

As for NaN gradients, I'm looking for parts of the kernel which cause this. Haven't identified anything yet, but I'll lyk.

@Hprairie
Copy link
Owner

Hprairie commented Sep 8, 2024

Should be the same for z. My classes resumed so I was rather quick to push the layer out. Thanks for the catch :)

@Hprairie
Copy link
Owner

Hprairie commented Sep 8, 2024

Alright, I pushed a fix. Lmk if NaNs still occur, as this is something I haven't been able to test for personally.

@YicongHong
Copy link
Author

YicongHong commented Sep 9, 2024

Thanks @Hprairie; after passing D=self.D, I got the following error:
bimamba2/src/ssd/bi/ssd_chunk_scan.py":138:11): error: operation scheduled before its operands

Also, it seems that self.fc_D is still not used? I thought self.fc_D is for Dx and self.D is a bias term?

@Hprairie
Copy link
Owner

Hprairie commented Sep 9, 2024

  1. The first error is with triton and I found occurs when you compile for the first time. It shouldn't keep occuring and shouldn't affect anything from what I have seen.
  2. You are right, hydra does use self.D as a bias and self.fc_D. I am away from a computer, but will make a fix tmrw. I will give two options, as using D without fc_D is more canonical to Mamba. If you want access to it immediately just use F.linear as in Hydra and then don't pass D to the optimized kernel.

Thanks again for pointing this out, I learned something new.

@YicongHong
Copy link
Author

Thanks @Hprairie,

  1. NaNs still occur easily.
  2. Yes, the error: operation scheduled before its operands only occurs at the start and doesn't stop anything.
  3. "using D without fc_D is more canonical to Mamba", I see, thanks!

@Hprairie
Copy link
Owner

Hprairie commented Sep 9, 2024

Hmmm okay I'll try to block out some time to look into the NaN problem.

@GLOMQuyet
Copy link

GLOMQuyet commented Sep 9, 2024

Yes, I have the same problem no matter how high you set the batch size or learning rate, they are the same. Colab link With ViT, remove the attetion and replace it with Bi-Mamba 2:https://colab.research.google.com/drive/1rgXkwnlevzZ0YPbefQS8qHRe7gFlb4J-?authuser=3. The loss result is always NaN, Bi-Mamba 2 gives too high gradient

@Hprairie
Copy link
Owner

I am looking into this and attempting to fix the error: operation scheduled before its operands error. Currently, the operation scheduled before operands error is simply an optimization error caused internally in Triton and is not well documented, making it difficult to look deep into it. The Triton team has a PR to get better error coding on this which will help. However, for the NaNs I have been struggling to reproduce any training pipeline with NaNs. When training on synthetic data in a 12-16 layer deep model I am not getting any NaNs when training in fp32. I will keep trying, but if anyone can post a minimum viable script where the error pops up that would be great (the colab link doesn't seem to work for me anymore). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants