-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to debug CUDNN_STATUS_EXECUTION_FAILED? #1116
Comments
Is there some chance that I need to use a specific stride? I know my shapes are correct, but it's definitely possible my stride is wrong. |
@vedantroy Could you post more information about your environment - most importantly TE, CUDA and cuDNN versions. Also, could you try the failing case with |
CUDA version:
CuDNN + transformer engine versions:
More logs using the command
|
Ok, further updates. It looks like it's failing on the backwards pass only. And ... if I use only 2 layers in my model, instead of 4, it doesn't fail. Is it possible I'm getting Cuda OOM issues? (Seems unlikely since I run this model w/ 48+ layers when using FA2). |
Hi @vedantroy , I tried to reproduce your config, and it seemed to pass my tests.
Could you extract a small reproducer code with just the Thanks, |
@cyanguwa -- I'll try to make a minimal reproduction soon. For now, a few more details
|
Also facing the error I am using Transformer Engine - 1.10.0+08a85d3
|
use same data or fixed data? |
I'm running my code with:
and getting errors like:
I'm using a pretty standard
DotProductAttention
:and I'm also calling it in a pretty standard way (all the assertions pass):
I'm kind of stuck on how to debug this. Seems like something is wrong with reading the inputs? Not sure. How should I proceed in debugging this?
The text was updated successfully, but these errors were encountered: