-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Unknown: Failed to get convolution algorithm." error and how to solve it #100
Comments
ooh that's good to know! @VolkerH Would you mind explaining what TF_FORCE_GPU_ALLOW_GROWTH=true does? Are you able to run N2V training and prediction at one run without having to restart the kernel? |
There is some background to how the environment variable changes the behaviour here: https://www.tensorflow.org/guide/gpu under the subheading "Limiting GPU memory growth". I cannot both have the training and prediction notebook running at the same time due to memory limitations. My understanding is that without this option, tensorflow grabs pretty much all available GPU memory initially. If you then even need to allocate a tiny bit more it will fail. |
Hi @VolkerH, Thank you for reporting this issue here as well. I am wondering if there might be an issue with the compatibility between keras, tensorflow, CUDA, and nvidia-driver versions. @citypalmtree interesting that your issue disappears if you rerun the notebook without changes. Do you restart the kernel between runs? Regarding training- and prediction-notebooks: It is like @VolkerH says, as long as you stay in the same kernel/notebook you can run training and prediction sequentially. But unfortunately tensorflow allocates all available GPU memory (if not told otherwise), which means that a second kernel/notebook will find no GPU memory left and fail. One way to limit GPU memory would be this:
This would only allocate 50% of the GPU memory. Doing this would allow you to train two models in parallel on one GPU, but most likely it will take much longer. First you can only use half of the data and second data for two independent models has to be transferred to the GPU, which could lead to inefficient processing. |
I spent yesterday afternoon installing n2v on a new machine (Ryzen 5, RTX 2060, Ubuntu 20.04, conda) and ran into tensorflow-related issues with n2v.
The error occurs when running training in any of the example notebooks.
The error message was cuDNN related (see full traceback below), so I suspected a library version problem.
I tried various versions of tensorflow-gpu such as 1.14, 1.15 and versions installed with pip or from conda using the anaconda and conda-forge channels. Also tried various versions of the CUDA toolkit and python 3.6 and 3.7. All without success.
While there was enough GPU VRAM available, it turns out that this is related to GPU memory management in tensorflow.
Setting the following environment variable
fixed the issue. This is not specific to
n2v
, in fact I found the answer in a thread related to DeepLabCut (https://forum.image.sc/t/could-not-create-cudnn-handle/24276/17).I am putting this here so that others who run into the issue can find it. I am not sure how common it is (I did not encounter the issue when installing n2v on Windows) and whether it warrants mentioning in the README.md file.
The text was updated successfully, but these errors were encountered: