Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA_ERROR_OUT_OF_MEMORY #20

Open
petergerten opened this issue Jun 21, 2021 · 4 comments
Open

CUDA_ERROR_OUT_OF_MEMORY #20

petergerten opened this issue Jun 21, 2021 · 4 comments

Comments

@petergerten
Copy link

I always get out of memory errors even when using all defaults and training low resolution.
8 * V100 16GB

@petergerten
Copy link
Author

Trying to train on 1 GPU I get stuck here:


  File "/usr/local/lib/python3.6/dist-packages/numpy/lib/shape_base.py", line 577, in expand_dims
    if axis > a.ndim or axis < -a.ndim - 1:
TypeError: '>' not supported between instances of 'list' and 'int'

@dorarad
Copy link
Owner

dorarad commented Jun 23, 2021

Hi, Thank you for the interest in the work! I have couple deadlines over the next days so will definitely try to get back to you by the end of the week!

@chalure
Copy link

chalure commented Jul 3, 2021

How many GPUs are required to train at least ?

@dorarad
Copy link
Owner

dorarad commented Feb 3, 2022

Hi, most sincere apologies for not getting back to it earlier!
The model can be trained by even a single GPU.

On which line of the code did you get the error? Did you make changes by any chance in the implementation? The error seems to potentially indicate some small bug so further information could be helpful.

Couple more points:

  • To train on 8 gpus basically you need to pass --gpus 0,1,2,3,4,5,6,7 (make sure to not pass e.g. --gpus 8).
  • consider using --batch-gpu with a lower value, like e.g. 1 to fit the model training into the GPU

I hope one of these might resolve the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants