-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple gpu errors #9
Comments
Thanks for your interest in our work. We currently only support single-gpu training with batch size of 32. Are you able to train fine on a single GPU? Multi-GPU training is currently not supported with this codebase since we are able to train in a reasonable time i.e. around 1-2 days (both synthetic training and real fine tuning) with 13 GB GPU memory. Having said this, please feel free to open a pull request or feature request for multi-gpu training. We can try to look into it but cannot promise this feature be available soon. Hope it helps! |
Ok, I see, thank you for your help. |
Thanks for the first issue! |
HI @dongho-Han, Are you able to train with a smaller batch size than 32? We were able to fit CenterSnap and ShAPO on 13 GB memory with a batch size of 32. Let us know if a smaller batch size works for you. For Multi-GPU training, apologies we don't support it currently but please free to open a pull request if you are able to make this enhancement. I would start by looking into how we can add the default pytorch lightning distributed training functionality by adding a flag here but we are using a slightly outdated PL version so this might break things. But please feel free to create a PR if that works on your end. |
Thank you for the answer. |
You can change the batch size in the |
Wow! Thank you for the meaningful advice!
Thank you! |
Awesome, great to know that lower batch size works for you.
|
RuntimeError: Sizes of tensors must match except in dimension 0. Got 32 and 16 (The offending index is 0)
This error will appear in ‘Validation sanity check‘
If I use multiple GPU training, the above error will appear. I think it's a bug in generating the data rather than a dimension mismatch, but I can't fix it, do you have any idea about that?
The text was updated successfully, but these errors were encountered: