Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple gpu errors #9

Open
yssjglg-elder opened this issue Jul 3, 2022 · 8 comments
Open

Multiple gpu errors #9

yssjglg-elder opened this issue Jul 3, 2022 · 8 comments
Labels
enhancement New feature or request

Comments

@yssjglg-elder
Copy link

RuntimeError: Sizes of tensors must match except in dimension 0. Got 32 and 16 (The offending index is 0)

This error will appear in ‘Validation sanity check‘

If I use multiple GPU training, the above error will appear. I think it's a bug in generating the data rather than a dimension mismatch, but I can't fix it, do you have any idea about that?

@zubair-irshad
Copy link
Owner

Thanks for your interest in our work. We currently only support single-gpu training with batch size of 32. Are you able to train fine on a single GPU?

Multi-GPU training is currently not supported with this codebase since we are able to train in a reasonable time i.e. around 1-2 days (both synthetic training and real fine tuning) with 13 GB GPU memory. Having said this, please feel free to open a pull request or feature request for multi-gpu training. We can try to look into it but cannot promise this feature be available soon.

Hope it helps!

@yssjglg-elder
Copy link
Author

Ok, I see, thank you for your help.

@zubair-irshad zubair-irshad added the enhancement New feature or request label Sep 5, 2022
@dongho-Han
Copy link

Thanks for the first issue!
As I found in the author's ShAPO project, there is the same issue that training should be run in the environment with the GPU which has more than 13GB.
Could you try to enhance this problem? Because I have multi GPUs with 12GB, I can't train the ShAPO, CenterSnap model.
Thanks!

@zubair-irshad zubair-irshad reopened this Jan 19, 2023
@zubair-irshad
Copy link
Owner

HI @dongho-Han,

Are you able to train with a smaller batch size than 32? We were able to fit CenterSnap and ShAPO on 13 GB memory with a batch size of 32. Let us know if a smaller batch size works for you.

For Multi-GPU training, apologies we don't support it currently but please free to open a pull request if you are able to make this enhancement. I would start by looking into how we can add the default pytorch lightning distributed training functionality by adding a flag here but we are using a slightly outdated PL version so this might break things. But please feel free to create a PR if that works on your end.

@dongho-Han
Copy link

Thank you for the answer.
For Multi-GPU training, I got your point.
In fact, I have a question in ShAPO project.
The ShAPO model is trained by prepare_data/distributed_generate_data.py's output pickle files, not like CenterSnap models.
If I want to change the batch size, by changing the configs/net_config.txt and run the net_train.py is enough? I thought prepare_data/distributed_generate_data.py should be run again, so I hesitated to do that.
But for CenterSnap, As you mentioned, I think changing the configs/net_config.txt and run the net_train.py will be enough.
Thanks!

@zubair-irshad
Copy link
Owner

You can change the batch size in the configs/net_config.txt both for CenterSnap and ShAPO. We load individual pickle files in both models and irrespective of how we generate the data, we could train with any batch size by changing it in the respective config files.

@dongho-Han
Copy link

dongho-Han commented Jan 19, 2023

Wow! Thank you for the meaningful advice!
I have a few questions.

  1. As we can see the train code, train is performed with 'pickle' files, not the original RGB-D(RGB+depth) images. Is there any reason about that? Can you share the intuition?
  2. You meant that CenterSnap and ShAPO both load individual pickle files. Then, are the CenterSnap's attached datasets(ex. Real dataset link) same as ShAPO's prepare_data/distributed_generate_data.py's output pickle files? If they are different, can you share the different points?
  3. As you mention in Regarding the evaluation of your pre-trained model  shapo#8, ShAPO's pre-trained model(uploaded) is not the optimized models for evaluation. Is that also applied to CenterSnap(only trained enough for visualization)?
  4. Continuing with question 3, How much epochs are needed to recall the performance(ex. mAP) as you wrote in the papers for CenterSnap, ShAPO? It is not mentioned in the github project pages.
  5. Can I change the batch size with random number(ex. 7, 10, ...) not the divisor of 32(ex. 8, 16)?
  • Thanks for your help!
    I could train the ShAPO, CenterSnap model with my GPU by changing batch size to 16. Memory usage: almost 6500MiB.

Thank you!

@zubair-irshad
Copy link
Owner

zubair-irshad commented Jan 20, 2023

Awesome, great to know that lower batch size works for you.

  1. We do indeed store RGB, depth (for input) and poses and masks (for supervision only) in these datapoint pickle files. It is a compact way to store all information we need for training.

  2. The difference is mentioned in each papers. We store SDF latent codes and texture latent codes for shapo and pointcloud latent codes for CenterSnap in these datapoint pickles. The rest of the information is same.

  3. Correct, in CenterSnap, we do not perform any post-optimization and this is one of our contributions of ShAPO.

  4. Yes, please feel free to play around with the batch size and choose the number that works for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants