Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory with batch_size 1 and 4GB VRAM #49

Open
fjodborg opened this issue Feb 22, 2021 · 0 comments
Open

Out of memory with batch_size 1 and 4GB VRAM #49

fjodborg opened this issue Feb 22, 2021 · 0 comments

Comments

@fjodborg
Copy link

Hello, i have this problem where i run out of memory when running python train.py --threed_match_dir ~/dataset/threedmatch/ --batch_size 1.
At first i ran out of memory before even starting the first epochs, so i changed the batch_size to 1 (batch_size 2 was still too much). After going through some thousands epochs i started getting "out of memory" errors like:

INFO - 2021-02-22 12:51:28,348 - trainer - Train Epoch: 1 [1440/7317], Current Loss: 1.157e+00 Pos: 0.365 Neg: 0.792	Data time: 0.0536, Train time: 0.5614, Iter time: 0.6150
Traceback (most recent call last):
  File "train.py", line 84, in <module>
    main(config)
  File "train.py", line 63, in main
    trainer.train()
  File "/home/f/repos/FCGF/lib/trainer.py", line 132, in train
    self._train_epoch(epoch)
  File "/home/f/repos/FCGF/lib/trainer.py", line 492, in _train_epoch
    self.config.batch_size)
  File "/home/f/repos/FCGF/lib/trainer.py", line 427, in contrastive_hardest_negative_loss
    D01 = pdist(posF0, subF1, dist_type='L2')
  File "/home/f/repos/FCGF/lib/metrics.py", line 24, in pdist
    D2 = torch.sum((A.unsqueeze(1) - B.unsqueeze(0)).pow(2), 2)
RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 3.82 GiB total capacity; 744.27 MiB already allocated; 43.38 MiB free; 814.00 MiB reserved in total by PyTorch)

Currently i'm my system takes up 500MiB VRAM from my GTX 1650 (4GB) and the rest is used by pytorch. I'm running pytorch 1.7 in a python 3.7 conda enviroment and i tried compiling minkowskiEngine with cuda 11.2 and currently i'm running cuda 10.2 but both gave the same error.

On a side note: Isn't it bad to run a batch size of only 1, wouldn't that cause poor convergence?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant