Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory #205

Open
itzzy opened this issue Aug 28, 2024 · 0 comments
Open

CUDA out of memory #205

itzzy opened this issue Aug 28, 2024 · 0 comments

Comments

@itzzy
Copy link

itzzy commented Aug 28, 2024

I am running the code on a single machine with an A100 80GB of GPU memory, and I encountered the following error:
Traceback (most recent call last):
File "main_fft_pretrain.py", line 302, in
main(args)
File "main_fft_pretrain.py", line 270, in main
train_stats = train_one_epoch(
File "/data0/zhiyong/code/github/mae/engine_pretrain.py", line 48, in train_one_epoch
loss, _, _ = model(samples, mask_ratio=args.mask_ratio)
File "/home/zhiyongzhang/anaconda3/envs/mae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data0/zhiyong/code/github/mae/models_fft_2.py", line 641, in forward
latent, mask, ids_restore = self.forward_encoder(imgs, mask_ratio)
File "/data0/zhiyong/code/github/mae/models_fft_2.py", line 545, in forward_encoder
x_combined = blk(x_combined)
File "/home/zhiyongzhang/anaconda3/envs/mae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/zhiyongzhang/anaconda3/envs/mae/lib/python3.8/site-packages/timm/models/vision_transformer.py", line 165, in forward
x = x + self.drop_path1(self.ls1(self.attn(self.norm1(x))))
File "/home/zhiyongzhang/anaconda3/envs/mae/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in call_impl
return forward_call(*input, **kwargs)
File "/home/zhiyongzhang/anaconda3/envs/mae/lib/python3.8/site-packages/timm/models/vision_transformer.py", line 99, in forward
attn = attn.softmax(dim=-1)
RuntimeError: CUDA out of memory. Tried to allocate 8.90 GiB (GPU 0; 79.21 GiB total capacity; 60.10 GiB already allocated; 7.09 GiB free; 60.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
My agrs:
python main_fft_pretrain.py --world_size 2 --batch_size 4 --model mae_vit_fft_base_patch16 --norm

pix_loss --mask_ratio 0.75 --epochs 800 --warmup_epochs 40 --blr 1.5e-4 --weight_decay 0.05 --data_path /data0/zhiyong/data/imagenetResize

@github-staff github-staff deleted a comment from itzzy Aug 28, 2024
@github-staff github-staff deleted a comment from ViniciusSCG Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@itzzy and others