Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OutOfMemoryError #71

Open
shih-huai opened this issue Dec 15, 2024 · 0 comments
Open

OutOfMemoryError #71

shih-huai opened this issue Dec 15, 2024 · 0 comments

Comments

@shih-huai
Copy link

Hello, when I training the code in my single RTX-4090, it said that OOM. Even I set the batch size to 2, it still has this problem. Anyone know how to solve it or anything I forget to do it?

thanks for anyone read this issue. I can run the Rectifiedflow successfully, but can't work on this. It make me upset and send this issue.

Traceback (most recent call last):
  File "./main.py", line 68, in <module>
    app.run(main)
  File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "./main.py", line 59, in main
    run_lib.train(FLAGS.config, FLAGS.workdir)
  File "/media/user/2tb/score_sde_pytorch/run_lib.py", line 131, in train
    loss = train_step_fn(state, batch)
  File "/media/user/2tb/score_sde_pytorch/losses.py", line 195, in step_fn
    loss = loss_fn(model, batch)
  File "/media/user/2tb/score_sde_pytorch/losses.py", line 118, in loss_fn
    score = model_fn(perturbed_data, labels)
  File "/media/user/2tb/score_sde_pytorch/models/utils.py", line 124, in model_fn
    return model(x, labels)
  File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/user/2tb/score_sde_pytorch/models/ncsnpp.py", line 275, in forward
    h = modules[m_idx](hs[-1], temb)
  File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/user/2tb/score_sde_pytorch/models/layerspp.py", line 265, in forward
    h = self.Dropout_0(h)
  File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/torch/nn/modules/dropout.py", line 59, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
  File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/torch/nn/functional.py", line 1252, in dropout
    return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 23.64 GiB total capacity; 1.86 GiB already allocated; 62.00 MiB free; 1.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2024-12-15 16:50:22.245628: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
[ble: exit 1][ble: elapsed 3.615s (CPU 452.1%)] python ./main.py --config ./configs/ve/cifar10_ncsnpp.py --eval_folder eval --mode train --workdir ./logs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant