You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, when I training the code in my single RTX-4090, it said that OOM. Even I set the batch size to 2, it still has this problem. Anyone know how to solve it or anything I forget to do it?
thanks for anyone read this issue. I can run the Rectifiedflow successfully, but can't work on this. It make me upset and send this issue.
Traceback (most recent call last):
File "./main.py", line 68, in <module>
app.run(main)
File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "./main.py", line 59, in main
run_lib.train(FLAGS.config, FLAGS.workdir)
File "/media/user/2tb/score_sde_pytorch/run_lib.py", line 131, in train
loss = train_step_fn(state, batch)
File "/media/user/2tb/score_sde_pytorch/losses.py", line 195, in step_fn
loss = loss_fn(model, batch)
File "/media/user/2tb/score_sde_pytorch/losses.py", line 118, in loss_fn
score = model_fn(perturbed_data, labels)
File "/media/user/2tb/score_sde_pytorch/models/utils.py", line 124, in model_fn
return model(x, labels)
File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/media/user/2tb/score_sde_pytorch/models/ncsnpp.py", line 275, in forward
h = modules[m_idx](hs[-1], temb)
File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/media/user/2tb/score_sde_pytorch/models/layerspp.py", line 265, in forward
h = self.Dropout_0(h)
File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/torch/nn/modules/dropout.py", line 59, in forward
return F.dropout(input, self.p, self.training, self.inplace)
File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/torch/nn/functional.py", line 1252, in dropout
return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 23.64 GiB total capacity; 1.86 GiB already allocated; 62.00 MiB free; 1.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2024-12-15 16:50:22.245628: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
[ble: exit 1][ble: elapsed 3.615s (CPU 452.1%)] python ./main.py --config ./configs/ve/cifar10_ncsnpp.py --eval_folder eval --mode train --workdir ./logs
The text was updated successfully, but these errors were encountered:
Hello, when I training the code in my single RTX-4090, it said that OOM. Even I set the batch size to 2, it still has this problem. Anyone know how to solve it or anything I forget to do it?
thanks for anyone read this issue. I can run the Rectifiedflow successfully, but can't work on this. It make me upset and send this issue.
The text was updated successfully, but these errors were encountered: