Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory error #1

Open
ghost opened this issue Jun 7, 2016 · 1 comment
Open

Out of memory error #1

ghost opened this issue Jun 7, 2016 · 1 comment

Comments

@ghost
Copy link

ghost commented Jun 7, 2016

When I execute

CUDA_VISIBLE_DEVICES=0 th feedforward_neural_doodle.lua -model_name skip_noise_4 -masks_hdf5 data/starry/gen_doodles.hdf5 -batch_size 4 -num_mask_noise_times 0 -num_noise_channels 0 -learning_rate 1e-1 -half false

I get the followingh result:

[libprotobuf WARNING google/protobuf/io/coded_stream.cc:505] Reading dangerously large protocol message. If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 574671192
Successfully loaded data/pretrained/VGG_ILSVRC_19_layers.caffemodel
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv3_4: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv4_4: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
conv5_4: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8: 1 1 4096 1000
Setting up style layer 2 : relu1_1
Replacing max pooling at layer 5 with average pooling
Setting up style layer 7 : relu2_1
Replacing max pooling at layer 10 with average pooling
Setting up style layer 12 : relu3_1
Replacing max pooling at layer 19 with average pooling
Setting up style layer 21 : relu4_1
Replacing max pooling at layer 28 with average pooling
Setting up style layer 30 : relu5_1
Optimize
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-7288/cutorch/lib/THC/generic/THCStorage.cu line=41 error=2 : out of memory
/home/andrew/torch/install/bin/luajit: /home/andrew/torch/install/share/lua/5.1/nn/Container.lua:67:
In 3 module of nn.Sequential:
In 1 module of nn.Sequential:
/home/andrew/torch/install/share/lua/5.1/nn/THNN.lua:109: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7288/cutorch/lib/THC/generic/THCStorage.cu:41
stack traceback:
[C]: in function 'v'
/home/andrew/torch/install/share/lua/5.1/nn/THNN.lua:109: in function 'SpatialReplicationPadding_updateGradInput'
...h/install/share/lua/5.1/nn/SpatialReplicationPadding.lua:41: in function 'updateGradInput'
/home/andrew/torch/install/share/lua/5.1/nn/Module.lua:31: in function </home/andrew/torch/install/share/lua/5.1/nn/Module.lua:29>
[C]: in function 'xpcall'
/home/andrew/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/andrew/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function </home/andrew/torch/install/share/lua/5.1/nn/Sequential.lua:78>
[C]: in function 'xpcall'
/home/andrew/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/andrew/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
feedforward_neural_doodle.lua:167: in function 'opfunc'
/home/andrew/torch/install/share/lua/5.1/optim/adam.lua:33: in function 'optim_method'
feedforward_neural_doodle.lua:199: in main chunk
[C]: in function 'dofile'
...drew/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670

I'm running with multiple GTX 980s, so GPU m,emory should not be an issue.

I have tried running with both -backend cudnn and -backen nn with no difference to the outcome.

I have been able to run the fast-neural-doodle project with no problems on my machine, so prerequisites such as python, torch and cuda appear to have been set up correctly.

Any idea of the cause of this problem?

@DmitryUlyanov
Copy link
Owner

DmitryUlyanov commented Jun 7, 2016

Hello, I tested everything using 12GB card, so all the parameters tuned to work in my settings. You can try to decrease batch_size to 1 to see if it not fails, but it will train much worse with this small batch size.

You can reduce the image size to decrease memory consumption. I used 512x images, you can always go to any dimensions which are factor of 32, try 384x for example with the same batch_size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant