Memory Issue #10

MatteoTomassetti · 2017-02-21T15:54:23Z

Hi,
Thank you for sharing your code publicly, but I'm having some memory issues when running it on AWS.

I'm spinning a g2.2xlarge instance on AWS and try to run your code for only the first 1000 lines of news.2011.en.shuffled.

Have you ever got an error message like this one (see below)? And if so, is there a way to change the parameters to avoid or maybe should I select another type of AWS instance?

Just for completeness these are the parameters I was trying to test

NUMBER_OF_ITERATIONS = 20000
EPOCHS_PER_ITERATION = 5
RNN = recurrent.LSTM
INPUT_LAYERS = 2
OUTPUT_LAYERS = 2
AMOUNT_OF_DROPOUT = 0.3
BATCH_SIZE = 500
HIDDEN_SIZE = 700
INITIALIZATION = "he_normal" # : Gaussian initialization scaled by fan_in (He et al., 2014)
MAX_INPUT_LEN = 40
MIN_INPUT_LEN = 3
INVERTED = True
AMOUNT_OF_NOISE = 0.2 / MAX_INPUT_LEN
NUMBER_OF_CHARS = 100 # 75

And this is the error that I'm getting

Iteration 1
Train on 3376 samples, validate on 376 samples
Epoch 1/5
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/theano/compile/function_module.py", line 884, in __call__
    self.fn() if output_subset is None else\
RuntimeError: Cuda error: GpuElemwise node_m71c627ae87c918771aac75471af66509_0 Add: out of memory.
    n_blocks=30 threads_per_block=256
   Call: kernel_Add_node_m71c627ae87c918771aac75471af66509_0_Ccontiguous<<<n_blocks, threads_per_block>>>(numEls, local_dims[0], local_dims[1], i0_data, local_str[0][0], local_str[0][1], i1_data, local_str[1][0], local_str[1][1], o0_data, local_ostr[0][0], local_ostr[0][1])


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 10, in main_news
  File "<stdin>", line 8, in iterate_training
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/keras/models.py", line 672, in fit
    initial_epoch=initial_epoch)
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 1196, in fit
    initial_epoch=initial_epoch)
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 891, in _fit_loop
    outs = f(ins_batch)
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/keras/backend/theano_backend.py", line 959, in __call__
    return self.function(*inputs)
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/theano/compile/function_module.py", line 898, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/theano/gof/link.py", line 325, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/theano/compile/function_module.py", line 884, in __call__
    self.fn() if output_subset is None else\
RuntimeError: Cuda error: GpuElemwise node_m71c627ae87c918771aac75471af66509_0 Add: out of memory.
    n_blocks=30 threads_per_block=256
   Call: kernel_Add_node_m71c627ae87c918771aac75471af66509_0_Ccontiguous<<<n_blocks, threads_per_block>>>(numEls, local_dims[0], local_dims[1], i0_data, local_str[0][0], local_str[0][1], i1_data, local_str[1][0], local_str[1][1], o0_data, local_ostr[0][0], local_ostr[0][1])

Apply node that caused the error: GpuElemwise{add,no_inplace}(GpuDot22.0, GpuDimShuffle{x,0}.0)
Toposort index: 207
Inputs types: [CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, row)]
Inputs shapes: [(20000, 700), (1, 700)]
Inputs strides: [(700, 1), (0, 1)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[GpuReshape{3}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)]]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

The text was updated successfully, but these errors were encountered:

parth126 · 2017-02-27T23:53:12Z

@MatteoTomassetti This is an out of memory issue
I believe reducing the batch size to 25-50 should solve it.

MajorTal · 2017-02-28T04:11:59Z

I changed the code to do iterative training with a generator. It now runs nicely on my laptop without needing to limit the amount of data! You might still need to limit the batch size to fit your specific memory availability.

…

On Feb 28, 2017 1:53 AM, "Parth Mehta" ***@***.***> wrote: This is an out of memory issue I believe reducing the batch size to 25-50 should solve it. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#10 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA9lwTvjoq-ooLNhyzh1QRNh1XAuhqiCks5rg2HrgaJpZM4MHfwQ> .

MatteoTomassetti · 2017-02-28T09:17:29Z

thanks @parth126 and @MajorTal! I was wondering, based on your experience, what's the average running time I should expect for one epoch to train the entire news.2011.en.shuffled dataset?
My problem is that I ran the code for just one epoch and I extrapolated the time it should take to reach 20,000 iterations I'm left with years of training!

MajorTal · 2017-02-28T12:05:12Z

I just moved to the news.2013.en.shuffled (much larger) - I'll update the code to reflect that.
It is so large that I split the epochs to mini-epochs that cover about 1% of the data (because I save the model after each epoch and because it is taking so long...).
These mini-epochs are configured to run about 30 minutes.
After about 2 hours you already see meaningful results (about 85% accuracy).
I used this AMI to train the system: https://aws.amazon.com/marketplace/pp/B06VSPXKDX
On an AWS EC2 p2.xlarge instance (currently at $0.9 per Hour)

FMFluke · 2017-11-16T15:20:24Z

@MatteoTomassetti @MajorTal I was running this exact code with the default news.2013.en.shuffled dataset (changed almost nothing except to update Keras API calling to newer version and adapt the code to be python 3 compatible). After almost 2 days of training (on reasonable speed, was using Azure with K80) the accuracy is stuck at about 47-48%. I also noticed that while it had been able to fix many spelling mistakes, it always repeat the last character or just add trailing periods to the prediction and therefore marked as wrong. Do you have any idea what could be happening? I have been looking around and could not find good answer.

MajorTal · 2017-11-17T15:35:38Z

If I recall correctly, the trailing periods are how I used to signal the end of the sequence and they should be stripped off. Also - I don't remember if the hyperparameters are optimized in any way in the latest code. I changed the data significantly so that I can open source the code from the version we used internally. Thanks, Tal Weiss +31-6-1165-8778 Skype: major.tal https://twitter.com/majortal

…

On Thu, Nov 16, 2017 at 4:20 PM, FMFluke ***@***.***> wrote: @MatteoTomassetti <https://github.com/matteotomassetti> @MajorTal <https://github.com/majortal> I was running this exact code with the default news.2013.en.shuffled dataset (changed almost nothing except to update Keras API calling to newer version and adapt the code to be python 3 compatible). After almost 2 days of training (on reasonable speed, was using Azure with K80) the accuracy is stuck at about 47-48%. I also noticed that while it had been able to fix many spelling mistakes, it always repeat the last character or just add trailing periods to the prediction and therefore marked as wrong. Do you have any idea what could be happening? I have been looking around and could not find good answer. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA9lwXa8tiAXnktUMMzpOe21hYhFTsAQks5s3FK4gaJpZM4MHfwQ> .

FMFluke · 2017-11-18T03:50:42Z

Ok, but then how did you make the model know to exclude those periods when calculating the accuracy? How exactly did you strip them off?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Issue #10

Memory Issue #10

MatteoTomassetti commented Feb 21, 2017

parth126 commented Feb 27, 2017 •

edited

Loading

MajorTal commented Feb 28, 2017 via email

MatteoTomassetti commented Feb 28, 2017 •

edited

Loading

MajorTal commented Feb 28, 2017

FMFluke commented Nov 16, 2017

MajorTal commented Nov 17, 2017 via email

FMFluke commented Nov 18, 2017

Memory Issue #10

Memory Issue #10

Comments

MatteoTomassetti commented Feb 21, 2017

parth126 commented Feb 27, 2017 • edited Loading

MajorTal commented Feb 28, 2017 via email

MatteoTomassetti commented Feb 28, 2017 • edited Loading

MajorTal commented Feb 28, 2017

FMFluke commented Nov 16, 2017

MajorTal commented Nov 17, 2017 via email

FMFluke commented Nov 18, 2017

parth126 commented Feb 27, 2017 •

edited

Loading

MatteoTomassetti commented Feb 28, 2017 •

edited

Loading