Instability during training #6

rrsayao · 2018-07-05T22:13:20Z

I'm fairly new to this and for some reason I'm having during training. I've witnessed over 10% decrease in validation accuracy at some point.

It's a many-to-many problem similar to pos tagging (vocab size much smaller). Input is an array of 40 integers (zero-padded), output is an array of 40 one-hot vectors. Any idea what I'm doing wrong?

max_seqlen = 40
s_vocabsize = 17
t_vocabsize = 124

embed_size = 64
hidden_size = 128

input_ = Input(shape=(max_seqlen,), dtype='float32')
input_embed = Embedding(s_vocabsize, embed_size, input_length=max_seqlen , mask_zero=True)(input_)

bi_lstm = Bidirectional(LSTM(HIDDEN_SIZE, dropout=0.2, recurrent_dropout=0.2, return_sequences=True), merge_mode='concat')(input_embed)
dropout = Dropout(0.8)(bi_lstm)

y_hat = AttentionDecoder(hidden_size , alphabet_size=t_vocabsize, embedding_dim=embed_size )(dropout)

model = Model(inputs=input_, outputs=y_hat)
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=["accuracy"])```

asmekal · 2018-07-06T08:00:59Z

Sounds like gradients explosion. Common recommendations are to do gradient clipping and/or reduce learning rate. Like optimizer=Adam(learning_rate=5e-4, clip_norm=5) for example

rrsayao · 2018-07-08T23:54:13Z

I'm still trying but so far haven't been able to solve it.

Forgot to mention I'm also getting a warning once I call fit_generator to this model:

Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

Can you tell if I'm incorrectly using your code in any way? ie. restrictions on masking 0 (padding) on the input, one-hot output only, etc

asmekal · 2018-07-09T08:52:12Z

Probably Dropout(0.8) is too much, I didn't see anything more than 0.5 so far anywhere. Also in input embedding layer your embedding size is ~4 times more than s_vocabsize, what for? I suppose using one-hot input embedding should give better performance/convergence. But all the above should not affect AttentionDecoder itself

The warning you recieve is caused by my somehow inefficient implementation of embeddings in AttentionDecoder, it may slow down the training, but should not arise instability you mentioned.

The problem is probably caused by masking, can you remove mask_zero=True? Actually I cannot say that for sure, I didn't experimented with masking on the layer. And could you please tell results afterwards?

PS I cannot see the picture you attached in initial comment ("crazy instability issues"), so I don't fully understand how crazy that instability is.

rrsayao · 2018-07-12T07:44:24Z

Sorry for the late reply.

While I can remove mask_zero=True, it would lead to bad results. My whole sequence is padded and all the correctly predicted zeros would alter the loss.

I should test it out tomorrow. This is the picture I mentioned in the first post.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instability during training #6

Instability during training #6

rrsayao commented Jul 5, 2018

asmekal commented Jul 6, 2018

rrsayao commented Jul 8, 2018 •

edited

Loading

asmekal commented Jul 9, 2018

rrsayao commented Jul 12, 2018

Instability during training #6

Instability during training #6

Comments

rrsayao commented Jul 5, 2018

asmekal commented Jul 6, 2018

rrsayao commented Jul 8, 2018 • edited Loading

asmekal commented Jul 9, 2018

rrsayao commented Jul 12, 2018

rrsayao commented Jul 8, 2018 •

edited

Loading