Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instability during training #6

Open
rrsayao opened this issue Jul 5, 2018 · 4 comments
Open

Instability during training #6

rrsayao opened this issue Jul 5, 2018 · 4 comments

Comments

@rrsayao
Copy link

rrsayao commented Jul 5, 2018

I'm fairly new to this and for some reason I'm having crazy instabilities issues during training. I've witnessed over 10% decrease in validation accuracy at some point.

It's a many-to-many problem similar to pos tagging (vocab size much smaller). Input is an array of 40 integers (zero-padded), output is an array of 40 one-hot vectors. Any idea what I'm doing wrong?

max_seqlen = 40
s_vocabsize = 17
t_vocabsize = 124

embed_size = 64
hidden_size = 128

input_ = Input(shape=(max_seqlen,), dtype='float32')
input_embed = Embedding(s_vocabsize, embed_size, input_length=max_seqlen , mask_zero=True)(input_)

bi_lstm = Bidirectional(LSTM(HIDDEN_SIZE, dropout=0.2, recurrent_dropout=0.2, return_sequences=True), merge_mode='concat')(input_embed)
dropout = Dropout(0.8)(bi_lstm)

y_hat = AttentionDecoder(hidden_size , alphabet_size=t_vocabsize, embedding_dim=embed_size )(dropout)

model = Model(inputs=input_, outputs=y_hat)
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=["accuracy"])```
@asmekal
Copy link
Owner

asmekal commented Jul 6, 2018

Sounds like gradients explosion. Common recommendations are to do gradient clipping and/or reduce learning rate. Like optimizer=Adam(learning_rate=5e-4, clip_norm=5) for example

@rrsayao
Copy link
Author

rrsayao commented Jul 8, 2018

I'm still trying but so far haven't been able to solve it.

Forgot to mention I'm also getting a warning once I call fit_generator to this model:

Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

Can you tell if I'm incorrectly using your code in any way? ie. restrictions on masking 0 (padding) on the input, one-hot output only, etc

@asmekal
Copy link
Owner

asmekal commented Jul 9, 2018

Probably Dropout(0.8) is too much, I didn't see anything more than 0.5 so far anywhere. Also in input embedding layer your embedding size is ~4 times more than s_vocabsize, what for? I suppose using one-hot input embedding should give better performance/convergence. But all the above should not affect AttentionDecoder itself

The warning you recieve is caused by my somehow inefficient implementation of embeddings in AttentionDecoder, it may slow down the training, but should not arise instability you mentioned.

The problem is probably caused by masking, can you remove mask_zero=True? Actually I cannot say that for sure, I didn't experimented with masking on the layer. And could you please tell results afterwards?

PS I cannot see the picture you attached in initial comment ("crazy instability issues"), so I don't fully understand how crazy that instability is.

@rrsayao
Copy link
Author

rrsayao commented Jul 12, 2018

Sorry for the late reply.

While I can remove mask_zero=True, it would lead to bad results. My whole sequence is padded and all the correctly predicted zeros would alter the loss.

I should test it out tomorrow. This is the picture I mentioned in the first post.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants