cannot replicate convergence #3

denizyuret · 2018-10-11T17:18:40Z

@OsmanMutlu when I try to train from scratch I do not seem to get convergence behavior described in README.md, can you try as well?

denizyuret · 2018-10-12T19:03:47Z

@ereday Thinking that I may have broken something, I went back to Julia 0.6 and tried training the model with the last commit before the Julia 1.0 transition. I got to 59% at the end of 30 epochs. With the latest commit I got to 64%. Do you remember the specific version / commit where I can use to replicate your results from the README so I can debug what is going on?

ereday · 2018-10-12T19:30:56Z

Hi, I checked the Knet version I am using. According to NEWS.md, I am using Knet v0.9.1. Unfortunatelly when I git log I am getting fatal: your current branch appears to be broken error on the cluster. For AutoGrad, commit is 823ea162c829402b0aaf7a7d9e4145f170fdd79b. After your issue, today I sent another job to train the model from scratch ( using a slower GPU than K80 now, 1 epoch takes ~30 mins). Currently it is on 19th epoch and its dev set accuracy is 56.27. I'll let you know when it is over. You can find the log file and saved model with the specified accuracy at the following path: /kuacc/users/edayanik16/relnet/saved_models.

ereday · 2018-10-16T12:09:46Z

I run a couple of experiments by using exactly the same script and code in the repository. (The environment: Julia 0.6.2, Knet v0.9.1). I share a chart below to share the results I obtained. As you said, they’re not same as the one shared in README. However, the model did not get stuck around ~60%. At the end of the training, I obtained accuracy around ~91% on dev set in general. I remembered that I trained this model (and obtained the corresponding learning curve) on the old cluster (somon & kuacctest) meaning that I might have used even older versions of Knet & Autograd. One possible reason might be the change in the dropout usage. Forget gate bias values of the LSTM might also affect the results. As far as I remember, I was setting them to 1.0 manually on the old cluster (by changing the source code of Knet). If one of these is the problem, playing with hyperparameters and the seed might be enough to recover the loss in the performance which is currently I am doing. If I get improvement, I'll post it here too. I don’t think something serious happened since we’re still able to achieve 91% performance. The saved models can be found here .

denizyuret · 2018-10-16T12:18:41Z

Thank you. The problem is just with convergence speed rather than accuracy then. I will try to replicate with Julia 1.0.

…

On Tue, Oct 16, 2018 at 8:09 AM erenay dayanik ***@***.***> wrote: I run a couple of experiments by using exactly the same script and code in the repository. (The environment: Julia 0.6.2, Knet v0.9.1). I share a chart below to share the results I obtained. As you said, they’re not same as the one shared in README. However, the model did not get stuck around ~60%. At the end of the training, I obtained accuracy around ~91% on dev set in general. I remembered that I trained this model (and obtained the corresponding learning curve) on the old cluster (somon & kuacctest) meaning that I might have used even older versions of Knet & Autograd. One possible reason might be the change in the dropout usage. Forget gate bias values of the LSTM might also affect the results. As far as I remember, I was setting them to 1.0 manually on the old cluster (by changing the source code of Knet). If one of these is the problem, playing with hyperparameters and the seed might be enough to recover the loss in the performance which is currently I am doing. If I get improvement, I'll post it here too. I don’t think something serious happened since we’re still able to achieve 91% performance. The saved models can be found here <https://goo.gl/e4Y3WZ> . [image: validation accuracy chart] <https://user-images.githubusercontent.com/13196191/47014337-462dec80-d14a-11e8-93f6-d8014bfd48b1.png> — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvNpguhFA6qBAJZVfCq70cGaetfJ8QTks5ulcxygaJpZM4XX__8> .

denizyuret · 2018-10-22T14:31:09Z

I can confirm similar results with Julia 1.0. Here are the results with old values for comparison. It never exceeds 90%. Dropout problem? (I no longer decide when to apply dropout automatically).

Epoch	Accuracy (Val Set)	Julia 1.0
1	44.07%	43.63%
5	47.50%	46.55%
15	57.69%	54.23%
25	79.60%	57.95%
40	93.21%	69.91%
65	94.50%	87.25%
100		89.88%

ereday · 2018-10-23T10:58:55Z

I was also thinking dropout at first but then I compared at the train set loss values of the model stated in the readme and 3 models I shared above. On the one hand, all of the the 3 models have higher loss values during training which might be the sign of using high amount of dropout but on the other hand, If we decrease dropout rate the models start to overfit even more. Therefore I started to think something else might be the reason for the decrease in validation set performance. Besides my thoughts, I've also tried smaller dropout rates to see it empirically and I didn't get 94% accuracy on val set. Could (as I said above) initialization way of RNN's forget gates might be the reason ?

denizyuret assigned denizyuret and unassigned denizyuret Oct 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cannot replicate convergence #3

cannot replicate convergence #3

denizyuret commented Oct 11, 2018

denizyuret commented Oct 12, 2018

ereday commented Oct 12, 2018 •

edited

Loading

ereday commented Oct 16, 2018

denizyuret commented Oct 16, 2018 via email

denizyuret commented Oct 22, 2018

ereday commented Oct 23, 2018

cannot replicate convergence #3

cannot replicate convergence #3

Comments

denizyuret commented Oct 11, 2018

denizyuret commented Oct 12, 2018

ereday commented Oct 12, 2018 • edited Loading

ereday commented Oct 16, 2018

denizyuret commented Oct 16, 2018 via email

denizyuret commented Oct 22, 2018

ereday commented Oct 23, 2018

ereday commented Oct 12, 2018 •

edited

Loading