Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot replicate convergence #3

Open
denizyuret opened this issue Oct 11, 2018 · 6 comments
Open

cannot replicate convergence #3

denizyuret opened this issue Oct 11, 2018 · 6 comments

Comments

@denizyuret
Copy link
Contributor

@OsmanMutlu when I try to train from scratch I do not seem to get convergence behavior described in README.md, can you try as well?

@denizyuret denizyuret assigned denizyuret and unassigned denizyuret Oct 11, 2018
@denizyuret
Copy link
Contributor Author

@ereday Thinking that I may have broken something, I went back to Julia 0.6 and tried training the model with the last commit before the Julia 1.0 transition. I got to 59% at the end of 30 epochs. With the latest commit I got to 64%. Do you remember the specific version / commit where I can use to replicate your results from the README so I can debug what is going on?

@ereday
Copy link
Member

ereday commented Oct 12, 2018

Hi, I checked the Knet version I am using. According to NEWS.md, I am using Knet v0.9.1. Unfortunatelly when I git log I am getting fatal: your current branch appears to be broken error on the cluster. For AutoGrad, commit is 823ea162c829402b0aaf7a7d9e4145f170fdd79b. After your issue, today I sent another job to train the model from scratch ( using a slower GPU than K80 now, 1 epoch takes ~30 mins). Currently it is on 19th epoch and its dev set accuracy is 56.27. I'll let you know when it is over. You can find the log file and saved model with the specified accuracy at the following path: /kuacc/users/edayanik16/relnet/saved_models.

@ereday
Copy link
Member

ereday commented Oct 16, 2018

I run a couple of experiments by using exactly the same script and code in the repository. (The environment: Julia 0.6.2, Knet v0.9.1). I share a chart below to share the results I obtained. As you said, they’re not same as the one shared in README. However, the model did not get stuck around ~60%. At the end of the training, I obtained accuracy around ~91% on dev set in general. I remembered that I trained this model (and obtained the corresponding learning curve) on the old cluster (somon & kuacctest) meaning that I might have used even older versions of Knet & Autograd. One possible reason might be the change in the dropout usage. Forget gate bias values of the LSTM might also affect the results. As far as I remember, I was setting them to 1.0 manually on the old cluster (by changing the source code of Knet). If one of these is the problem, playing with hyperparameters and the seed might be enough to recover the loss in the performance which is currently I am doing. If I get improvement, I'll post it here too. I don’t think something serious happened since we’re still able to achieve 91% performance. The saved models can be found here .

validation accuracy chart

@denizyuret
Copy link
Contributor Author

denizyuret commented Oct 16, 2018 via email

@denizyuret
Copy link
Contributor Author

I can confirm similar results with Julia 1.0. Here are the results with old values for comparison. It never exceeds 90%. Dropout problem? (I no longer decide when to apply dropout automatically).

Epoch Accuracy (Val Set) Julia 1.0
1 44.07% 43.63%
5 47.50% 46.55%
15 57.69% 54.23%
25 79.60% 57.95%
40 93.21% 69.91%
65 94.50% 87.25%
100 89.88%

@ereday
Copy link
Member

ereday commented Oct 23, 2018

I was also thinking dropout at first but then I compared at the train set loss values of the model stated in the readme and 3 models I shared above. On the one hand, all of the the 3 models have higher loss values during training which might be the sign of using high amount of dropout but on the other hand, If we decrease dropout rate the models start to overfit even more. Therefore I started to think something else might be the reason for the decrease in validation set performance. Besides my thoughts, I've also tried smaller dropout rates to see it empirically and I didn't get 94% accuracy on val set. Could (as I said above) initialization way of RNN's forget gates might be the reason ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants