Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train from logical error #66

Open
hanskrupakar opened this issue Oct 28, 2016 · 6 comments
Open

Train from logical error #66

hanskrupakar opened this issue Oct 28, 2016 · 6 comments

Comments

@hanskrupakar
Copy link

hanskrupakar commented Oct 28, 2016

I am trying to resume training from checkpoint file and even though it says loaded model, the perplexity restarts at weight initialization level and the accuracy of translation when I use evaluate.lua also seems to indicate that the model is simply reinitializing the vectors instead of loading from checkpoint.

Is this an issue with the API? What am I doing wrong?

.......
Epoch: 4, Batch: 11850/11961, Batch size: 16, LR: 0.1000, PPL: 2565.87, |Param|: 5479.77, |GParam|: 44.02, Training: 134/65/69 total/source/target tokens/sec   
Epoch: 4, Batch: 11900/11961, Batch size: 16, LR: 0.1000, PPL: 2573.56, |Param|: 5480.11, |GParam|: 46.07, Training: 134/65/69 total/source/target tokens/sec   
Epoch: 4, Batch: 11950/11961, Batch size: 16, LR: 0.1000, PPL: 2580.50, |Param|: 5480.42, |GParam|: 90.12, Training: 134/65/69 total/source/target tokens/sec   
Train   2582.1220978721 
Valid   2958.3082902242 
saving checkpoint to demo-model_epoch4.00_2958.31.t7    
Script started on Monday 24 October 2016 08:55:52 AM IST
hans@hans-Lenovo-IdeaPad-Y500:~/seq2seq-attn-master$ th train.lua -data_file data/demo-train.hdf5 -val_data_file data/demo-val.hdf5 -savefile demo-model
using CUDA on GPU 1...
loading data...
done!
Source vocab size: 50004, Target vocab size: 150004
Source max sent len: 50, Target max sent len: 52
Number of additional features on source side: 0
Switching on memory preallocation
loading demo-model_epoch4.00_2958.31.t7...
Number of parameters: 84236504 (active: 84236504)
Epoch: 5, Batch: 50/11961, Batch size: 16, LR: 0.0500, PPL: 375825299.43, |Param|: 5407.84, |GParam|: 503.37, Training: 131/61/69 total/source/target tokens/sec
Epoch: 5, Batch: 100/11961, Batch size: 16, LR: 0.0500, PPL: 145308733.29, |Param|: 5407.19, |GParam|: 130.81, Training: 132/63/69 total/source/target tokens/sec
Epoch: 5, Batch: 150/11961, Batch size: 16, LR: 0.0500, PPL: 85249666.69, |Param|: 5406.86, |GParam|: 1190.36, Training: 133/64/69 total/source/target tokens/sec
@guillaumekln
Copy link
Contributor

I can't reproduce this on the latest revision.

  • What is the command lines you used to start the training and to resume it?
  • Did you do any changes to the code?

@hanskrupakar
Copy link
Author

I didn't make any changes except specify the epoch to start the loading from. I have attached a log file specifying the train and load from commands, which remain the same except me specifying the load from file.

log.txt

@guillaumekln
Copy link
Contributor

Something is not right. According to your log file, you always run the same command:

th train.lua -data_file data/demo-train.hdf5 -val_data_file data/demo-val.hdf5 -savefile demo-model

Is it the case?

  • If not, can you share the actual command lines you ran?
  • If yes, make sure you don't have local modifications in your source code. The logs you are getting do not reflect this command.

@hanskrupakar
Copy link
Author

I ran it again from the beginning again after you said it was strange. Attached is the log file for that. Also attached is the train.lua and preprocess.py I used.
preprocess.py.docx
train.lua.docx
error.txt

@guillaumekln
Copy link
Contributor

It seems that AdaGrad does not play nicely with the train_from option at the moment. I would advise you to stick with the default SGD which works well.

Also, please don't set your option within the code. It is error prone and harder for whoever might assist you to know what you are doing.

@hanskrupakar
Copy link
Author

Will remember not to inline changes from now.
I implemented SGD and the train_from works as expected.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants