Train from logical error #66

hanskrupakar · 2016-10-28T08:17:58Z

I am trying to resume training from checkpoint file and even though it says loaded model, the perplexity restarts at weight initialization level and the accuracy of translation when I use evaluate.lua also seems to indicate that the model is simply reinitializing the vectors instead of loading from checkpoint.

Is this an issue with the API? What am I doing wrong?

.......
Epoch: 4, Batch: 11850/11961, Batch size: 16, LR: 0.1000, PPL: 2565.87, |Param|: 5479.77, |GParam|: 44.02, Training: 134/65/69 total/source/target tokens/sec   
Epoch: 4, Batch: 11900/11961, Batch size: 16, LR: 0.1000, PPL: 2573.56, |Param|: 5480.11, |GParam|: 46.07, Training: 134/65/69 total/source/target tokens/sec   
Epoch: 4, Batch: 11950/11961, Batch size: 16, LR: 0.1000, PPL: 2580.50, |Param|: 5480.42, |GParam|: 90.12, Training: 134/65/69 total/source/target tokens/sec   
Train   2582.1220978721 
Valid   2958.3082902242 
saving checkpoint to demo-model_epoch4.00_2958.31.t7    
Script started on Monday 24 October 2016 08:55:52 AM IST
hans@hans-Lenovo-IdeaPad-Y500:~/seq2seq-attn-master$ th train.lua -data_file data/demo-train.hdf5 -val_data_file data/demo-val.hdf5 -savefile demo-model
using CUDA on GPU 1...
loading data...
done!
Source vocab size: 50004, Target vocab size: 150004
Source max sent len: 50, Target max sent len: 52
Number of additional features on source side: 0
Switching on memory preallocation
loading demo-model_epoch4.00_2958.31.t7...
Number of parameters: 84236504 (active: 84236504)
Epoch: 5, Batch: 50/11961, Batch size: 16, LR: 0.0500, PPL: 375825299.43, |Param|: 5407.84, |GParam|: 503.37, Training: 131/61/69 total/source/target tokens/sec
Epoch: 5, Batch: 100/11961, Batch size: 16, LR: 0.0500, PPL: 145308733.29, |Param|: 5407.19, |GParam|: 130.81, Training: 132/63/69 total/source/target tokens/sec
Epoch: 5, Batch: 150/11961, Batch size: 16, LR: 0.0500, PPL: 85249666.69, |Param|: 5406.86, |GParam|: 1190.36, Training: 133/64/69 total/source/target tokens/sec

The text was updated successfully, but these errors were encountered:

guillaumekln · 2016-10-28T09:11:30Z

I can't reproduce this on the latest revision.

What is the command lines you used to start the training and to resume it?
Did you do any changes to the code?

hanskrupakar · 2016-10-28T12:07:39Z

I didn't make any changes except specify the epoch to start the loading from. I have attached a log file specifying the train and load from commands, which remain the same except me specifying the load from file.

log.txt

guillaumekln · 2016-10-28T12:31:52Z

Something is not right. According to your log file, you always run the same command:

th train.lua -data_file data/demo-train.hdf5 -val_data_file data/demo-val.hdf5 -savefile demo-model

Is it the case?

If not, can you share the actual command lines you ran?
If yes, make sure you don't have local modifications in your source code. The logs you are getting do not reflect this command.

hanskrupakar · 2016-10-29T14:44:58Z

I ran it again from the beginning again after you said it was strange. Attached is the log file for that. Also attached is the train.lua and preprocess.py I used.
preprocess.py.docx
train.lua.docx
error.txt

guillaumekln · 2016-10-29T16:54:38Z

It seems that AdaGrad does not play nicely with the train_from option at the moment. I would advise you to stick with the default SGD which works well.

Also, please don't set your option within the code. It is error prone and harder for whoever might assist you to know what you are doing.

hanskrupakar · 2016-10-30T22:20:28Z

Will remember not to inline changes from now.
I implemented SGD and the train_from works as expected.
Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train from logical error #66

Train from logical error #66

hanskrupakar commented Oct 28, 2016 •

edited

Loading

guillaumekln commented Oct 28, 2016

hanskrupakar commented Oct 28, 2016

guillaumekln commented Oct 28, 2016

hanskrupakar commented Oct 29, 2016

guillaumekln commented Oct 29, 2016

hanskrupakar commented Oct 30, 2016

Train from logical error #66

Train from logical error #66

Comments

hanskrupakar commented Oct 28, 2016 • edited Loading

guillaumekln commented Oct 28, 2016

hanskrupakar commented Oct 28, 2016

guillaumekln commented Oct 28, 2016

hanskrupakar commented Oct 29, 2016

guillaumekln commented Oct 29, 2016

hanskrupakar commented Oct 30, 2016

hanskrupakar commented Oct 28, 2016 •

edited

Loading