Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The ocr loss come to nan #53

Open
GlowingHorse opened this issue Sep 17, 2019 · 2 comments
Open

The ocr loss come to nan #53

GlowingHorse opened this issue Sep 17, 2019 · 2 comments

Comments

@GlowingHorse
Copy link

I use my own Japanese dataset, and crop all words into single word images.
Then I use train_ocr to train the ocr network, (using e2e-mltrctw.h5 as pretrained model, but change the output size of the model from 7500 to 4748 that is number of word types of my dataset). But the loss come to nan very fast. Is there some reason for that? Thanks!

These are training loss:
683464 training images in data/crop_train_images/crop_trainkuzushi.txt
683464 training images in data/crop_train_images/crop_trainkuzushi.txt
683464 training images in data/crop_train_images/crop_trainkuzushi.txt
683464 training images in data/crop_train_images/crop_trainkuzushi.txt
epoch 0[0], loss: 55.214, lr: 0.00010
epoch 0[500], loss: 54.610, lr: 0.00010
epoch 0[1000], loss: 14.609, lr: 0.00010
epoch 0[1500], loss: 7.219, lr: 0.00010
epoch 0[2000], loss: 6.109, lr: 0.00010
epoch 0[2500], loss: 5.536, lr: 0.00010
epoch 0[3000], loss: 4.826, lr: 0.00010
epoch 0[3500], loss: 4.030, lr: 0.00010
epoch 0[4000], loss: 3.301, lr: 0.00010
epoch 0[4500], loss: nan, lr: 0.00010
epoch 1[5000], loss: nan, lr: 0.00010
save model: backup2/E2E_5000.h5
epoch 1[5500], loss: nan, lr: 0.00010
epoch 1[6000], loss: nan, lr: 0.00010
epoch 1[6500], loss: nan, lr: 0.00010
epoch 1[7000], loss: nan, lr: 0.00010
epoch 1[7500], loss: nan, lr: 0.00010
epoch 1[8000], loss: nan, lr: 0.00010
epoch 1[8500], loss: nan, lr: 0.00010
epoch 1[9000], loss: nan, lr: 0.00010
epoch 1[9500], loss: nan, lr: 0.00010

@GlowingHorse
Copy link
Author

GlowingHorse commented Sep 17, 2019

That's weird. When I increase the network output from 4748 to 4900. The network can be trained longer. For now, 'nan' has not appeared. I will report it tomorrow.

@GlowingHorse
Copy link
Author

It seems has been solved. I just set the output channel to be bigger than your target number (don't just plus one, try to one hundred more than target number)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant