Training Cost NAN #27

jiangqy · 2016-04-17T04:38:43Z

Hi, I would like to train AlexNet on ImageNet. While after 20 iterations, training cost becomes nan.
Here are the details:

Should I set a smaller learning rate? Could you give me some suggestions?

Thank you~

hma02 · 2016-04-22T23:52:01Z

@jiangqy
What is your batch size and current learning rate?

heipangpang · 2016-05-31T12:12:15Z

I have met the same problem, and my batch size is 256, learning rate is 0.01. Do you have any ideas?

jiangqy · 2016-05-31T13:09:12Z

@hma02 My batch size is 256 and learning rate is 0.01, too.

hma02 · 2016-05-31T15:12:11Z

@jiangqy @heipangpang
Looks like you are running the single GPU train.py, then the problem is not related to weight exchanging.

The cost should be around 6.9 initially.

The unbounded cost value maybe caused by gradient explosion. I got into similar situations when initializing a deep network with arrays of large variance and mean. Too large learning rates and batch sizes may result in strong gradient zigzag as well.

Also do check the input images when loading them
to see if they are preprocessed correctly and correspond to loaded labels
You can show them using similar tricks as here. Try using a stack of image_means as input data

heipangpang · 2016-06-01T06:19:14Z

@hma02
I will try it. Thank you very much.

heipangpang · 2016-06-02T10:36:49Z

@hma02
When I check the output of every layer, I found that for the layer_input, I got a zero matrix which may be the problem why I such a large training loss.

hma02 · 2016-06-02T18:18:23Z

@heipangpang
Yes, this probably is the reason you got large cost. Make sure you set use_data_layer to False in config.yaml. Then the layer_input should be equal to x as shown here, which is the input batch. If x is a zero matrix, there's something wrong with the preprocessed image batches.

heipangpang · 2016-06-03T02:10:48Z

@hma02
But when I load the batches hand by hand in python, it seems that I can get the correct results.
Thank you very much.

heipangpang · 2016-06-03T02:15:07Z

@hma02
I am getting the correct results now, thank you very much.

liaocs2008 · 2017-01-19T16:57:10Z

I had the same problem here. If "para_load" is set False, I could train it normally. But I think one of great contributions of this work is the parallel loading right?

Magotraa · 2017-04-20T10:51:07Z

@heipangpang
Can you please share, what change exactly made it possible for you to get correct results.

As you wrote
"I am getting the correct results now, thank you very much."

hma02 mentioned this issue Apr 16, 2017

error on Windows 10 #33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Cost NAN #27

Training Cost NAN #27

jiangqy commented Apr 17, 2016 •

edited

Loading

hma02 commented Apr 22, 2016

heipangpang commented May 31, 2016

jiangqy commented May 31, 2016

hma02 commented May 31, 2016 •

edited

Loading

heipangpang commented Jun 1, 2016

heipangpang commented Jun 2, 2016

hma02 commented Jun 2, 2016 •

edited

Loading

heipangpang commented Jun 3, 2016

heipangpang commented Jun 3, 2016

liaocs2008 commented Jan 19, 2017

Magotraa commented Apr 20, 2017

Training Cost NAN #27

Training Cost NAN #27

Comments

jiangqy commented Apr 17, 2016 • edited Loading

hma02 commented Apr 22, 2016

heipangpang commented May 31, 2016

jiangqy commented May 31, 2016

hma02 commented May 31, 2016 • edited Loading

heipangpang commented Jun 1, 2016

heipangpang commented Jun 2, 2016

hma02 commented Jun 2, 2016 • edited Loading

heipangpang commented Jun 3, 2016

heipangpang commented Jun 3, 2016

liaocs2008 commented Jan 19, 2017

Magotraa commented Apr 20, 2017

jiangqy commented Apr 17, 2016 •

edited

Loading

hma02 commented May 31, 2016 •

edited

Loading

hma02 commented Jun 2, 2016 •

edited

Loading