loss is nan during training HD-CNN #2

hokkun-dayo · 2015-10-05T07:20:30Z

Hi,
I saw this page: https://sites.google.com/site/homepagezhichengyan/home/hdcnn/code
and try train using CIFAR-100, but in training time, displayed loss is nan. but accuracy seems to be improved little by little. Could you kindly explain this?

hokkun-dayo · 2015-10-06T03:49:52Z

After the whole training phase, accuracy become < 0.01.
now I talk about this part:

Train a CNN using 'train_train' set as training data and 'train_val' set as testing data.
command: ./examples/cifar100/train_cifar100_NIN_float_crop_v2_train_val.sh

stephenyan1231 · 2015-10-08T13:58:38Z

Hi,

I observed a similar problem and I found that cuDNN was the one to blame. It introduced unknown bugs on CIFAR100 dataset. The way I fixed it is to disable cuDNN by setting 'USE_CUDNN := 0' in makefile.config. Can you try this?

hokkun-dayo · 2015-10-08T18:49:02Z

Thank you for your kind advice. In my setting, USE_CUDNN is 0 (precisely, commented).

By the way, I changed the number of GPU used for training from 2 to 1, so the problem is solved (accuracy is 0.6). I think the part of multi-gpu have a slight problem.

stephenyan1231 · 2015-10-08T19:03:51Z

It might be a multi-gpu issue. Fortunately, on Cifar100 dataset, the training speed with a single GPU is fine.
Usually the multi-gpu training works when the two GPUs have peer-to-peer access, which can be verified by an example in CUDA samples installation folder.

Thanks
Zhicheng “Stephen"

On Oct 8, 2015, at 1:49 PM, Hokuto Kagaya [email protected] wrote:

Thank you for your kind advice. In my setting, USE_CUDNN is 0 (precisely, commented).

By the way, I changed the number of GPU used for training from 2 to 1, so the problem is solved (accuracy is 0.6). I think the part of multi-gpu have a slight problem.

—
Reply to this email directly or view it on GitHub #2 (comment).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss is nan during training HD-CNN #2

loss is nan during training HD-CNN #2

hokkun-dayo commented Oct 5, 2015

hokkun-dayo commented Oct 6, 2015

stephenyan1231 commented Oct 8, 2015

hokkun-dayo commented Oct 8, 2015

stephenyan1231 commented Oct 8, 2015

loss is nan during training HD-CNN #2

loss is nan during training HD-CNN #2

Comments

hokkun-dayo commented Oct 5, 2015

hokkun-dayo commented Oct 6, 2015

stephenyan1231 commented Oct 8, 2015

hokkun-dayo commented Oct 8, 2015

stephenyan1231 commented Oct 8, 2015