Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss is nan during training HD-CNN #2

Open
hokkun-dayo opened this issue Oct 5, 2015 · 4 comments
Open

loss is nan during training HD-CNN #2

hokkun-dayo opened this issue Oct 5, 2015 · 4 comments

Comments

@hokkun-dayo
Copy link

Hi,
I saw this page: https://sites.google.com/site/homepagezhichengyan/home/hdcnn/code
and try train using CIFAR-100, but in training time, displayed loss is nan. but accuracy seems to be improved little by little. Could you kindly explain this?

@hokkun-dayo
Copy link
Author

After the whole training phase, accuracy become < 0.01.
now I talk about this part:

Train a CNN using 'train_train' set as training data and 'train_val' set as testing data.
command: ./examples/cifar100/train_cifar100_NIN_float_crop_v2_train_val.sh

@stephenyan1231
Copy link
Owner

Hi,

I observed a similar problem and I found that cuDNN was the one to blame. It introduced unknown bugs on CIFAR100 dataset. The way I fixed it is to disable cuDNN by setting 'USE_CUDNN := 0' in makefile.config. Can you try this?

@hokkun-dayo
Copy link
Author

Thank you for your kind advice. In my setting, USE_CUDNN is 0 (precisely, commented).

By the way, I changed the number of GPU used for training from 2 to 1, so the problem is solved (accuracy is 0.6). I think the part of multi-gpu have a slight problem.

@stephenyan1231
Copy link
Owner

It might be a multi-gpu issue. Fortunately, on Cifar100 dataset, the training speed with a single GPU is fine.
Usually the multi-gpu training works when the two GPUs have peer-to-peer access, which can be verified by an example in CUDA samples installation folder.

Thanks
Zhicheng “Stephen"

On Oct 8, 2015, at 1:49 PM, Hokuto Kagaya [email protected] wrote:

Thank you for your kind advice. In my setting, USE_CUDNN is 0 (precisely, commented).

By the way, I changed the number of GPU used for training from 2 to 1, so the problem is solved (accuracy is 0.6). I think the part of multi-gpu have a slight problem.


Reply to this email directly or view it on GitHub #2 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants