Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when running cifar100 examples #4

Open
deercoder opened this issue Mar 3, 2016 · 4 comments
Open

Error when running cifar100 examples #4

deercoder opened this issue Mar 3, 2016 · 4 comments

Comments

@deercoder
Copy link

Hi, Zhicheng,

I successfully build caffe using your tutorial here: https://sites.google.com/site/homepagezhichengyan/home/hdcnn/code, but when running the example of cifar100 in the 2nd step(./examples/cifar100/train_cifar100_NIN_float_crop_v2_train_val.sh, there is some strange error, as the following shows, I think it may be the problem of multiple GPUs, as I can run single experiment using one GPU in the Caffe's official example. Could you please give me some advices? Thank you very much!

I0302 18:59:01.165179 5920 caffe.cpp:105] Use GPUs with device IDs below
I0302 18:59:01.165335 5920 caffe.cpp:107] device id 0
I0302 18:59:01.165354 5920 caffe.cpp:107] device id 1
I0302 18:59:01.165369 5920 caffe.cpp:117] Starting Optimization
I0302 18:59:11.525671 5920 solver.cpp:77] Creating training net from net file: models/cifar100_NIN_float_crop_v2/train_val/train_test.prototxt
I0302 18:59:11.525739 5920 upgrade_proto.cpp:928] start ReadNetParamsFromTextFileOrDie
I0302 18:59:11.526916 5920 solver.cpp:80] create net
I0302 18:59:11.527045 5920 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer cifar
I0302 18:59:11.527104 5920 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0302 18:59:11.527258 5920 data_transformer.cpp:24] Loading mean file from: data/cifar100/float_mean.binaryproto
I0302 18:59:11.530076 5920 db.cpp:20] Opened leveldb examples/cifar100/cifar100-float-train-train-val-leveldb/cifar100-train-leveldb
I0302 18:59:11.530103 5920 data_manager.cpp:97] new database cursor
I0302 18:59:11.530953 5920 data_manager.cpp:99] new database transaction
*** Aborted at 1456963151 (unix time) try "date -d @1456963151" if you are using GNU date ***
PC: @ 0x7f9d86a53644 leveldb::(anonymous namespace)::MergingIterator::key()
*** SIGSEGV (@0x18) received by PID 5920 (TID 0x7f9d8c7339c0) from PID 24; stack trace: ***
@ 0x7f9d80d81670 (unknown)
@ 0x7f9d86a53644 leveldb::(anonymous namespace)::MergingIterator::key()
@ 0x7f9d86a3dc7e leveldb::(anonymous namespace)::DBIter::key()
@ 0x4fb102 caffe::db::LevelDBCursor::key()
@ 0x535fa1 caffe::DataManager<>::DataManager()
@ 0x5496b2 caffe::Net<>::InitDataManager()
@ 0x5676be caffe::Net<>::Init()
@ 0x567880 caffe::Net<>::Net()
@ 0x573eab caffe::Solver<>::InitTrainNet()
@ 0x574ebc caffe::Solver<>::Init()
@ 0x575046 caffe::Solver<>::Solver()
@ 0x4229f0 caffe::GetSolver<>()
@ 0x41c1f8 train()
@ 0x414091 main
@ 0x7f9d80d6db15 __libc_start_main
@ 0x41bd6d (unknown)
./examples/cifar100/train_cifar100_NIN_float_crop_v2_train_val.sh: line 5: 5920 Segmentation fault (core dumped) GLOG_logtostderr=1 ./build/tools/caffe train --solver=models/cifar100_NIN_float_crop_v2/train_val/solver.prototxt
I0302 18:59:13.376158 6839 caffe.cpp:105] Use GPUs with device IDs below
I0302 18:59:13.376302 6839 caffe.cpp:107] device id 0
I0302 18:59:13.376323 6839 caffe.cpp:107] device id 1
I0302 18:59:13.376339 6839 caffe.cpp:117] Starting Optimization
I0302 18:59:24.075724 6839 solver.cpp:77] Creating training net from net file: models/cifar100_NIN_float_crop_v2/train_val/train_test.prototxt
I0302 18:59:24.075798 6839 upgrade_proto.cpp:928] start ReadNetParamsFromTextFileOrDie
I0302 18:59:24.076957 6839 solver.cpp:80] create net
I0302 18:59:24.077093 6839 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer cifar
I0302 18:59:24.077142 6839 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0302 18:59:24.077369 6839 data_transformer.cpp:24] Loading mean file from: data/cifar100/float_mean.binaryproto
I0302 18:59:24.080384 6839 db.cpp:20] Opened leveldb examples/cifar100/cifar100-float-train-train-val-leveldb/cifar100-train-leveldb
I0302 18:59:24.080411 6839 data_manager.cpp:97] new database cursor
I0302 18:59:24.081198 6839 data_manager.cpp:99] new database transaction
*** Aborted at 1456963164 (unix time) try "date -d @1456963164" if you are using GNU date ***
PC: @ 0x7f9b196d2644 leveldb::(anonymous namespace)::MergingIterator::key()
*** SIGSEGV (@0x18) received by PID 6839 (TID 0x7f9b1f3b29c0) from PID 24; stack trace: ***
@ 0x7f9b13a00670 (unknown)
@ 0x7f9b196d2644 leveldb::(anonymous namespace)::MergingIterator::key()
@ 0x7f9b196bcc7e leveldb::(anonymous namespace)::DBIter::key()
@ 0x4fb102 caffe::db::LevelDBCursor::key()
@ 0x535fa1 caffe::DataManager<>::DataManager()
@ 0x5496b2 caffe::Net<>::InitDataManager()
@ 0x5676be caffe::Net<>::Init()
@ 0x567880 caffe::Net<>::Net()
@ 0x573eab caffe::Solver<>::InitTrainNet()
@ 0x574ebc caffe::Solver<>::Init()
@ 0x575046 caffe::Solver<>::Solver()
@ 0x4229f0 caffe::GetSolver<>()
@ 0x41c1f8 train()
@ 0x414091 main
@ 0x7f9b139ecb15 __libc_start_main
@ 0x41bd6d (unknown)
./examples/cifar100/train_cifar100_NIN_float_crop_v2_train_val.sh: line 8: 6839 Segmentation fault (core dumped) GLOG_logtostderr=1 ./build/tools/caffe train --solver=models/cifar100_NIN_float_crop_v2/train_val/solver_lr1.prototxt --snapshot=models/cifar100_NIN_float_crop_v2/train_val/cifar100_NIN_float_crop_v2_iter_100000.solverstate
I0302 18:59:25.059556 7868 caffe.cpp:105] Use GPUs with device IDs below
I0302 18:59:25.059707 7868 caffe.cpp:107] device id 0
I0302 18:59:25.059726 7868 caffe.cpp:107] device id 1
I0302 18:59:25.059741 7868 caffe.cpp:117] Starting Optimization
I0302 18:59:35.792002 7868 solver.cpp:77] Creating training net from net file: models/cifar100_NIN_float_crop_v2/train_val/train_test.prototxt
I0302 18:59:35.792084 7868 upgrade_proto.cpp:928] start ReadNetParamsFromTextFileOrDie
I0302 18:59:35.793494 7868 solver.cpp:80] create net
I0302 18:59:35.793649 7868 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer cifar
I0302 18:59:35.793747 7868 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0302 18:59:35.793915 7868 data_transformer.cpp:24] Loading mean file from: data/cifar100/float_mean.binaryproto
I0302 18:59:35.796743 7868 db.cpp:20] Opened leveldb examples/cifar100/cifar100-float-train-train-val-leveldb/cifar100-train-leveldb
I0302 18:59:35.796777 7868 data_manager.cpp:97] new database cursor
I0302 18:59:35.797546 7868 data_manager.cpp:99] new database transaction
*** Aborted at 1456963175 (unix time) try "date -d @1456963175" if you are using GNU date ***
PC: @ 0x7f10c8d4a644 leveldb::(anonymous namespace)::MergingIterator::key()
*** SIGSEGV (@0x18) received by PID 7868 (TID 0x7f10cea2a9c0) from PID 24; stack trace: ***
@ 0x7f10c3078670 (unknown)
@ 0x7f10c8d4a644 leveldb::(anonymous namespace)::MergingIterator::key()
@ 0x7f10c8d34c7e leveldb::(anonymous namespace)::DBIter::key()
@ 0x4fb102 caffe::db::LevelDBCursor::key()
@ 0x535fa1 caffe::DataManager<>::DataManager()
@ 0x5496b2 caffe::Net<>::InitDataManager()
@ 0x5676be caffe::Net<>::Init()
@ 0x567880 caffe::Net<>::Net()
@ 0x573eab caffe::Solver<>::InitTrainNet()
@ 0x574ebc caffe::Solver<>::Init()
@ 0x575046 caffe::Solver<>::Solver()
@ 0x4229f0 caffe::GetSolver<>()
@ 0x41c1f8 train()
@ 0x414091 main
@ 0x7f10c3064b15 __libc_start_main
@ 0x41bd6d (unknown)
./examples/cifar100/train_cifar100_NIN_float_crop_v2_train_val.sh: line 11: 7868 Segmentation fault (core dumped) GLOG_logtostderr=1 ./build/tools/caffe train --solver=models/cifar100_NIN_float_crop_v2/train_val/solver_lr2.prototxt --snapshot=models/cifar100_NIN_float_crop_v2/train_val/cifar100_NIN_float_crop_v2_iter_115000.solverstate

@stephenyan1231
Copy link
Owner

You can use a single GPU to run. To do this, open solver.prototxt and keep only one line 'device_id: 0'. Remove the other line 'device_id: 1'.

@deercoder
Copy link
Author

Thanks for your reply. @stephenyan1984

Actually I've tried your method, but it doesn't work very well, so I'd figure out where the problem is. If I only use one GPU, it still gives out the following error:

[cliu@ycao-hadoop3 HD-CNN]$ ./examples/cifar100/train_cifar100_NIN_float_crop_v2_train_val.sh
I0309 22:07:28.194317 7029 caffe.cpp:105] Use GPUs with device IDs below
I0309 22:07:28.194476 7029 caffe.cpp:107] device id 0
I0309 22:07:28.194507 7029 caffe.cpp:117] Starting Optimization
I0309 22:07:38.190814 7029 solver.cpp:77] Creating training net from net file: models/cifar100_NIN_float_crop_v2/train_val/train_test.prototxt
I0309 22:07:38.190878 7029 upgrade_proto.cpp:928] start ReadNetParamsFromTextFileOrDie
I0309 22:07:38.191999 7029 solver.cpp:80] create net
I0309 22:07:38.192157 7029 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer cifar
I0309 22:07:38.192248 7029 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0309 22:07:38.192414 7029 data_transformer.cpp:24] Loading mean file from: data/cifar100/float_mean.binaryproto
I0309 22:07:38.195355 7029 db.cpp:20] Opened leveldb examples/cifar100/cifar100-float-train-train-val-leveldb/cifar100-train-leveldb
I0309 22:07:38.195408 7029 data_manager.cpp:97] new database cursor
*** Aborted at 1457579258 (unix time) try "date -d @1457579258" if you are using GNU date ***
PC: @ 0x7f1234f941e0 (unknown)
*** SIGSEGV (@0x7f1234f941e0) received by PID 7029 (TID 0x7f123aa5b9c0) from PID 888750560; stack trace: ***
@ 0x7f122f0a9670 (unknown)
@ 0x7f1234f941e0 (unknown)
./examples/cifar100/train_cifar100_NIN_float_crop_v2_train_val.sh: line 5: 7029 Segmentation fault (core dumped) GLOG_logtostderr=1 ./build/tools/caffe train --solver=models/cifar100_NIN_float_crop_v2/train_val/solver.prototxt
I0309 22:07:39.626389 8122 caffe.cpp:105] Use GPUs with device IDs below
I0309 22:07:39.626590 8122 caffe.cpp:107] device id 0
I0309 22:07:39.626626 8122 caffe.cpp:117] Starting Optimization
I0309 22:07:50.246525 8122 solver.cpp:77] Creating training net from net file: models/cifar100_NIN_float_crop_v2/train_val/train_test.prototxt
I0309 22:07:50.246598 8122 upgrade_proto.cpp:928] start ReadNetParamsFromTextFileOrDie
I0309 22:07:50.247752 8122 solver.cpp:80] create net
I0309 22:07:50.247891 8122 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer cifar
I0309 22:07:50.247942 8122 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0309 22:07:50.248123 8122 data_transformer.cpp:24] Loading mean file from: data/cifar100/float_mean.binaryproto
I0309 22:07:50.251242 8122 db.cpp:20] Opened leveldb examples/cifar100/cifar100-float-train-train-val-leveldb/cifar100-train-leveldb
I0309 22:07:50.251271 8122 data_manager.cpp:97] new database cursor
*** Aborted at 1457579270 (unix time) try "date -d @1457579270" if you are using GNU date ***
PC: @ 0x7fc6efd921e0 (unknown)
*** SIGSEGV (@0x7fc6efd921e0) received by PID 8122 (TID 0x7fc6f58599c0) from PID 18446744073438568928; stack trace: ***
@ 0x7fc6e9ea7670 (unknown)
@ 0x7fc6efd921e0 (unknown)
./examples/cifar100/train_cifar100_NIN_float_crop_v2_train_val.sh: line 8: 8122 Segmentation fault (core dumped) GLOG_logtostderr=1 ./build/tools/caffe train --solver=models/cifar100_NIN_float_crop_v2/train_val/solver_lr1.prototxt --snapshot=models/cifar100_NIN_float_crop_v2/train_val/cifar100_NIN_float_crop_v2_iter_100000.solverstate
I0309 22:07:51.091190 8949 caffe.cpp:105] Use GPUs with device IDs below
I0309 22:07:51.091320 8949 caffe.cpp:107] device id 0
I0309 22:07:51.091341 8949 caffe.cpp:117] Starting Optimization
I0309 22:08:01.450184 8949 solver.cpp:77] Creating training net from net file: models/cifar100_NIN_float_crop_v2/train_val/train_test.prototxt
I0309 22:08:01.450248 8949 upgrade_proto.cpp:928] start ReadNetParamsFromTextFileOrDie
I0309 22:08:01.451361 8949 solver.cpp:80] create net
I0309 22:08:01.451508 8949 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer cifar
I0309 22:08:01.451594 8949 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0309 22:08:01.451751 8949 data_transformer.cpp:24] Loading mean file from: data/cifar100/float_mean.binaryproto
I0309 22:08:01.454588 8949 db.cpp:20] Opened leveldb examples/cifar100/cifar100-float-train-train-val-leveldb/cifar100-train-leveldb
I0309 22:08:01.454617 8949 data_manager.cpp:97] new database cursor
*** Aborted at 1457579281 (unix time) try "date -d @1457579281" if you are using GNU date ***
PC: @ 0x7f57302ad1e0 (unknown)
*** SIGSEGV (@0x7f57302ad1e0) received by PID 8949 (TID 0x7f5735d749c0) from PID 808112608; stack trace: ***
@ 0x7f572a3c2670 (unknown)
@ 0x7f57302ad1e0 (unknown)
./examples/cifar100/train_cifar100_NIN_float_crop_v2_train_val.sh: line 11: 8949 Segmentation fault (core dumped) GLOG_logtostderr=1 ./build/tools/caffe train --solver=models/cifar100_NIN_float_crop_v2/train_val/solver_lr2.prototxt --snapshot=models/cifar100_NIN_float_crop_v2/train_val/cifar100_NIN_float_crop_v2_iter_115000.solverstate

Any new ideas? or suggestions? Thanks.

@stephenyan1231
Copy link
Owner

From the error message, you should check whether the leveldb database exists. Its path is 'examples/cifar100/cifar100-float-train-train-val-leveldb/cifar100-train-leveldb'.

@deercoder
Copy link
Author

Thanks for your advice. @stephenyan1984

Actually I have checked that and the folder exists, I also tried to remove the repo and redo all the operation, it seems that it still doesn't work. I'm not sure if the level-db file you provide is broken or not. I used the examples/cifar100/get_cifar100_float_train-train-val-leveldb.sh to fetch your level-db files.

I will try to generate my own level-db files and see if it works. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants