Unable to train raptor and dog character #17

RexDrac · 2017-03-02T15:11:12Z

Hi, I have tried training both the raptor and the dog using the original code, and both wasn't successful due to large memory usage. Ubuntu keeps killing the program after running out of memory and swap.
I tried reducing memory usage by changing trainer_replay_mem_size and trainer_num_init_samples, and I get the error:

Actor Iter 0
Update Net 0:
F0302 12:29:19.209728 6773 memory_data_layer.cpp:93] Check failed: labels.size() / label_size_ == num (210847752626046076 vs. 32) Number of labels must be the same as data.
*** Check failure stack trace: ***
Aborted (core dumped)

I can run the simulation but can't train it. I didn't change any of the code. Can you tell me what went wrong? Is this caused by the version of caffe?

Thanks a lot!

Neo-X · 2017-03-02T19:03:29Z

How much memory does your computer have? How many threads are you running? That label size is very wrong. Did you recompile the caffe code in the external library? If not try that.

RexDrac · 2017-03-02T19:28:29Z

Hi Neo-X,
16GB memory +32 GB Swap, not sure about the number of threads, yes I recompiled the caffe in the external library.
I tried replacing the caffe.proto, memory_data_layer.hpp, memory_data_layer.hpp with the ones in caffe_mods before compiling as well, it didn't seem to work either.

RexDrac · 2017-03-04T15:35:31Z

@Neo-X @xbpeng I am wondering whether it is a problem of the version of the ubuntu or cuda I am using. Can you tell me the version of your ubuntu, gcc, and cuda?

Thanks

Neo-X · 2017-03-04T22:40:55Z

You computer has plenty of memory. I have run the code successfully on Ubuntu 14.04 and 16.04. The code doesn't really use CUDA. I have seen this issue before and it always has todo with a mismatch between the network dimensions and the controller action/state output/input size. Double check that your new biped controller action dimensions match the network description file.

RexDrac · 2017-03-05T17:48:54Z

Hi Neo-X, thanks for the reply, but the problem is that I can't even train the original raptor and dog character using the original code. Downloaded from git and compiled straight away, no changes.

Is there possibly a mismatch between the dog and raptor network description and controller?

xbpeng · 2017-03-05T17:50:58Z

It's not immediately obvious why the label size will be such a large value. Maybe you can step into the code and see what is causing the label size to be so large. It is probably the reason for the large memory consumption as well, since it's allocating memory to store such large labels.

Neo-X · 2017-03-05T22:37:51Z

Might need to rebuild the protoc files in caffe and recompile caffe again.

RexDrac · 2017-03-06T14:05:53Z

@Neo-X You mean regenerate caffe.pb.cc and caffe.pb.h?

Neo-X · 2017-03-06T15:46:03Z

Yes.

RexDrac · 2017-03-06T16:31:34Z

@xbpeng I think I have solved the label size issue, something to do with AddData() in memory_data_layer.cpp, but I still have memory issues.

I am confused with the relationship between trainer_replay_memory_size and the number of tuples. What is the upper limit for the total number of tuples? It doesn't seem to be set by trainer_replay_memory_size, because from what is shown in the terminal, the "Num Tuples:" will exceed trainer_replay_memory_size.

xbpeng · 2017-03-07T16:01:28Z

trainer_replay_memory_size is the size of the replay memory, the number of the most recent tuples to store. We don't currently set any limits on the maximum number of tuples. You can set the maximum number of training iterations with trainer_max_iter.

nwcora · 2018-10-27T16:51:52Z

I meet this problem also ,i want to know how to solve this problem .thank you

Neo-X · 2018-10-29T20:49:02Z

This looks like integer wrap around. I think I fixed this before by rebuilding the protoc stuff for caffe.

nwcora · 2018-10-30T04:34:40Z

I've recompiled caffe,but there seems to be other problems in the source code.I find something wrong in the step() function when training .After sample initialization,the network doesnt update and the program stops.I've no idea where to modify.

Neo-X · 2018-10-31T17:27:16Z

If you can provide the output and some more details I may be able to help.

nwcora · 2018-11-02T16:04:44Z

when the samples initialization finished ,it doesn't update the network.

Neo-X · 2018-11-03T21:09:19Z

Is this compiled in Debug mode? There should be more output or maybe in the arg_file the number of training updates is set to 0?

nwcora · 2018-11-04T01:49:11Z

yeah， I compiled using "make config=debug64 -j8",and i modified the parameters in the args file to be smaller to show its process.I find something wrong in AddData（）this function.it did not work 。 can you run your source code when training with Q or cacla??i followed all your steps but just cant run your source code .

nwcora · 2018-11-04T02:19:04Z

I remembered i met some problems when uncompressed the external files,but it always exists

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to train raptor and dog character #17

Unable to train raptor and dog character #17

RexDrac commented Mar 2, 2017

Neo-X commented Mar 2, 2017

RexDrac commented Mar 2, 2017

RexDrac commented Mar 4, 2017 •

edited

Loading

Neo-X commented Mar 4, 2017

RexDrac commented Mar 5, 2017 •

edited

Loading

xbpeng commented Mar 5, 2017 •

edited

Loading

Neo-X commented Mar 5, 2017

RexDrac commented Mar 6, 2017

Neo-X commented Mar 6, 2017

RexDrac commented Mar 6, 2017

xbpeng commented Mar 7, 2017

nwcora commented Oct 27, 2018

Neo-X commented Oct 29, 2018

nwcora commented Oct 30, 2018

Neo-X commented Oct 31, 2018

nwcora commented Nov 2, 2018

Neo-X commented Nov 3, 2018

nwcora commented Nov 4, 2018

nwcora commented Nov 4, 2018

Unable to train raptor and dog character #17

Unable to train raptor and dog character #17

Comments

RexDrac commented Mar 2, 2017

Neo-X commented Mar 2, 2017

RexDrac commented Mar 2, 2017

RexDrac commented Mar 4, 2017 • edited Loading

Neo-X commented Mar 4, 2017

RexDrac commented Mar 5, 2017 • edited Loading

xbpeng commented Mar 5, 2017 • edited Loading

Neo-X commented Mar 5, 2017

RexDrac commented Mar 6, 2017

Neo-X commented Mar 6, 2017

RexDrac commented Mar 6, 2017

xbpeng commented Mar 7, 2017

nwcora commented Oct 27, 2018

Neo-X commented Oct 29, 2018

nwcora commented Oct 30, 2018

Neo-X commented Oct 31, 2018

nwcora commented Nov 2, 2018

Neo-X commented Nov 3, 2018

nwcora commented Nov 4, 2018

nwcora commented Nov 4, 2018

RexDrac commented Mar 4, 2017 •

edited

Loading

RexDrac commented Mar 5, 2017 •

edited

Loading

xbpeng commented Mar 5, 2017 •

edited

Loading