Skip to content

Demos' common issues

Darcy edited this page Oct 25, 2017 · 8 revisions

reader is not properly set

In label_semantic_roles, the original test reader is directly pointing to train reader, and the dataset conll05 has only one batch of data. When one pass is done, there will be no more data feed from the iterator, which leads to "division by zero" error. the fix is to use dataset's reader and wrap it with paddle.batch.

whole job is waiting but doesn't print anything

One of the most difficult questions is to tell a user what our framework is doing: It's waiting and doesn't print anything. When a trainer's pod is deleted, the whole job is waiting for his task timeout even the other jobs complete.When the other jobs get task of next pass from master:

383     if passID < s.state.CurPass {
384         return ErrPassBefore
385     }
386     if passID > s.state.CurPass {
387         // Client may get run to pass after master when one client faster than the
388         // other
389         return ErrPassAfter
390     }

So since an RPC function is an action, it should print info about input context result when it complete or return or in a loop or maybe blocked for long time.And then, we can find what is doing otherwise one may think: "Oh, no, It hangs somewhere?". I'll add more logs later. Maybe set error condition and check error or warning log should be included in the unit test.

Core dump related to matrix CHECK_LT(index[i], (int)tableSize) fail

this issue is caused by paddle.infer not provided with feed setting. this issue is fixed in book, but not synched to cloud repo. see this fix for more detail