Implementation of CTC in pure theano with custom gradient #108

nlgranger · 2017-08-26T21:11:21Z

Unfortunately, this comes a bit late as theano has recently merged a PR adding some bindings to warp-ctc (Theano/Theano#5949). But I wanted to finish this anyway, so here it is :-).

This implementation is:

written in pure theano
uses an overriden gradient computation which is more resilient to precision issues
fairly compact (suggestions for improvements and readability are welcome)
works in log space for the most parts to prevent precision issues (so does warp-ctc). Note that I haven't use the rescaling trick though (don't know if warp-ctc uses it)

I think it can still be useful to anyone who wants to modify the original cost function. And it can run without extra dependency on any plateform where theano runs already.

Notes:

I haven't battle tested the code, just run tests so far. It seems to give results very close to warp-ctc as it should (10^-7 difference on the gradients).
The code uses OpFromGraph which is relatively recent in the theano codebase.
I have no demo so far, contributions are welcome for that. I think the test script is a poor replacement for a real demo.

…ld hopefully be more robust to precision issues)

f0k · 2017-11-29T17:15:06Z

Sorry for the late reply, and thanks for the heads up on the mailing list. Looks cool at first glance! Not quite sure if this belongs in papers or examples... adding a replication of a result from the original paper would place it in the former category, a toy example would probably place it in the latter. Is there a toy example you could come up with, just to demonstrate how to use it, or would you rather just see it merged the way it is?

nlgranger · 2017-11-30T10:07:22Z

The model from the paper and the data pre-processing part are not overly complicated at first sight, but the prediction algorithm (prefix search) might require some work. I'll try to look into it this week-end.

f0k · 2017-12-01T16:38:28Z

but the prediction algorithm (prefix search) might require some work

What about a toy example that uses a less complex prediction method in the end (e.g., just sampling)?

nlgranger · 2017-12-06T14:32:55Z

It seems there are some precision issues on real world data (TIMIT speech). I need to investigate that first. When I get it to work reliably I think I will run the model with a simple prediction scheme (greedy) for the demo.

…ard passes

The CTC loss function now takes predictions in log space (before softmax) to avoid precision issues.

nlgranger · 2018-01-12T09:57:57Z

The latests commit should fix most precision issues, but there is still some divergence at some point. The final output layer values (before softmax) explode at some point in the training. This happens before any useful output is obtained, the network just learns to predict the blank class all the time.

I have added a Tensorflow implementation for the sake of the comparison and a test notebook to compare the TF implementation of CTC with mine. The loss values of my implementation seem correct, but the gradients are a bit off. I was not able to track down the reason any further.

If anyone is interested in getting CTC in pure Theano, some help would be very welcome ;-)

f0k · 2018-01-23T10:54:30Z

This happens before any useful output is obtained, the network just learns to predict the blank class all the time.

This (predicting blanks) seems to be a common effect:

So maybe this is a good sign ;)

I have added a Tensorflow implementation for the sake of the comparison and a test notebook to compare the TF implementation of CTC with mine.

And the TF implementation works well with the same dataset?

PS: Looking at your notebook, when you call pickle.dump(), you should pass -1 as the third (protocol) argument. This will result in smaller files and much shorter dumping and loading times.

nlgranger · 2018-01-24T11:00:54Z

Ok, the damn error is fixed now, both loss and gradient are now in line with tensorflow's implementation.

Thanks for reading through this, I have corrected the pickle line. I will let training run for a long time to see if it goes past predicting blank all the time since this is the expected behaviour.

I'm now waiting for some help from the Theano people because the binary variables I use in some places seem to break the graph optimization when the target device is a GPU.

f0k · 2018-02-19T16:42:00Z

Ok, the damn error is fixed now

Great! Bad luck -- I think Theano would have optimized the log-sum-exp expression by itself, but I'm not sure if optimization breaks depending on keepdims or so.

I'm now waiting for some help from the Theano people because the binary variables I use in some places seem to break the graph optimization when the target device is a GPU.

Any progress on this? Do you need some advice? If you don't need those variables for advanced indexing, you may get away by simply casting them to floatX.

nlgranger · 2018-02-19T16:54:17Z

I wanted something that would remain robust even with optimizations off, especially because I was debugging it myself ;-) so the handwritten logsumexp is safer when properly implemented.

For the optimization errors, I have opened a discussion on theano-users ML but the activity is a bit low right now. It's actually not too serious because it will only trigger a warning with default .theanorc settings.

I think the CTC part is done, but for the experimental demo the results are not good. There must be an issue with the model or the parameters or the data, some difference between this code and the paper. If somebody familiar with CTC trained model could have a look that would be great. Meanwhile I will give it a try when I have some time.

…ough...

Implementation of CTC in pure theano with custom gradient (which shou…

9ca152d

…ld hopefully be more robust to precision issues)

nlgranger mentioned this pull request Aug 26, 2017

Connectionist Temporal Classification cost Lasagne/Lasagne#372

Open

nlgranger added 5 commits December 15, 2017 16:23

fix error with empty sequences+added low-level test for forward backw…

96b8d68

…ard passes

more fixes for precision issues

afcf4b7

Improved ctc gradient stability

c7ce022

The CTC loss function now takes predictions in log space (before softmax) to avoid precision issues.

test for more stability in computations

0f739a8

fixes errors and precision issues, adds demos and tests.

51a3585

fixed error in logsumexp, ctc gradient is now equal to tensorflow's

f58174f

split ctc op in two to avoid redundant computation: no improvement th…

324f2a9

…ough...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of CTC in pure theano with custom gradient #108

Implementation of CTC in pure theano with custom gradient #108

nlgranger commented Aug 26, 2017 •

edited

Loading

f0k commented Nov 29, 2017

nlgranger commented Nov 30, 2017

f0k commented Dec 1, 2017

nlgranger commented Dec 6, 2017

nlgranger commented Jan 12, 2018

f0k commented Jan 23, 2018 •

edited

Loading

nlgranger commented Jan 24, 2018 •

edited

Loading

f0k commented Feb 19, 2018

nlgranger commented Feb 19, 2018 •

edited

Loading

Implementation of CTC in pure theano with custom gradient #108

Are you sure you want to change the base?

Implementation of CTC in pure theano with custom gradient #108

Conversation

nlgranger commented Aug 26, 2017 • edited Loading

f0k commented Nov 29, 2017

nlgranger commented Nov 30, 2017

f0k commented Dec 1, 2017

nlgranger commented Dec 6, 2017

nlgranger commented Jan 12, 2018

f0k commented Jan 23, 2018 • edited Loading

nlgranger commented Jan 24, 2018 • edited Loading

f0k commented Feb 19, 2018

nlgranger commented Feb 19, 2018 • edited Loading

nlgranger commented Aug 26, 2017 •

edited

Loading

f0k commented Jan 23, 2018 •

edited

Loading

nlgranger commented Jan 24, 2018 •

edited

Loading

nlgranger commented Feb 19, 2018 •

edited

Loading