Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of CTC in pure theano with custom gradient #108

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Implementation of CTC in pure theano with custom gradient #108

wants to merge 8 commits into from

Conversation

nlgranger
Copy link

@nlgranger nlgranger commented Aug 26, 2017

Unfortunately, this comes a bit late as theano has recently merged a PR adding some bindings to warp-ctc (Theano/Theano#5949). But I wanted to finish this anyway, so here it is :-).

This implementation is:

  • written in pure theano
  • uses an overriden gradient computation which is more resilient to precision issues
  • fairly compact (suggestions for improvements and readability are welcome)
  • works in log space for the most parts to prevent precision issues (so does warp-ctc). Note that I haven't use the rescaling trick though (don't know if warp-ctc uses it)

I think it can still be useful to anyone who wants to modify the original cost function. And it can run without extra dependency on any plateform where theano runs already.

Notes:

  • I haven't battle tested the code, just run tests so far. It seems to give results very close to warp-ctc as it should (10^-7 difference on the gradients).
  • The code uses OpFromGraph which is relatively recent in the theano codebase.
  • I have no demo so far, contributions are welcome for that. I think the test script is a poor replacement for a real demo.

…ld hopefully be more robust to precision issues)
@f0k
Copy link
Member

f0k commented Nov 29, 2017

Sorry for the late reply, and thanks for the heads up on the mailing list. Looks cool at first glance! Not quite sure if this belongs in papers or examples... adding a replication of a result from the original paper would place it in the former category, a toy example would probably place it in the latter. Is there a toy example you could come up with, just to demonstrate how to use it, or would you rather just see it merged the way it is?

@nlgranger
Copy link
Author

The model from the paper and the data pre-processing part are not overly complicated at first sight, but the prediction algorithm (prefix search) might require some work. I'll try to look into it this week-end.

@f0k
Copy link
Member

f0k commented Dec 1, 2017

but the prediction algorithm (prefix search) might require some work

What about a toy example that uses a less complex prediction method in the end (e.g., just sampling)?

@nlgranger
Copy link
Author

It seems there are some precision issues on real world data (TIMIT speech). I need to investigate that first. When I get it to work reliably I think I will run the model with a simple prediction scheme (greedy) for the demo.

@nlgranger
Copy link
Author

The latests commit should fix most precision issues, but there is still some divergence at some point. The final output layer values (before softmax) explode at some point in the training. This happens before any useful output is obtained, the network just learns to predict the blank class all the time.

I have added a Tensorflow implementation for the sake of the comparison and a test notebook to compare the TF implementation of CTC with mine. The loss values of my implementation seem correct, but the gradients are a bit off. I was not able to track down the reason any further.

If anyone is interested in getting CTC in pure Theano, some help would be very welcome ;-)

@f0k
Copy link
Member

f0k commented Jan 23, 2018

This happens before any useful output is obtained, the network just learns to predict the blank class all the time.

This (predicting blanks) seems to be a common effect:

So maybe this is a good sign ;)

I have added a Tensorflow implementation for the sake of the comparison and a test notebook to compare the TF implementation of CTC with mine.

And the TF implementation works well with the same dataset?

PS: Looking at your notebook, when you call pickle.dump(), you should pass -1 as the third (protocol) argument. This will result in smaller files and much shorter dumping and loading times.

@nlgranger
Copy link
Author

nlgranger commented Jan 24, 2018

Ok, the damn error is fixed now, both loss and gradient are now in line with tensorflow's implementation.

Thanks for reading through this, I have corrected the pickle line. I will let training run for a long time to see if it goes past predicting blank all the time since this is the expected behaviour.

I'm now waiting for some help from the Theano people because the binary variables I use in some places seem to break the graph optimization when the target device is a GPU.

@f0k
Copy link
Member

f0k commented Feb 19, 2018

Ok, the damn error is fixed now

Great! Bad luck -- I think Theano would have optimized the log-sum-exp expression by itself, but I'm not sure if optimization breaks depending on keepdims or so.

I'm now waiting for some help from the Theano people because the binary variables I use in some places seem to break the graph optimization when the target device is a GPU.

Any progress on this? Do you need some advice? If you don't need those variables for advanced indexing, you may get away by simply casting them to floatX.

@nlgranger
Copy link
Author

nlgranger commented Feb 19, 2018

I wanted something that would remain robust even with optimizations off, especially because I was debugging it myself ;-) so the handwritten logsumexp is safer when properly implemented.

For the optimization errors, I have opened a discussion on theano-users ML but the activity is a bit low right now. It's actually not too serious because it will only trigger a warning with default .theanorc settings.

I think the CTC part is done, but for the experimental demo the results are not good. There must be an issue with the model or the parameters or the data, some difference between this code and the paper. If somebody familiar with CTC trained model could have a look that would be great. Meanwhile I will give it a try when I have some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants