Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subsequence Probability #149

Open
dedcode opened this issue Jan 17, 2016 · 3 comments
Open

Subsequence Probability #149

dedcode opened this issue Jan 17, 2016 · 3 comments

Comments

@dedcode
Copy link

dedcode commented Jan 17, 2016

Hi,
I am using char-nn to sample only a small number of characters (e.g., -length 20) given some seed text.
Is there a possibility to compute the probability with which a sub-sequence was generated out of all other options at each char?
My goal is to compute a confidence score on a generated word.
Thanks !

@FragLegs
Copy link

Take a look at my pull request: #151

@vinhqdang
Copy link

Hi,

Thanks for your answer @FragLegs , but I am not suer how should I use your code.

Let's say I have a trained text "abcd", and I want to predict the next character, and want output like:

a: 0.4
b:0.1
c:0.2
d:0.3

the number is probability of the corresponding character will appear as 5th character.

@FragLegs
Copy link

Hi @vinhqdang . My pull request is designed to do something slightly different. You can use it to do what you are trying to accomplish, but you might be better served editing the code yourself.

My PR is intended to give the probability of a string of characters (both the seed and the characters generated by the rnn). So, let's say you want the (log) probability of "abcda". You can get that via th sample.lua cv/my_checkpointed_model.t7 -primetext "abcda" -length 0

Similarly, for "abcdb" you can call th sample.lua cv/my_checkpointed_model.t7 -primetext "abcdb" -length 0 and so on.

In order to determine the probability of each of those characters in the 5th position, you'll also need to know the probability of the 4 leading characters via th sample.lua cv/my_checkpointed_model.t7 -primetext "abcd" -length 0

For a language model such as this one, the probability of c_0, c_1, c_2, ... c_-2, c_-1, c equals the probability of c given c_0, c_1, c_2, ... c_-2, c_-1 times the probability of c_0, c_1, c_2, ... c_-2, c_-1. So, to get the probability of character c given c_0, c_1, c_2, ... c_-2, c_-1, simply divide the probability of c_0, c_1, c_2, ... c_-2, c_-1, c by the probability of c_0, c_1, c_2, ... c_-2, c_-1. To make that more concrete, in your example above:

a: P(abcda) / P(abcd)
b: P(abcdb) / P(abcd)
c: P(abcdc) / P(abcd)
d: P(abcdd) / P(abcd)

Since my script outputs log probabilities, simply subtract the value you get via th sample.lua cv/my_checkpointed_model.t7 -primetext "abcd" -length 0 from the value you get via th sample.lua cv/my_checkpointed_model.t7 -primetext "abcda" -length 0 to get the log probability of a given abcd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants