Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why the count of polys in cedict is larger then that in corpus #5

Open
JohnHerry opened this issue Aug 21, 2020 · 2 comments
Open

Why the count of polys in cedict is larger then that in corpus #5

JohnHerry opened this issue Aug 21, 2020 · 2 comments

Comments

@JohnHerry
Copy link

Hi,
I found that the count of poly chars in corpus is 623, while count of poly chars in cedict is over 700, what is the reason?
I mean, when we do prediction, the poly in sentense may be not in the set of 623 polys, but in the set of 700+ polys, Then How will the model predict its Pinyin?

@seanie12
Copy link
Contributor

Hi, as mentioned in the previous issue, our dataset does not cover all possible Chinese polyphonic characters. We collect Chinese sentences from wikipedia and label it, so some of polyphonic characters are missing in our data.
The final output of our model is probability distribution of all possible pinyins. But as you point out, the model never see some of polyphonic character during training. So it is highly likely that model fails to predict correct pinyin for such cases. But I believe such cases are really rare.

@JohnHerry
Copy link
Author

As we tested, the g2pM is not good enough for use in production. Maybe more samples need for CPP dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants