You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I found that the count of poly chars in corpus is 623, while count of poly chars in cedict is over 700, what is the reason?
I mean, when we do prediction, the poly in sentense may be not in the set of 623 polys, but in the set of 700+ polys, Then How will the model predict its Pinyin?
The text was updated successfully, but these errors were encountered:
Hi, as mentioned in the previous issue, our dataset does not cover all possible Chinese polyphonic characters. We collect Chinese sentences from wikipedia and label it, so some of polyphonic characters are missing in our data.
The final output of our model is probability distribution of all possible pinyins. But as you point out, the model never see some of polyphonic character during training. So it is highly likely that model fails to predict correct pinyin for such cases. But I believe such cases are really rare.
Hi,
I found that the count of poly chars in corpus is 623, while count of poly chars in cedict is over 700, what is the reason?
I mean, when we do prediction, the poly in sentense may be not in the set of 623 polys, but in the set of 700+ polys, Then How will the model predict its Pinyin?
The text was updated successfully, but these errors were encountered: