Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training for Persian #19

Open
MahdiEsrafili opened this issue Apr 10, 2022 · 2 comments
Open

training for Persian #19

MahdiEsrafili opened this issue Apr 10, 2022 · 2 comments

Comments

@MahdiEsrafili
Copy link

Hello. Thanks for your great work. I want to train the model for Persian data. In Persian we link some words based on context using 'Ezafe' which is not written but pronounced. for example, here is two words and phonemes:
کیف: kif
من: man
But we read the sentence 'کیف من' as 'kife man' and not 'kif man' (Persian is written right to left). Also words pronunciations can differ based on their meanings.
My question is that how can I change the model to consider these issues?
Thanks

@cschaefer26
Copy link
Collaborator

Hi, these context dependencies are generally not easy to solve. One option could be to train the model on n-grams of words (e.g. produce training data with 3 words at once = trigram) where you have ambiguity already resolved and apply accordingly to the text. Another option could be to distinguish the words via some kind of flag or added text (e.g use 'kife' instead of 'kif' according to the pronunciation) and then resolve the ambiguity before you use the phonemizer. We are currently working on a similar problem, namely finding English inclusions in German text and phonemizing them in the correct language. We went for the latter solution, first finding the English inclusions with a NER system and then using the standard phonemizer to do its job word-wise.

@MahdiEsrafili
Copy link
Author

@cschaefer26 Thanks for your reply. It seems resolving ambiguity before using phonemizer will work better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants