training for Persian #19

MahdiEsrafili · 2022-04-10T13:37:21Z

Hello. Thanks for your great work. I want to train the model for Persian data. In Persian we link some words based on context using 'Ezafe' which is not written but pronounced. for example, here is two words and phonemes:
کیف: kif
من: man
But we read the sentence 'کیف من' as 'kife man' and not 'kif man' (Persian is written right to left). Also words pronunciations can differ based on their meanings.
My question is that how can I change the model to consider these issues?
Thanks

cschaefer26 · 2022-04-12T12:24:50Z

Hi, these context dependencies are generally not easy to solve. One option could be to train the model on n-grams of words (e.g. produce training data with 3 words at once = trigram) where you have ambiguity already resolved and apply accordingly to the text. Another option could be to distinguish the words via some kind of flag or added text (e.g use 'kife' instead of 'kif' according to the pronunciation) and then resolve the ambiguity before you use the phonemizer. We are currently working on a similar problem, namely finding English inclusions in German text and phonemizing them in the correct language. We went for the latter solution, first finding the English inclusions with a NER system and then using the standard phonemizer to do its job word-wise.

MahdiEsrafili · 2022-04-12T14:10:14Z

@cschaefer26 Thanks for your reply. It seems resolving ambiguity before using phonemizer will work better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training for Persian #19

training for Persian #19

MahdiEsrafili commented Apr 10, 2022

cschaefer26 commented Apr 12, 2022

MahdiEsrafili commented Apr 12, 2022

training for Persian #19

training for Persian #19

Comments

MahdiEsrafili commented Apr 10, 2022

cschaefer26 commented Apr 12, 2022

MahdiEsrafili commented Apr 12, 2022