Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about grapheme set #14

Open
kkp15 opened this issue Jan 13, 2022 · 11 comments
Open

Question about grapheme set #14

kkp15 opened this issue Jan 13, 2022 · 11 comments

Comments

@kkp15
Copy link

kkp15 commented Jan 13, 2022

Hello. Thank you for this amazing repository!
I have a question though. What’s the easiest way to get a unique grapheme set for a specific language? How did you get that list when training a multilingual model?

@cschaefer26
Copy link
Collaborator

Hi, you can just extract it from the training data. E.g. you collect the set of characters from it and then paste the result into the config. That's basically how i proceeded for the trained models (I filtered some graphemes though).

@skanda1005
Copy link

Hi @cschaefer26 , I wanted to train the model for hindi, but had doubts on how I need to make the config file, especially the input and output because I'm getting index out of range error. Thanks!

@cschaefer26
Copy link
Collaborator

cschaefer26 commented Jun 9, 2022

Hi, you can use the standard config file, but you will have to adjust the language and:

text_symbols
phoneme_symbols

according to the symbols that occur in your dataset!

@skanda1005
Copy link

Got it working, thanks!

@cschaefer26
Copy link
Collaborator

Nice, let me know if you run into issues.

@skanda1005
Copy link

Hi, so I realized that in my phoneme set, if some of the phonemes have multiple characters, it doesn't get parsed and those multiple char phones are either removed or replaced after preprocessing.
Any solutions to this issue?

@cschaefer26
Copy link
Collaborator

cschaefer26 commented Jun 10, 2022

Hi, multiple characters shouldn't be a problem, the cmudict model has multi-char phonemes: https://github.com/as-ideas/DeepPhonemizer#:~:text=en_us_cmudict_forward

You can pass each sample as a tuple of [str, str, list], e.g. ('en', 'word', ['p', 'h', 'o', 'neme'])

@skanda1005
Copy link

So, I am training it in hindi and phones like t͡ʃ and ẽː dont get parsed. I used these as inputs for the tokenizer and there is no output meaning it doesn't get tokenized.
PS. t͡ʃ is actually 3 chars, not 2. Would that cause a problem?

@cschaefer26
Copy link
Collaborator

cschaefer26 commented Jun 10, 2022

No that should be fine. Actually your example looks more like there should be three phoneme chars as output instead of a single phoneme instance incorporating all three chars (t͡ʃ). Just make sure the symbols are present in the config (phoneme_symbols, e.g. '͡')

@skanda1005
Copy link

Oh, So should I separate the chars of that phone as 3 different elements in the list?
e.g ['t', '͡', 'ʃ')

@cschaefer26
Copy link
Collaborator

Yes, that's also how the standard config is set. You can then simply provide the phonemized words as strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants