Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include stresses preidction #13

Open
stasbel opened this issue Oct 14, 2021 · 4 comments
Open

Include stresses preidction #13

stasbel opened this issue Oct 14, 2021 · 4 comments

Comments

@stasbel
Copy link

stasbel commented Oct 14, 2021

Hi, @cschaefer26
Cool lib!

I was just wondering: any particular reason you don't include stresses prediction into pipeline?
Both "cmudict-ipa" and "wikipron" has stresses labelling included.
Phoneme tokenizers from pretrained checkpoints lack ' and , symbols (this was probably done due to collision with puctuation, but it's pretty easy to avoid).

@cschaefer26
Copy link
Collaborator

Hi stasbel,

the stresses are intentionally excluded as they are quite hard to predict and make the overall result worse (they are also commonly excluded from benchmarks in the literature). If you want to train a model with stresses you can simply add them to the symbols and proceed with preprocessing / training. If I have time I will try to train a model purely on stress prediction (phonemes in, phonemes + stress out) which I believe would make the overall performance quite good.

@stasbel
Copy link
Author

stasbel commented Oct 15, 2021

this is very interesting, as stresses are very important for number of tasks
looking forward to hear from you!

@lorinczb
Copy link

lorinczb commented Nov 5, 2021

Hi @cschaefer26,
I have added the numbers (that mark the stress) to the symbols as you suggested above, and changed the list of phones to include the phones with accents, but at prediction I still get the unaccented phones. Sorry, I have not spent a lot of time on looking into the model, but maybe you have a hint on why that might happen.

@cschaefer26
Copy link
Collaborator

cschaefer26 commented Jan 21, 2022

Hi, did you preprocess the data with the updated config and train a new model? You could check whether the processed data looks correct in datasets/combined_dataset.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants