-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A couple of questions #37
Comments
Hi, glad you like it. How much data do you have?
|
Thanks for the info, I will check that out! I have around 7,000 samples of my voice and I trained at 48Khz. 11 hours 45 mins. It is my own dataset that is about 4500 sentences from LJSpeech corpus, 500 from Alice In Wonderland, and 2000 questions from Wikipedia. Here are some examples using WaveRNN One thing that I find strange is why my sentences go flat toward the end: Most sentences suffer from this. |
Nice. Sounds quite good already, but imo the WaveRNN could still improve a bit (the gnarling/hissing) - how many steps is this for vocoder and tts? The hissing could also come from not so great durations if the tacotron attention is off. I've seen some problems with ending pitch for some datasets, mainly male. Did you look at the pitch loss? Maybe its overfitting. Also, it could be a problem of trailing durations being a bit off, maybe trimming some silence would help with this (I just fixed the missing trimming functions in master preprocessing). If that doesnt help you could try to mess around witth the pitch loss function and scale it up at the end of the batch (e.g. multiply the loss with an increasing factor), we tried this already and it seemed to help with ending pitch. |
Hi I retrained using the latest repo and it was a bit better but still wound up getting the end pitches wrong by the end of training. It starts out alright but eventually I guess overfits or something. However, I did try something interesting. I modified the scripts a bit so that the pitches for each phoneme came from LJSpeech model and got great results like this! I think it could be interesting to have the option to use different models for duration and pitch predictions! This is using my pitch conditioning: Notice the endings become very monotone. Now here is the same thing but I fed in the pitches from LJSpeech https://vocaroo.com/19ltwQ1gBJOJ To me it sounds much better! I think this could have some interesting applications. I think it could allow to have high quality voices with potentially less forced alignment data! In any case, I think adding the option to use a different model for duration and or pitch prediction could be interesting! |
Hi, very cool. This is something on my list, I will also try to train multispeaker models which I hope will improve the pitch prediction. I am pretty sure that some transfer learning will benefit the pitch prediction. So far it seems to me that the pitches of male speakers are harder to pick up for the models, maybe it is harder to extract in the first place (to me the female mel specs are much clearer than male ones). |
Hi One thing you mentioned was adding silence to the mel spectrogram. I thought I could add silence by playing with duration of spaces, but it turns out, most words don't actually contain 'silence' phonemes. However, if I insert something like '...' between words, it completely messes up / changes the spectrogram. Is there a token I can insert that adds in silence without altering the mels in any other way than to add silence? If not, would you be able to point me to the part of the code where I could inject my own silence after a phoneme? Thanks! I'm working on a little program that will allow me to insert pauses, and alter the length and pitch of words / phonemes with a user interface and this would be very helpful! I was also wondering if you knew the meaning of the duration values. In that, is there a way to convert those values to milliseconds. For example, if I want a word to last exactly 2 seconds, if I know the value of 1.0 duration in ms, I can easily figure out what constant to multiply the durations for that word by for it to last the length of time I want. Same question for pitch; is it possible to target a specific fundamental frequency for a given phoneme? (which would require knowing the base fundamental frequency generated by the network) Update: I managed to be able to align the phonemes to a grid: https://vocaroo.com/1d2EZ8aXR8AF |
Hey @jmasterx amazing work here. Thanks for the insights. Your results look great. I am wondering how was your dataset collected? I am a beginner in tts area, so looking for some best pratices.. Could you describe it please? How many samples do you think are enough ? For many tts datasets, except ljspeech, I couldnt find so many hours from the same speaker. I directed the question for @jmasterx, but please anyone feel free to contribute. Thanks |
My dataset was collected by me speaking into a Rode NT1 microphone. I used a tool that I wrote to make it easier to record the samples. The data was recorded at 16 bit, 96 Khz then downsampled to 48 Khz. However this model you hear here is very noisy. The new one I am training, with the same samples, I have processed as follows: Noise suppression I have attached my hparams for the new way I'm training which addresses hop size, max freq of spectrograms, etc, for 48Khz. |
@jmasterx Thank you very much for the detailed answer. It will be very useful for me :) |
@jmasterx Have you been able to insert pauses into the text? If so, could you please point me in some direction? |
Hi!
I have tried the latest version and I am quite pleased with the results; there is some great progress happening on this repository!
I am using 48KHz 7000 samples of my own voice.
I am very happy with pronunciation.
I had a couple questions:
When I have many sentences together, it does not seem to take a pause and sounds like it is rushing through the sentences. Is this normal, is there a workaround? my current one is to add a '...' instead of '.'
My other question is, are there plans for tokenizable pitch, to be able to do things like emphasize a specific word, or to give a work in particular a specific tone (in the text input not automatic)
Thanks!
The text was updated successfully, but these errors were encountered: