-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python train.py #7
Comments
I have found it |
python train.py -c ckpt/config.json -m mymodel -- Process 0 terminated with the following error: |
Hello, The filelist path should be specified in the config.
Each text file should include the paths for the wav, F0 and norm F0 files. Thank you. |
Hello, Thank you for your response. Could you please let me know which dataset you are using? Is it possible to share it? Additionally, do we need to generate the wav, F0, and normalized F0 files ourselves? Thank you. |
Yes, the wav files you have to collect. There are lots of datasets online. Then, use something like crepe to generate the pitch embedding from them. It seems like it expects 2 dimension pitch embeddings, so I'm just unsqeezing a new zero dimension and hoping that works. Then you collect the mean and std dev and zscore them for the normalized pitch embeddings. All that seems to mostly work, keeping in mind there is a minimum length the wavs have to be based on the segment size. I still hit an issue though - during evaluation, it fails in the encoder because the pitch embedding mask somehow ends up being like twice the length of the pitch embedding, and when they are multiplied together it fails. Have not been able to figure out what is wrong yet. Perhaps unsqueezing a dimension for the pitch embedding is not correct, and the pitch embedding is supposed to be some different 2 dimension structure. |
During evaluation, the length of the pitch embedding mask does not match the length of the pitch embedding itself, leading to a failure. How can this issue be resolved? If you have a solution, could you please share it? |
I have no clue. All I can guess is that the pitch embeddings are supposed to be in some format that I don't know. I made some changes that get it past evaluation, but then it dies with a similar issue in training. Obviously, the code is wrong, or the data is, so I'm guessing it's the pitch embeddings. This code is based on the GradTTS code, like a couple dozen other voice conversion models, and typically, I haven't had much of an issue with the pitch embeddings in some of these other models, so I don't know whats up. |
As described in the paper, we use F0 information with four times higher resolution compared to Mel. Therefore, the F0 mask is four times longer than the Mel segment mask. Since the hop size is 320, we used segment length // 80 in the data loader. |
Yeah, that's my fault. I haven't read the paper in months. Thank you. |
Thanks for making the training code available, by the way! I'm really looking forward to playing with this model. Just have to produce about 700,000 new pitch embeddings. I'm only using 2 RTX 3090s, so I'm sure I have quite a bit of training time to go through. I converted the diffusion model in GradSVC to use diffusers (discrete time steps) and latent space to dramatically speed up training and use use diffusers schedulers so I may take a look at that here, perhaps, but if its not relatively straight forward to reuse that, I'll probably just eat the super long time. |
You're conducting interesting work! I plan to use a diffuser as well. |
Modified the data loader, had my f0 files as .npy files so f0 = torch.load(f0_path) was not working, used -- Process 0 terminated with the following error: any fixes? |
No description provided.
The text was updated successfully, but these errors were encountered: