Creating new voices by mixing two voices #282
Replies: 15 comments 44 replies
-
Very interesting. |
Beta Was this translation helpful? Give feedback.
-
You would have 4 models A, B, C, D and 3 parameters (w1,w2, w3) . Then your mix , assuming w1, w2, w3 >0 and w1+w2+w3<1, would be something like:
|
Beta Was this translation helpful? Give feedback.
-
@philmccarty I'm not a specialist of SVC. I'm not a specialist in ML and all this is above my head too. I was just playing these G_XXXX.pth file, looking at what is inside. You can do it too.
You'll see that what you get is a dictionary. You can get the keys with:
and so on... Then you can play with it. For example, pitch contour involves the layers related to f0. So you could replace:
with:
Which would only act on the part of the generator that controls the pitch. I tried that. Pretty funny. With the above voices, you hear the girl's pitch going lower and lower, while keeping the female character of the voice. But, since you seem to talk about the "4 models" thing: I don't think it is a good idea. Already with 2 voices, we hear the quality decreasing with the amount of mixing (the mixing is max at 0.5). So, mixing even more voices would probably sound terrible. In fact, if you average an infinite number of trained generators (G_xxxx.pth), the result might well be close to the original G_0.pth pre-trained model. To make the story short: I'm not sure that such a simple averaging/weighting scheme is the right way. It's "OK" if you add just a bit of voice B in voice A. I'm working on a different idea which might give better results. But I'd prefer not to talk about it now, since it is probably a silly idea... |
Beta Was this translation helpful? Give feedback.
-
Ahh, I wasn't the 4 models guy that was a different poster. Thanks, so much, for the explanation. That's helpful to get start me down the road of exploration, however I'm a bit curious how you know that f0 is pitch, as an example. Here's a second question, is there a way to modify the extent to which the inference is applied? I understand how you can average weights between two models to get a blend of them, but what if I wanted a voice to sound "a bit" like the model, instead of -all- of it? |
Beta Was this translation helpful? Give feedback.
-
Hmm thanks! I'll take a peek at Netron. In my imagination, it seems possible to find the different elements of a voice, so that one could adjust the dials on which aspects of a voice to infer right? Or maybe that's senseless and not at all how these things work. |
Beta Was this translation helpful? Give feedback.
-
Want some help? I’m a middling developer but can usually hack and slash my
way towards something that works
…On Tuesday, May 2, 2023, sbersier ***@***.***> wrote:
Yes, that's the idea. I'm currently working on it using principal
components analysis. I already have some interesting results but it needs
more work.
—
Reply to this email directly, view it on GitHub
<#282 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGRJNLN35CMUBQCD5MZ34DXEDPUHANCNFSM6AAAAAAWZCCFRQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
This was just a short run to see if it might work. Now I selected good audio files from Librivox: no noise, no reverb, no boxy voices (i.e. excess in the low-mids) and long enough. It is indeed a bit challenging on Librivox but possible. |
Beta Was this translation helpful? Give feedback.
-
So, I trained 38 models (18 male/20 female voices). Here, a link to the audio generated by the models, as they were trained: Not exceptional but "OK" (~300 samples, ~5 seconds long, trained for 60 epochs, audio from librivox. Note: I selected the input audio as well as I could.) And here, a link to the audio result for 14 randomly generated voices: The PCA model includes 26 principal components. The 26 components are supposed to account for 75% of the observed variance in the parameters of the models. I'm not sure that increasing the number of models or increasing the training time will make a big difference. In fact, the more I think about it, the more I think that the whole thing was a crappy idea from the start. |
Beta Was this translation helpful? Give feedback.
-
Thank you @sbersier for this idea. I think we should interpolate only the speaker embeddings which are under 'emb_g.weight' key. There is a possibility to train one model for many voices. These are then identified by speaker ID at the inference. Every ID has associated embedding vector of size 256. So we can then interpolate 2 or more of these vectors into another (one of them). I have just tried it with only two female voices with too little training. Interpolated 50%. Results are not clear - it was not enough trained. Here is what I did: In the config set number of speakers (there's 200 in the template):
Data are in 44k as different subdirectories A, B, C
Similarities between voices can be somehow measured by cosine similarity of the embeddings. |
Beta Was this translation helpful? Give feedback.
-
Regarding VRAM it's the same, only epochs are very long. |
Beta Was this translation helpful? Give feedback.
-
Great! And the PCA might well give more pertinent results this time. |
Beta Was this translation helpful? Give feedback.
-
One thing before you start training 38 voices. |
Beta Was this translation helpful? Give feedback.
-
I messed up. Now I am impressed. In case someone was to make the same mistake as I: Don't skip and recompute it when you change speaker table in the config!! I had everything set to zero. Even then the model was able somehow distinct the 3 speakers, but they were closer than should be. Now the interpolation is working perfectly. |
Beta Was this translation helpful? Give feedback.
-
Mmmh.... That's weird... I forgot to change the number of speakers in the config file. It was left to the default 200 instead of the actual number of speakers: 38. I trained just for 20 epochs then stopped to have a look. By the way, I don't understand why the number of speakers is not automatically set. config.json contains the full list of speakers. It's just a matter of counting the nb. of elements in this list... Any reason for not doing it? Anyway, since I'm not sure I will restart for zero. |
Beta Was this translation helpful? Give feedback.
-
For those interested, here an example of random voices generation. It lacks a bit of diversity but that's probably due to my choice for the original dataset (from librivox). Note: sound starts at 39 seconds. test_pca_svc.mov |
Beta Was this translation helpful? Give feedback.
-
I was playing with the idea of mixing voices in order to create voices that don't exist.
I found a very easy way to do that by just interpolating the weights and biases between two models.
It looks like it works...
res.mp4
Same sentence by interpolating between model A and model B, with factor 0.00 ,0.25, 0.50, 0.75 and 1.00
(0 = 100% model A, 1 = 100% model B)
Below, a little python script to do that
For example:
The script:
Beta Was this translation helpful? Give feedback.
All reactions