Creating new voices by mixing two voices #282

sbersier · 2023-04-10T14:53:04Z

sbersier
Apr 10, 2023

I was playing with the idea of mixing voices in order to create voices that don't exist.
I found a very easy way to do that by just interpolating the weights and biases between two models.
It looks like it works...

res.mp4

Same sentence by interpolating between model A and model B, with factor 0.00 ,0.25, 0.50, 0.75 and 1.00
(0 = 100% model A, 1 = 100% model B)

Below, a little python script to do that

For example:

python mix.py -a SP_A/logs/44k/G_64000.pth -b SP_B/logs/44k/G_60000.pth -w 0.5 out.pth
svc infer -n 1 -a -fm crepe -m out.pth -c SP_A/configs/44k/config.json input.wav -o output_50.wav

The script:

# mix.py: 
import argparse
parser = argparse.ArgumentParser(
                    prog='mix.py',
                    description='Mixing to generative models',
                    epilog="Example: python mix.py -a path/to/model_A.pth -b path/to/model_B.pth -w 0.3 -o my_mixed_model.pth will create a generator file my_mixed_model.pth which will mix voice A and B. w=0 --> 100 % A w=1 --> 100 % B")
                    
parser.add_argument('-a',metavar='',type=str, help='path to model A (ex. path/to/modelA/G_60000.pth)')  # path to voice A G_*****.pth
parser.add_argument('-b',metavar='',type=str, help='path to model B (ex. path/to/modelB/G_67000.pth)')  # path to voice B G_*****.pth
parser.add_argument('-w',metavar='', type=float, help='(float) relative weight: 0.0==voice_A, 1.0==voice_B')
parser.add_argument('o',metavar='', type=str, help='output model (ex. out.pth)' ) # path to output mixed model
args=parser.parse_args()

import torch
modelA=args.a
modelB=args.b
weight=args.w
outname=args.o

A=torch.load(modelA)
B=torch.load(modelB)
C=A

# We just play with the weights and biases of the models
modelKeys=A['model'].keys() 
for k in modelKeys:
    C['model'][k]=((1.-float(weight))*A['model'][k]+float(weight)*B['model'][k])
torch.save(C,outname)
print('Mixed model saved to: '+outname)
print("For inference: svc infer [options] -c <path/to/config.json> -m "+outname+" input_file.wav")

34j · 2023-04-10T15:23:16Z

34j
Apr 10, 2023
Maintainer

Very interesting.

2 replies

sbersier Apr 10, 2023
Author

In a previous attempt, I tried merging two datasets and train on the resulting dataset. But it didn't work. At inference, it randomly jumped from one voice to the other.

Then I tried with the (pretty naive) method described above and to be honnest, I didn't expect it to work... It was kind of a surprise.

Now, if you load a model "A.pth" and print(A['model'].keys()) you have a lot of parameters. Instead of interpolating on all of them (like I did), it might be more clever to act on a well chosen subset of these parameters. But which ones?

maxiedaniels Apr 21, 2023

This is very interesting. Curious, how would you adjust your mix.py to use, say, four models?

sbersier · 2023-04-21T17:46:01Z

sbersier
Apr 21, 2023
Author

You would have 4 models A, B, C, D and 3 parameters (w1,w2, w3) . Then your mix , assuming w1, w2, w3 >0 and w1+w2+w3<1, would be something like:

R['model'][k]=(1-w1-w2-w3)*A[..]+w1*B[...]+w2*C[...]+w3*D[...]

1 reply

philmccarty Apr 25, 2023

@sbersier This is fascinating but a bit over my head. Could I "hire" you to give me a bit of an explanation of this and fill out the gaps in my SVC understanding. (I'm a backend developer learning my way through this)

sbersier · 2023-04-25T18:53:02Z

sbersier
Apr 25, 2023
Author

@philmccarty I'm not a specialist of SVC. I'm not a specialist in ML and all this is above my head too. I was just playing these G_XXXX.pth file, looking at what is inside. You can do it too.
In a python console:

import torch; 
m=torch.load('G_0.pth')

You'll see that what you get is a dictionary. You can get the keys with:

print(m.keys())
print(m['model'])

and so on...

Then you can play with it. For example, pitch contour involves the layers related to f0. So you could replace:

for k in modelKeys:
    C['model'][k]=((1.-float(weight))*A['model'][k]+float(weight)*B['model'][k])

with:

for k in modelKeys:
    if ('f0' in k):
        C['model'][k]=((1.-float(weight))*A['model'][k]+float(weight)*B['model'][k])

Which would only act on the part of the generator that controls the pitch. I tried that. Pretty funny. With the above voices, you hear the girl's pitch going lower and lower, while keeping the female character of the voice.

But, since you seem to talk about the "4 models" thing: I don't think it is a good idea. Already with 2 voices, we hear the quality decreasing with the amount of mixing (the mixing is max at 0.5). So, mixing even more voices would probably sound terrible. In fact, if you average an infinite number of trained generators (G_xxxx.pth), the result might well be close to the original G_0.pth pre-trained model.

To make the story short: I'm not sure that such a simple averaging/weighting scheme is the right way. It's "OK" if you add just a bit of voice B in voice A.

I'm working on a different idea which might give better results. But I'd prefer not to talk about it now, since it is probably a silly idea...

1 reply

sbersier May 2, 2023
Author

EDIT: I averaged 170 voices and the result is not something close to G_0.pth. It is, indeed, an "average" voice.

philmccarty · 2023-04-25T20:46:49Z

philmccarty
Apr 25, 2023

Ahh, I wasn't the 4 models guy that was a different poster. Thanks, so much, for the explanation. That's helpful to get start me down the road of exploration, however I'm a bit curious how you know that f0 is pitch, as an example.

Here's a second question, is there a way to modify the extent to which the inference is applied? I understand how you can average weights between two models to get a blend of them, but what if I wanted a voice to sound "a bit" like the model, instead of -all- of it?

1 reply

sbersier Apr 26, 2023
Author

How do I know that f0 relates to pitch? When you launch svcg you can see the little checkbox "Auto predict f0 (Pitch may...)" And you can google about the different methods and you'll discover that they are all related to pitch. So, by deduction, when you look into the model and see a tensor named, for example: 'f0_decoder.cond.bias' you know that this tensor is related to pitch. Nothing complicated.
Are you talking about mixing the input signal and the output of the model? You can try that with Audacity. You import the input track and the model output and you can play them together. You distinctly hear two voices talking (or singing) together, not one "altered" voice.

Note: If you want to have an overview of the model (e.g. G_0.pth) you can use something like Netron (https://github.com/lutzroeder/netron) The result is not very informative (at least for me) but it gives some info.

philmccarty · 2023-05-02T09:54:36Z

philmccarty
May 2, 2023

Hmm thanks! I'll take a peek at Netron. In my imagination, it seems possible to find the different elements of a voice, so that one could adjust the dials on which aspects of a voice to infer right? Or maybe that's senseless and not at all how these things work.

1 reply

sbersier May 2, 2023
Author

Yes, that's the idea. I'm currently working on it using principal components analysis. I already have some interesting results but it needs more work.

philmccarty · 2023-05-02T10:43:06Z

philmccarty
May 2, 2023

Want some help? I’m a middling developer but can usually hack and slash my way towards something that works

…

On Tuesday, May 2, 2023, sbersier ***@***.***> wrote: Yes, that's the idea. I'm currently working on it using principal components analysis. I already have some interesting results but it needs more work. — Reply to this email directly, view it on GitHub <#282 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGRJNLN35CMUBQCD5MZ34DXEDPUHANCNFSM6AAAAAAWZCCFRQ> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

5 replies

sbersier May 2, 2023
Author

For the moment, what I need the most are trained generators (G_xxxx.pth files). Because, each generator represents a "point" in the analysis. For the moment, I trained 170 voices but up to 1600 steps only. Just enough to see that it might work. I took me 22 hours. Now, I'm in the process of redoing that but on less generators (20 males, 20 females but carefully chosen, based on recording quality) and trained for longer (ideally 16000 steps) and see where it leads. It should take me about 50 hours. If I come to the conclusion that it would take 10x more to give interesting results, well... I probably won't do it.

Now, regarding your proposition for help: It's a bit too soon. I don't want to "enrole" people on an idea that might be a crappy one. But, in a second phase, why not? Thank you.

For now, I can show you a bit what I'm up to... (see video in link below)

I trained 170 models for 1600 steps. So, don't expect a good audio quality.
What you can see below is the following:
It is a comparison, for different voices between the (poorly) trained model and the model generated using 6 principal components. For example: The pca generated voice 8753 is obtained using pca_components=[-2255,-1193,527,-479,1545,121] while generated voice 7874 is obtained using pca_components=[-2179,1257,-904,-623,-1274,747]

Another use could be to start with a good voice model, project it on the pca vectors and scale it along the pca directions. I haven't tried yet.

Now, the question is: Can it give good results or not? I don't know yet. For the moment, I need to improve the data.

Note: For some reason, my upload only shows the image without sound or sound without image. So I give a link:
https://drive.google.com/file/d/1V1_ptkPXhDm3gDCgmtgViOwdHuv_eSdi/view?usp=share_link

Z3Coder May 3, 2023

Where do you get the data for training? Just any data from the internet, or some explicitly free data?

sbersier May 3, 2023
Author

From Librivox.
EDIT: For the example I give in the link, the audio comes from the Librispeech dataset.
But, for my new try, the selected audio comes from Librivox.(Sorry for the confusion).

Here are two examples of voices I included in my new try:

https://ia801606.us.archive.org/27/items/tothelighthouse_2303_librivox/tothelighthouse_01_woolf_128kb.mp3

and:

https://ia902600.us.archive.org/18/items/galacticpatrol_2303_librivox/galacticpatrol_01_smith_128kb.mp3

Z3Coder May 3, 2023

Thank you!

Z3Coder May 3, 2023

maybe the poor audio quality is an issue?

sbersier · 2023-05-03T16:22:30Z

sbersier
May 3, 2023
Author

This was just a short run to see if it might work. Now I selected good audio files from Librivox: no noise, no reverb, no boxy voices (i.e. excess in the low-mids) and long enough. It is indeed a bit challenging on Librivox but possible.
Currently, I'm training 40 selected voices for longer (16000 steps) instead of just 1600 steps. This was really just to get an idea of how it might work. I'll see what comes out in about 2 days.

1 reply

definatefilms May 4, 2023

interested to see how it turns out! I am trying to mix 2 voices as well so hopefully i can get this method working.

sbersier · 2023-05-06T15:13:30Z

sbersier
May 6, 2023
Author

So, I trained 38 models (18 male/20 female voices).
The models, themselves, are not that bad, but when generating a random voice with them the result is not good - even if from time to time, by accident, the result is OK.

Here, a link to the audio generated by the models, as they were trained:
https://drive.google.com/file/d/1IGsPzNrWS6teWCCev_uybC5KoeCn0LZG/view?usp=sharing

Not exceptional but "OK" (~300 samples, ~5 seconds long, trained for 60 epochs, audio from librivox. Note: I selected the input audio as well as I could.)

And here, a link to the audio result for 14 randomly generated voices:
https://drive.google.com/file/d/1FRHQMb-fgNvn6hHKXnrVxp-nw81vOfK1/view?usp=sharing

The PCA model includes 26 principal components. The 26 components are supposed to account for 75% of the observed variance in the parameters of the models.

I'm not sure that increasing the number of models or increasing the training time will make a big difference. In fact, the more I think about it, the more I think that the whole thing was a crappy idea from the start.
Simply "mixing" two models (like the original idea) will more or less give the same result (in terms of quality), that is, not great...

4 replies

Z3Coder May 7, 2023

I think what is worth trying, is training a single model with multiple voices. Since you already have the datasets, maybe it is worth trying?

sbersier May 7, 2023
Author

I already tried that by merging two datasets (see 2nd comment above). The generated voice randomly jumps from one voice to the other.

sbersier May 7, 2023
Author

Or, do you mean by merging more voices datasets, like let's say 20 voices and see what it gives? Maybe I could give it try...

sbersier May 13, 2023
Author

@Z3Coder : Ah... Now, I understand your question. Sorry but I had never tried to train a multi-speaker model before. I was always training one model for one speaker. So, I completely misunderstood your question. My bad. So, yes, that's currently the plan. (sorry again.)

hataori-p · 2023-05-11T19:02:59Z

hataori-p
May 11, 2023

Thank you @sbersier for this idea.

I think we should interpolate only the speaker embeddings which are under 'emb_g.weight' key.
And all the voices should use one common model.

There is a possibility to train one model for many voices. These are then identified by speaker ID at the inference.

Every ID has associated embedding vector of size 256. So we can then interpolate 2 or more of these vectors into another (one of them).
And you can do your PCA on them too.

I have just tried it with only two female voices with too little training. Interpolated 50%. Results are not clear - it was not enough trained.
But I think that it could work. Now I am training it also with a mail voice to get unambiguous results.

Here is what I did:

In the config set number of speakers (there's 200 in the template):

...
    "n_speakers": 3
  },
  "spk": {
    "A": 0,
    "B": 1,
    "C": 2
  }

Data are in 44k as different subdirectories A, B, C
After training the model:

import torch
m=torch.load('logs/44k/G_1000.pth')

print(m['model']['emb_g.weight'].shape)
torch.Size([3, 256])

m['model']['emb_g.weight'][1]=(float(0.5)*m['model']['emb_g.weight'][0]+float(0.5)*m['model']['emb_g.weight'][2])
torch.save(m, 'logs/44k/G_tst.pth')

from sklearn.metrics.pairwise import cosine_similarity
mc = m['model']['emb_g.weight'].cpu()
print(cosine_similarity(mc))

[[1.0000001  0.7484816  0.11031095]
 [0.7484816  0.9999998  0.74167436]
 [0.11031095 0.74167436 0.99999976]]

Similarities between voices can be somehow measured by cosine similarity of the embeddings.

5 replies

sbersier May 11, 2023
Author

@hataori-p : Very interesting... Much food for thoughts! So, what you say is that everything that is related to the tone of voice of the speaker is in emb_g.weight ? It makes some sense. It might well be that emb_g is indeed strictly the "voice embedding" and is all we need to interpolate between voices. I didn't think of that. I also didn't consider the multi-speaker approach. Very interesting. I'll give it a try!

PS: The bit about computing voices similarities is also very interesting! Thanks!

hataori-p May 11, 2023

that is related to the tone of voice of the speaker

I'd say 'the color of the voice'. A pitch predictor for speaking is in another part of the model.
I use it only for singing so I don't use the pitch predictor.

The VC idealy must separate a phonetic content from the speaker and then change the speaker for another one and synthesize.
This is how I understand the VC model. Look at diagram in this repo https://github.com/quickvc/QuickVC-VoiceConversion this is
how it works.

Now when I train it multispeaker, it seems the training must be longer because the voices influence each other from the start.

sbersier May 11, 2023
Author

Indeed, by "tone" I meant: "timbre" or "color" (i.e. not related to pitch)
Now, could it be that modifying emb_g.weight AND enc_p.f0_emb.weight would result in transferring both "color" and pitch?

hataori-p May 11, 2023

Now, could it be that modifying emb_g.weight AND enc_p.f0_emb.weight would result in transferring both "color" and pitch?

I don't know. Maybe it could. These are embeddings too.

Update:
It definitely interpolate but it is very nonlinear. By interpolating two of the 3 voices I have received almost the same as the 3rd one.
I thing the ortogonalization by PDA will help. And also maybe the space of 3 voices is too smal to generate totaly different ones.

sbersier May 11, 2023
Author

Training 38 one-speaker models is OK. I dit it. Now, I'll have to see if it is possible for me to re-train all 38 voices as a single multi-speaker model... I never tried and I have no idea about how things scale up (with respect to VRAM). If this is possible, I'll give it a try.

hataori-p · 2023-05-11T20:25:56Z

hataori-p
May 11, 2023

Regarding VRAM it's the same, only epochs are very long.

0 replies

sbersier · 2023-05-11T20:30:13Z

sbersier
May 11, 2023
Author

Great! And the PCA might well give more pertinent results this time.

0 replies

hataori-p · 2023-05-11T20:41:52Z

hataori-p
May 11, 2023

One thing before you start training 38 voices.
I don't know why, but speaker names from config don't match the assigned indexes,
ie real voices. They are mixed up. This is a little annoying.
Maybe someone can tell me what I do wrong.

0 replies

hataori-p · 2023-05-12T11:31:12Z

hataori-p
May 12, 2023

I messed up. Now I am impressed.

In case someone was to make the same mistake as I:
When training multiple voice model, I wanted to save some time skipping the "svc pre-hubert" script, because
the ".data.pt" files were already computed from previous trainings of individual models.

Don't skip and recompute it when you change speaker table in the config!!
The ".data.pt" files contain index of speaker_id from the config and determine which speaker to train with the data point.

I had everything set to zero. Even then the model was able somehow distinct the 3 speakers, but they were closer than should be.

Now the interpolation is working perfectly.
It would be nice to incorporate it somehow into GUI, or at least make some script.

6 replies

sbersier May 12, 2023
Author

Question: Which f0-method did you choose for pre-hubert?

hataori-p May 12, 2023

I use the SVC for singing only, so I use crepe method, which seems to work best for me.

Here is a zip. It doesn't support sound formats:
test.zip

sbersier May 12, 2023
Author

Great!!! Voice_A is clear and richer in the mids/high-mids, while voice_B is softer, has more breathiness and (to me) looks younger. The interpolated result is pretty amazing! I'm not a sound engineer but it seems to me that the AB mix really lies in-between. Very encouraging!
Note: Because similar voices (rather young women) are probably easier to mix, it would be interesting to mix a low boomy breathy old male voice with a clear high-pitched young girl's voice. It would give an idea of how well the interpolation trick works.

sbersier May 12, 2023
Author

@hataori-p : I would like your opinion on training multi-speakers models. My voices samples are 5 seconds long on average and I have 250-500 samples per voice. What would be better? Train with less samples but for a larger number of epochs? Or to increase the number of samples per voice and decrease the max number of epochs? How do you see the trend? (The best would be to increase BOTH but training time is constraint.)

hataori-p May 12, 2023

I'd definitely recommend to use all data available. You need all variations you can get for HQ. The covergence of this model is incredible.

What you have heard is only 3000 steps with batch size of 10. It was only a couple of 10s epochs. And the quality is already good.
I trained singing voices max 15000 iterations. I use 5 - 10 secs samples and cca 1000 samples if I have it. But sometimes only two 3 minutes songs do a good job too.
And the quality of the input signal is the most important. No noise, no room revererb. Othervise it will learn it too.
Garbage in - garbage out.

sbersier · 2023-05-13T06:54:35Z

sbersier
May 13, 2023
Author

Mmmh.... That's weird... I forgot to change the number of speakers in the config file. It was left to the default 200 instead of the actual number of speakers: 38. I trained just for 20 epochs then stopped to have a look.
The shape of emb_g.weight is [200,256] (of course since I forgot to change the nb. of speakers). I told myself, well... it shouldn't be a problem: the first 38 [0:38 , 256] correspond to the 38 speakers and the rest should be either zeroes or constants at some random inital values. But these "ghostly" voice embeddings are updated between epochs. What is it doing? Svcg correctly shows 38 voices and it looks like it's working... I think.

By the way, I don't understand why the number of speakers is not automatically set. config.json contains the full list of speakers. It's just a matter of counting the nb. of elements in this list... Any reason for not doing it?

Anyway, since I'm not sure I will restart for zero.

10 replies

sbersier May 13, 2023
Author

@maxiedaniels :I'm not completely sure @hataori-p gave you the full answer. I think he believes in it and I think the short answer is: yes (which is much more reasonnable than what I used to think.) The notion of "voice embedding" is the key. :-) As far as I know...

sbersier May 13, 2023
Author

Now have a question: Is it posssible that only 256 numbers fully chatacterize a voice? It's hard to believe but I'ill take a look at it.

hataori-p May 13, 2023

The embedding characterize the voice as a name. It is trained as a part of the model and determines, which parts of the whole model will synthesize the associated voice. It doesn't contain whole information about the voice, it is an identifier, which works only in the context of whole model. But it is more than an identifier, therefore we can average it with others and still get reasonable results.
I don't know if I am able to explain it right.

sbersier May 13, 2023
Author

I'm not sure about my question but I'll ask it it anyway, so ...Just to make sure.. Are you saying that 256 numbers are enough? Then, where is rest of the information?

hataori-p May 13, 2023

Then, where is rest of the information?

In the model.

Maybe some analogy: Think of describing people in some group not by arbitrary names but by a set of their features like height, weight ... For every person you have a vector of these features - the embedding, but what these features mean is defined somewhere else. You can average height of two persons, but the vector itself without the features interpretation is meaningless.

sbersier · 2023-05-18T16:50:00Z

sbersier
May 18, 2023
Author

For those interested, here an example of random voices generation. It lacks a bit of diversity but that's probably due to my choice for the original dataset (from librivox).
I made a repo https://github.com/sbersier/pca_svc
I'm not sure this idea is very promising but it was fun playing with it.

Note: sound starts at 39 seconds.

test_pca_svc.mov

7 replies

hataori-p May 18, 2023

I thought original embeddings.
As you say - random voices - what is then random?
I think there is a difference when generating the embedding as a random vector and contrary to (randomly) mixing the trained embeddings.

sbersier May 18, 2023
Author

I made PCA analysis on the voice embeddings ('emb_g.weight'). It gives principal directions. To make a voice, I use the principal directions (like basis vectors) and construct a "point" in the voice space by giving random components along these directions. In fact, after the PCA analysis, the original components along these directions, corresponding to the original voices, are not needed anymore. In the model I give, all 'emb_g.weight' have been set an average value. So, there is nothing left from the original speakers except these principal directions and some statistical infos about the original embeddings (like the standard deviations of these components.)

hataori-p May 18, 2023

Aha. Does the eigenvalues perhaps correspond to some speech characteristics?

sbersier May 18, 2023
Author

Indeed, component 0 corresponds to male/female (or pitch?), second component seems to correspond to high-mids content. For the rest, it is difficult to know and a larger dataset would certainly improve the "meaning" of these components. (A bit) more details on https://github.com/sbersier/pca_svc
I made a script: test_pca.py that allows to explore a bit this space.

hataori-p May 18, 2023

This is great!

Creating new voices by mixing two voices #282

Replies: 15 comments · 44 replies

34j Apr 10, 2023 Maintainer

sbersier Apr 10, 2023 Author

sbersier Apr 21, 2023 Author

sbersier Apr 25, 2023 Author

sbersier May 2, 2023 Author

sbersier Apr 26, 2023 Author

sbersier May 2, 2023 Author

sbersier May 2, 2023 Author

sbersier May 3, 2023 Author

sbersier May 3, 2023 Author

sbersier May 6, 2023 Author

sbersier May 7, 2023 Author

sbersier May 7, 2023 Author

sbersier May 13, 2023 Author

sbersier May 11, 2023 Author

sbersier May 11, 2023 Author

sbersier May 11, 2023 Author

sbersier May 11, 2023 Author

sbersier May 12, 2023 Author

sbersier May 12, 2023 Author

sbersier May 12, 2023 Author

sbersier May 13, 2023 Author

sbersier May 13, 2023 Author

sbersier May 13, 2023 Author

sbersier May 13, 2023 Author

sbersier May 18, 2023 Author

sbersier May 18, 2023 Author

sbersier May 18, 2023 Author

Replies: 15 comments 44 replies

34j
Apr 10, 2023
Maintainer

sbersier Apr 10, 2023
Author

sbersier
Apr 21, 2023
Author

sbersier
Apr 25, 2023
Author

sbersier May 2, 2023
Author

sbersier Apr 26, 2023
Author

sbersier May 2, 2023
Author

sbersier May 2, 2023
Author

sbersier May 3, 2023
Author

sbersier
May 3, 2023
Author

sbersier
May 6, 2023
Author

sbersier May 7, 2023
Author

sbersier May 7, 2023
Author

sbersier May 13, 2023
Author

sbersier May 11, 2023
Author

sbersier May 11, 2023
Author

sbersier May 11, 2023
Author

sbersier
May 11, 2023
Author

sbersier May 12, 2023
Author

sbersier May 12, 2023
Author

sbersier May 12, 2023
Author

sbersier
May 13, 2023
Author

sbersier May 13, 2023
Author

sbersier May 13, 2023
Author

sbersier May 13, 2023
Author

sbersier
May 18, 2023
Author

sbersier May 18, 2023
Author

sbersier May 18, 2023
Author