How to change speaker encoder to one-hot encoder #102

Jwaminju · 2021-11-09T22:15:16Z

Hi, I'm interested in this project, and I'm looking forward to run this with my Korean audio files.
But I'm undergraduated student with less knowledge about audio processing programming.

I've read a lot of issues in this repo, but I was confused.. so I uploaded this issue.
The Zero shot model demo got result, but I want to run AutoVC-One-Hot to compare.
Maybe I have to change make_metadata.py file to use one-hot encoder.
I tried to change speaker encoder to one-hot using tf.one_hot, but the print log of the variable, emb's shape(which was [1, 128, 80, 256]) was not same with the result of C(melsp)(whish was [1, 256])
I used the data same as demo wavs file.

Could you help me how to code the one-hot encodings? Thank you.

yenebeb · 2021-11-30T19:00:35Z

Hi @Jwaminju,

Not sure if you still need it, but this might be helpful for anyone looking to do the same.

The emb variable is indeed the right one to change. The embeddings currently used are created using the GE2E loss.
To change this to an one-hot embedding you can simply create a zero-filled array and give each speaker its own id. This can be done in the make_metadata.py file.

I haven't tested it (and wrote this quite quickly), but something like this should work.
replace:

for speaker in sorted(subdirList):
    print('Processing speaker: %s' % speaker)
    utterances = []
    utterances.append(speaker)
    _, _, fileList = next(os.walk(os.path.join(dirName,speaker)))
    # make speaker embedding
    assert len(fileList) >= num_uttrs
    idx_uttrs = np.random.choice(len(fileList), size=num_uttrs, replace=False)
    embs = []

with:

# use unemerate to get index
for i, speaker in enumerate( sorted(subdirList)):
    print('Processing speaker: %s' % speaker)
    utterances = []
    utterances.append(speaker)
    
    # -----
    # one hot embedding
    # create zero array of shape (256,), note that this shape is right since squeeze effectivly changes the (1,256) into shape (256,)
    emb = np.zeros(256, dtype=np.float32)
    # set speaker id
    emb[i] = 1
    utterances.append(emb)

The whole second for loop can be removed here since we don't need the mel spectogram to create the embeddings anymore.

If you have more than 256 speakers or you want to change the embedding size to match the number of speakers you have, you'll have to pass the --dim_emb parameter on main.

WGQ123-code · 2021-12-04T06:09:59Z

Hi, I'm interested in this project, and I'm looking forward to run this with my Korean audio files. But I'm undergraduated student with less knowledge about audio processing programming.

I've read a lot of issues in this repo, but I was confused.. so I uploaded this issue. The Zero shot model demo got result, but I want to run AutoVC-One-Hot to compare. Maybe I have to change make_metadata.py file to use one-hot encoder. I tried to change speaker encoder to one-hot using tf.one_hot, but the print log of the variable, emb's shape(which was [1, 128, 80, 256]) was not same with the result of C(melsp)(whish was [1, 256]) I used the data same as demo wavs file.

Could you help me how to code the one-hot encodings? Thank you.

Hi @Jwaminju,

Not sure if you still need it, but this might be helpful for anyone looking to do the same.

The emb variable is indeed the right one to change. The embeddings currently used are created using the GE2E loss. To change this to an one-hot embedding you can simply create a zero-filled array and give each speaker its own id. This can be done in the make_metadata.py file.

I haven't tested it (and wrote this quite quickly), but something like this should work. replace:
for speaker in sorted(subdirList):
    print('Processing speaker: %s' % speaker)
    utterances = []
    utterances.append(speaker)
    _, _, fileList = next(os.walk(os.path.join(dirName,speaker)))
    # make speaker embedding
    assert len(fileList) >= num_uttrs
    idx_uttrs = np.random.choice(len(fileList), size=num_uttrs, replace=False)
    embs = []
with:
# use unemerate to get index
for i, speaker in enumerate( sorted(subdirList)):
    print('Processing speaker: %s' % speaker)
    utterances = []
    utterances.append(speaker)
    
    # -----
    # one hot embedding
    # create zero array of shape (256,), note that this shape is right since squeeze effectivly changes the (1,256) into shape (256,)
    emb = np.zeros(256, dtype=np.float32)
    # set speaker id
    emb[i] = 1
    utterances.append(emb)
The whole second for loop can be removed here since we don't need the mel spectogram to create the embeddings anymore.

If you have more than 256 speakers or you want to change the embedding size to match the number of speakers you have, you'll have to pass the --dim_emb parameter on main.

Hi @yenebeb,
It's a pleasure to read your comments. I need use the speaker embedding, too. For a particular speaker, we know the position of '1'. If i use one-hot embedding, whether this trainning is not necessary.
I don't know if my understanding is correct. If not, please give me some guidance.
Thanks.

WildFire212 · 2021-12-06T21:48:24Z

@yenebeb Thanks a lot for the comment!
I went through most of the issues in the repo, this is the only one that gives some explanation about the one-hot encoder.
I am still a bit confused.
By removing the second loop we would totally remove the mel-spectograms?
It would be really helpful if you can explain/point to a resource regarding this.

yenebeb · 2021-12-16T17:29:03Z

@WGQ123-code short answer, yes it's important to train with the one-hot embedding.

Somewhat longer answer:
You replace the whole embedding with one-hot embeddings.
The 'only' difference between training with one-hot and the embedding generated by the GE2E encoder is their 'accuracy'. GE2E tries to create same embeddings for the same speaker this means that there is some kind of information about the speaker in the embedding. By training on the GE2E embedding the model is trained to recognise this information and thus is able to also work on unseen data (zero-shot learning). With one-hot embeddings you remove said information and force the model to train on the mel. The model does have to know which voices (mels) are from the same speaker however, this is why you need the one-hot embedding during training time.

@WildFire212
Yes you do remove the mel-spectograms but if you look carefully you'll notice that they're only used to create the GE2E embedding. Since you want one-hot embeddings this is not needed and will speed up the process quite a bit.
The mel's are actually created and saved when you run make_spect.

WildFire212 · 2021-12-16T22:30:25Z

@yenebeb Thank you for the clarification!

WGQ123-code · 2021-12-20T06:47:10Z

@yenebeb Thank you very much for your guidance! Wish you a happy life!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to change speaker encoder to one-hot encoder #102

How to change speaker encoder to one-hot encoder #102

Jwaminju commented Nov 9, 2021 •

edited

Loading

yenebeb commented Nov 30, 2021 •

edited

Loading

WGQ123-code commented Dec 4, 2021

WildFire212 commented Dec 6, 2021 •

edited

Loading

yenebeb commented Dec 16, 2021 •

edited

Loading

WildFire212 commented Dec 16, 2021

WGQ123-code commented Dec 20, 2021

How to change speaker encoder to one-hot encoder #102

How to change speaker encoder to one-hot encoder #102

Comments

Jwaminju commented Nov 9, 2021 • edited Loading

yenebeb commented Nov 30, 2021 • edited Loading

WGQ123-code commented Dec 4, 2021

WildFire212 commented Dec 6, 2021 • edited Loading

yenebeb commented Dec 16, 2021 • edited Loading

WildFire212 commented Dec 16, 2021

WGQ123-code commented Dec 20, 2021

Jwaminju commented Nov 9, 2021 •

edited

Loading

yenebeb commented Nov 30, 2021 •

edited

Loading

WildFire212 commented Dec 6, 2021 •

edited

Loading

yenebeb commented Dec 16, 2021 •

edited

Loading