Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unintelligible Results on Custom Data and on the sample /Wavs from this Repo #124

Open
jvel07 opened this issue May 7, 2024 · 0 comments

Comments

@jvel07
Copy link

jvel07 commented May 7, 2024

@auspicious3000
Hey there! :)

We are trying to reproduce the results of your paper on both custom data and on some combination of /wavs from the repo, but the results are not decent enough. Only when using specific wav samples from the repo such as p225 (input) and p256 (reference), the output is close to good enough. (We are using your trained models).

As for the custom samples, I generated the data as follows:

  1. After running make_spect.py on a couple of custom samples, I generated a metadata script to have [filename, speaker embedding, spectrogram] in the following way:
import os
import pickle
from model_bl import D_VECTOR
from collections import OrderedDict
import numpy as np
import torch

C = D_VECTOR(dim_input=80, dim_cell=768, dim_emb=256).eval().cuda()
# loading the speaker encoder model
c_checkpoint = torch.load('ckpt/3000000-BL.ckpt')
new_state_dict = OrderedDict()
for key, val in c_checkpoint['model_b'].items():
    new_key = key[7:]
    new_state_dict[new_key] = val
C.load_state_dict(new_state_dict)
num_uttrs = 2
len_crop = 128


# Directory containing mel-spectrograms
rootDir = './spmel'
dirName, subdirList, _ = next(os.walk(rootDir))
print('Found directory: %s' % dirName)


# create metadata that contains [filename, speaker embedding, spectrogram]
speakers = []
for speaker in sorted(subdirList):
    print('Processing speaker: %s' % speaker)
    _, _, fileList = next(os.walk(os.path.join(dirName, speaker)))

    # make speaker embedding
    assert len(fileList) >= num_uttrs
    idx_uttrs = np.random.choice(len(fileList), size=num_uttrs, replace=False)

    embs = []
    for i in range(num_uttrs):
        path_file = os.path.join(dirName, speaker, fileList[idx_uttrs[i]])
        print(path_file)
        tmp = np.load(path_file)
        candidates = np.delete(np.arange(len(fileList)), idx_uttrs)
        # choose another utterance if the current one is too short
        while tmp.shape[0] < len_crop:
            print("short")
            idx_alt = np.random.choice(candidates)
            tmp = np.load(os.path.join(dirName, speaker, fileList[idx_alt]))
            candidates = np.delete(candidates, np.argwhere(candidates==idx_alt))
        left = np.random.randint(0, tmp.shape[0]-len_crop)
        melsp = torch.from_numpy(tmp[np.newaxis, left:left+len_crop, :]).cuda()
        emb = C(melsp)
        embs.append(emb.detach().squeeze().cpu().numpy())
    utterances = np.mean(embs, axis=0)

    # [filename, speaker embedding, spectrogram]
    filename = os.path.dirname(path_file).split('\\')[1]
    speakers.append([filename,
                     utterances,
                     melsp.squeeze().cpu().numpy()])

with open(os.path.join(rootDir, 'cremad_metadata.pkl'), 'wb') as handle:
    pickle.dump(speakers, handle)
  1. Tried a sample experiment with one female voice input and a male voice as a reference --using the models available for download. The results are not the same in terms of style and also the content (speech) is unintelligible.

Is there anything else I am missing?
Do you have a script of your own that can be used instead of the referenced script from KnurpsBram?

@jvel07 jvel07 changed the title Intelligible Results on Custom Data and on the sample /Wavs from this Repo Unintelligible Results on Custom Data and on the sample /Wavs from this Repo May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant