It is an improved version of Real-Time-Voice-Cloning.
-
Install ffmpeg. This is necessary for reading audio files.
-
Create a new conda environment with
conda create -n rtvc python=3.7.13
-
Install PyTorch. Pick the proposed CUDA version if you have a GPU, otherwise pick CPU. My torch version:
torch=1.9.1+cu111
torchvision=0.10.1+cu111
-
Install the remaining requirements with
pip install -r requirements.txt
- Install spaCy model en_core_web_sm by
python -m spacy download en_core_web_sm
Download dataset:
-
LibriSpeech: train-other-500 for training, dev-other for validation (extract as <datasets_root>/LibriSpeech/<dataset_name>)
-
VoxCeleb1: Dev A - D for training, Test for validation as well as the metadata file
vox1_meta.csv
(extract as <datasets_root>/VoxCeleb1/ and <datasets_root>/VoxCeleb1/vox1_meta.csv) -
VoxCeleb2: Dev A - H for training, Test for validation (extract as <datasets_root>/VoxCeleb2/)
Encoder preprocessing:
python encoder_preprocess.py <datasets_root>
Encoder training:
it is recommended to start visdom server for monitor training with
visdom
then start training with
python encoder_train.py <model_id> <datasets_root>/SV2TTS/encoder
Download dataset:
- LibriSpeech: train-clean-100 and train-clean-360 for training, dev-clean for validation (extract as <datasets_root>/LibriSpeech/<dataset_name>)
- LibriSpeech alignments: merge the directory structure with the LibriSpeech datasets you have downloaded (do not take the alignments from the datasets you haven't downloaded else the scripts will think you have them)
- VCTK: used for training and validation
Synthesizer preprocessing:
python synthesizer_preprocess_audio.py <datasets_root>
python synthesizer_preprocess_embeds.py <datasets_root>/SV2TTS/synthesizer
Synthesizer training:
python synthesizer_train.py <model_id> <datasets_root>/SV2TTS/synthesizer --use_tb
if you want to monitor the training progress, run
tensorboard --logdir log/vc/synthesizer --host localhost --port 8088
Download dataset:
The same as synthesizer. You can skip this if you already download synthesizer training dataset.
Vocoder preprocessing:
python vocoder_preprocess.py <datasets_root>
Vocoder training:
python vocoder_train.py <model_id> <datasets_root> --use_tb
if you want to monitor the training progress, run
tensorboard --logdir log/vc/vocoder --host localhost --port 8080
Note:
Training breakpoints are saved periodically, so you can run the training command and resume training when the breakpoint exists.
Terminal:
python demo_cli.py
First input the number of audios, then input the audio file paths, then input the text message. The attention alignments and mel spectrogram are stored in syn_results/. The generated audio is stored in out_audios/.
GUI demo:
python demo_toolbox.py
Download dataset:
LibriSpeech: test-other (extract as <datasets_root>/LibriSpeech/<dataset_name>)
Preprocessing:
python encoder_test_preprocess.py <datasets_root>
Visualization:
python encoder_test_visualization.py <model_id> <datasets_root>
The results are saved in dim_reduction_results/.
You can download the pretrained model from this and extract as saved_models/default
The audio results are here