Voice Conversion Experiments for THUHCSI Course : <Digital Processing of Speech Signals>
-
Install sox from http://sox.sourceforge.net/ or apt install sox
-
Install ffmpeg from https://www.ffmpeg.org/download.html#build-linux or apt install FFmpeg
-
Set up python environment through:
python3 -m venv /path/to/new/virtual/environment
source /path/to/new/virtual/environment/bin/activate
pip3 install -r dpss-exp3-VC-BNF/requirement_torch19.txt
# or you can use this if you prefer torch1.8 version
pip3 install -r dpss-exp3-VC-BNF/requirement_torch18.txt
Tips: You can also setup your own environment depends on cuda you have. We recommend that you use pytorch 1.9.0 with the corresponding cuda version to avoid bug.
- Download bzn/mst-male/mst-female corpus from here
- Extract the dataset (via
tar -xzvf sub_dataset.tar.gz
), and organize your data directories as follows:
dataset/
├── mst-female
├── mst-male
├── bzn
- Download pretrained ASR model from here
- Move final.pt to ./pretrained_model/asr_model
CUDA_VISIBLE_DEVICES=0 python preprocess.py --data_dir /path/to/dataset/bzn --save_dir /path/to/save_data/bzn/
Your extracted features will be organized as follows:
bzn/
├── dev_meta.csv
├── f0s
│ ├── bzn_000001.npy
│ ├── ...
├── linears
│ ├── bzn_000001.npy
│ ├── ...
├── mels
│ ├── bzn_000001.npy
│ ├── ...
├── BNFs
│ ├── bzn_000001.npy
│ ├── ...
├── test_meta.csv
└── train_meta.csv
Tips: If you get 'Could not find a version for torch==1.9.0+cu111', run the following script to solve the problem. More details please refer to: https://jishuin.proginn.com/p/763bfbd5e54b.
pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
If you have GPU (one typical GPU is enough, nearly 1s/batch):
CUDA_VISIBLE_DEVICES=0 python train_to_one.py --model_dir ./exps/model_dir_to_bzn --test_dir ./exps/test_dir_to_bzn --data_dir /path/to/save_data/bzn/
If you have no GPU (nearly 5s/batch):
python train_to_one.py --model_dir ./exps/model_dir_to_bzn --test_dir ./exps/test_dir_to_bzn --data_dir /path/to/save_data/bzn/
CUDA_VISIBLE_DEVICES=0 python inference_to_one.py --src_wav /path/to/source/xx.wav --ckpt ../exps/model_dir_to_bzn/bnf-vc-to-one-49.pt --save_dir ./test_dir/
# In any-to-many VC task, we use all the above 3 speakers as the target speaker set.
CUDA_VISIBLE_DEVICES=0 python preprocess.py --data_dir /path/to/dataset/ --save_dir /path/to/save_data/exp3-data-all
Your extracted features will be organized as follows:
exp3-data-all/
├── dev_meta.csv
├── f0s
│ ├── bzn_000001.npy
│ ├── ...
├── linears
│ ├── bzn_000001.npy
│ ├── ...
├── mels
│ ├── bzn_000001.npy
│ ├── ...
├── BNFs
│ ├── bzn_000001.npy
│ ├── ...
├── test_meta.csv
└── train_meta.csv
If you have GPU (one typical GPU is enough, nearly 1s/batch):
CUDA_VISIBLE_DEVICES=0 python train_to_many.py --model_dir ./exps/model_dir_to_many --test_dir ./exps/test_dir_to_many --data_dir /path/to/save_data/exp3-data-all
If you have no GPU (nearly 5s/batch):
python train_to_many.py --model_dir ./exps/model_dir_to_many --test_dir ./exps/test_dir_to_many --data_dir /path/to/save_data/exp3-data-all
# Here for inference, we use 'mst-male' as the target speaker. you can change the tgt_spk argument to any of the above 3 speakers.
CUDA_VISIBLE_DEVICES=0 python inference_to_many.py --src_wav /path/to/source/*.wav --tgt_spk bzn/mst-female/mst-male --ckpt ./model_dir/bnf-vc-to-many-49.pt --save_dir ./test_dir/
This project is a vanilla voice conversion system based on BNFs.
When you encounter problems while finishing your project, search the issues first to see if there are similar problems. If there are no similar problems, you can create new issues and state you problems clearly.