Quickstart

Audio Samples

The audio example webpage is still under development

Quickstart

DATASET refers to the names of datasets such as LJSpeech and VCTK in the following documents.

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Inference

You can send me a private message to obtain the processed model and put them in

``output/ckpt

For a single-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --mode single --dataset DATASET

For a multi-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET

The dictionary of learned speakers can be found at preprocessed_data/DATASET/speakers.json, and the generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/DATASET/val.txt --model MODEL --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances in preprocessed_data/DATASET/val.txt.

Controllability

The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8 --energy_control 0.8

Please note that the controllability is originated from FastSpeech2 and not a vital interest of Ours.

Training

Datasets

The supported datasets are

Preprocessing

For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/.
Run
```
python3 prepare_align.py --dataset DATASET
```
for some preparations.

For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DATASET/TextGrid/. Alternately, you can run the aligner by yourself.

After that, run the preprocessing script by
```
python3 preprocess.py --dataset DATASET
```

Training

You can train model:

Training :

Train with

python3 train.py --model naive --dataset DATASET

TensorBoard

Use

tensorboard --logdir output/log/DATASET

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
__pycache__		__pycache__
audio		audio
config		config
deepspeaker		deepspeaker
hifigan		hifigan
lexicon		lexicon
model		model
preprocessor		preprocessor
text		text
utils		utils
=1.20.3		=1.20.3
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
evaluate.py		evaluate.py
prepare_align.py		prepare_align.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
synthesize.py		synthesize.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio Samples

Quickstart

Dependencies

Inference

Batch Inference

Controllability

Training

Datasets

Preprocessing

Training

TensorBoard

About

Releases

Packages

Languages

License

ICIG/MIXDIFF

Folders and files

Latest commit

History

Repository files navigation

Audio Samples

Quickstart

Dependencies

Inference

Batch Inference

Controllability

Training

Datasets

Preprocessing

Training

TensorBoard

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages