ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation (Interspeech 2023)
Paper: https://arxiv.org/pdf/2305.18028.pdf
There are significant challenges for speaker adaptation in text-to-speech for languages that are not widely spoken or for speakers with accents or dialects that are not well-represented in the training data. To address this issue, we propose the use of the "mixture of adapters" method. This approach involves adding multiple adapters within a backbone-model layer to learn the unique characteristics of different speakers. Our approach outperforms the baseline, with a noticeable improvement of 5% observed in speaker preference tests when using only one minute of data for each new speaker. Moreover, following the adapter paradigm, we fine-tune only the adapter parameters (11% of the total model parameters). This is a significant achievement in parameter-efficient speaker adaptation, and one of the first models of its kind. Overall, our proposed approach offers a promising solution to the speech synthesis techniques, particularly for adapting to speakers from diverse backgrounds.
The MoA module comprises N residual adapters. Every adapter chooses k closest tokens and processes it. The same token can be processed by multiple adapters. The outputs of the adapters are combined}. Additionally, the architecture of the standard residual adapter is illustrated on the right in the same diagram.
You can install the Python dependencies with
pip3 install -r requirements.txt
The supported datasets are
- [LTS100]
- VCTK: The CSTR VCTK Corpus includes speech data uttered by 110 English speakers (multi-speaker TTS) with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.
DATASET refers to the names of datasets such as LTS
and VCTK
in the following documents.
- Run
python3 prepare_align.py --dataset DATASET
python3 preprocess.py --dataset DATASET
Train your model with
python3 train.py --dataset DATASET
- For vocoder : HiFi-GAN.
For a multi-speaker TTS, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET
The dictionary of learned speakers can be found at preprocessed_data/DATASET/speakers.json
, and the generated utterances will be put in output/result/
.
Batch inference is also supported, try
python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET
to synthesize all utterances in preprocessed_data/DATASET/val.txt
.
We borrow the code in https://github.com/keonlee9420/Comprehensive-Transformer-TTS repository. We thank the author for open-sourcing their code.