Skip to content

declare-lab/adapter-mix

Repository files navigation

ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation (Interspeech 2023)

Paper: https://arxiv.org/pdf/2305.18028.pdf

Abstract

There are significant challenges for speaker adaptation in text-to-speech for languages that are not widely spoken or for speakers with accents or dialects that are not well-represented in the training data. To address this issue, we propose the use of the "mixture of adapters" method. This approach involves adding multiple adapters within a backbone-model layer to learn the unique characteristics of different speakers. Our approach outperforms the baseline, with a noticeable improvement of 5% observed in speaker preference tests when using only one minute of data for each new speaker. Moreover, following the adapter paradigm, we fine-tune only the adapter parameters (11% of the total model parameters). This is a significant achievement in parameter-efficient speaker adaptation, and one of the first models of its kind. Overall, our proposed approach offers a promising solution to the speech synthesis techniques, particularly for adapting to speakers from diverse backgrounds.

moa-abhi

The MoA module comprises N residual adapters. Every adapter chooses k closest tokens and processes it. The same token can be processed by multiple adapters. The outputs of the adapters are combined}. Additionally, the architecture of the standard residual adapter is illustrated on the right in the same diagram.

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Training

Datasets

The supported datasets are

  • [LTS100]
  • VCTK: The CSTR VCTK Corpus includes speech data uttered by 110 English speakers (multi-speaker TTS) with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.

Preprocessing

DATASET refers to the names of datasets such as LTS and VCTK in the following documents.

  • Run
    python3 prepare_align.py --dataset DATASET
    
    python3 preprocess.py --dataset DATASET
    

Training

Train your model with

python3 train.py --dataset DATASET
  • For vocoder : HiFi-GAN.

Inference

For a multi-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET

The dictionary of learned speakers can be found at preprocessed_data/DATASET/speakers.json, and the generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances in preprocessed_data/DATASET/val.txt.

Acknowledgement

We borrow the code in https://github.com/keonlee9420/Comprehensive-Transformer-TTS repository. We thank the author for open-sourcing their code.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages