This is an evolving repo for the paper Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey.
Method | ZS | Pit. | Ene. | Spe. | Pro. | Tim. | Emo. | Env. | Des. | Acoustic Model |
Vocoder | Acoustic Feature |
Release Time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FastSpeech | ✓ | ✓ | Transformer | WaveGlow | MelS | 2019.05 | |||||||
DWAPI | ✓ | ✓ | ✓ | DNN | Straight | MelS + F0 + Intensity | 2020.04 | ||||||
FastSpeech 2 | ✓ | ✓ | ✓ | ✓ | Transformer | Parallel WaveGAN | MelS | 2020.06 | |||||
FastPitch | ✓ | ✓ | Transformer | WaveGlow | MelS | 2020.06 | |||||||
Parallel Tacotron | ✓ | Transformer + CNN | WaveRNN | MelS | 2020.10 | ||||||||
StyleTagging-TTS | ✓ | ✓ | ✓ | Transformer + CNN | HiFi-GAN | MelS | 2021.04 | ||||||
SC-GlowTTS | ✓ | ✓ | Transformer + Conv | HiFi-GAN | MelS | 2021.06 | |||||||
Meta-StyleSpeech | ✓ | ✓ | Transformer | MelGAN | MelS | 2021.06 | |||||||
DelightfulTTS | ✓ | ✓ | ✓ | Transformer + CNN | HiFiNet | MelS | 2021.11 | ||||||
YourTTS | ✓ | ✓ | Transformer | HiFi-GAN | LinS | 2021.12 | |||||||
DiffGAN-TTS | ✓ | ✓ | ✓ | Diffusion + GAN | HiFi-GAN | MelS | 2022.01 | ||||||
StyleTTS | ✓ | ✓ | CNN + RNN + GAN | HiFi-GAN | MelS | 2022.05 | |||||||
GenerSpeech | ✓ | ✓ | Transformer + Flow-based | HiFi-GAN | MelS | 2022.05 | |||||||
NaturalSpeech 2 | ✓ | ✓ | Diffusion | Codec Decoder | Token | 2022.05 | |||||||
Cauliflow | ✓ | ✓ | BERT + Flow | UP WaveNet | MelS | 2022.06 | |||||||
CLONE | ✓ | ✓ | ✓ | Transformer + CNN | WaveNet | MelS + LinS | 2022.07 | ||||||
PromptTTS | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Transformer | HiFi-GAN | MelS | 2022.11 | |||
Grad-StyleSpeech | ✓ | ✓ | Score-based Diffusion | HiFi-GAN | MelS | 2022.11 | |||||||
PromptStyle | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | VITS | HiFi-GAN | MelS | 2023.05 | |||
StyleTTS 2 | ✓ | ✓ | ✓ | ✓ | Diffusion + GAN | HifiGAN / iSTFTNet | MelS | 2023.06 | |||||
VoiceBox | ✓ | ✓ | Flow Matching Diffusion | HiFi-GAN | MelS | 2023.06 | |||||||
MegaTTS 2 | ✓ | ✓ | ✓ | Diffusion + GAN | HiFi-GAN | MelS | 2023.07 | ||||||
PromptTTS 2 | ✓ | ✓ | ✓ | ✓ | ✓ | Diffusion | Codec Decoder | Token | 2023.09 | ||||
VoiceLDM | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Diffusion | HiFi-GAN | MelS | 2023.09 | |||
DuIAN-E | ✓ | ✓ | ✓ | CNN + RNN | HiFi-GAN | MelS | 2023.09 | ||||||
PromptTTS++ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Transformer + Diffusion | BigVGAN | MelS | 2023.09 | |||
SpeechFlow | ✓ | ✓ | Flow Matching Diffusion | HiFi-GAN | MelS | 2023.10 | |||||||
P-Flow | ✓ | ✓ | Flow Matching | HiFi-GAN | MelS | 2023.10 | |||||||
E3 TTS | ✓ | ✓ | Diffusion | / | Waveform | 2023.11 | |||||||
HierSpeech++ | ✓ | ✓ | Hierarchical Conditional VAE | BigVGAN | MelS | 2023.11 | |||||||
Audiobox | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Flow Matching | EnCodec | MelS | 2023.12 | ||
FlashSpeech | ✓ | ✓ | Latent Consistency Model | EnCodec | Token | 2024.04 | |||||||
NaturalSpeech 3 | ✓ | ✓ | ✓ | ✓ | Diffusion | EnCodec | Token | 2024.04 | |||||
InstructTTS | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Transformer + Diffusion | HiFi-GAN | Token | 2024.05 | |||
ControlSpeech | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Transformer + Diffusion | FACodec Decoder | Token | 2024.06 | |
AST-LDM | ✓ | ✓ | ✓ | Diffusion | HiFi-GAN | MelS | 2024.06 | ||||||
SimpleSpeech | ✓ | ✓ | Transformer Diffusion | SQ Decoder | Token | 2024.06 | |||||||
DiTTo-TTS | ✓ | ✓ | ✓ | DiT | BigVGAN | Token | 2024.06 | ||||||
E2 TTS | ✓ | ✓ | Flow Matching Transformer | BigVGAN | MelS | 2024.06 | |||||||
MobileSpeech | ✓ | ✓ | ConFormer Decoder | Vocos | Token | 2024.06 | |||||||
DEX-TTS | ✓ | ✓ | Diffusion | HiFi-GAN | MelS | 2024.06 | |||||||
ArtSpeech | ✓ | ✓ | RNN + CNN | HiFI-GAN | MelS | 2024.07 | |||||||
CCSP | ✓ | ✓ | Diffusion | Codec Decoder | Token | 2024.07 | |||||||
SimpleSpeech 2 | ✓ | ✓ | ✓ | Flow-based Transformer Diffusion | SQ Decoder | Token | 2024.08 | ||||||
E1 TTS | ✓ | ✓ | DiT | BigVGAN | Token | 2024.09 | |||||||
VoiceGuider | ✓ | ✓ | Diffusion | BigVGAN | MelS | 2024.09 | |||||||
StyleTTS-ZS | ✓ | ✓ | Diffusion + GAN | HifiGAN / iSTFTNet | Token | 2024.09 | |||||||
NansyTTS | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Transformer | NANSY++ | MelS | 2024.09 | |||
NanoVoice | ✓ | ✓ | Diffusion | BigVGAN | MelS | 2024.09 | |||||||
MS$^{2}$KU-VTTS | ✓ | ✓ | Diffusion | BigvGAN | MelS | 2024.10 | |||||||
MaskGCT | ✓ | ✓ | ✓ | Masked Generative Transformers | DAC + Vocos | Token | 2024.10 |
Abbreviations: Z(ero-)S(hot), Pit(ch), Ene(rgy)=Volume=Loudness, Spe(ed)=Duration, Pro(sody), Tim(bre), Emo(tion), Env(ironment), Des(cription). Timbre involves gender and age. MelS and LinS represent Mel Spectrogram and Linear Spectrogram respectively.
Method | ZS | Pit. | Ene. | Spe. | Pro. | Tim. | Emo. | Env. | Des. | Acoustic Model |
Vocoder | Acoustic Feature |
Release Time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Prosody-Tacotron | ✓ | ✓ | RNN | WaveNet | MelS | 2018.03 | |||||||
GST-Tacotron | ✓ | ✓ | CNN + RNN | Griffin-Lim | LinS | 2018.03 | |||||||
GMVAE-Tacotron | ✓ | ✓ | ✓ | ✓ | CNN + RNN | WaveRNN | MelS | 2018.12 | |||||
VAE-Tacotron | ✓ | ✓ | ✓ | CNN + RNN | WaveNet | MelS | 2019.02 | ||||||
DurIAN | ✓ | ✓ | ✓ | CNN + RNN | MB-WaveRNN | MelS | 2019.09 | ||||||
Flowtron | ✓ | ✓ | ✓ | CNN + RNN | WaveGlow | MelS | 2020.07 | ||||||
MsEmoTTS | ✓ | ✓ | ✓ | CNN + RNN | WaveRNN | MelS | 2022.01 | ||||||
VALL-E | ✓ | ✓ | LLM | EnCodec | Token | 2023.01 | |||||||
SpearTTS | ✓ | ✓ | LLM | SoundStream | Token | 2023.02 | |||||||
VALL-E X | ✓ | ✓ | LLM | EnCodec | Token | 2023.03 | |||||||
Make-a-voice | ✓ | ✓ | LLM | BigVGAN | Token | 2023.05 | |||||||
TorToise | ✓ | Transformer + DDPM | Univnet | MelS | 2023.05 | ||||||||
MegaTTS | ✓ | ✓ | LLM + GAN | HiFi-GAN | MelS | 2023.06 | |||||||
SC VALL-E | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | LLM | EnCodec | Token | 2023.07 | |||
Salle | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | LLM | Codec Decoder | Token | 2023.08 | ||
UniAudio | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | LLM | EnCodec | Token | 2023.10 | |||
ELLA-V | ✓ | ✓ | LLM | EnCodec | Token | 2024.01 | |||||||
BaseTTS | ✓ | ✓ | LLM | UnivNet | Token | 2024.02 | |||||||
ClaM-TTS | ✓ | ✓ | LLM | BigVGAN | MelS+Token | 2024.04 | |||||||
RALL-E | ✓ | ✓ | LLM | SoundStream | Token | 2024.05 | |||||||
ARDiT | ✓ | ✓ | ✓ | Decoder-only Diffusion Transformer | BigVGAN | MelS | 2024.06 | ||||||
VALL-E R | ✓ | ✓ | LLM | Vocos | Token | 2024.06 | |||||||
VALL-E 2 | ✓ | ✓ | LLM | Vocos | Token | 2024.06 | |||||||
Seed-TTS | ✓ | ✓ | ✓ | LLM + Diffusion Transformer | / | Token | 2024.06 | ||||||
VoiceCraft | ✓ | ✓ | LLM | HiFi-GAN | Token | 2024.06 | |||||||
XTTS | ✓ | ✓ | LLM + GAN | HiFi-GAN | MelS+Token | 2024.06 | |||||||
CosyVoice | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | LLM + Conditional Flow Matching | HiFi-GAN | Token | 2024.07 | ||
MELLE | ✓ | ✓ | LLM | HiFi-GAN | MelS | 2024.07 | |||||||
Bailing TTS | ✓ | ✓ | LLM + Diffusion Transformer | / | Token | 2024.08 | |||||||
VoxInstruct | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | LLM | Vocos | Token | 2024.08 | |
Emo-DPO | ✓ | ✓ | LLM | HiFi-GAN | Token | 2024.09 | |||||||
FireRedTTS | ✓ | ✓ | ✓ | LLM + Conditional Flow Matching | BigVGAN-v2 | Token | 2024.09 | ||||||
CoFi-Speech | ✓ | ✓ | LLM | BigVGAN | Token | 2024.09 | |||||||
Takin | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | LLM | HiFi-Codec | Token | 2024.09 | ||
HALL-E | ✓ | ✓ | LLM | EnCodec | Token | 2024.10 |
Abbreviations: Z(ero-)S(hot), Pit(ch), Ene(rgy)=Volume=Loudness, Spe(ed)=Duration, Pro(sody), Tim(bre), Emo(tion), Env(ironment), Des(cription). Timbre involves gender and age. MelS and LinS represent Mel Spectrogram and Linear Spectrogram respectively.
A summary of open-source datasets for controllable TTS:
Dataset | Hours | #Speakers | Labels | Lang | Release Time |
|||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Pit. | Ene. | Spe. | Age | Gen. | Emo. | Emp. | Acc. | Top. | Des. | Env. | Dia. | |||||
Taskmaster-1 | / | / | ✓ | en | 2019.09 | |||||||||||
Libri-light | 60,000 | 9,722 | ✓ | en | 2019.12 | |||||||||||
AISHELL-3 | 85 | 218 | ✓ | ✓ | ✓ | zh | 2020.10 | |||||||||
ESD | 29 | 10 | ✓ | en,zh | 2021.05 | |||||||||||
GigaSpeech | 10,000 | / | ✓ | en | 2021.06 | |||||||||||
WenetSpeech | 10,000 | / | ✓ | zh | 2021.07 | |||||||||||
PromptSpeech | / | / | ✓ | ✓ | ✓ | ✓ | ✓ | en | 2022.11 | |||||||
DailyTalk | 20 | 2 | ✓ | ✓ | ✓ | en | 2023.05 | |||||||||
TextrolSpeech | 330 | 1,324 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | en | 2023.08 | ||||||
VoiceLDM | / | / | ✓ | ✓ | ✓ | ✓ | ✓ | en | 2023.09 | |||||||
VccmDataset | 330 | 1,324 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | en | 2024.06 | ||||||
MSceneSpeech | 13 | 13 | ✓ | zh | 2024.07 | |||||||||||
SpeechCraft | 2,391 | 3,200 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | en,zh | 2024.08 |
Abbreviations: Pit(ch), Ene(rgy)=volume=loudness, Spe(ed)=duration, Gen(der), Emo(tion), Emp(hasis), Acc(ent), Dia(logue), Env(ironment), Des(cription).
Common objective and subjective evaluation metrics:
Metric | Type | Eval Target | GT Required |
---|---|---|---|
MCD | Objective | Acoustic similarity | ✓ |
PESQ | Objective | Perceptual quality | ✓ |
WER | Objective | Intelligibility | ✓ |
MOS | Subjective | Preference | |
CMOS | Subjective | Preference |
@article{xie2024controllablespeechsynthesisera,
title={Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey},
author={Tianxin Xie and Yan Rong and Pengfei Zhang and Li Liu},
journal={arXiv preprint arXiv:2412.06602}
year={2024},
}