Skip to content

This is an evolving repo for the paper "Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey".

License

Notifications You must be signed in to change notification settings

imxtx/awesome-controllabe-speech-synthesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Awesome Controllabe Speech Synthesis

This is an evolving repo for the paper Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey.

summary pipeline control strategies

🚀 Non-autoregressive Controllable TTS

Method ZS Pit. Ene. Spe. Pro. Tim. Emo. Env. Des. Acoustic
Model
Vocoder Acoustic
Feature
Release
Time
FastSpeech Transformer WaveGlow MelS 2019.05
DWAPI DNN Straight MelS + F0 + Intensity 2020.04
FastSpeech 2 Transformer Parallel WaveGAN MelS 2020.06
FastPitch Transformer WaveGlow MelS 2020.06
Parallel Tacotron Transformer + CNN WaveRNN MelS 2020.10
StyleTagging-TTS Transformer + CNN HiFi-GAN MelS 2021.04
SC-GlowTTS Transformer + Conv HiFi-GAN MelS 2021.06
Meta-StyleSpeech Transformer MelGAN MelS 2021.06
DelightfulTTS Transformer + CNN HiFiNet MelS 2021.11
YourTTS Transformer HiFi-GAN LinS 2021.12
DiffGAN-TTS Diffusion + GAN HiFi-GAN MelS 2022.01
StyleTTS CNN + RNN + GAN HiFi-GAN MelS 2022.05
GenerSpeech Transformer + Flow-based HiFi-GAN MelS 2022.05
NaturalSpeech 2 Diffusion Codec Decoder Token 2022.05
Cauliflow BERT + Flow UP WaveNet MelS 2022.06
CLONE Transformer + CNN WaveNet MelS + LinS 2022.07
PromptTTS Transformer HiFi-GAN MelS 2022.11
Grad-StyleSpeech Score-based Diffusion HiFi-GAN MelS 2022.11
PromptStyle VITS HiFi-GAN MelS 2023.05
StyleTTS 2 Diffusion + GAN HifiGAN / iSTFTNet MelS 2023.06
VoiceBox Flow Matching Diffusion HiFi-GAN MelS 2023.06
MegaTTS 2 Diffusion + GAN HiFi-GAN MelS 2023.07
PromptTTS 2 Diffusion Codec Decoder Token 2023.09
VoiceLDM Diffusion HiFi-GAN MelS 2023.09
DuIAN-E CNN + RNN HiFi-GAN MelS 2023.09
PromptTTS++ Transformer + Diffusion BigVGAN MelS 2023.09
SpeechFlow Flow Matching Diffusion HiFi-GAN MelS 2023.10
P-Flow Flow Matching HiFi-GAN MelS 2023.10
E3 TTS Diffusion / Waveform 2023.11
HierSpeech++ Hierarchical Conditional VAE BigVGAN MelS 2023.11
Audiobox Flow Matching EnCodec MelS 2023.12
FlashSpeech Latent Consistency Model EnCodec Token 2024.04
NaturalSpeech 3 Diffusion EnCodec Token 2024.04
InstructTTS Transformer + Diffusion HiFi-GAN Token 2024.05
ControlSpeech Transformer + Diffusion FACodec Decoder Token 2024.06
AST-LDM Diffusion HiFi-GAN MelS 2024.06
SimpleSpeech Transformer Diffusion SQ Decoder Token 2024.06
DiTTo-TTS DiT BigVGAN Token 2024.06
E2 TTS Flow Matching Transformer BigVGAN MelS 2024.06
MobileSpeech ConFormer Decoder Vocos Token 2024.06
DEX-TTS Diffusion HiFi-GAN MelS 2024.06
ArtSpeech RNN + CNN HiFI-GAN MelS 2024.07
CCSP Diffusion Codec Decoder Token 2024.07
SimpleSpeech 2 Flow-based Transformer Diffusion SQ Decoder Token 2024.08
E1 TTS DiT BigVGAN Token 2024.09
VoiceGuider Diffusion BigVGAN MelS 2024.09
StyleTTS-ZS Diffusion + GAN HifiGAN / iSTFTNet Token 2024.09
NansyTTS Transformer NANSY++ MelS 2024.09
NanoVoice Diffusion BigVGAN MelS 2024.09
MS$^{2}$KU-VTTS Diffusion BigvGAN MelS 2024.10
MaskGCT Masked Generative Transformers DAC + Vocos Token 2024.10

Abbreviations: Z(ero-)S(hot), Pit(ch), Ene(rgy)=Volume=Loudness, Spe(ed)=Duration, Pro(sody), Tim(bre), Emo(tion), Env(ironment), Des(cription). Timbre involves gender and age. MelS and LinS represent Mel Spectrogram and Linear Spectrogram respectively.

🎞️ Autoregressive Controllable TTS

Method ZS Pit. Ene. Spe. Pro. Tim. Emo. Env. Des. Acoustic
Model
Vocoder Acoustic
Feature
Release
Time
Prosody-Tacotron RNN WaveNet MelS 2018.03
GST-Tacotron CNN + RNN Griffin-Lim LinS 2018.03
GMVAE-Tacotron CNN + RNN WaveRNN MelS 2018.12
VAE-Tacotron CNN + RNN WaveNet MelS 2019.02
DurIAN CNN + RNN MB-WaveRNN MelS 2019.09
Flowtron CNN + RNN WaveGlow MelS 2020.07
MsEmoTTS CNN + RNN WaveRNN MelS 2022.01
VALL-E LLM EnCodec Token 2023.01
SpearTTS LLM SoundStream Token 2023.02
VALL-E X LLM EnCodec Token 2023.03
Make-a-voice LLM BigVGAN Token 2023.05
TorToise Transformer + DDPM Univnet MelS 2023.05
MegaTTS LLM + GAN HiFi-GAN MelS 2023.06
SC VALL-E LLM EnCodec Token 2023.07
Salle LLM Codec Decoder Token 2023.08
UniAudio LLM EnCodec Token 2023.10
ELLA-V LLM EnCodec Token 2024.01
BaseTTS LLM UnivNet Token 2024.02
ClaM-TTS LLM BigVGAN MelS+Token 2024.04
RALL-E LLM SoundStream Token 2024.05
ARDiT Decoder-only Diffusion Transformer BigVGAN MelS 2024.06
VALL-E R LLM Vocos Token 2024.06
VALL-E 2 LLM Vocos Token 2024.06
Seed-TTS LLM + Diffusion Transformer / Token 2024.06
VoiceCraft LLM HiFi-GAN Token 2024.06
XTTS LLM + GAN HiFi-GAN MelS+Token 2024.06
CosyVoice LLM + Conditional Flow Matching HiFi-GAN Token 2024.07
MELLE LLM HiFi-GAN MelS 2024.07
Bailing TTS LLM + Diffusion Transformer / Token 2024.08
VoxInstruct LLM Vocos Token 2024.08
Emo-DPO LLM HiFi-GAN Token 2024.09
FireRedTTS LLM + Conditional Flow Matching BigVGAN-v2 Token 2024.09
CoFi-Speech LLM BigVGAN Token 2024.09
Takin LLM HiFi-Codec Token 2024.09
HALL-E LLM EnCodec Token 2024.10

Abbreviations: Z(ero-)S(hot), Pit(ch), Ene(rgy)=Volume=Loudness, Spe(ed)=Duration, Pro(sody), Tim(bre), Emo(tion), Env(ironment), Des(cription). Timbre involves gender and age. MelS and LinS represent Mel Spectrogram and Linear Spectrogram respectively.

💾 Datsets

A summary of open-source datasets for controllable TTS:

Dataset Hours #Speakers Labels Lang Release
Time
Pit. Ene. Spe. Age Gen. Emo. Emp. Acc. Top. Des. Env. Dia.
Taskmaster-1 / / en 2019.09
Libri-light 60,000 9,722 en 2019.12
AISHELL-3 85 218 zh 2020.10
ESD 29 10 en,zh 2021.05
GigaSpeech 10,000 / en 2021.06
WenetSpeech 10,000 / zh 2021.07
PromptSpeech / / en 2022.11
DailyTalk 20 2 en 2023.05
TextrolSpeech 330 1,324 en 2023.08
VoiceLDM / / en 2023.09
VccmDataset 330 1,324 en 2024.06
MSceneSpeech 13 13 zh 2024.07
SpeechCraft 2,391 3,200 en,zh 2024.08

Abbreviations: Pit(ch), Ene(rgy)=volume=loudness, Spe(ed)=duration, Gen(der), Emo(tion), Emp(hasis), Acc(ent), Dia(logue), Env(ironment), Des(cription).

📏 Evaluation Metrics

Common objective and subjective evaluation metrics:

Metric Type Eval Target GT Required
MCD Objective Acoustic similarity
PESQ Objective Perceptual quality
WER Objective Intelligibility
MOS Subjective Preference
CMOS Subjective Preference

📚 Citations

@article{xie2024controllablespeechsynthesisera,
    title={Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey}, 
    author={Tianxin Xie and Yan Rong and Pengfei Zhang and Li Liu},
    journal={arXiv preprint arXiv:2412.06602}
    year={2024},
}

About

This is an evolving repo for the paper "Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published