This is an evolving repo for the survey: Towards Controllable Speech Synthesis in the Era of Large Language Models: A Systematic Survey.
Text-to-speech (TTS) has advanced from generating natural-sounding speech to enabling fine-grained control over attributes like emotion, timbre, and style. Driven by rising industrial demand and breakthroughs in deep learning, e.g., diffusion and large language models (LLMs), controllable TTS has become a rapidly growing research area. This survey provides the first comprehensive review of controllable TTS methods, from traditional control techniques to emerging approaches using natural language prompts. We categorize model architectures, control strategies, and feature representations, while also summarizing challenges, datasets, and evaluations in controllable TTS. This survey aims to guide researchers and practitioners by offering a clear taxonomy and highlighting future directions in this fast-evolving field.
If you find our survey useful for your research, please consider 📚citing📚 our EMNLP 2025 main conference paper.
🎞️ Video introductions for our paper: YouTube (English), Bilibili (Chinese)
💖 Let’s make it better together:
- If you find any mistakes, please don’t hesitate to open an issue.
- If you find this project helpful, please consider giving it a ⭐ on GitHub to stay updated.
- Pull requests are always welcome if you’d like to contribute papers to this repo.
- [2025-09-29] Our paper has been accepted to the EMNLP 2025 Main Conference. We look forward to seeing you in Suzhou, China!
- Follow-up Papers
- Non-autoregressive Controllable TTS
- Autoregressive Controllable TTS
- Datasets
- Evaluation
- Star History
- Kojima, Atsushi, Yusuke Fujita, Hao Shi, Tomoya Mizumoto, Mengjie Zhao, and Yui Sudo. "Conversation Context-aware Direct Preference Optimization for Style-Controlled Speech Synthesis." APSIPA ASC25. [2025.12]
- Okamoto, Umi, Sei Ueno, and Akinobu Lee. "Face-conditioned Large-scale Text-to-Speech via Speaker Embedding Prediction from Facial Images." APSIPA ASC25. [2025.12]
- Yin, Kang, Chunyu Qiang, Sirui Zhao, Xiaopeng Wang, Yuzhe Liang, Pengfei Cai, Tong Xu, Chen Zhang and Enhong Chen. “DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance.” (2025). Demo [2025.12]
- Wang, Cong, Changfeng Gao, Yang Xiang, Zhihao Du, Keyu An, Han Zhao, Qian Chen, Xiangang Li, Yingming Gao, and Ya Li. "RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS." arXiv preprint arXiv:2512.04552 (2025). [2025.12]
- R. -G. Bolborici and A. Neacşu, "Adding Emotion Conditioning in Speech Synthesis via Multi-Term Classifier-Free Guidance," 2025 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), Cluj-Napoca, Romania, 2025, pp. 86-91. Demo [2025.11]
- Qiang, Chunyu, Kang Yin, Xiaopeng Wang, Yuzhe Liang, Jiahui Zhao, Ruibo Fu, Tianrui Wang et al. "InstructAudio: Unified speech and music generation with natural language instruction." arXiv preprint arXiv:2511.18487 (2025). Demo [2025.11]
- Seung-Bin Kim, Jun-Hyeok Cha, Hyung-Seok Oh, Heejin Choi, and Seong-Whan Lee. 2025. FillerSpeech: Towards Human-Like Text-to-Speech Synthesis with Filler Insertion and Filler Style Control. EMNLP 2025. Demo [2025.11]
- Yejin Jeon, Youngjae Kim, Jihyun Lee, Hyounghun Kim, and Gary Lee. 2025. Progressive Facial Granularity Aggregation with Bilateral Attribute-based Enhancement for Face-to-Speech Synthesis. EMNLP 2025 Findings. [2025.11]
- Yifu Chen, Shengpeng Ji, Ziqing Wang, Hanting Wang, and Zhou Zhao. 2025. InteractSpeech: A Speech Dialogue Interaction Corpus for Spoken Dialogue Model. EMNLP 2025 Findings. Demo [2025.11]
- Shehzeen Samarah Hussain, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Subhankar Ghosh, Roy Fejgin, Mikyas T. Desta, Rafael Valle, and Jason Li. 2025. Koel-TTS: Enhancing LLM-based Speech Generation with Preference Alignment and Classifier Free Guidance. EMNLP 2025. Demo [2025.11]
- Zhenqi Jia, Rui Liu, Berrak Sisman, and Haizhou Li. 2025. Multimodal Fine-grained Context Interaction Graph Modeling for Conversational Speech Synthesis. EMNLP 2025. Demo & Code [2025.11]
- Jianxing Yu, Gou Zihao, Chen Li, Zhisheng Wang, Peiji Yang, Wenqing Chen, and Jian Yin. 2025. Eliciting Implicit Acoustic Styles from Open-domain Instructions to Facilitate Fine-grained Controllable Generation of Speech. EMNLP 2025. Demo [2025.11]
- Anuj Diwan, Zhisheng Zheng, David Harwath, and Eunsol Choi. 2025. Scaling Rich Style-Prompted Text-to-Speech Datasets. EMNLP 2025. Dataset & Model & Code [2025.11]
- Zhisheng Zheng, Puyuan Peng, Anuj Diwan, Cong Phuoc Huynh, Xiaohang Sun, Zhu Liu, Vimal Bhat, and David Harwath. 2025. VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing. EMNLP 2025. Demo [2025.11]
- Zhao, Yiwen, Jiatong Shi, Jinchuan Tian, Yuxun Tang, Jiarui Hai, Jionghao Han, and Shinji Watanabe. "Adapting Speech Language Model to Singing Voice Synthesis." In AI for Music Workshop. Demo [2025.11]
- Yan, Chao, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu, Yuxin Zhang, Fei Tian et al. "Step-Audio-EditX Technical Report." arXiv preprint arXiv:2511.03601 (2025). Code [2025.11]
- Tu, Wenming, Guanrou Yang, Ruiqi Yan, Wenxi Chen, Ziyang Ma, Yipeng Kang, Kai Yu, Xie Chen, and Zilong Zheng. "UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models." arXiv preprint arXiv:2510.22588 (2025). Demo, Code [2025.10]
- Lou, Haowei, Hye-Young Paik, Wen Hu, and Lina Yao. "ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech Generation." arXiv preprint arXiv:2510.18308 (2025). Code [2025.10]
- Peng, Yizhou, Yukun Ma, Chong Zhang, Yi-Wen Chao, Chongjia Ni, and Bin Ma. "Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models." arXiv preprint arXiv:2510.13293 (2025).
- Li, Haoxun, Yu Liu, Yuqing Sun, Hanlei Shi, Leyuan Qu, and Taihao Li. "EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS." arXiv preprint arXiv:2510.05758 (2025). Demo [2025.10]
- Wang, Yue, Ruotian Ma, Xingyu Chen, Zhengliang Shi, Wanshun Chen, Huang Liu, Jiadi Yao et al. "BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs." arXiv preprint arXiv:2509.26514 (2025). Demo & Code [2025.09]
- Zhang, Ziyu, Hanzhao Li, Jingbin Hu, Wenhao Li, and Lei Xie. "HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis." arXiv preprint arXiv:2509.25842 (2025). Demo [2025.09]
- Wang, Tianrui, Haoyu Wang, Meng Ge, Cheng Gong, Chunyu Qiang, Ziyang Ma, Zikang Huang et al. "Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis." arXiv preprint arXiv:2509.24629 (2025). Demo [2025.09]
- Wang, Sirui, Andong Chen, and Tiejun Zhao. "Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation." arXiv preprint arXiv:2509.20378 (2025). [2025.09]
- Liu, Min, JingJing Yin, Xiang Zhang, Siyu Hao, Yanni Hu, Bin Lin, Yuan Feng, Hongbin Zhou, and Jianhao Ye. "Audiobook-CC: Controllable Long-context Speech Generation for Multicast Audiobook." arXiv preprint arXiv:2509.17516 (2025). Demo [2025.09]
- Lu, Ye-Xin, Yu Gu, Kun Wei, Hui-Peng Du, Yang Ai, and Zhen-Hua Ling. "DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis." arXiv preprint arXiv:2509.14684 (2025). Demo [2025.09]
- Tian, Fengping, Chenyang Lyu, Xuanfan Ni, Haoqin Sun, Qingjuan Li, Zhiqiang Qian, Haijun Li et al. "Marco-voice technical report." arXiv preprint arXiv:2508.02038 (2025). Demo & Code [2025.08]
- Zhang, Xueyao, Junan Zhang, Yuancheng Wang, Chaoren Wang, Yuanzhe Chen, Dongya Jia, Zhuo Chen, and Zhizheng Wu. "Vevo2: Bridging Controllable Speech and Singing Voice Generation via Unified Prosody Learning." arXiv preprint arXiv:2508.16332 (2025). Demo [2025.08]
- Park, Joonyong, and Kenichi Nakamura. "EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens." arXiv preprint arXiv:2508.11273 (2025). [2025.08]
- Bauer, Judith, Frank Zalkow, Meinard Müller, and Christian Dittmar. "Explicit Emphasis Control in Text-to-Speech Synthesis." In Proc. SSW 2025, pp. 21-27. 2025. Demo
- Lemerle, Théodor, Nicolas Obin, and Axel Roebel. "Lina-Style: Word-Level Style Control in TTS via Interleaved Synthetic Data." In Proc. SSW 2025, pp. 35-39. 2025. [2025.08]
- Zhu, Boyu, Cheng Gong, Muyang Wu, Ruihao Jing, Fan Liu, Xiaolei Zhang, Chi Zhang, and Xuelong Li. "$\text {M}^ 3\text {PDB} $: A Multimodal, Multi-Label, Multilingual Prompt Database for Speech Generation." arXiv preprint arXiv:2508.09702 (2025). Dataset [2025.08]
- Xie, Tianxin, Shan Yang, Chenxing Li, Dong Yu, and Li Liu. "EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering." arXiv preprint arXiv:2508.03543 (2025). Demo [2025.08]
- Wu, Zhuojun, Dong Liu, Juan Liu, Yechen Wang, Linxi Li, Liwei Jin, Hui Bu, Pengyuan Zhang, and Ming Li. "SMIIP-NV: A Multi-Annotation Non-Verbal Expressive Speech Corpus in Mandarin for LLM-Based Speech Synthesis." ACM Multimedia, 2025. Dataset [2025.07]
- Niu, Rui, Weihao Wu, Jie Chen, Long Ma, and Zhiyong Wu. "A Multi-Stage Framework for Multimodal Controllable Speech Synthesis." arXiv preprint arXiv:2506.20945 (2025). Demo [2025.06]
- Zhou, Siyi, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, and Jingchen Shu. "IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech." arXiv preprint arXiv:2506.21619 (2025). Demo, Code [2025.06]
- Rong, Yan, Jinting Wang, Guangzhi Lei, Shan Yang, and Li Liu. "AudioGenie: A Training-Free Multi-Agent Framework for Diverse Multimodality-to-Multiaudio Generation." arXiv preprint arXiv:2505.22053 (2025). Demo [2025.05]
- Rong, Yan, Shan Yang, Guangzhi Lei, and Li Liu. "Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Immersive Audiobook Generation." arXiv preprint arXiv:2504.11002 (2025). Demo [2025.04]
Below are representative non-autoregressive controllable TTS methods. Each entry follows this format: method name, zero-shot capability, controllability, acoustic model, vocoder, acoustic feature, release date, and code/demo.
NOTE: MelS and LinS represent Mel Spectrogram and Linear Spectrogram, respectively. Among today’s TTS systems, MelS, latent features (from VAEs, diffusion models, and other flow-based methods), and various types of discrete tokens are the most commonly used acoustic representations.
- ProEmo, Zero-shot (✗), Controllability (Pitch, Energy, Emotion, Description), Transformer, HiFi-GAN, MelS, 2025.01, Code
- DrawSpeech, Zero-shot (✗), Controllability (Energy, Prosody), Diffusion, HiFi-GAN, MelS, 2025.01, Demo, Code
- DiffStyleTTS, Zero-shot (✗), Controllability (Pitch, Energy, Speed, Prosody, Timbre), Transformer + Diffusion, HiFi-GAN, MelS, 2025.01, Demo
- HED, Zero-shot (✓), Controllability (Emotion), Flow-based Diffusion, Vocos, MelS, 2024.12, Demo
- EmoDubber, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion), Transformer + Flow, Flow-based Vocoder, MelS, 2024.12, Demo
- EmoSphere++, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion), Transformer + Flow, BigVGAN, MelS, 2024.11, Demo, Code
- MS$^{2}$KU-VTTS, Zero-shot (✗), Controllability (Environment, Description), Diffusion, BigvGAN, MelS, 2024.10
- NanoVoice, Zero-shot (✓), Controllability (Timbre), Diffusion, BigVGAN, MelS, 2024.09
- NansyTTS, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Description), Transformer, NANSY++, MelS, 2024.09, Demo
- StyleTTS-ZS, Zero-shot (✓), Controllability (Timbre), Flow-based Diffusion + GAN, Mel-based Decoder, MelS, 2024.09, Demo
- E1 TTS, Zero-shot (✓), Controllability (Timbre), DiT + Flow, BigVGAN, Token + MelS, 2024.09, Demo
- SimpleSpeech 2, Zero-shot (✓), Controllability (Speed, Timbre), Flow-based DiT, SQ Codec, Token, 2024.08, Demo, Code
- CCSP, Zero-shot (✓), Controllability (Timbre), Diffusion, RVQ-based Codec, Token, 2024.07, Demo
- ArtSpeech, Zero-shot (✓), Controllability (Timbre), RNN + CNN, HiFI-GAN, MelS, 2024.07, Demo, Code
- DEX-TTS, Zero-shot (✓), Controllability (Timbre), Diffusion, HiFi-GAN, MelS, 2024.06, Code
- MobileSpeech, Zero-shot (✓), Controllability (Timbre), Transformer, Vocos, Token, 2024.06, Demo
- E2 TTS, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, BigVGAN, MelS, 2024.06, Demo, Code (unofficial)
- DiTTo-TTS, Zero-shot (✓), Controllability (Speed, Timbre), DiT + VAE, BigVGAN, MelS, 2024.06, Demo
- SimpleSpeech, Zero-shot (✓), Controllability (Timbre), Transformer + Diffusion, SQ Codec, Token, 2024.06, Demo, Code
- AST-LDM, Zero-shot (✗), Controllability (Timbre, Environment, Description), Diffusion + VAE, HiFi-GAN, MelS, 2024.06, Demo
- ControlSpeech, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Transformer + Diffusion, FACodec, Token, 2024.06, Demo, Code
- InstructTTS, Zero-shot (✗), Controllability (Pitch, Speed, Prosody, Timbre, Emotion, Description), Transformer + Diffusion, HiFi-GAN, Token, 2024.05, Demo
- NaturalSpeech 3, Zero-shot (✓), Controllability (Speed, Prosody, Timbre), Transformer + Diffusion, FACodec, Token, 2024.04, Demo
- FlashSpeech, Zero-shot (✓), Controllability (Timbre), Latent Consistency Model, EnCodec, Token, 2024.04, Demo, Code
- Audiobox, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Environment, Description), Transformer + Flow, EnCodec, MelS, 2023.12, Demo
- HierSpeech++, Zero-shot (✓), Controllability (Timbre), Transformer + VAE + Flow, BigVGAN, MelS, 2023.11, Demo, Code
- E3 TTS, Zero-shot (✓), Controllability (Timbre), Diffusion, Not required, Waveform, 2023.11, Demo
- P-Flow, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, MelS, 2023.10, Demo, Code (unofficial)
- SpeechFlow, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, MelS, 2023.10, Demo
- PromptTTS++, Zero-shot (✗), Controllability (Pitch, Speed, Prosody, Timbre, Emotion, Description), Transformer + Diffusion, BigVGAN, MelS, 2023.09, Demo, Code
- DuIAN-E, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), CNN + RNN, HiFi-GAN, MelS, 2023.09, Demo
- VoiceLDM, Zero-shot (✗), Controllability (Pitch, Prosody, Timbre, Emotion, Environment, Description), Diffusion, HiFi-GAN, MelS, 2023.09, Demo, Code
- PromptTTS 2, Zero-shot (✗), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Description), Diffusion, RVQ-based Codec, Latent Feature, 2023.09, Demo
- MegaTTS 2, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion), Decoder-only Transformer + GAN, HiFi-GAN, MelS, 2023.07, Demo, Code (unofficial)
- VoiceBox, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, MelS, 2023.06, Demo, Code (unofficial)
- StyleTTS 2, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion), Flow-based Diffusion + GAN, HiFi-GAN / iSTFTNet, MelS, 2023.06, Demo, Code
- PromptStyle, Zero-shot (✓), Controllability (Pitch, Prosody, Timbre, Emotion, Description), VITS + Flow, HiFi-GAN, MelS, 2023.05, Demo
- NaturalSpeech 2, Zero-shot (✓), Controllability (Timbre), Diffusion, RVQ-based Codec, Token, 2023.04, Demo, Code (unofficial)
- Grad-StyleSpeech, Zero-shot (✓), Controllability (Timbre), Score-based Diffusion, HiFi-GAN, MelS, 2022.11, Demo
- PromptTTS, Zero-shot (✗), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Bert + Transformer, HiFi-GAN, MelS, 2022.11, Demo
- CLONE, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), Transformer + CNN, WaveNet, MelS + LinS, 2022.07, Demo
- Cauliflow, Zero-shot (✗), Controllability (Speed, Prosody), BERT + Flow, UP WaveNet, MelS, 2022.06
- GenerSpeech, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, MelS, 2022.05, Demo
- StyleTTS, Zero-shot (✓), Controllability (Timbre), CNN + RNN, HiFi-GAN, MelS, 2022.05, Code
- YourTTS, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, LinS, 2021.12, Demo & Checkpoint
- DelightfulTTS, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), Transformer + CNN, HiFiNet, MelS, 2021.11, Demo
- Meta-StyleSpeech, Zero-shot (✓), Controllability (Timbre), Transformer, MelGAN, MelS, 2021.06, Code
- SC-GlowTTS, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, MelS, 2021.06, Demo, Code
- StyleTagging-TTS, Zero-shot (✓), Controllability (Timbre, Emotion), Transformer + CNN, HiFi-GAN, MelS, 2021.04, Demo
- Parallel Tacotron, Zero-shot (✗), Controllability (Prosody), Transformer + CNN, WaveRNN, MelS, 2020.10, Demo
- FastPitch, Zero-shot (✗), Controllability (Pitch, Prosody), Transformer, WaveGlow, MelS, 2020.06, Code
- FastSpeech 2, Zero-shot (✗), Controllability (Pitch, Energy, Speed, Prosody), Transformer, Parallel WaveGAN, MelS, 2020.06, Code (unofficial)
- FastSpeech, Zero-shot (✗), Controllability (Speed, Prosody), Transformer, WaveGlow, MelS, 2019.05, Code (unofficial)
Below are representative non-autoregressive controllable TTS methods. Each entry follows this format: method name, zero-shot capability, controllability, acoustic model, vocoder, acoustic feature, release date, and code/demo.
NOTE: MelS and LinS represent Mel Spectrogram and Linear Spectrogram, respectively. Among today’s TTS systems, MelS, latent features (from VAEs, diffusion models, and other flow-based methods), and various types of discrete tokens are the most commonly used acoustic representations.
- EmoVoice, Zero-shot (✗), Controllability (Emotion, Description), Decoder-only Transformer, HiFi-GAN, Token, 2025.04, Demo
- Spark-TTS, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre), Decoder-only Transformer, BiCodec, Token, 2025.03, Code
- Vevo, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion), Decoder-only Transformer, BigVGAN, Token + MelS, 2025.02, Demo, Code
- Step-Audio, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion, Description), Decoder-only Transformer, Flow-based Vocoder, Token, 2025.02, Code
- FleSpeech, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Flow-based DiT, WaveGAN, Latent Feature, 2025.01, Demo
- IDEA-TTS, Zero-shot (✓), Controllability (Timbre, Environment), Transformer, Flow-based Vocoder, LinS + MelS, 2024.12, Demo, Code
- KALL-E, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion), Decoder-only Transformer, WaveVAE, Latent Feature, 2024.12, Demo
- IST-LM, Zero-shot (✓), Controllability (Prosody, Timbre), Decoder-only Transformer, HiFi-GAN, Token + MelS, 2024.12
- SLAM-Omni, Zero-shot (✓), Controllability (Prosody, Timbre), Decoder-only Transformer, HiFi-GAN, Token + MelS, 2024.12, Demo, Code
- FishSpeech, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, Firefly-GAN,Token, 2024.11, Code
- HALL-E, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, EnCodec, Token, 2024.10
- Takin, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Emotion, Description), Decoder-only Transformer + Flow, HiFi-GAN, Token + MelS, 2024.09, Demo
- Emotional Dimension Control, Zero-shot (✓), Controllability (Timbre, Emotion), Decoder-only Transformer + Flow, HiFI-GAN, Token + MelS, 2024.09, Demo
- CoFi-Speech, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, BigVGAN, Token + MelS, 2024.09, Demo
- FireRedTTS, Zero-shot (✓), Controllability (Prosody, Timbre), Decoder-only Transformer + Flow, BigVGAN-v2, Token + MelS, 2024.09, Demo, Code
- Emo-DPO, Zero-shot (✗), Controllability (Emotion), Decoder-only Transformer, HiFi-GAN, Token + MelS, 2024.09, Demo
- VoxInstruct, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Decoder-only Transformer, Vocos, Token, 2024.08, Demo, Code
- MELLE, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, HiFi-GAN, MelS, 2024.07. Demo
- CosyVoice, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Emotion, Description), Decoder-only Transformer + Flow, HiFi-GAN, Token, 2024.07, Demo, Code
- XTTS, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer + GAN, HiFi-GAN-based Vococder, Token + MelS, 2024.06, Demo, Code
- VoiceCraft, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, HiFi-GAN, Token, 2024.06, Code
- Seed-TTS, Zero-shot (✓), Controllability (Timbre, Emotion), Decoder-only Transformer + DiT, Unknown Vocoder, Latent Feature, 2024.06, Demo
- VALL-E 2, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, Vocos, Token, 2024.06, Demo, Code (unofficial 1), Code (unofficial 2)
- VALL-E R, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, Vocos, Token, 2024.06, Demo
- ARDiT, Zero-shot (✓), Controllability (Speed, Timbre), Decoder-only DiT, BigVGAN, MelS, 2024.06, Demo
- RALL-E, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, SoundStream, Token, 2024.05, Demo
- CLaM-TTS, Zero-shot (✓), Controllability (Timbre), Encoder-decoder Transformer, BigVGAN, Token + MelS, 2024.04, Demo
- BaseTTS, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, Speechcode Decoder, Token, 2024.02, Demo
- ELLA-V, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, EnCodec, Token, 2024.01, Demo
- UniAudio, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Description), Decoder-only Transformer, UniAudio Codec, Token, 2023.10, Demo, Code
- Salle, Zero-shot (✗), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Decoder-only Transformer, EnCodec, Token, 2023.08, Demo
- SC VALL-E, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion), Decoder-only Transformer, EnCodec, Token, 2023.07, Demo, Code
- MegaTTS, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer + GAN, HiFi-GAN, MelS, 2023.06, Demo
- TorToise, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer + Diffusion, UnivNet, MelS, 2023.05, Code
- Make-a-voice, Zero-shot (✓), Controllability (Timbre), Encoder-decoder Transformer, Unit-based Vocoder, Token, 2023.05, Demo
- VALL-E X, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, EnCodec, Token, 2023.03, Demo, Code (unofficial)
- SpearTTS, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, SoundStream, Token, 2023.02, Demo, Code (unofficial)
- VALL-E, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, EnCodec, Token, 2023.01, Demo, Code (unofficial 1), Code (unofficial 2)
- MsEmoTTS, Zero-shot (✓), Controllability (Pitch, Prosody, Emotion), CNN + RNN, WaveRNN, MelS, 2022.01, Demo
- Flowtron, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), CNN + RNN, WaveGlow, MelS, 2020.07, Demo, Code
- DurIAN, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), CNN + RNN, MB-WaveRNN, MelS, 2019.09, Demo, Code (unofficial)
- VAE-Tacotron, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), VAE, WaveNet, MelS, 2019.02, Code (unoffcial 1), Code (unoffcial 2)
- GMVAE-Tacotron, Zero-shot (✗), Controllability (Pitch, Speed, Prosody, Description), VAE, WaveRNN, MelS, 2018.12, Demo, Code (unofficial)
- GST-Tacotron, Zero-shot (✗), Controllability (Pitch, Prosody), CNN + RNN, Griffin-Lim, LinS, 2018.03, Demo, Code (unofficial)
- Prosody-Tacotron, Zero-shot (✗), Controllability (Pitch, Prosody), RNN, WaveNet, MelS, 2018.03, Demo
A summary of open-source datasets for controllable TTS:
| Dataset | Hours | #Speakers | Labels | Lang | Release | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Pit. | Ene. | Spe. | Age | Gen. | Emo. | Emp. | Acc. | Top. | Des. | Dia. | |||||
| SpeechCraft | 2,391 | 3,200 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | en,zh | 2024 | ||
| Parler-TTS | 50,000 | / | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | en | 2024 | |||||
| MSceneSpeech | 13 | 13 | ✓ | zh | 2024 | ||||||||||
| VccmDataset | 330 | 1,324 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | en | 2024 | |||||
| CLESC | <1 | / | ✓ | ✓ | ✓ | ✓ | en | 2024 | |||||||
| TextrolSpeech | 330 | 1,324 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | en | 2023 | |||||
| DailyTalk | 20 | 2 | ✓ | ✓ | ✓ | en | 2023 | ||||||||
| MagicData-RAMC | 180 | 663 | ✓ | ✓ | zh | 2022 | |||||||||
| PromptSpeech | / | / | ✓ | ✓ | ✓ | ✓ | ✓ | en | 2022 | ||||||
| WenetSpeech | 10,000 | / | ✓ | zh | 2021 | ||||||||||
| GigaSpeech | 10,000 | / | ✓ | en | 2021 | ||||||||||
| ESD | 29 | 10 | ✓ | en,zh | 2021 | ||||||||||
| CommonVoice | 2,500 | 50,000 | ✓ | ✓ | ✓ | multi | 2020 | ||||||||
| AISHELL-3 | 85 | 218 | ✓ | ✓ | ✓ | zh | 2020 | ||||||||
| Taskmaster-1 | / | / | ✓ | en | 2019 | ||||||||||
| CMU-MOSEI | 65 | 1,000 | ✓ | en | 2018 | ||||||||||
| RAVDESS | / | 24 | ✓ | ✓ | en | 2018 | |||||||||
| RECOLA | 3.8 | 46 | ✓ | fr | 2013 | ||||||||||
| IEMOCAP | 12 | 10 | ✓ | ✓ | ✓ | ✓ | ✓ | en | 2008 |
Abbreviations: Pit(ch), Ene(rgy)=volume, Spe(ed), Gen(der), Emo(tion), Emp(hasis), Acc(ent), Top(ic), Des(cription), Env(ironment), Dia(logue).
| Metric | Type | Eval Target | GT Required |
|---|---|---|---|
| Mel-Cepstral Distortion (MCD) |
Objective | Acoustic similarity | ✓ |
| Frequency Domain Score Difference (FDSD) |
Objective | Acoustic similarity | ✓ |
| Word Error Rate (WER) |
Objective | Intelligibility | ✓ |
| Cosine Similarity |
Objective | Speaker similarity | ✓ |
| Perceptual Evaluation of Speech Quality (PESQ) |
Objective | Perceptual quality | ✓ |
| Signal-to-Noise Ratio (SNR) |
Objective | Perceptual quality | ✓ |
| Mean Opinion Score (MOS) |
Subjective | Preference | |
| Comparison Mean Opinion Score (CMOS) |
Subjective | Preference | |
| AB Test | Subjective | Preference | |
| ABX Test | Subjective | Perceptual similarity | ✓ |
GT: Ground truth,