Awesome Controllable Speech Synthesis

This is an evolving repo for the survey: Towards Controllable Speech Synthesis in the Era of Large Language Models: A Systematic Survey.

Abstract

Text-to-speech (TTS) has advanced from generating natural-sounding speech to enabling fine-grained control over attributes like emotion, timbre, and style. Driven by rising industrial demand and breakthroughs in deep learning, e.g., diffusion and large language models (LLMs), controllable TTS has become a rapidly growing research area. This survey provides the first comprehensive review of controllable TTS methods, from traditional control techniques to emerging approaches using natural language prompts. We categorize model architectures, control strategies, and feature representations, while also summarizing challenges, datasets, and evaluations in controllable TTS. This survey aims to guide researchers and practitioners by offering a clear taxonomy and highlighting future directions in this fast-evolving field.

If you find our survey useful for your research, please consider 📚citing📚 our EMNLP 2025 main conference paper.

🎞️ Video introductions for our paper: YouTube (English), Bilibili (Chinese)

💖 Let’s make it better together:

If you find any mistakes, please don’t hesitate to open an issue.
If you find this project helpful, please consider giving it a ⭐ on GitHub to stay updated.
Pull requests are always welcome if you’d like to contribute papers to this repo.

Figure 1: Recent trends in controllable TTS regarding architectures, feature representations, and control abilities (till 2025.03).

News

[2025-09-29] Our paper has been accepted to the EMNLP 2025 Main Conference. We look forward to seeing you in Suzhou, China!

Follow-up Papers 🔥🔥🔥 (Newest First)

Kojima, Atsushi, Yusuke Fujita, Hao Shi, Tomoya Mizumoto, Mengjie Zhao, and Yui Sudo. "Conversation Context-aware Direct Preference Optimization for Style-Controlled Speech Synthesis." APSIPA ASC25. [2025.12]
Okamoto, Umi, Sei Ueno, and Akinobu Lee. "Face-conditioned Large-scale Text-to-Speech via Speaker Embedding Prediction from Facial Images." APSIPA ASC25. [2025.12]
Yin, Kang, Chunyu Qiang, Sirui Zhao, Xiaopeng Wang, Yuzhe Liang, Pengfei Cai, Tong Xu, Chen Zhang and Enhong Chen. “DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance.” (2025). Demo [2025.12]
Wang, Cong, Changfeng Gao, Yang Xiang, Zhihao Du, Keyu An, Han Zhao, Qian Chen, Xiangang Li, Yingming Gao, and Ya Li. "RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS." arXiv preprint arXiv:2512.04552 (2025). [2025.12]
R. -G. Bolborici and A. Neacşu, "Adding Emotion Conditioning in Speech Synthesis via Multi-Term Classifier-Free Guidance," 2025 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), Cluj-Napoca, Romania, 2025, pp. 86-91. Demo [2025.11]
Qiang, Chunyu, Kang Yin, Xiaopeng Wang, Yuzhe Liang, Jiahui Zhao, Ruibo Fu, Tianrui Wang et al. "InstructAudio: Unified speech and music generation with natural language instruction." arXiv preprint arXiv:2511.18487 (2025). Demo [2025.11]
Seung-Bin Kim, Jun-Hyeok Cha, Hyung-Seok Oh, Heejin Choi, and Seong-Whan Lee. 2025. FillerSpeech: Towards Human-Like Text-to-Speech Synthesis with Filler Insertion and Filler Style Control. EMNLP 2025. Demo [2025.11]
Yejin Jeon, Youngjae Kim, Jihyun Lee, Hyounghun Kim, and Gary Lee. 2025. Progressive Facial Granularity Aggregation with Bilateral Attribute-based Enhancement for Face-to-Speech Synthesis. EMNLP 2025 Findings. [2025.11]
Yifu Chen, Shengpeng Ji, Ziqing Wang, Hanting Wang, and Zhou Zhao. 2025. InteractSpeech: A Speech Dialogue Interaction Corpus for Spoken Dialogue Model. EMNLP 2025 Findings. Demo [2025.11]
Shehzeen Samarah Hussain, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Subhankar Ghosh, Roy Fejgin, Mikyas T. Desta, Rafael Valle, and Jason Li. 2025. Koel-TTS: Enhancing LLM-based Speech Generation with Preference Alignment and Classifier Free Guidance. EMNLP 2025. Demo [2025.11]
Zhenqi Jia, Rui Liu, Berrak Sisman, and Haizhou Li. 2025. Multimodal Fine-grained Context Interaction Graph Modeling for Conversational Speech Synthesis. EMNLP 2025. Demo & Code [2025.11]
Jianxing Yu, Gou Zihao, Chen Li, Zhisheng Wang, Peiji Yang, Wenqing Chen, and Jian Yin. 2025. Eliciting Implicit Acoustic Styles from Open-domain Instructions to Facilitate Fine-grained Controllable Generation of Speech. EMNLP 2025. Demo [2025.11]
Anuj Diwan, Zhisheng Zheng, David Harwath, and Eunsol Choi. 2025. Scaling Rich Style-Prompted Text-to-Speech Datasets. EMNLP 2025. Dataset & Model & Code [2025.11]
Zhisheng Zheng, Puyuan Peng, Anuj Diwan, Cong Phuoc Huynh, Xiaohang Sun, Zhu Liu, Vimal Bhat, and David Harwath. 2025. VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing. EMNLP 2025. Demo [2025.11]
Zhao, Yiwen, Jiatong Shi, Jinchuan Tian, Yuxun Tang, Jiarui Hai, Jionghao Han, and Shinji Watanabe. "Adapting Speech Language Model to Singing Voice Synthesis." In AI for Music Workshop. Demo [2025.11]
Yan, Chao, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu, Yuxin Zhang, Fei Tian et al. "Step-Audio-EditX Technical Report." arXiv preprint arXiv:2511.03601 (2025). Code [2025.11]
Tu, Wenming, Guanrou Yang, Ruiqi Yan, Wenxi Chen, Ziyang Ma, Yipeng Kang, Kai Yu, Xie Chen, and Zilong Zheng. "UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models." arXiv preprint arXiv:2510.22588 (2025). Demo, Code [2025.10]
Lou, Haowei, Hye-Young Paik, Wen Hu, and Lina Yao. "ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech Generation." arXiv preprint arXiv:2510.18308 (2025). Code [2025.10]
Peng, Yizhou, Yukun Ma, Chong Zhang, Yi-Wen Chao, Chongjia Ni, and Bin Ma. "Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models." arXiv preprint arXiv:2510.13293 (2025).
Li, Haoxun, Yu Liu, Yuqing Sun, Hanlei Shi, Leyuan Qu, and Taihao Li. "EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS." arXiv preprint arXiv:2510.05758 (2025). Demo [2025.10]
Wang, Yue, Ruotian Ma, Xingyu Chen, Zhengliang Shi, Wanshun Chen, Huang Liu, Jiadi Yao et al. "BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs." arXiv preprint arXiv:2509.26514 (2025). Demo & Code [2025.09]
Zhang, Ziyu, Hanzhao Li, Jingbin Hu, Wenhao Li, and Lei Xie. "HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis." arXiv preprint arXiv:2509.25842 (2025). Demo [2025.09]
Wang, Tianrui, Haoyu Wang, Meng Ge, Cheng Gong, Chunyu Qiang, Ziyang Ma, Zikang Huang et al. "Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis." arXiv preprint arXiv:2509.24629 (2025). Demo [2025.09]
Wang, Sirui, Andong Chen, and Tiejun Zhao. "Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation." arXiv preprint arXiv:2509.20378 (2025). [2025.09]
Liu, Min, JingJing Yin, Xiang Zhang, Siyu Hao, Yanni Hu, Bin Lin, Yuan Feng, Hongbin Zhou, and Jianhao Ye. "Audiobook-CC: Controllable Long-context Speech Generation for Multicast Audiobook." arXiv preprint arXiv:2509.17516 (2025). Demo [2025.09]
Lu, Ye-Xin, Yu Gu, Kun Wei, Hui-Peng Du, Yang Ai, and Zhen-Hua Ling. "DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis." arXiv preprint arXiv:2509.14684 (2025). Demo [2025.09]
Tian, Fengping, Chenyang Lyu, Xuanfan Ni, Haoqin Sun, Qingjuan Li, Zhiqiang Qian, Haijun Li et al. "Marco-voice technical report." arXiv preprint arXiv:2508.02038 (2025). Demo & Code [2025.08]
Zhang, Xueyao, Junan Zhang, Yuancheng Wang, Chaoren Wang, Yuanzhe Chen, Dongya Jia, Zhuo Chen, and Zhizheng Wu. "Vevo2: Bridging Controllable Speech and Singing Voice Generation via Unified Prosody Learning." arXiv preprint arXiv:2508.16332 (2025). Demo [2025.08]
Park, Joonyong, and Kenichi Nakamura. "EmoSSLSphere: Multilingual Emotional Speech Synthesis with Spherical Vectors and Discrete Speech Tokens." arXiv preprint arXiv:2508.11273 (2025). [2025.08]
Bauer, Judith, Frank Zalkow, Meinard Müller, and Christian Dittmar. "Explicit Emphasis Control in Text-to-Speech Synthesis." In Proc. SSW 2025, pp. 21-27. 2025. Demo
Lemerle, Théodor, Nicolas Obin, and Axel Roebel. "Lina-Style: Word-Level Style Control in TTS via Interleaved Synthetic Data." In Proc. SSW 2025, pp. 35-39. 2025. [2025.08]
Zhu, Boyu, Cheng Gong, Muyang Wu, Ruihao Jing, Fan Liu, Xiaolei Zhang, Chi Zhang, and Xuelong Li. "$\text {M}^ 3\text {PDB} $: A Multimodal, Multi-Label, Multilingual Prompt Database for Speech Generation." arXiv preprint arXiv:2508.09702 (2025). Dataset [2025.08]
Xie, Tianxin, Shan Yang, Chenxing Li, Dong Yu, and Li Liu. "EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering." arXiv preprint arXiv:2508.03543 (2025). Demo [2025.08]
Wu, Zhuojun, Dong Liu, Juan Liu, Yechen Wang, Linxi Li, Liwei Jin, Hui Bu, Pengyuan Zhang, and Ming Li. "SMIIP-NV: A Multi-Annotation Non-Verbal Expressive Speech Corpus in Mandarin for LLM-Based Speech Synthesis." ACM Multimedia, 2025. Dataset [2025.07]
Niu, Rui, Weihao Wu, Jie Chen, Long Ma, and Zhiyong Wu. "A Multi-Stage Framework for Multimodal Controllable Speech Synthesis." arXiv preprint arXiv:2506.20945 (2025). Demo [2025.06]
Zhou, Siyi, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, and Jingchen Shu. "IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech." arXiv preprint arXiv:2506.21619 (2025). Demo, Code [2025.06]
Rong, Yan, Jinting Wang, Guangzhi Lei, Shan Yang, and Li Liu. "AudioGenie: A Training-Free Multi-Agent Framework for Diverse Multimodality-to-Multiaudio Generation." arXiv preprint arXiv:2505.22053 (2025). Demo [2025.05]
Rong, Yan, Shan Yang, Guangzhi Lei, and Li Liu. "Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Immersive Audiobook Generation." arXiv preprint arXiv:2504.11002 (2025). Demo [2025.04]

Non-autoregressive Controllable TTS

Below are representative non-autoregressive controllable TTS methods. Each entry follows this format: method name, zero-shot capability, controllability, acoustic model, vocoder, acoustic feature, release date, and code/demo.

NOTE: MelS and LinS represent Mel Spectrogram and Linear Spectrogram, respectively. Among today’s TTS systems, MelS, latent features (from VAEs, diffusion models, and other flow-based methods), and various types of discrete tokens are the most commonly used acoustic representations.

ProEmo, Zero-shot (✗), Controllability (Pitch, Energy, Emotion, Description), Transformer, HiFi-GAN, MelS, 2025.01, Code
DrawSpeech, Zero-shot (✗), Controllability (Energy, Prosody), Diffusion, HiFi-GAN, MelS, 2025.01, Demo, Code
DiffStyleTTS, Zero-shot (✗), Controllability (Pitch, Energy, Speed, Prosody, Timbre), Transformer + Diffusion, HiFi-GAN, MelS, 2025.01, Demo
HED, Zero-shot (✓), Controllability (Emotion), Flow-based Diffusion, Vocos, MelS, 2024.12, Demo
EmoDubber, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion), Transformer + Flow, Flow-based Vocoder, MelS, 2024.12, Demo
EmoSphere++, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion), Transformer + Flow, BigVGAN, MelS, 2024.11, Demo, Code
MS$^{2}$KU-VTTS, Zero-shot (✗), Controllability (Environment, Description), Diffusion, BigvGAN, MelS, 2024.10
NanoVoice, Zero-shot (✓), Controllability (Timbre), Diffusion, BigVGAN, MelS, 2024.09
NansyTTS, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Description), Transformer, NANSY++, MelS, 2024.09, Demo
StyleTTS-ZS, Zero-shot (✓), Controllability (Timbre), Flow-based Diffusion + GAN, Mel-based Decoder, MelS, 2024.09, Demo
E1 TTS, Zero-shot (✓), Controllability (Timbre), DiT + Flow, BigVGAN, Token + MelS, 2024.09, Demo
SimpleSpeech 2, Zero-shot (✓), Controllability (Speed, Timbre), Flow-based DiT, SQ Codec, Token, 2024.08, Demo, Code
CCSP, Zero-shot (✓), Controllability (Timbre), Diffusion, RVQ-based Codec, Token, 2024.07, Demo
ArtSpeech, Zero-shot (✓), Controllability (Timbre), RNN + CNN, HiFI-GAN, MelS, 2024.07, Demo, Code
DEX-TTS, Zero-shot (✓), Controllability (Timbre), Diffusion, HiFi-GAN, MelS, 2024.06, Code
MobileSpeech, Zero-shot (✓), Controllability (Timbre), Transformer, Vocos, Token, 2024.06, Demo
E2 TTS, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, BigVGAN, MelS, 2024.06, Demo, Code (unofficial)
DiTTo-TTS, Zero-shot (✓), Controllability (Speed, Timbre), DiT + VAE, BigVGAN, MelS, 2024.06, Demo
SimpleSpeech, Zero-shot (✓), Controllability (Timbre), Transformer + Diffusion, SQ Codec, Token, 2024.06, Demo, Code
AST-LDM, Zero-shot (✗), Controllability (Timbre, Environment, Description), Diffusion + VAE, HiFi-GAN, MelS, 2024.06, Demo
ControlSpeech, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Transformer + Diffusion, FACodec, Token, 2024.06, Demo, Code
InstructTTS, Zero-shot (✗), Controllability (Pitch, Speed, Prosody, Timbre, Emotion, Description), Transformer + Diffusion, HiFi-GAN, Token, 2024.05, Demo
NaturalSpeech 3, Zero-shot (✓), Controllability (Speed, Prosody, Timbre), Transformer + Diffusion, FACodec, Token, 2024.04, Demo
FlashSpeech, Zero-shot (✓), Controllability (Timbre), Latent Consistency Model, EnCodec, Token, 2024.04, Demo, Code
Audiobox, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Environment, Description), Transformer + Flow, EnCodec, MelS, 2023.12, Demo
HierSpeech++, Zero-shot (✓), Controllability (Timbre), Transformer + VAE + Flow, BigVGAN, MelS, 2023.11, Demo, Code
E3 TTS, Zero-shot (✓), Controllability (Timbre), Diffusion, Not required, Waveform, 2023.11, Demo
P-Flow, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, MelS, 2023.10, Demo, Code (unofficial)
SpeechFlow, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, MelS, 2023.10, Demo
PromptTTS++, Zero-shot (✗), Controllability (Pitch, Speed, Prosody, Timbre, Emotion, Description), Transformer + Diffusion, BigVGAN, MelS, 2023.09, Demo, Code
DuIAN-E, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), CNN + RNN, HiFi-GAN, MelS, 2023.09, Demo
VoiceLDM, Zero-shot (✗), Controllability (Pitch, Prosody, Timbre, Emotion, Environment, Description), Diffusion, HiFi-GAN, MelS, 2023.09, Demo, Code
PromptTTS 2, Zero-shot (✗), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Description), Diffusion, RVQ-based Codec, Latent Feature, 2023.09, Demo
MegaTTS 2, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion), Decoder-only Transformer + GAN, HiFi-GAN, MelS, 2023.07, Demo, Code (unofficial)
VoiceBox, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, MelS, 2023.06, Demo, Code (unofficial)
StyleTTS 2, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion), Flow-based Diffusion + GAN, HiFi-GAN / iSTFTNet, MelS, 2023.06, Demo, Code
PromptStyle, Zero-shot (✓), Controllability (Pitch, Prosody, Timbre, Emotion, Description), VITS + Flow, HiFi-GAN, MelS, 2023.05, Demo
NaturalSpeech 2, Zero-shot (✓), Controllability (Timbre), Diffusion, RVQ-based Codec, Token, 2023.04, Demo, Code (unofficial)
Grad-StyleSpeech, Zero-shot (✓), Controllability (Timbre), Score-based Diffusion, HiFi-GAN, MelS, 2022.11, Demo
PromptTTS, Zero-shot (✗), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Bert + Transformer, HiFi-GAN, MelS, 2022.11, Demo
CLONE, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), Transformer + CNN, WaveNet, MelS + LinS, 2022.07, Demo
Cauliflow, Zero-shot (✗), Controllability (Speed, Prosody), BERT + Flow, UP WaveNet, MelS, 2022.06
GenerSpeech, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, MelS, 2022.05, Demo
StyleTTS, Zero-shot (✓), Controllability (Timbre), CNN + RNN, HiFi-GAN, MelS, 2022.05, Code
YourTTS, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, LinS, 2021.12, Demo & Checkpoint
DelightfulTTS, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), Transformer + CNN, HiFiNet, MelS, 2021.11, Demo
Meta-StyleSpeech, Zero-shot (✓), Controllability (Timbre), Transformer, MelGAN, MelS, 2021.06, Code
SC-GlowTTS, Zero-shot (✓), Controllability (Timbre), Transformer + Flow, HiFi-GAN, MelS, 2021.06, Demo, Code
StyleTagging-TTS, Zero-shot (✓), Controllability (Timbre, Emotion), Transformer + CNN, HiFi-GAN, MelS, 2021.04, Demo
Parallel Tacotron, Zero-shot (✗), Controllability (Prosody), Transformer + CNN, WaveRNN, MelS, 2020.10, Demo
FastPitch, Zero-shot (✗), Controllability (Pitch, Prosody), Transformer, WaveGlow, MelS, 2020.06, Code
FastSpeech 2, Zero-shot (✗), Controllability (Pitch, Energy, Speed, Prosody), Transformer, Parallel WaveGAN, MelS, 2020.06, Code (unofficial)
FastSpeech, Zero-shot (✗), Controllability (Speed, Prosody), Transformer, WaveGlow, MelS, 2019.05, Code (unofficial)

Autoregressive Controllable TTS

Below are representative non-autoregressive controllable TTS methods. Each entry follows this format: method name, zero-shot capability, controllability, acoustic model, vocoder, acoustic feature, release date, and code/demo.

NOTE: MelS and LinS represent Mel Spectrogram and Linear Spectrogram, respectively. Among today’s TTS systems, MelS, latent features (from VAEs, diffusion models, and other flow-based methods), and various types of discrete tokens are the most commonly used acoustic representations.

EmoVoice, Zero-shot (✗), Controllability (Emotion, Description), Decoder-only Transformer, HiFi-GAN, Token, 2025.04, Demo
Spark-TTS, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre), Decoder-only Transformer, BiCodec, Token, 2025.03, Code
Vevo, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion), Decoder-only Transformer, BigVGAN, Token + MelS, 2025.02, Demo, Code
Step-Audio, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion, Description), Decoder-only Transformer, Flow-based Vocoder, Token, 2025.02, Code
FleSpeech, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Flow-based DiT, WaveGAN, Latent Feature, 2025.01, Demo
IDEA-TTS, Zero-shot (✓), Controllability (Timbre, Environment), Transformer, Flow-based Vocoder, LinS + MelS, 2024.12, Demo, Code
KALL-E, Zero-shot (✓), Controllability (Prosody, Timbre, Emotion), Decoder-only Transformer, WaveVAE, Latent Feature, 2024.12, Demo
IST-LM, Zero-shot (✓), Controllability (Prosody, Timbre), Decoder-only Transformer, HiFi-GAN, Token + MelS, 2024.12
SLAM-Omni, Zero-shot (✓), Controllability (Prosody, Timbre), Decoder-only Transformer, HiFi-GAN, Token + MelS, 2024.12, Demo, Code
FishSpeech, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, Firefly-GAN,Token, 2024.11, Code
HALL-E, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, EnCodec, Token, 2024.10
Takin, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Emotion, Description), Decoder-only Transformer + Flow, HiFi-GAN, Token + MelS, 2024.09, Demo
Emotional Dimension Control, Zero-shot (✓), Controllability (Timbre, Emotion), Decoder-only Transformer + Flow, HiFI-GAN, Token + MelS, 2024.09, Demo
CoFi-Speech, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, BigVGAN, Token + MelS, 2024.09, Demo
FireRedTTS, Zero-shot (✓), Controllability (Prosody, Timbre), Decoder-only Transformer + Flow, BigVGAN-v2, Token + MelS, 2024.09, Demo, Code
Emo-DPO, Zero-shot (✗), Controllability (Emotion), Decoder-only Transformer, HiFi-GAN, Token + MelS, 2024.09, Demo
VoxInstruct, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Decoder-only Transformer, Vocos, Token, 2024.08, Demo, Code
MELLE, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, HiFi-GAN, MelS, 2024.07. Demo
CosyVoice, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Emotion, Description), Decoder-only Transformer + Flow, HiFi-GAN, Token, 2024.07, Demo, Code
XTTS, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer + GAN, HiFi-GAN-based Vococder, Token + MelS, 2024.06, Demo, Code
VoiceCraft, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, HiFi-GAN, Token, 2024.06, Code
Seed-TTS, Zero-shot (✓), Controllability (Timbre, Emotion), Decoder-only Transformer + DiT, Unknown Vocoder, Latent Feature, 2024.06, Demo
VALL-E 2, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, Vocos, Token, 2024.06, Demo, Code (unofficial 1), Code (unofficial 2)
VALL-E R, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, Vocos, Token, 2024.06, Demo
ARDiT, Zero-shot (✓), Controllability (Speed, Timbre), Decoder-only DiT, BigVGAN, MelS, 2024.06, Demo
RALL-E, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, SoundStream, Token, 2024.05, Demo
CLaM-TTS, Zero-shot (✓), Controllability (Timbre), Encoder-decoder Transformer, BigVGAN, Token + MelS, 2024.04, Demo
BaseTTS, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, Speechcode Decoder, Token, 2024.02, Demo
ELLA-V, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, EnCodec, Token, 2024.01, Demo
UniAudio, Zero-shot (✓), Controllability (Pitch, Speed, Prosody, Timbre, Description), Decoder-only Transformer, UniAudio Codec, Token, 2023.10, Demo, Code
Salle, Zero-shot (✗), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion, Description), Decoder-only Transformer, EnCodec, Token, 2023.08, Demo
SC VALL-E, Zero-shot (✓), Controllability (Pitch, Energy, Speed, Prosody, Timbre, Emotion), Decoder-only Transformer, EnCodec, Token, 2023.07, Demo, Code
MegaTTS, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer + GAN, HiFi-GAN, MelS, 2023.06, Demo
TorToise, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer + Diffusion, UnivNet, MelS, 2023.05, Code
Make-a-voice, Zero-shot (✓), Controllability (Timbre), Encoder-decoder Transformer, Unit-based Vocoder, Token, 2023.05, Demo
VALL-E X, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, EnCodec, Token, 2023.03, Demo, Code (unofficial)
SpearTTS, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, SoundStream, Token, 2023.02, Demo, Code (unofficial)
VALL-E, Zero-shot (✓), Controllability (Timbre), Decoder-only Transformer, EnCodec, Token, 2023.01, Demo, Code (unofficial 1), Code (unofficial 2)
MsEmoTTS, Zero-shot (✓), Controllability (Pitch, Prosody, Emotion), CNN + RNN, WaveRNN, MelS, 2022.01, Demo
Flowtron, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), CNN + RNN, WaveGlow, MelS, 2020.07, Demo, Code
DurIAN, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), CNN + RNN, MB-WaveRNN, MelS, 2019.09, Demo, Code (unofficial)
VAE-Tacotron, Zero-shot (✗), Controllability (Pitch, Speed, Prosody), VAE, WaveNet, MelS, 2019.02, Code (unoffcial 1), Code (unoffcial 2)
GMVAE-Tacotron, Zero-shot (✗), Controllability (Pitch, Speed, Prosody, Description), VAE, WaveRNN, MelS, 2018.12, Demo, Code (unofficial)
GST-Tacotron, Zero-shot (✗), Controllability (Pitch, Prosody), CNN + RNN, Griffin-Lim, LinS, 2018.03, Demo, Code (unofficial)
Prosody-Tacotron, Zero-shot (✗), Controllability (Pitch, Prosody), RNN, WaveNet, MelS, 2018.03, Demo

Datsets

A summary of open-source datasets for controllable TTS:

Dataset	Hours	#Speakers	Labels											Lang	Release
			Pit.	Ene.	Spe.	Age	Gen.	Emo.	Emp.	Acc.	Top.	Des.	Dia.
SpeechCraft	2,391	3,200	✓	✓	✓	✓	✓	✓	✓		✓	✓		en,zh	2024
Parler-TTS	50,000	/	✓		✓		✓	✓		✓		✓		en	2024
MSceneSpeech	13	13									✓			zh	2024
VccmDataset	330	1,324	✓	✓	✓		✓	✓				✓		en	2024
CLESC	<1	/	✓	✓	✓			✓						en	2024
TextrolSpeech	330	1,324	✓	✓	✓		✓	✓				✓		en	2023
DailyTalk	20	2						✓			✓		✓	en	2023
MagicData-RAMC	180	663									✓		✓	zh	2022
PromptSpeech	/	/	✓	✓	✓			✓				✓		en	2022
WenetSpeech	10,000	/									✓			zh	2021
GigaSpeech	10,000	/									✓			en	2021
ESD	29	10						✓						en,zh	2021
CommonVoice	2,500	50,000				✓	✓			✓				multi	2020
AISHELL-3	85	218				✓	✓			✓				zh	2020
Taskmaster-1	/	/											✓	en	2019
CMU-MOSEI	65	1,000						✓						en	2018
RAVDESS	/	24				✓		✓						en	2018
RECOLA	3.8	46						✓						fr	2013
IEMOCAP	12	10	✓	✓	✓		✓	✓						en	2008

Abbreviations: Pit(ch), Ene(rgy)=volume, Spe(ed), Gen(der), Emo(tion), Emp(hasis), Acc(ent), Top(ic), Des(cription), Env(ironment), Dia(logue).

Evaluation

Metric	Type	Eval Target	GT Required
Mel-Cepstral Distortion (MCD) $\downarrow$	Objective	Acoustic similarity	✓
Frequency Domain Score Difference (FDSD) $\downarrow$	Objective	Acoustic similarity	✓
Word Error Rate (WER) $\downarrow$	Objective	Intelligibility	✓
Cosine Similarity $\downarrow$	Objective	Speaker similarity	✓
Perceptual Evaluation of Speech Quality (PESQ) $\uparrow$	Objective	Perceptual quality	✓
Signal-to-Noise Ratio (SNR) $\uparrow$	Objective	Perceptual quality	✓
Mean Opinion Score (MOS) $\uparrow$	Subjective	Preference
Comparison Mean Opinion Score (CMOS) $\uparrow$	Subjective	Preference
AB Test	Subjective	Preference
ABX Test	Subjective	Perceptual similarity	✓

GT: Ground truth, $\downarrow$: Lower is better, $\uparrow$: Higher is better.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
images		images
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome Controllable Speech Synthesis

Abstract

News

Table of Contents

Follow-up Papers 🔥🔥🔥 (Newest First)

Non-autoregressive Controllable TTS

Autoregressive Controllable TTS

Datsets

Evaluation

Star History

About

Uh oh!

Contributors 3

Uh oh!

License

imxtx/awesome-controllable-speech-synthesis

Folders and files

Latest commit

History

Repository files navigation

Awesome Controllable Speech Synthesis

Abstract

News

Table of Contents

Follow-up Papers 🔥🔥🔥 (Newest First)

Non-autoregressive Controllable TTS

Autoregressive Controllable TTS

Datsets

Evaluation

Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 3

Uh oh!