π Paper Β Β |Β Β π€ HuggingFace Β Β |Β Β π€ ModelScope Β Β |Β Β π οΈAudio.Z.AI
GLM-TTS is a high-quality text-to-speech (TTS) synthesis system based on large language models, supporting zero-shot voice cloning and streaming inference. This system adopts a two-stage architecture: first, it uses LLM to generate speech token sequences, then uses Flow model to convert tokens into high-quality audio waveforms. By introducing a Multi-Reward Reinforcement Learning framework, GLM-TTS can generate more expressive and emotional speech, significantly improving the expressiveness of traditional TTS systems.
- [2025.12.11] π The project is officially open-sourced, featuring inference scripts and a series of model weights.
- [2025.12.17] GLM-TTS Technical Report is available on arXiv: 2512.14291.
- [Coming Soon] 2D Vocos vocoder update in progress.
- [Coming Soon] Model Weights Optimized via Reinforcement Learning
- Zero-shot Voice Cloning: Clone any speaker's voice with just 3-10 seconds of prompt audio
- RL-enhanced Emotion Control: Achieve more natural emotional expression and prosody control through multi-reward reinforcement learning framework
- Streaming Inference: Support real-time streaming audio generation, suitable for interactive applications
- High-quality Synthesis: Generate natural and expressive speech with quality comparable to commercial systems
- Multi-language Support: Primarily supports Chinese, while also supporting English mixed text
- Phoneme-level Modeling: Support phoneme-level text-to-speech conversion
- Flexible Inference Methods: Support multiple sampling strategies and inference modes
Ensure you use Python 3.10 - Python 3.12 versions.
# Clone repository
git clone https://github.com/zai-org/GLM-TTS.git
cd GLM-TTS
# Install dependencies
pip install -r requirements.txt
# Install reinforcement learning related dependencies (optional)
cd grpo/modules
git clone https://github.com/s3prl/s3prl
git clone https://github.com/omine-me/LaughterSegmentation
# Download wavlm_large_finetune.pth and place it in grpo/ckpt directoryWe support downloading the complete model weights (including Tokenizer, LLM, Flow, Vocoder, and Frontend) from HuggingFace or ModelScope.
# Create model directory
mkdir -p ckpt
# Option 1: Download from HuggingFace
pip install -U huggingface_hub
huggingface-cli download zai-org/GLM-TTS --local-dir ckpt
# Option 2: Download from ModelScope
pip install -U modelscope
modelscope download --model ZhipuAI/GLM-TTS --local_dir ckptpython glmtts_inference.py \
--data=example_zh \
--exp_name=_test \
--use_cache \
# --phoneme # Add this flag to enable phoneme capabilities.bash glmtts_inference.shpython -m tools.gradio_appGLM-TTS adopts a two-stage design: in the first stage, a large language model (LLM) based on Llama architecture converts input text into speech token sequences; in the second stage, the Flow Matching model converts these token sequences into high-quality mel-spectrogram, and finally generates audio waveforms through a vocoder. The system supports zero-shot voice cloning by extracting speaker features from prompt audio without fine-tuning for specific speakers.
For scenarios demanding high pronunciation accuracy, such as educational assessments and audiobooks, GLM-TTS introduces the Phoneme-in mechanism to address automatic pronunciation ambiguity in polyphones (e.g., "θ‘" which can be read as xΓng or hΓ‘ng) and rare characters. This mechanism supports "Hybrid Phoneme + Text" input, enabling precise, targeted control over specific vocabulary pronunciation.
-
Hybrid Training During training, random G2P (Grapheme-to-Phoneme) conversion is applied to parts of the text. This strategy compels the model to adapt to hybrid input sequences, preserving its ability to understand pure text while enhancing generalization for phoneme inputs.
-
Targeted Inference Inference follows a
G2P -> Table Lookup Replacement -> Hybrid Inputworkflow:- Global Conversion: Obtain the complete phoneme sequence for the input text.
- Dynamic Replacement: Using a "Dynamic Controllable Dictionary," automatically identify polyphones or rare characters and replace them with specified target phonemes.
- Hybrid Generation: Feed the combination of replaced phonemes and original text into GLM-TTS as a hybrid input. This ensures precise pronunciation control for specific words while maintaining natural prosody.
To address the issue of flat emotional expression in traditional TTS, we introduce a multi-reward reinforcement learning framework. This framework comprehensively evaluates generated speech through multiple reward functions (including similarity reward, CER reward, emotion reward, laughter reward, etc.) and uses the GRPO (Group Relative Policy Optimization) algorithm to optimize the LLM's generation strategy. Specifically:
- Multi-reward Design: The system designs various reward functions to evaluate the quality of generated speech from different dimensions, including sound quality, similarity, emotional expression, etc.
- Reward Server: Computes multiple reward functions through a distributed reward server, supporting parallel processing
- Policy Optimization: Uses GRPO algorithm to optimize the LLM's generation strategy based on reward signals, enhancing the emotional expressiveness of speech
- Token-level Rewards: Supports fine-grained token-level reward allocation, providing more precise optimization signals
Through RL optimization, GLM-TTS_RL reduces the CER metric from 1.03 to 0.89 compared to the base model, while maintaining high similarity, achieving better sound quality and expressiveness.
- File Location:
llm/glmtts.py - Function: Text-to-speech model based on Llama architecture, responsible for converting input text into speech token sequences
- Supported Modes: Pretrained (PRETRAIN), Fine-tuning (SFT), and LoRA modes
- File Location:
flow/directory - Core Files:
- Function: Converts token sequences generated by LLM into high-quality mel-spectrogram
- File Location:
cosyvoice/cli/frontend.py - Function: Preprocessing of text and speech, including text normalization, phoneme conversion, speech token extraction, and speaker embedding extraction
- Features: Supports Chinese and English mixed text processing
- File Location:
grpo/directory - Core Files:
grpo_utils.py: GRPO algorithm implementation and batch inferencereward_func.py: Multi-reward function implementationreward_server.py: Distributed reward server
- Function: Optimizes the emotional expressiveness of the TTS system through multi-reward reinforcement learning
Evaluated on seed-tts-eval zh testset. To maintain consistency with the original evaluation, inference was performed without the --phoneme flag.
CER: Character Error Rate (lower is better
| Model | CER |
SIM |
Open-source |
|---|---|---|---|
| MegaTTS3 | 1.52 | 79.0 | π No |
| DiTAR | 1.02 | 75.3 | π No |
| CosyVoice3 | 1.12 | 78.1 | π No |
| Seed-TTS | 1.12 | 79.6 | π No |
| MiniMax | 0.83 | 78.3 | π No |
| CosyVoice2 | 1.38 | 75.7 | π Yes |
| F5-TTS | 1.53 | 76.0 | π Yes |
| FireRedTTS-2 | 1.14 | 73.6 | π Yes |
| IndexTTS2 | 1.03 | 76.5 | π Yes |
| VibeVoice | 1.16 | 74.4 | π Yes |
| HiggsAudio-v2 | 1.50 | 74.0 | π Yes |
| VoxCPM | 0.93 | 77.2 | π Yes |
| GLM-TTS (Ours) | 1.03 | 76.1 | π Yes |
| GLM-TTS_RL (Ours) | 0.89 | 76.4 | π Yes |
GLM-TTS/
βββ glmtts_inference.py # Main inference script, containing complete inference process
βββ glmtts_inference.sh # Pre-trained model inference script
βββ configs/ # Configuration files directory
β βββ spk_prompt_dict.yaml # Speaker prompt dictionary
β βββ lora_adapter_configV3.1.json # LoRA adapter configuration
β βββ G2P_able_1word.json # Single character phoneme conversion configuration
β βββ G2P_all_phonemes.json # Full phoneme list
β βββ G2P_replace_dict.jsonl # Phoneme replacement dictionary
β βββ custom_replace.jsonl # Custom replacement rules
βββ cosyvoice/ # Cosyvoice module
β βββ cli/
β β βββ frontend.py # Text and speech frontend processing
β βββ utils/ # Utility functions
βββ examples/ # Example data
β βββ *.jsonl # Example jsonl files
β βββ prompt/ # Prompt audio directory
β βββ *.wav # Prompt audio (for research use only)
β βββ LICENSE # Audio file license
βββ flow/ # Flow model related
β βββ dit.py # Diffusion Transformer implementation
β βββ flow.py # Streaming Flow model
β βββ modules.py # Flow model basic modules
βββ grpo/ # Reinforcement learning module
β βββ grpo_utils.py # GRPO algorithm implementation
β βββ reward_func.py # Multi-reward functions
β βββ reward_server.py # Distributed reward server
β βββ train_ds_grpo.py # GRPO training script
β βββ data/ # Training data and configuration
βββ llm/ # Large language model related
β βββ glmtts.py # GLM-TTS LLM implementation
βββ frontend/ # Frontend model files
β βββ campplus.onnx # Speaker embedding model
β βββ cosyvoice_frontend.yaml # Frontend configuration
βββ tools/ # Tool scripts
β βββ gradio_app.py # Gradio interactive interface
β βββ ffmpeg_speech_control.py # Audio processing tool
β βββ flow_reconstruct.py # Audio reconstruction
βββ utils/ # Common utilities
βββ tts_model_util.py # TTS model utilities
βββ yaml_util.py # YAML configuration loading utility
βββ audio.py # Audio processing utility
βββ seed_util.py # Random seed utility
βββ block_mask_util.py # Block mask utility
βββ vocos_util.py # Vocos vocoder utility
βββ hift_util.py # Hift vocoder utility
βββ whisper_models/ # Whisper model components
βββ glm_g2p.py # Text to phoneme conversion
We thank the following open-source projects for their support:
- CosyVoice - Providing frontend processing framework and high-quality vocoder
- Llama - Providing basic language model architecture
- Vocos - Providing high-quality vocoder
- GRPO-Zero - Reinforcement learning algorithm implementation inspiration
If you find GLM-TTS useful for your research, please cite our technical report:
@misc{cui2025glmttstechnicalreport,
title={GLM-TTS Technical Report},
author={Jiayan Cui and Zhihan Yang and Naihan Li and Jiankun Tian and Xingyu Ma and Yi Zhang and Guangyu Chen and Runxuan Yang and Yuqing Cheng and Yizhi Zhou and Guochen Yu and Xiaotao Gu and Jie Tang},
year={2025},
eprint={2512.14291},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2512.14291},
}