X-Voice is a flow-matching-based multilingual text-to-speech system that enables one speaker to speak 30 languages.
- 2026/04/30: X-Voice codebase, model, demo, Hugging Face Space, dataset, and benchmark are released.
- 2026/05/08: X-Voice paper publicly released on arXiv.
# Create a conda env with python_version>=3.10
conda create -n x-voice python=3.11
conda activate x-voice
# Install FFmpeg if you haven't yet
conda install ffmpegNVIDIA GPU
# Install pytorch with your CUDA version, e.g. pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
AMD GPU
# Install pytorch with your ROCm version (Linux only), e.g. pip install torch==2.5.1+rocm6.2 torchaudio==2.5.1+rocm6.2 --extra-index-url https://download.pytorch.org/whl/rocm6.2
Intel GPU
# Install pytorch with your XPU version, e.g. pip install torch torchaudio --index-url https://download.pytorch.org/whl/test/xpu
Apple Silicon
# Install the stable pytorch, e.g. pip install torch torchaudio
git clone https://github.com/sunnyxrxrx/X-Voice.git
cd X-Voice
pip install -e .Check your ESpeak-ng installation:
espeak-ng --versionIf not found, run bash src/x_voice/prepare_ipa.sh first.
- In order to achieve desired performance, take a moment to read detailed guidance.
x-voice_infer-gradio --host 0.0.0.0 --port 7860# X-Voice Stage1
python -m x_voice.infer.infer_cli_stage1 -c src/x_voice/infer/examples/basic/basic_stage1.toml
# X-Voice Stage2
python -m x_voice.infer.infer_cli_stage2 -c src/x_voice/infer/examples/basic/basic_stage2.tomlRefer to training guidance for best practice.
Refer to speaking rate predictor guidance for the multilingual speaking rate predictor used in X-Voice.
Refer to evaluation guidance for benchmark and metric scripts.
X-Voice/
├── ckpts/ # checkpoints
├── data/ # datasets and processed data
├── src/
│ ├── rate_pred/ # speaking rate predictor
│ ├── third_party/
│ │ └── BigVGAN/ # BigVGAN submodule
│ └── x_voice/ # main X-Voice package
└── pyproject.toml # package definition and dependencies
Use pre-commit to ensure code quality:
pip install pre-commit
pre-commit install
pre-commit run --all-files- F5-TTS brilliant work and the foundation of this codebase
- Cross-Lingual F5-TTS 2 for its supervised fine-tuning strategy with synthetic audio prompts
- Cross-Lingual F5-TTS for its speaking rate predictor
- NLLB for translation in the Gradio demo
- torchdiffeq as ODE solver, Vocos and BigVGAN as vocoder
- FunASR, faster-whisper, UniSpeech, SpeechMOS for evaluation tools
- MAVL for Japanese syllable counting
If you find our work useful, please cite as:
@article{xu2026xvoiceenablingspeak30,
title={X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning},
author={Rixi Xu and Qingyu Liu and Haitao Li and Yushen Chen and Zhikang Niu and Yunting Yang and Jian Zhao and Ke Li and Berrak Sisman and Qinyuan Cheng and Xipeng Qiu and Kai Yu and Xie Chen},
journal={arXiv preprint arXiv:2605.05611},
year={2026},
}
Our code is released under MIT License. The pre-trained models are licensed under the CC-BY-NC license due to the training data X-Voice Dataset.