High-quality Text-to-Speech system for Turkic languages, with a primary focus on Kazakh.
Turkic-TTS delivers natural-sounding speech synthesis powered by state-of-the-art neural architectures. Built on Grad-TTS with emotional voice synthesis capabilities, this system supports multiple speakers and emotions, making it ideal for applications ranging from voice assistants to audiobook narration.
- ๐ฃ๏ธ Multi-speaker support - Generate speech with different voice profiles (Female and Male speakers)
- ๐ Emotional synthesis - Control emotions: neutral, happy, sad, angry, scared, surprised
- ๐ฏ High-quality output - Neural vocoder (HiFi-GAN) produces 22.05kHz audio
- ๐ง Flexible architecture - Based on Grad-TTS with diffusion probabilistic modeling
- ๐ Turkic language support - Optimized for Kazakh with IPA phonetic conversion for multiple Turkic languages
- โก Fast inference - Adjustable timesteps for speed-quality trade-off
The system consists of three main components:
- Text Encoder - Converts text to phoneme embeddings with speaker/emotion conditioning
- Diffusion Decoder - Grad-TTS based mel-spectrogram generator with emotion control
- Neural Vocoder - HiFi-GAN converts mel-spectrograms to high-fidelity audio
- Monotonic Alignment Search (MAS) for text-audio alignment
- Classifier-free guidance for enhanced emotion control
- Exponential Moving Average (EMA) training for stability
- Support for both emotional and neutral speech synthesis
- Python 3.9 or higher
- PyTorch 1.10+
- CUDA (optional, for GPU acceleration)
- Clone the repository
git clone https://github.com/stukenov/turkic-tts.git
cd turkic-tts- Install dependencies
pip install -r requirements.txt- Build monotonic alignment module
cd model/monotonic_align
python setup.py build_ext --inplace
cd ../..Pre-trained models are available on HuggingFace:
- Main repository: stukenov/turkic-tts-models
Download the models and place them in the appropriate directories:
- TTS model checkpoint โ
pt_10000/ - HiFi-GAN vocoder โ
pre_trained_3/
Create a text file with your input (e.g., filelists/my_text.txt):
ะกำะปะตะผ, าะฐะปะฐะนััาฃ?|0|1
Format: text|emotion_id|speaker_id
Emotion IDs:
- 0: Angry
- 1: Fear
- 2: Happy
- 3: Neutral
- 4: Sad
- 5: Surprised
Speaker IDs:
- 0: M1 (Male 1)
- 1: F1 (Female 1)
- 2: M2 (Male 2)
Run inference:
python inference_EMA.py \
-c configs/train_grad.json \
-m pt_10000/EMA_grad_10000.pt \
-t 10 \
-g 100 \
-f filelists/my_text.txt \
-r output_audio/Parameters:
-c: Configuration file-m: Model checkpoint path-t: Number of diffusion timesteps (higher = better quality, slower)-g: Classifier-free guidance level (recommended: 100)-f: Input text file-r: Output directory for audio files
# Coming soon: Python API example- Organize your dataset with audio files and transcriptions
- Create filelists in the format:
audio_path|speaker_id|emotion_id|text
python data_preparation.py -d /path/to/your/datasetCUDA_VISIBLE_DEVICES=0 python train_EMA.py \
-c configs/train_grad.json \
-m logs/train_logsThe training script includes:
- Exponential Moving Average (EMA) for model stability
- Duration prediction with Monotonic Alignment Search
- Multi-speaker and emotion embedding
- Tensorboard logging
The model has been trained on high-quality Kazakh speech data with:
- Multiple speakers and emotional expressions
- 80-dimensional mel-spectrograms
- 22.05kHz sampling rate
- Diffusion-based generation with controllable quality
Key configuration options in configs/train_grad.json:
- Model architecture: encoder layers, channels, attention heads
- Training parameters: learning rate, batch size, optimization
- Data processing: sampling rate, hop length, mel bins
- Speaker/emotion settings: number of speakers, emotions, embedding dimensions
turkic-tts/
โโโ model/ # Core model architectures
โ โโโ tts.py # Main Grad-TTS model
โ โโโ diffusion.py # Diffusion decoder
โ โโโ text_encoder.py # Text encoding module
โ โโโ monotonic_align/ # MAS alignment
โโโ text/ # Text processing
โ โโโ cleaners.py # Text normalization
โ โโโ symbols.py # Phoneme symbols
โโโ configs/ # Configuration files
โโโ inference_EMA.py # Inference script
โโโ train_EMA.py # Training script
โโโ models.py # HiFi-GAN vocoder
โโโ ipa_convert.py # IPA phonetic conversion
โโโ requirements.txt # Dependencies
While optimized for Kazakh, the IPA conversion module includes support for:
- Kazakh
- Turkish
- Kyrgyz
- Uzbek
- Azerbaijani
- Turkmen
- Tatar
- Bashkir
- Sakha (Yakut)
- Uyghur
Contributions are welcome! Whether it's:
- Bug fixes
- New features
- Documentation improvements
- Training data contributions
- Support for additional Turkic languages
Please feel free to submit issues and pull requests.
If you use this code or models in your research, please cite:
@software{turkic_tts_2024,
author = {Tukenov, Saken},
title = {Turkic-TTS: High-Quality Text-to-Speech for Turkic Languages},
year = {2024},
url = {https://github.com/stukenov/turkic-tts}
}This project is licensed under the MIT License - see the LICENSE file for details.
This work builds upon several excellent open-source projects:
- Grad-TTS - Diffusion probabilistic TTS
- HiFi-GAN - Neural vocoder
- KazEmoTTS - Kazakh emotional TTS dataset
For questions, suggestions, or collaborations:
- GitHub: @stukenov
- Issues: Project Issues
Star โญ this repository if you find it useful!