Skip to content

stukenov/turkic-tts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

9 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Turkic-TTS ๐ŸŽ™๏ธ

License: MIT Python 3.9+ PyTorch

High-quality Text-to-Speech system for Turkic languages, with a primary focus on Kazakh.

Turkic-TTS delivers natural-sounding speech synthesis powered by state-of-the-art neural architectures. Built on Grad-TTS with emotional voice synthesis capabilities, this system supports multiple speakers and emotions, making it ideal for applications ranging from voice assistants to audiobook narration.

โœจ Features

  • ๐Ÿ—ฃ๏ธ Multi-speaker support - Generate speech with different voice profiles (Female and Male speakers)
  • ๐Ÿ˜Š Emotional synthesis - Control emotions: neutral, happy, sad, angry, scared, surprised
  • ๐ŸŽฏ High-quality output - Neural vocoder (HiFi-GAN) produces 22.05kHz audio
  • ๐Ÿ”ง Flexible architecture - Based on Grad-TTS with diffusion probabilistic modeling
  • ๐ŸŒ Turkic language support - Optimized for Kazakh with IPA phonetic conversion for multiple Turkic languages
  • โšก Fast inference - Adjustable timesteps for speed-quality trade-off

๐Ÿ—๏ธ Architecture Overview

The system consists of three main components:

  1. Text Encoder - Converts text to phoneme embeddings with speaker/emotion conditioning
  2. Diffusion Decoder - Grad-TTS based mel-spectrogram generator with emotion control
  3. Neural Vocoder - HiFi-GAN converts mel-spectrograms to high-fidelity audio

Key Technical Features:

  • Monotonic Alignment Search (MAS) for text-audio alignment
  • Classifier-free guidance for enhanced emotion control
  • Exponential Moving Average (EMA) training for stability
  • Support for both emotional and neutral speech synthesis

๐Ÿ“ฆ Installation

Prerequisites

  • Python 3.9 or higher
  • PyTorch 1.10+
  • CUDA (optional, for GPU acceleration)

Setup

  1. Clone the repository
git clone https://github.com/stukenov/turkic-tts.git
cd turkic-tts
  1. Install dependencies
pip install -r requirements.txt
  1. Build monotonic alignment module
cd model/monotonic_align
python setup.py build_ext --inplace
cd ../..

๐Ÿš€ Quick Start

Download Pre-trained Models

Pre-trained models are available on HuggingFace:

Download the models and place them in the appropriate directories:

  • TTS model checkpoint โ†’ pt_10000/
  • HiFi-GAN vocoder โ†’ pre_trained_3/

Inference

Create a text file with your input (e.g., filelists/my_text.txt):

ะกำ™ะปะตะผ, า›ะฐะปะฐะนัั‹าฃ?|0|1

Format: text|emotion_id|speaker_id

Emotion IDs:

  • 0: Angry
  • 1: Fear
  • 2: Happy
  • 3: Neutral
  • 4: Sad
  • 5: Surprised

Speaker IDs:

  • 0: M1 (Male 1)
  • 1: F1 (Female 1)
  • 2: M2 (Male 2)

Run inference:

python inference_EMA.py \
  -c configs/train_grad.json \
  -m pt_10000/EMA_grad_10000.pt \
  -t 10 \
  -g 100 \
  -f filelists/my_text.txt \
  -r output_audio/

Parameters:

  • -c: Configuration file
  • -m: Model checkpoint path
  • -t: Number of diffusion timesteps (higher = better quality, slower)
  • -g: Classifier-free guidance level (recommended: 100)
  • -f: Input text file
  • -r: Output directory for audio files

Example Usage

# Coming soon: Python API example

๐Ÿ‹๏ธ Training

Data Preparation

  1. Organize your dataset with audio files and transcriptions
  2. Create filelists in the format: audio_path|speaker_id|emotion_id|text
python data_preparation.py -d /path/to/your/dataset

Training the Model

CUDA_VISIBLE_DEVICES=0 python train_EMA.py \
  -c configs/train_grad.json \
  -m logs/train_logs

The training script includes:

  • Exponential Moving Average (EMA) for model stability
  • Duration prediction with Monotonic Alignment Search
  • Multi-speaker and emotion embedding
  • Tensorboard logging

๐Ÿ“Š Model Performance

The model has been trained on high-quality Kazakh speech data with:

  • Multiple speakers and emotional expressions
  • 80-dimensional mel-spectrograms
  • 22.05kHz sampling rate
  • Diffusion-based generation with controllable quality

๐Ÿ”ง Configuration

Key configuration options in configs/train_grad.json:

  • Model architecture: encoder layers, channels, attention heads
  • Training parameters: learning rate, batch size, optimization
  • Data processing: sampling rate, hop length, mel bins
  • Speaker/emotion settings: number of speakers, emotions, embedding dimensions

๐Ÿ“ Project Structure

turkic-tts/
โ”œโ”€โ”€ model/               # Core model architectures
โ”‚   โ”œโ”€โ”€ tts.py          # Main Grad-TTS model
โ”‚   โ”œโ”€โ”€ diffusion.py    # Diffusion decoder
โ”‚   โ”œโ”€โ”€ text_encoder.py # Text encoding module
โ”‚   โ””โ”€โ”€ monotonic_align/ # MAS alignment
โ”œโ”€โ”€ text/               # Text processing
โ”‚   โ”œโ”€โ”€ cleaners.py     # Text normalization
โ”‚   โ””โ”€โ”€ symbols.py      # Phoneme symbols
โ”œโ”€โ”€ configs/            # Configuration files
โ”œโ”€โ”€ inference_EMA.py    # Inference script
โ”œโ”€โ”€ train_EMA.py        # Training script
โ”œโ”€โ”€ models.py           # HiFi-GAN vocoder
โ”œโ”€โ”€ ipa_convert.py      # IPA phonetic conversion
โ””โ”€โ”€ requirements.txt    # Dependencies

๐ŸŒ Supported Languages

While optimized for Kazakh, the IPA conversion module includes support for:

  • Kazakh
  • Turkish
  • Kyrgyz
  • Uzbek
  • Azerbaijani
  • Turkmen
  • Tatar
  • Bashkir
  • Sakha (Yakut)
  • Uyghur

๐Ÿค Contributing

Contributions are welcome! Whether it's:

  • Bug fixes
  • New features
  • Documentation improvements
  • Training data contributions
  • Support for additional Turkic languages

Please feel free to submit issues and pull requests.

๐Ÿ“ Citation

If you use this code or models in your research, please cite:

@software{turkic_tts_2024,
  author = {Tukenov, Saken},
  title = {Turkic-TTS: High-Quality Text-to-Speech for Turkic Languages},
  year = {2024},
  url = {https://github.com/stukenov/turkic-tts}
}

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

This work builds upon several excellent open-source projects:

๐Ÿ“ง Contact

For questions, suggestions, or collaborations:


Star โญ this repository if you find it useful!

About

High-quality Text-to-Speech for Turkic languages (Kazakh) with multi-speaker and emotion control.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors