Skip to content

KTTRCDL/UMETTS

Repository files navigation

UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts

  • Coming soon: Our team is actively working on the latest code updates to provide better performance and functionality. Stay tuned as the new version will be released soon!

Xiang Li, Zhi-Qi Cheng, Jun-Yan He, Junyao Chen, Xiaomao Fan, Xiaojiang Peng, Alexander G. Hauptmann

Domo Apache-2 License

Emotional Text-to-Speech (E-TTS) synthesis has garnered significant attention in recent years due to its potential to revolutionize human-computer interaction. However, current E-TTS approaches often struggle to capture the intricacies of human emotions, primarily relying on oversimplified emotional labels or single-modality input. In this paper, we introduce the Unified Multimodal Prompt-Induced Emotional Text-to-Speech System (UMETTS), a novel framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. The core of UMETTS consists of two key components: the Emotion Prompt Alignment Module (EP-Align) and the Emotion Embedding-Induced TTS Module(EMI-TTS). (1) EP-Align employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information. (2)Subsequently, EMI-TTS integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions.

Pre-requisites

  1. NVIDIA GPU
  2. Python >= 3.7

Setup

  1. Clone this repository
    # SSH
    git clone --recursive git@github.com:KTTRCDL/UMETTS.git
    
    # HTTPS
    git clone --recursive https://github.com/KTTRCDL/UMETTS.git
  2. Install python requirements. Please refer requirements.txt for the complete list of dependencies.
    # requirements.txt
    pip install -r requirements.txt
    # CLIP
    pip install EPAlign/CLIP
  3. Download datasets
  4. Preprocess the datasets
  5. Download the Pretrained Models

Train & Finetune

Emotion Prompt Alignment Module (EP-Align)

  1. Train the model

    follow the jupyter notebook EPAlign/script/EPAlign_prompt_audio_finetune.ipynb, EPAlign/script/EPAlign_prompt_vision_finetune.ipynb and EPAlign/script/EPAlign_prompt_fuse_finetune.ipynb

  2. Extract the aligned emotional features

    follow the jupyter notebook EPAlign/script/extract_emofeature.ipynb

Emotion Embedding-Induced TTS (EMI-TTS)

Train the model

# Variant VITS
# Cython-version Monotonoic Alignment Search
cd EMITTS/VITS/model/monotonic_align
python setup.py build_ext --inplace
cd ../..
# path/to/json e.g. config/esd_en_e5.json, 
python train.py -c path/to/json -m esd_en

# Variant FastSpeech2
cd EMITTS/FastSpeech2
# need to change some path config in EMITTS/FastSpeech2/config/config.py file
python -m src.scripts.train

# Variant Tacotron2
cd EMITTS/Tacotron2
python train.py --output_directory=ckpt --log_directory=logs --multi_speaker --multi_emotion --emotion_feature

Inference

Emotion Prompt Alignment Module (EP-Align)

Follow the jupyter notebook EPAlign/script/EPAlign_inference.ipynb

Emotion Embedding-Induced TTS (EMI-TTS)

You should train the model to get the checkpoint files and extract the aligned emotional features before inference.

NEW

We have limited resources for maintaining and updating the code. However, we are aware of the feedback regarding the checkpoints used in the inference code, and we are offering published checkpoints for three different variant models of EMI-TTS.

If you would like to obtain these checkpoints, Please click HERE and include the message "Request for EMI-TTS published checkpoints" when accessing the shared Google Drive.

Note: After downloading the checkpoints to your project, please modify the relevant path/to/checkpoint in the inference code to ensure proper execution.

Acknowledgement

This repository is based on CLIP, VITS, Tacotron2, FastSpeech2, and references the Pytorch-DDP, emospeech. We would like to thank the authors of these work for publicly releasing their code.

Citation

If you find this work useful, please consider citing our paper:

@INPROCEEDINGS{10889012,
  author={Li, Xiang and Cheng, Zhi-Qi and He, Jun-Yan and Chen, Junyao and Fan, Xiaomao and Peng, Xiaojiang and Hauptmann, Alexander G.},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts}, 
  year={2025},
  volume={},
  number={},
  pages={1-5},
  keywords={Human computer interaction;Visualization;Codes;Accuracy;Contrastive learning;Signal processing;Reproducibility of results;Acoustics;Text to speech;Speech processing;Emotional Text-to-Speech;Multimodal Synthesis;Contrastive Learning;Expressive Speech Synthesis;Human-Computer Interaction},
  doi={10.1109/ICASSP49660.2025.10889012}}

About

UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors