- Coming soon: Our team is actively working on the latest code updates to provide better performance and functionality. Stay tuned as the new version will be released soon!
Xiang Li, Zhi-Qi Cheng, Jun-Yan He, Junyao Chen, Xiaomao Fan, Xiaojiang Peng, Alexander G. Hauptmann
Emotional Text-to-Speech (E-TTS) synthesis has garnered significant attention in recent years due to its potential to revolutionize human-computer interaction. However, current E-TTS approaches often struggle to capture the intricacies of human emotions, primarily relying on oversimplified emotional labels or single-modality input. In this paper, we introduce the Unified Multimodal Prompt-Induced Emotional Text-to-Speech System (UMETTS), a novel framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. The core of UMETTS consists of two key components: the Emotion Prompt Alignment Module (EP-Align) and the Emotion Embedding-Induced TTS Module(EMI-TTS). (1) EP-Align employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information. (2)Subsequently, EMI-TTS integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions.
- NVIDIA GPU
- Python >= 3.7
- Clone this repository
# SSH git clone --recursive git@github.com:KTTRCDL/UMETTS.git # HTTPS git clone --recursive https://github.com/KTTRCDL/UMETTS.git
- Install python requirements. Please refer requirements.txt for the complete list of dependencies.
# requirements.txt pip install -r requirements.txt # CLIP pip install EPAlign/CLIP
- Download datasets
- Download and extract the Emotion Speech Dataset (ESD) following the instructions in the official repository Emotional-Speech-Data
- Download and extract the Real-world Expression Database (RAF-DB) following the instructions in the official website Real-world Affective Faces Database
- Download and extract the Multimodal EmotionLines Dataset (MELD)
- Preprocess the datasets
- ESD: follow the jupyter notebook preprocess/ESD.ipynb
- Download the Pretrained Models
- Waveglow for Tacotron2 variant
-
Train the model
follow the jupyter notebook EPAlign/script/EPAlign_prompt_audio_finetune.ipynb, EPAlign/script/EPAlign_prompt_vision_finetune.ipynb and EPAlign/script/EPAlign_prompt_fuse_finetune.ipynb
-
Extract the aligned emotional features
follow the jupyter notebook EPAlign/script/extract_emofeature.ipynb
Train the model
# Variant VITS
# Cython-version Monotonoic Alignment Search
cd EMITTS/VITS/model/monotonic_align
python setup.py build_ext --inplace
cd ../..
# path/to/json e.g. config/esd_en_e5.json,
python train.py -c path/to/json -m esd_en
# Variant FastSpeech2
cd EMITTS/FastSpeech2
# need to change some path config in EMITTS/FastSpeech2/config/config.py file
python -m src.scripts.train
# Variant Tacotron2
cd EMITTS/Tacotron2
python train.py --output_directory=ckpt --log_directory=logs --multi_speaker --multi_emotion --emotion_featureFollow the jupyter notebook EPAlign/script/EPAlign_inference.ipynb
You should train the model to get the checkpoint files and extract the aligned emotional features before inference.
- For VITS variant, follow the jupyter notebook EMITTS/VITS/VITS_variant_inference.ipynb
- For FastSpeech2 variant, follow the jupyter notebook EMITTS/FastSpeech2/src/scripts/FastSpeech2_variant_inference.ipynb
- For Tacotron2 variant, follow the jupyter notebook EMITTS/Tacotron2/Tacotron2_variant_inference.ipynb
We have limited resources for maintaining and updating the code. However, we are aware of the feedback regarding the checkpoints used in the inference code, and we are offering published checkpoints for three different variant models of EMI-TTS.
If you would like to obtain these checkpoints, Please click HERE and include the message "Request for EMI-TTS published checkpoints" when accessing the shared Google Drive.
Note: After downloading the checkpoints to your project, please modify the relevant path/to/checkpoint in the inference code to ensure proper execution.
This repository is based on CLIP, VITS, Tacotron2, FastSpeech2, and references the Pytorch-DDP, emospeech. We would like to thank the authors of these work for publicly releasing their code.
If you find this work useful, please consider citing our paper:
@INPROCEEDINGS{10889012,
author={Li, Xiang and Cheng, Zhi-Qi and He, Jun-Yan and Chen, Junyao and Fan, Xiaomao and Peng, Xiaojiang and Hauptmann, Alexander G.},
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts},
year={2025},
volume={},
number={},
pages={1-5},
keywords={Human computer interaction;Visualization;Codes;Accuracy;Contrastive learning;Signal processing;Reproducibility of results;Acoustics;Text to speech;Speech processing;Emotional Text-to-Speech;Multimodal Synthesis;Contrastive Learning;Expressive Speech Synthesis;Human-Computer Interaction},
doi={10.1109/ICASSP49660.2025.10889012}}