HemoGAT: Heterogeneous multi-modal emotion recognition with cross-modal transformer and graph attention network

Official code repository for the manuscript "HemoGAT: Heterogeneous Multi-Modal Emotion Recognition with Cross-Modal Transformer and Graph Attention Network", accepted in Advances in Electrical and Electronic Engineering.

Please press ⭐ button and/or cite papers if you feel helpful.

Abstract • Install • How to run • References • Citation • Contact

Abstract

Multi-modal speech emotion recognition (SER) is promising, but fusing diverse information streams remains challenging. Sophisticated architectures are required to synergistically combine the modeling of structural relationships across modalities with fine-grained, feature-level interactions. To address this, we introduce HemoGAT, a novel heterogeneous multi-modal SER architecture integrating a cross-modal transformer (CMT) and a graph attention network. HemoGAT employs a dual-stream architecture with two core modules: a heterogeneous multi-modal graph attention network (HM-GAT), which models complex structural and contextual dependencies using a graph of deep embeddings, and a CMT, which enables fine-grained feature fusion through bidirectional cross-attention. This design captures both high-level relationships and immediate inter-modal influences. HemoGAT achieves a 0.29% improvement in accuracy compared to the previous best on the IEMOCAP dataset, and obtains highly competitive results on the MELD dataset, demonstrating its effectiveness compared to the existing methods. Comprehensive ablation studies evaluate the impact of the Top-K algorithm for heterogeneous graph construction, compare uni-modal and multi-modal fusion strategies, assess the contributions of the HM-GAT and the CMT modules, and analyze the effect of GAT layer depth.

Index Terms: Heterogeneous graph construction, Graph attention network, Cross-modal transformer, Feature fusion, Multi-modal speech emotion recognition.

Install

Clone this repository

git clone https://github.com/nhut-ngnn/HemoGAT.git

Create Conda Enviroment and Install Requirement

Navigate to the project directory and create a Conda environment:

cd HemoGAT
conda create --name hemogat python=3.8
conda activate hemogat

Install Dependencies

pip install -r requirements.txt

How to run

This section provides step-by-step instructions to extract features, train, and predict using HemoGAT.

1. Download HemoGAT Resources

The pre-extracted data samples, pretrained models, and configuration files are available at:

Download HemoGAT Resources

Download and extract the resources to your workspace before proceeding.

2. Feature Extraction

To extract text and audio embeddings using BERT and wav2vec2, run:

python feature_extract/BERT_wav2vec2.py

By default, this will load your dataset, extract BERT-based text embeddings and wav2vec2-based audio embeddings, and save them into a feature directory.

3. Training the Model

To train HemoGAT on your extracted features, use:

python main.py --data_dir Path/to/feature/folder --dataset MELD --num_classes 7 --k_text 2 --k_audio 8

Arguments:

--data_dir: Path to the extracted feature folder.
--dataset: Dataset to train on (e.g., MELD, IEMOCAP).
--num_classes: Number of emotion classes.
--k_text: Number of neighbors for the text graph.
--k_audio: Number of neighbors for the audio graph.

4. Prediction

To predict using the trained HemoGAT model, run:

python predict.py --data_dir feature --dataset MELD --num_classes 7 --k_text 2 --k_audio 8

This will load your model checkpoint and output predicted emotion labels along with evaluation metrics.

References

[1] Nhat Truong Pham, SERVER: Multi-modal Speech Emotion Recognition using Transformer-based and Vision-based Embeddings (ICIIT), 2023. Available https://github.com/nhattruongpham/mmser.git.

[2] Mustaqeem Khan, MemoCMT: Cross-Modal Transformer-Based Multimodal Emotion Recognition System (Scientific Reports), 2025. Available https://github.com/tpnam0901/MemoCMT.

[3] Nhut Minh Nguyen, FleSER: Multi-modal emotion recognition via dynamic fuzzy membership and attention fusion, 2025. Available https://github.com/aita-lab/FleSER.

Citation

If you use this code or part of it, please cite the following papers:

Update soon

Contact

For any information, please contact the main author:

Nhut Minh Nguyen at FPT University, Vietnam

Email: minhnhut.ngnn@gmail.com
ORCID: https://orcid.org/0009-0003-1281-5346
GitHub: https://github.com/nhut-ngnn/

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
__pycache__		__pycache__
architecture		architecture
feature		feature
feature_extract		feature_extract
fine-tuning		fine-tuning
metadata		metadata
saved_models		saved_models
LICENSE		LICENSE
README.md		README.md
main.py		main.py
predict.py		predict.py
requirements.txt		requirements.txt
selected_topK.py		selected_topK.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HemoGAT: Heterogeneous multi-modal emotion recognition with cross-modal transformer and graph attention network

Abstract

Install

Clone this repository

Create Conda Enviroment and Install Requirement

Install Dependencies

How to run

1. Download HemoGAT Resources

2. Feature Extraction

3. Training the Model

4. Prediction

References

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

nhut-ngnn/HemoGAT

Folders and files

Latest commit

History

Repository files navigation

HemoGAT: Heterogeneous multi-modal emotion recognition with cross-modal transformer and graph attention network

Abstract

Install

Clone this repository

Create Conda Enviroment and Install Requirement

Install Dependencies

How to run

1. Download HemoGAT Resources

2. Feature Extraction

3. Training the Model

4. Prediction

References

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages