Skip to content

[APNOMS'25] - "CemoBAM: Advancing Multimodal Emotion Recognition through Heterogeneous Graph Networks and Cross-Modal Attention Mechanisms" by Nhut Minh Nguyen, Thu Thuy Le, Trung Thanh Nguyen, Tai Duc Phan, Anh Khoa Tran, and Duc Ngoc Minh Dang

License

Notifications You must be signed in to change notification settings

nhut-ngnn/CemoBAM

Repository files navigation

CemoBAM: Advancing Multimodal Emotion Recognition through Heterogeneous Graph Networks and Cross-Modal Attention Mechanisms

Official code repository for the manuscript "CemoBAM: Advancing Multimodal Emotion Recognition through Heterogeneous Graph Networks and Cross-Modal Attention Mechanisms", accepted to The 25th Asia-Pacific Network Operations and Management Symposium Conference.

Please press ⭐ button and/or cite papers if you feel helpful.

python pytorch cuda

Abstract

Multimodal Speech Emotion Recognition (SER) offers significant advantages over unimodal approaches by integrating diverse information streams such as audio and text. However, effectively fusing these heterogeneous modalities remains a significant challenge. We propose CemoBAM, a novel dual-stream architecture that synergistically combines a Cross-modal Heterogeneous Graph Attention Network (CH-GAT) and a Cross-modal Convolutional Block Attention Mechanism (xCBAM). In CemoBAM architecture, the CH-GAT constructs a heterogeneous graph that models intra- and inter-modal relationships, employing multi-head attention to capture fine-grained dependencies across audio and text feature embeddings. The xCBAM enhances feature refinement through a cross-modal transformer with a modified 1D-CBAM, employing bidirectional cross-attention and channel-spatial attention to emphasize emotionally salient features. The CemoBAM architecture surpasses previous state-of-the-art methods by 0.32% on IEMOCAP and 3.25% on ESD datasets. Comprehensive ablation studies validate the impact of Top-K graph construction parameters, fusion strategies, and the complementary contributions of both modules. The results highlight CemoBAM's robustness and potential for advancing multimodal SER applications.

Index Terms: Multimodal emotion recognition, Speech emotion recognition, Cross-modal heterogeneous graph attention, Cross-modal convolutional block attention mechanism, Feature fusion.

Install

Clone this repository

git clone https://github.com/nhut-ngnn/CemoBAM.git

Create Conda Enviroment and Install Requirement

Navigate to the project directory and create a Conda environment:

cd CemoBAM
conda create --name cemobam python=3.8
conda activate cemobam

Install Dependencies

pip install -r requirements.txt

Dataset

CemoBAM is evaluated on two widely-used multimodal emotion recognition datasets:

IEMOCAP (Interactive Emotional Dyadic Motion Capture)

  • Modality: Audio + Text
  • Classes: angry, happy, sad, neutral (4 classes)
  • Sessions: 5
  • Official Website: https://sail.usc.edu/iemocap/
  • Note: We use Wav2Vec2.0 for audio and BERT for text feature extraction.

🔹 ESD (Emotional Speech Dataset)

Usage

Preprocessed Features

We provide .pkl files with BERT and Wav2Vec2.0 embeddings for each dataset:

feature/
├── IEMOCAP_BERT_WAV2VEC_train.pkl
├── IEMOCAP_BERT_WAV2VEC_val.pkl
├── IEMOCAP_BERT_WAV2VEC_test.pkl
├── ESD_BERT_WAV2VEC_train.pkl
├── ESD_BERT_WAV2VEC_val.pkl
├── ESD_BERT_WAV2VEC_test.pkl

Train & Evaluate

Run a grid search on different k_text and k_audio values for graph construction:

bash selected_topK.sh

Single Run Example

To run CemoBAM with a specific configuration:

bash run_training.sh

Evaluation

Evaluate saved models using:

bash run_eval.sh

References

[1] Nhat Truong Pham, SERVER: Multi-modal Speech Emotion Recognition using Transformer-based and Vision-based Embeddings (ICIIT), 2023. Available https://github.com/nhattruongpham/mmser.git.

[2] Mustaqeem Khan, MemoCMT: Cross-Modal Transformer-Based Multimodal Emotion Recognition System (Scientific Reports), 2025. Available https://github.com/tpnam0901/MemoCMT.

[3] Nhut Minh Nguyen, HemoGAT: Heterogeneous multi-modal emotion recognition with cross-modal transformer and graph attention network, 2025. Available https://github.com/nhut-ngnn/HemoGAT.

Citation

If you use this code or part of it, please cite the following papers:

Nhut Minh Nguyen, Thu Thuy Le, Thanh Trung Nguyen, Duc Tai Phan, Khoa Anh Tran and Duc Ngoc Minh Dang, “CemoBAM: Advancing Multimodal Emotion Recognition through Heterogeneous Graph Networks and Cross-Modal Attention Mechanisms”, The 25th Asia-Pacific Network Operations and Management Symposium Conference (APNOMS 2025), Kaohsiung City, Taiwan, Sep 22–24, 2025.

Contact

For any information, please contact the main author:

Nhut Minh Nguyen at FPT University, Vietnam

Email: minhnhut.ngnn@gmail.com
ORCID: https://orcid.org/0009-0003-1281-5346
GitHub: https://github.com/nhut-ngnn/

About

[APNOMS'25] - "CemoBAM: Advancing Multimodal Emotion Recognition through Heterogeneous Graph Networks and Cross-Modal Attention Mechanisms" by Nhut Minh Nguyen, Thu Thuy Le, Trung Thanh Nguyen, Tai Duc Phan, Anh Khoa Tran, and Duc Ngoc Minh Dang

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published