CemoBAM: Advancing Multimodal Emotion Recognition through Heterogeneous Graph Networks and Cross-Modal Attention Mechanisms
Please press ⭐ button and/or cite papers if you feel helpful.
Multimodal Speech Emotion Recognition (SER) offers significant advantages over unimodal approaches by integrating diverse information streams such as audio and text. However, effectively fusing these heterogeneous modalities remains a significant challenge. We propose CemoBAM, a novel dual-stream architecture that synergistically combines a Cross-modal Heterogeneous Graph Attention Network (CH-GAT) and a Cross-modal Convolutional Block Attention Mechanism (xCBAM). In CemoBAM architecture, the CH-GAT constructs a heterogeneous graph that models intra- and inter-modal relationships, employing multi-head attention to capture fine-grained dependencies across audio and text feature embeddings. The xCBAM enhances feature refinement through a cross-modal transformer with a modified 1D-CBAM, employing bidirectional cross-attention and channel-spatial attention to emphasize emotionally salient features. The CemoBAM architecture surpasses previous state-of-the-art methods by 0.32% on IEMOCAP and 3.25% on ESD datasets. Comprehensive ablation studies validate the impact of Top-K graph construction parameters, fusion strategies, and the complementary contributions of both modules. The results highlight CemoBAM's robustness and potential for advancing multimodal SER applications.
Index Terms: Multimodal emotion recognition, Speech emotion recognition, Cross-modal heterogeneous graph attention, Cross-modal convolutional block attention mechanism, Feature fusion.
git clone https://github.com/nhut-ngnn/CemoBAM.git
Navigate to the project directory and create a Conda environment:
cd CemoBAM
conda create --name cemobam python=3.8
conda activate cemobam
pip install -r requirements.txt
CemoBAM is evaluated on two widely-used multimodal emotion recognition datasets:
- Modality: Audio + Text
- Classes:
angry
,happy
,sad
,neutral
(4 classes) - Sessions: 5
- Official Website: https://sail.usc.edu/iemocap/
- Note: We use Wav2Vec2.0 for audio and BERT for text feature extraction.
- Modality: Audio + Text
- Languages: English, Mandarin, and more
- Classes:
neutral
,angry
,happy
,sad
,surprise
(5 classes) - Official GitHub: https://github.com/HLTSingapore/ESD
We provide .pkl
files with BERT and Wav2Vec2.0 embeddings for each dataset:
feature/
├── IEMOCAP_BERT_WAV2VEC_train.pkl
├── IEMOCAP_BERT_WAV2VEC_val.pkl
├── IEMOCAP_BERT_WAV2VEC_test.pkl
├── ESD_BERT_WAV2VEC_train.pkl
├── ESD_BERT_WAV2VEC_val.pkl
├── ESD_BERT_WAV2VEC_test.pkl
Run a grid search on different k_text
and k_audio
values for graph construction:
bash selected_topK.sh
To run CemoBAM with a specific configuration:
bash run_training.sh
Evaluate saved models using:
bash run_eval.sh
[1] Nhat Truong Pham, SERVER: Multi-modal Speech Emotion Recognition using Transformer-based and Vision-based Embeddings (ICIIT), 2023. Available https://github.com/nhattruongpham/mmser.git.
[2] Mustaqeem Khan, MemoCMT: Cross-Modal Transformer-Based Multimodal Emotion Recognition System (Scientific Reports), 2025. Available https://github.com/tpnam0901/MemoCMT.
[3] Nhut Minh Nguyen, HemoGAT: Heterogeneous multi-modal emotion recognition with cross-modal transformer and graph attention network, 2025. Available https://github.com/nhut-ngnn/HemoGAT.
If you use this code or part of it, please cite the following papers:
Nhut Minh Nguyen, Thu Thuy Le, Thanh Trung Nguyen, Duc Tai Phan, Khoa Anh Tran and Duc Ngoc Minh Dang, “CemoBAM: Advancing Multimodal Emotion Recognition through Heterogeneous Graph Networks and Cross-Modal Attention Mechanisms”, The 25th Asia-Pacific Network Operations and Management Symposium Conference (APNOMS 2025), Kaohsiung City, Taiwan, Sep 22–24, 2025.
For any information, please contact the main author:
Nhut Minh Nguyen at FPT University, Vietnam
Email: minhnhut.ngnn@gmail.com
ORCID: https://orcid.org/0009-0003-1281-5346
GitHub: https://github.com/nhut-ngnn/