🌟 Official repository for the paper "Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models"
[📖 Paper] [🤗 Dataset] [🤗 Model]
Multimodal Large Language Models (MLLMs) have revolutionized numerous research fields, including computer vision and affective computing. As a pivotal challenge in this interdisciplinary domain, facial expression recognition (FER) has evolved from separate, domain-specific models to more unified approaches. One promising avenue to unify FER tasks is converting conventional FER datasets into visual question-answering (VQA) formats, enabling the direct application of powerful generalist MLLMs for inference. However, despite the success of cutting-edge MLLMs in various tasks, their performance on FER tasks remains largely unexplored. To address this gap, we provide FERBench, a systematic benchmark that incorporates 20 state-of-the-art MLLMs across four widely used FER datasets. Our results reveal that, while MLLMs exhibit good classification performance, they still face significant limitations in reasoning and interpretability.
To this end, we introduce post-training strategies aimed at enhancing the facial expression reasoning capabilities of MLLMs. Specifically, we curate two high-quality and large-scale datasets: UniFER-CoT-230K for cold-start initialization and UniFER-RLVR-360K for reinforcement learning with verifiable rewards (RLVR), respectively. Building upon them, we develop a unified and interpretable FER foundation model termed UniFER-7B, which outperforms many open-sourced and closed-source generalist MLLMs (e.g., Gemini-2.5-Pro and Qwen2.5-VL-72B).
Our curated datasets consist of four widely used FER datasets: RAF-DB, FERPlus, AffectNet, and SFEW 2.0. Please download the corresponding images from their official websites before use.
Clone the repository:
git clone https://github.com/zfkarl/UniFER.git
cd UniFER
Create a conda environment:
conda create -n r1-v python=3.11
conda activate r1-v
Please follow the official instructions here to install both PyTorch and additional dependencies.
The proposed four subsets of FERBench are stored in the following json files:
eval_rafdb/data/rafdb_qa.json
eval_ferplus/data/ferplus_qa.json
eval_affectnet/data/affectnet_qa.json
eval_sfew_2.0/data/sfew_2.0_qa.jsonDownload our dataset, and put the json file UniFER_CoT_230K.json in:
data/UniFER_CoT_230K.jsonDownload our dataset, and put the json file UniFER_RLVR_360K.json in:
data/UniFER_RLVR_360K.jsoncd train_unifer/src/scripts
bash run_sft_fer.shcd train_unifer/src/scripts
bash run_grpo_vllm.shAfter the above two-stage post-training, we can subsequently employ the derived model UniFER-7B for inference and evaluate its performance. You may change the directory name Qwen2.5-VL-7B-FER-GRPO-VLLM-8GPU to UniFER-7B for inference. Also, you can directly download our provided checkpoints for inference.
On RAFDB:
cd eval_rafdb/code
python infer_unifer.py
python eval_unifer.pyOn FERPlus:
cd eval_ferplus/code
python infer_unifer.py
python eval_unifer.pyOn AffectNet:
cd eval_affectnet/code
python infer_unifer.py
python eval_unifer.pyOn SFEW2.0:
cd eval_sfew_2.0/code
python infer_unifer.py
python eval_unifer.pyOverall Performance:
cd eval_total/code
python eval_unifer.pyWe would like to thank R1-V and video-r1, which served as the foundations for our repository.
If you find UniFER useful for your research and applications, please kindly cite using this BibTeX:
@misc{zhang2025rethinkingfacialexpressionrecognition,
title={Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and Beyond},
author={Fan Zhang and Haoxuan Li and Shengju Qian and Xin Wang and Zheng Lian and Hao Wu and Zhihong Zhu and Yuan Gao and Qiankun Li and Yefeng Zheng and Zhouchen Lin and Pheng-Ann Heng},
year={2025},
eprint={2511.00389},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.00389},
}🔥 Please contact fzhang@link.cuhk.edu.hk if you would like to contribute to the leaderboard or have any problems.