skip to main content
research-article
Open access

MGRR-Net: Multi-level Graph Relational Reasoning Network for Facial Action Unit Detection

Published: 29 March 2024 Publication History

Abstract

The Facial Action Coding System (FACS) encodes the action units (AUs) in facial images, which has attracted extensive research attention due to its wide use in facial expression analysis. Many methods that perform well on automatic facial action unit (AU) detection primarily focus on modeling various AU relations between corresponding local muscle areas or mining global attention–aware facial features; however, they neglect the dynamic interactions among local-global features. We argue that encoding AU features just from one perspective may not capture the rich contextual information between regional and global face features, as well as the detailed variability across AUs, because of the diversity in expression and individual characteristics. In this article, we propose a novel Multi-level Graph Relational Reasoning Network (termed MGRR-Net) for facial AU detection. Each layer of MGRR-Net performs a multi-level (i.e., region-level, pixel-wise, and channel-wise level) feature learning. On the one hand, the region-level feature learning from the local face patch features via graph neural network can encode the correlation across different AUs. On the other hand, pixel-wise and channel-wise feature learning via graph attention networks (GAT) enhance the discrimination ability of AU features by adaptively recalibrating feature responses of pixels and channels from global face features. The hierarchical fusion strategy combines features from the three levels with gated fusion cells to improve AU discriminative ability. Extensive experiments on DISFA and BP4D AU datasets show that the proposed approach achieves superior performance than the state-of-the-art methods.

1 Introduction

Facial action units (AUs) are defined as a set of facial muscle movements that correspond to a displayed expression according to the Facial Action Coding System (FACS) [8]. As a fundamental research problem, AU detection is beneficial to facial expression analysis [26, 65, 67] and has wide potential applications in diagnosing mental health issues [40, 48], improving e-learning experiences [37], detecting deception [22], etc. However, AU detection is challenging because of the difficulty in identifying the subtle facial changes caused by AUs and individual physiology. Some earlier studies [25, 56] design hand-crafted features to represent different local facial regions related to AUs, according to the corresponding movements of facial muscles. However, hand-crafted shallow features are not discriminative enough to represent the rich facial morphology. Hence, deep learning–based AU detection methods that rely on global and local facial features have been studied to enhance the feature representation of each AU.
Several recent works [29, 36, 41, 45] aim to enhance the corresponding AU feature representation by combining the affected features in a deep global face feature map. For instance, LP-Net [36] using an LSTM model [13] combines the patch features from grids of equal partition made by a global Convolutional Neural Network (CNN). ARL [45] directly learns spatial attention from the global CNN features of independent AU branches, as shown in Figure 1(a). And Reference [32] separately represented AU features directly from a shared full-face feature via multiple independent fully connected layers to model the relationships among all AUs in a graph. However, these methods suffered from the challenges of accurate localization of muscle areas corresponding to AUs, leading to potential interference from some irrelevant regions. In the past, such issues were addressed by extracting AU-related features from regions of interest (ROIs) centered around the associated facial landmarks [43, 44, 71], which provide more precise muscle locations for AUs and lead to a better AU detection performance. For example, JAA [43] and J\(\rm \hat{A}\)ANet [44] propose attention-based deep models to adaptively select the highly contributing neighboring pixels of initially predefined muscle region for joint AU detection and face alignment, as shown in the Figure 1(b). However, the above local attention-based methods emphasize learning the appearance representation of each facial region based on detected landmarks while ignoring some intrinsic dependencies between different facial muscles. For example, AU2 (“Outer Brow Raiser”) and AU7 (“Lid Tightener”) will be activated simultaneously when scaring and AU6 (“Cheek Raiser) and AU12 (“Lip Corner Puller”), usually simultaneously in a smiling face. To this end, some methods [6, 30, 35, 69] try to utilize prior knowledge of AU correlation by defining a fixed graph that represents the statistical AU correlations. For instance, Reference [30] constructs a predefined graph for each face based on the AU co-occurrences to explicitly model the relationships between AU regions and enhance their semantic representations. However, it is difficult to effectively capture the dynamic relationships between AUs and the distinction of related AUs by a single predefined graph due to the complexity of AU activation and diversity across different subjects. Recent works [49, 50, 51] make an attempt to exploit an adaptive graph to model the uncertainty relationship between AUs. For instance, Reference [50] emphasizes the learning of important local facial regions based on probabilistic graph and obtains better facial appearance features by emphasizing important local facial regions via Long Short-Term Memory (LSTM) [11]. However, these approaches still enhance the semantic AU representations from the perspective of better regional feature representation, neglecting the modeling of the distinctive local and global features of each AU.
Fig. 1.
Fig. 1. Comparisons between the proposed method and two state-of-the-art methods in AU feature learning and the corresponding visualized activation maps for AU10 (Upper Lip Raiser/Levator labii superioris). (a) ARL [45] performs global feature learning, (b) J\(\rm \hat{A}\)ANet [44] learns from predefined local regions based on the landmarks, and (c) multi-level feature learning from both local regions and global face regions (best viewed in color).
The key issue of facial AU detection lies in obtaining a better facial appearance representation by improving the feature discriminative ability of local AUs and global features from the whole face. On the one hand, region-level dynamic AU relevance mining based on facial landmarks accurately detects the corresponding muscles and flexibly models the relevance among muscle regions. It is different from the existing methods focusing on extracting features for a single AU region [43, 44] or a predefined fixed graph representing prior knowledge [19]. Although there have been many methods [30, 49, 50, 51] on modeling relationships between AU regions, this issue still needs to be addressed effectively. On the other hand, due to the differences in expressions, postures, and individuals, fully learning the responses of the target AU in the global face can better capture the contextual differences between different AUs and complement more semantic details from the global face. For instance, References [43, 44] simply concatenated the global features extracted from the whole face via CNNs with all local AU features for input into the final classifier. However, it is difficult for all these methods to learn the sensitivity of the target AU within the global face and supplement enough semantic details from the global face representation in terms of different expressions, postures, and individuals. To the best of our knowledge, how to better respond globally to each AU remains unexploited in existing works [19, 30, 32, 44].
Motivated by the above insights, we propose a novel technique for facial AU detection called MGRR-Net. Our main innovations lie in three aspects, as shown in Figure 1(c). First, we introduce a dynamic graph to model and reason the relationship between a target AU and other AUs. The region-level AU features (as nodes) can accurately locate the corresponding muscles. Second, we supplement each AU with different levels (channel- and pixel-level) of attention-aware details from global features, which greatly improves the distinction between AUs. Finally, we iteratively refine the AU features of the proposed multi-level local-global relational reasoning layer, which makes them more robust and more interpretable. Different from the existing GNN-based approaches [19, 30, 32, 35, 49, 51] that utilize complex GCNs [18] to enhance the distinguishability of AUs by constructing AU relationships, however, we supplement each AU with different perspectives (channel- and pixel-level) of attention-aware details from global features, making it possible to achieve the same purpose in a basic GNN and solve a certain over-smoothing issue. In particular, we extract the global features by multi-layer CNNs and precise AU region features based on the detected facial landmarks, which serve as the inputs of each multi-level relational reasoning layer. A simple region-level AU graph is constructed to represent the relationships by the adjacency matrix (as edges) among AU regions (as nodes), initialized by prior knowledge and iteratively updated. We propose a method to learn channel- and pixel-wise semantic relations for different AUs at the same time by processing them in two separate efficient and effective multi-head graph attention networks (MH-GATs) [58]. Through this, we model the complementary channel- and pixel-level global details. After these local and global relation-oriented modules, a hierarchical gated fusion strategy helps to select more useful information for the final AU representation in terms of different individuals.
The contributions of this work are as follows:
We propose a novel end-to-end iterative reasoning and training scheme for facial AU detection, which leverages the complementary multi-level local-global feature relationships to improve the robustness and discrimination for AU detection;
We construct a region-level AU graph with the prior knowledge initialization and dynamically reason the correlated relationship of individual AUs, thereby improving the robustness of AU detection;
We propose a GAT-based model to improve the discrimination of each local AU patch by supplementing multiple levels of global features;
The proposed MGRR-Net outperforms the state-of-the-art approaches for AU detection on two widely used benchmarks, i.e., BP4D and DISFA, without any external data or pre-trained models.

2 Related Work

2.1 Facial AU Detection

Automatic AU detection has been studied for decades, and several methods [23, 33, 36, 44, 45, 66, 71] have been proposed. Some works [23, 29, 36, 41, 45] predicted the activation state of each AU by directly extracting global face features via CNNs. For instance, References [23, 45] proposed sequential or parallel channel and spatial attention learning mechanisms to explore the attention-aware global representation of each face. While progress has indeed been achieved through the utilization of global representations, the advancement remains constrained by the rudimentary nature of the coarse-grained features. Most existing approaches for facial AU detection use feature learning from local patches [20, 33, 36, 43, 44, 71]. However, there is a need to pre-define the patch location first in some early works [53, 70]. For instance, Reference [15] proposed to use domain knowledge and facial geometry to pre-select a relevant image region (as a patch) for a particular AU and feed it to a convolutional and bi-directional Long Short-Term Memory (LSTM) [11] neural network. Reference [43] proposed an end-to-end deep learning framework for joint AU detection and face alignment, which used the detected landmarks to locate specific AU regions. However, all the above methods focused only on independent regions without considering the correlations among different AU areas to reinforce and diversify each other. Recent works focus on capturing the relations among AUs for local feature enhancement, which can improve robustness compared to single-patch features or global face features. Reference [19] incorporated the AU knowledge graph as extra guidance for enhancing facial region representation. Reference [30] applied the spectral perspective of graph convolutional network (GCN) for AU relation modeling, which also needed an additional AU correlation reference extracted from EAC-Net [21]. However, these methods need prior knowledge of co-occurrence probability in different datasets to construct the fixed relation matrix instead of dynamically updating for different expressions and individuals. Reference [10] proposed a complex skip-BiLSTM to mine the potential mutual assistance and exclusion relationship between AU branches and simple complementary global information. Reference [51] proposed a performance-driven Monte Carlo Markov Chain to generate graphs from the global face, which, however, also captures some irrelevant regions affecting the performance. Moreover, these approaches usually ignored or simply fused the local and global information for each AU without considering the importance (important and non-important) of features. Recently, Reference [32] learned a unique AU graph to explicitly describe the relationship between AUs, where each AU is simply represented from the same full face representation via a fully connected layer and a global average pooling. Although this method explores the global face features to some extent, it relies on strong global feature extraction benchmarks and lacks accurate localization of local muscle areas and discriminable feature representation via local-global interaction.

2.2 Graph Neural Network

Integrating graphs with deep neural networks have recently been an emerging topic in deep learning research. GCNs have been widely used in many applications such as human action recognition [62], emotion recognition [52], social relationship understanding [60], and object parsing [27]. Reference [19] proposed to apply a gated graph neural network (GGNN) with the guidance of AU knowledge-graph on facial AU detection. Reference [35] embedded the relations among AUs through a predefined GCN to enhance the local semantic representation. However, these AU detection methods require a fixed predefined graph from different datasets when applying GGNN or GCN. References [49, 51] applied an adaptive graph to model the relationships between AUs based on global features, ignoring local-global interactions. Recently, a novel graph attention network with multi-head (MH-GAT) leveraged masked self-attentional layers to operate on graph-structured data, which shows high computational efficiency.
As far as we know, there has been no work attempting to obtain better feature representation by multiple interactions between local AU regions and the global face, which we believe is an important cue to boost facial AU detection performance with more fine-grained information and higher diversity of expressions. To this end, our proposed MGRR-Net automatically models the relevance among the facial AU regions by a dynamic matrix as a graph and supplements each AU patch with multiple levels of global features to improve the variability. Multiple layers of iterative refinement significantly improve the AU discrimination ability. Our MGRR-Net has wide potential applications in diagnosing mental health issues [40, 48], improving e-learning experiences [37], detecting deception [22], etc. For example, in our future work, we will apply MGRR-Net to automatically estimate facial palsy severity for patients, such as Reference [9]. This will be helpful for the diagnosis and treatment of people who have facial palsy across the world.

3 Approach

As shown in Figure 2, the proposed approach consists of two core modules in each relational reasoning layer, i.e., region-level local feature learning with relational modeling, and global feature learning with channel- and pixel-level attention. A hierarchical gated fusion network is designed to combine multi-level local and global features as the new target AU feature. Finally, after multiple layers of iterative refinement and updating, the AU features are fed into a multi-branch classification network for AU detection. For clarity, the main notations and their definitions throughout the article are shown in Table 1.
Table 1.
NotationDefinition
Ia facial image
Sa set of detected landmarks
\(\rm {O\_G}\)the original global feature
mthe number of detected landmarks
Va set of calculated patch features
\(v_i\)the feature of ith patch
nthe number of calculated patches corresponding to AUs
\(\rm {D}\_\rm {G}\)a fully connected graph for AU relationship construction
Aa learnable adjacency matrix
\(a_i\)the activation status of the ith AU
\(P_{ij}\)the coefficient between ith and jth AU
\(P, C\)a set of pixel- and channel-level features
\(\rm {P}\_\rm {G}\)the pixel-level attention-aware global feature
\(\rm {C}\_\rm {G}\)the channel-level attention-aware global feature
Lthe number of parallel attention layers
Kthe number of relational reasoning layers
\(\bar{v}_i^{k}\)the feature of ith AU patch after kth reasoning layer
\(\rm {GFC}\)a gated fusion cell
\((x_i,y_i)\)the ground-truth coordinate of the ith facial landmark
\((\hat{x}_i,\hat{y}_i)\)the predicted coordinate of the ith facial landmark
\(d_o\)the ground-truth inter-ocular distance
\(p_i\)the ground-truth occurrence probability of ith AU
\(\hat{p}_i\)the predicted occurrence probability of ith AU
Table 1. Main Notations and Their Definitions
Fig. 2.
Fig. 2. The overall architecture of the proposed MGRR-Net for facial AU detection. Given one face image, the region-level features of local AU patches are extracted based on the detected landmarks from an efficient landmark localization network. The original global feature is extracted from the same shared stem network. Then the region-level GNN initialized with prior knowledge is applied to encode the correlation between different AU patches. Two separate MH-GATs are adopted to get two levels of global attention-aware features to supplement each AU. Finally, multiple levels of local-global features are fused by a hierarchical gated fusion strategy and refined by multiple iterations (best viewed in color).

3.1 Global and Local Features Extraction

Given a face image I, we adapt a stem network from the widely used multi-branch network [43] to extract the original global feature \(\rm O\_G\) and further obtain the AU regions based on the detected landmarks. Different from Reference [21], our stem network contains a face alignment module for automatic face landmark detection, facilitating end-to-end training of our method. All branches share the stem network to reduce training costs and the complexity of network training. In particular, a hierarchical and multi-scale region learning module in the stem network extracts features from each local patch with different scales, thus obtaining multi-scale representations. A series of landmarks \(S = \lbrace s_1,s_2,\ldots ,s_m\rbrace\) with length m are detected by an efficient face alignment module similar to Reference [44], including three convolutional blocks connected to a max-pooling layer. According to the detected landmarks, local patches are calculated, and their features \(V=\lbrace v_1,v_2,\ldots ,v_n\rbrace\) are learned via the stem network, where n is the number of selected AU patches. For simplicity, we do not repeat the detailed structure of the stem network here.

3.2 Multi-level Relational Reasoning Layer

After we get the original global feature \(\rm O\_G\) for a face and the local region features \(V=\lbrace v_1,v_2,\ldots ,v_n\rbrace\) for AUs, a multi-layer multi-level relational reasoning model is introduced to automatically explore the relationship of individual local facial regions and supply two levels of global information. Figure 2 shows the detailed structure of the 1st multi-level relational reasoning layer.

3.2.1 Region-level Local Feature Relational Modeling.

Different from the predefined fixed AU relationship graph in Reference [19], we construct a fully connected graph \(\rm D\_G\) for all AUs, where the region-level features \(V=\lbrace v_1,v_2,\ldots ,v_n\rbrace\) constitute the nodes, and a learnable adjacency matrix A constitutes the edges at each layer to represent the possibility of AU co-occurrence (co-activated or non-activated). In this scheme, the AUs with no co-occurrence or low co-occurrence relationship in the training set will not be completely ignored, like References [19, 30, 35]. During the training process, we utilize prior knowledge to initialize A to assist and constrain model learning. Specifically, the dynamic graph \(\rm D\_G\) comprises nodes (the local region features \(V=\lbrace v_1,v_2,\ldots ,v_n\rbrace\)) and edges (the relationship matrix A among AUs). Following Reference [35], we calculate the relationship coefficients between AUs from datasets to initialize the adjacency matrix A (Figure 5(a) shows the predefined AU correlation on BP4D). The statistical prior knowledge serves as the initial relationship, allowing suppression of the edges with low correlation and speeding up the relationship learning. The relationship coefficient \(A_{ij}\) between the ith and jth AU can be formulated as:
\begin{align} P_{ij} =&\, \frac{1}{2} (P(a_i=1 | a_j=1) + P(a_i=0 | a_j=0)), \end{align}
(1)
\begin{align} A_{ij} =&\, |(P_{ij} - 0.5) * 2|, \end{align}
(2)
where \(a_i\) = 1 denotes ith AU is activated and 0 otherwise, \(|\cdot |\) means absolute value function. From Equation (1) and Equation (2), \(P (a_i\) = \(1|a_j\) = \(1)\) = 0.5 means that when jth AU is activated, the probability of occurrence is equal to the no occurrence for ith AU. It indicates that the activation of jth AU could not provide useful information for the ith AU, and therefore no edge is connected.

3.2.2 Attention-aware Global Features Learning.

We argue that complementary global feature can improve the discrimination between AUs, which also alleviates the over-smoothing issue in graph neural networks for local relationship modeling. To this end, we employ two separate high-efficiency GAT models [58] to perform channel- and pixel-level attention-aware global features from original deep visual features to handle expression and subject diversities. Specifically, we reshape the original global feature \(\rm O\_G \in \mathbb {R}^{(c,w,h)}\) into a set of channel-level features \(\lbrace C_1, \ldots ,C_c\rbrace ,C_i\in \mathbb {R}^{w*h}\). Similarly, by reshaping pixel dimensions and keeping channel dimension of \(\rm O\_G\) from a convolution layer to reduce the parameters, we get a set of pixel-level features \(\lbrace P_1, \ldots ,P_{w^{\prime }*h^{\prime }}\rbrace ,P_i\in \mathbb {R}^{c}\). The attention coefficient \(\alpha _{ij}\) between channel- or pixel-level features is calculated in GAT, which can be formulated as (Here, we take the process of channel-level attention-aware features as an example.):
\begin{equation} \alpha _{ij}= \frac{ {\rm exp}(U_q C_i(U_k C_j)^T / \sqrt {D})}{{\rm \sum _{o\in \Omega _i} exp}(U_q C_i(U_o C_o)^T / \sqrt {D})}, \end{equation}
(3)
where \(U_q,U_k,U_o\) are the parameters of mapping from \(w*h\) to D and \(\Omega _i\) denotes neighborhoods of \(C_i\). \(\sqrt {D}\) acts as a normalization factor. Following References [57, 58], we also employ multi-head dot product by L parallel attention layers to speed up the calculation efficiency. The overall working flow is formulated as:
\begin{equation} \begin{aligned}\bar{C}_i = {\rm ReLU}\left({\rm \sum \nolimits _{o\in \Omega _i}} U_c||_l^L(\alpha _{io}^l * C_i)\right), \\ \alpha _{ij}^l= \frac{ {\rm exp}(U^{\prime }_q C_i(U^{\prime }_k C_j)^T / \sqrt {d})}{{\rm \sum \nolimits _{o\in \Omega _i} exp}(U^{\prime }_q C_i(U^{\prime }_o C_o)^T / \sqrt {d})}, \end{aligned} \end{equation}
(4)
where \(U_c\) is the mapping parameter, \(U^{\prime }_q,U^{\prime }_k,U^{\prime }_o\) map the feature dimension to \(1/L\) of the original, \(||\) means concatenation, and d equals \(D/L\). Finally, the new channel-level attention-aware global feature \(\rm C\_G\) = \(\lbrace \bar{C}_i\rbrace\) is reshaped to the same domination with \(\rm O\_G\). With the same process on pixel-level features \(\lbrace P_1, \ldots ,P_{w^{\prime }*h^{\prime }}\rbrace\), we can get the final pixel-level attention-aware global features \(\rm P\_G\) after a deconvolution layer behind of a GAT with multi-head (MH-GAT).

3.2.3 Hierarchical Fusion and Iteration.

We iteratively refine the ith target AU feature of the proposed multi-level relational reasoning layer K times, which obtains other correlated local and regional information and provides rich global details in each layer. The process can be formulated as:
\begin{equation} \bar{v}_i^{k} = W_i^k v_i^k + \sum \nolimits _i^n(A^k_{ij} W_j^k v_j), \end{equation}
(5)
where \(W^k\) is the mapping parameter and \(A^k_{ij}\) means the learnable correlation coefficient between AU\(_i\) and AU\(_j\) at kth layer. We then use a hierarchical fusion strategy by a gated fusion cell (GFC) to complement the global multi-level information for each updated AU feature at kth layer as follows:
\begin{equation} \bar{v}_i^{k+1} = {\rm GFC}(\bar{v}_i^{k},{\rm GFC}({\rm O\_G}^k, {\rm GFC}({\rm C\_G}^k,{\rm P\_G}^k))). \end{equation}
(6)
We define the operation of \({\rm GFC}\) as follows:
\begin{align} {\rm GFC}({\rm C\_G}^k, {\rm P\_G}^k) =&\, \beta \odot \Vert W_C^k {\rm C\_G}^k\Vert _2 + (1-\beta) \odot \Vert W_P^k {\rm P\_G}^k\Vert _2, \end{align}
(7)
\begin{align} \beta =&\, {\rm {\sigma }}({W_C^{k^{\prime }}} {\rm C\_G}^k+{{W^{k^{\prime }}_P}} {\rm P\_G}^k), \end{align}
(8)
where \(\sigma\) is the sigmoid function, and \(\Vert \cdot \Vert\) denotes the \(l_2\)-normalization. \({W^{k^{\prime }}_*}\) and \({W^k_*}\) denote the Conv2D operation.

3.3 Joint Learning

A multi-label binary classifier is used to classify the AU activation state, which adopts a weighted multi-label cross-entropy loss function (denoted as CE in Figure 2) as follows:
\begin{equation} \mathcal {L}_{au} = - \frac{1}{n} \sum _{i=1}^{n} w_i [p_i {\rm log} \hat{p_i} + (1-p_i) {\rm log} (1-\hat{p}_i)], \end{equation}
(9)
where \(p_i\) and \(\hat{p}_i\) denote the ground-truth and predicted occurrence probability of the ith AU, respectively; \(w_i\) is the data balance weights used in Reference [43]. Furthermore, we also minimize the loss of AU category classification \(\mathcal {L}_{int}\) by integrating all AUs information, including the refined AU features and the face alignment features, which is similar to the processing of \(\mathcal {L}_{au}\).
We jointly integrate face alignment and facial AU recognition into an end-to-end learning model. The face alignment loss is defined as:
\begin{equation} \mathcal {L}_{align} = \frac{1}{2d_o^2} \sum _{i=1}^{m} [(x_i-\hat{x_i})^2+(y_i-\hat{y_i})^2], \end{equation}
(10)
where \((x_i,y_i)\) and \((\hat{x}_i,\hat{y}_i)\) denote the ground-truth coordinate and corresponding predicted coordinate of the ith facial landmark, and \(d_o\) is the ground-truth inter-ocular distance for normalization [44]. Finally, the joint loss of our MGRR-Net is defined as:
\begin{equation} \mathcal {L} = (\mathcal {L}_{au} + \mathcal {L}_{int})+ \lambda \mathcal {L}_{align}, \end{equation}
(11)
where \(\lambda\) is a tuning parameter for balancing.

4 Experiments

In this section, we conduct extensive experiments to evaluate the proposed MGRR-Net. Especially the dataset and training strategy are first introduced. Then, MGRR-Net is compared with state-of-the-art FAU detection approaches quantitatively. Finally, we qualitatively analyze the results in detail.

4.1 Dataset

We provide evaluations on the popular BP4D [68] and DISFA [34] datasets.
BP4D is a spontaneous facial AU database containing 328 facial videos from 41 participants (23 females and 18 males) who were involved in 8 sessions. Similar to References [20, 44, 45], we consider 12 AUs and 140k valid frames with labels.
DISFA consists of 27 participants (12 females and 15 males). Each participant has a video of 4,845 frames. We limited the number of AUs to 8, similar to References [20, 44]. Following References [43, 44], frames in DISFA with AU intensity labels higher than two are considered positive samples. Compared to BP4D, the experimental protocol and lighting conditions deliver DISFA to be a more challenging dataset.
During training, each frame of BP4D and DISFA is annotated with 49 landmarks detected and calculated by SDM [61]. Following the experiment setting of References [43, 44], we evaluated the model using the 3-fold subject-exclusive cross-validation protocol.

4.2 Training Strategy

Our model is trained on a single NVIDIA RTX 2080Ti with 11 GB memory. The whole network is trained with the default initializer of PyTorch [39] with the SGD solver, a Nesterov momentum of 0.9, and a weight decay of 0.0005. The learning rate is set to 0.01 initially, with a decay rate of 0.5 every two epochs. The maximum epoch number is set to 15. During the training process, aligned faces are randomly cropped into \(176 \times 176\) and horizontally flipped. Regarding the face alignment network and stem network, we set the value of the general parameters to be the same with Reference [44]. The iteration layer number K is set to 2 except otherwise noted. The dimensionality of \(O\_G\) is \((64,44,44)\) and D is 1,024. We employ L = 8 parallel attention layers in GATs. In our article, all the mapping Conv2D operations used \(1 \times 1\) convolutional filters with a stride one and a padding 1. We use a \(3 \times 3\) Conv2D operation with a stride two and padding one before learning the channel-level feature to reduce the parameters. \(\lambda\) is empirically set to 0.5 for the joint optimisation of face alignment and facial AU detection on two benchmarks. Following the settings in References [21, 44, 72], our MGRR-Net initializes the parameters of the well-trained model trained on BP4D when training on DISFA. This initialization greatly alleviates the poor performance issue on DISFA due to data volume and AU category imbalance. Compared to \(\rm J\hat{A}A\)-\(\rm Net\) [44], which takes 26.6ms per image to do a forward pass, our model takes just 16.5 ms using an RTX 2080Ti GPU. This is due to the multi-head operation of the effective MH-GATs and the optimization of the model, which significantly reduces forward pass time. The training time is approximately 1.5 hours per epoch. In addition, we average the predicted probability of the local information and the integrated information as the final predicted activation probability for each AU rather than simply using the integrated information of all the AUs.

4.3 Evaluation Metrics

For all methods, the frame-based F1-score (F1-frame, %) is reported, which is the harmonic mean of the Precision \(\rm P\) and Recall \(\rm R\) and calculated by \(\rm F1=2P*R/(P + R)\). To conduct a more comprehensive comparison with other methods, we also evaluate the performance with AUC (%) referring to the area under the ROC curve and accuracy (%). In addition, the average results over all AUs (denoted as Avg.) are computed with “%” omitted.

4.4 Comparison with State-of-the-art Methods

We compare our proposed MGRR-Net with several frame-based AU detection baselines and the latest state-of-the-art methods, including Deep Structure Inference Network (DSIN) [4], Joint AU Detection and Face Alignment (JAA) [43], Multi-Label Co-Regularization (MLCR) [35], Local relationship learning with Person-specific shape regularization (LP-Net) [36], Attention and Relation Learning (ARL) [45], Semantic Relationships Embedded Representation Learning (SRERL) [19], Joint AU detection and face alignment via Adaptive Attention Network (J\(\rm \hat{A}\)ANet) [44], Data-Aware Relation Graph Convolutional Neural network (DAR-GCN) [17], Dual-channel Graph Convolutional Neural Network (JAA-DGCN) [16], a semi-supervised Contrastively Learning the Person-independent representations method (CLP) [24], and a Multiview Mixed Attention-based Network (MMA-Net) [42]. To ensure reliable and fair comparisons, we directly use the results of these methods reported. Note that the best and second-best results are shown using bold and underline, respectively. The experimental results of our MGRR-Net are shown with a grey background.
For a more comprehensive display, we present methods (marked with \(*\)) [2, 3, 5, 14, 49, 51, 55, 59] that use additional data, such as ImageNet [7] and VGGFace2 [1], for pre-training their complex feature extraction stem network first, such as ResNet [12] and so on. From References [14, 36], the pre-trained feature extractor improved the average F1-score by at least 1.2% on BP4D. Due to the fact that our stem network only consists of a few simple convolutional layers, even if we pre-trained on additional datasets, it is unsuitable compared to pre-training on deeper feature extraction networks, such as ResNet50 [12], ResNet101 [12], and Swin Transformer-base [31]. To this end, we have grouped them together to facilitate comparison with our proposed MGRR-Net. Notably, our results show excellence, affirming the superiority and efficacy of our proposed learning methodology. To provide a fair comparison, we omit the need for additional modality inputs and non-frame-based models [28, 47, 54, 63, 64].

4.4.1 Quantitative Comparison on DISFA.

We compare our proposed method with its counterpart in Table 2 and Table 3. It has been shown that our MGRR-Net outperforms all its competitors with impressive margins. Compared with the existing end-to-end feature learning and multi-label classification methods DSIN [4] and ARL [45], our MGRR-Net shows significant improvements on all AUs. These results demonstrate the effectiveness of accurate muscle region localization for AU detection. Although ARL [45] also performs sequential multiple attention explorations on global features, we believe that the sequential mechanism may destroy the diversity of different attention-aware features and slow down the training time. J\(\rm \hat{A}\)ANet is the latest state-of-the-art method that also joint AU detection and face alignment into an end-to-end multi-label multi-branch network. Compared with the baseline J\(\rm \hat{A}\)ANet [44], our MGRR-Net increases the average F1-frame and average accuracy scores by large margins of 4.7% and 1.2% and shows clear improvements for most annotated AU categories. The main reason lies in J\(\rm \hat{A}\)ANet [44] completely ignores the correlation between branches and the individual modeling of each AU. Compared with JAA-DGCN [16] that also applies the graph relationship model, our MGRR-Net still performs better on most metrics, because we model local relationships while supplementing a variety of information from the global face. Moreover, compared with the latest state-of-the-art MMA-Net [42], MGRR-Net achieves a 2.2% lead in the average F1-frame metric. In addition, compared with the current state-of-the-art AU detection methods based on pre-trained models, such as UGN-B [49], HMP-PS [51], DML [59], PIAP [55], Bio-AU [5] and so on, we also achieve the best performance in terms of the average F1-frame.
Table 2.
MethodAU IndexAvg.
12469122526
DSIN [4]42.439.068.428.646.870.890.442.253.6
JAA [43]43.746.256.041.444.769.688.358.456.0
LP-Net [36]29.924.772.746.849.672.993.865.056.9
ARL [45]43.942.163.641.840.076.295.266.858.7
SRERL [19]45.747.859.647.145.673.584.343.655.9
J\(\rm \hat{A}\)ANet [44]62.460.767.141.145.173.590.967.463.5
JAA-DGCN [16]61.851.764.546.054.263.685.569.462.0
CLP\(^{\dagger }\) [24]42.438.763.559.738.973.085.058.157.4
MMA-Net [42]63.854.873.639.261.573.192.370.566.0
MGRR-Net61.362.975.848.753.875.594.373.168.2
UGN-B\(^{*}\) [49]43.348.163.449.548.272.990.859.060.0
HMP-PS\(^{*}\) [51]21.848.553.656.058.757.455.956.961.0
DML\(^{*}\) [59]62.965.871.351.445.976.092.150.264.4
PIAP\(^{*}\) [55]50.251.871.950.654.579.794.157.263.8
TransAU\(^{*}\) [14]46.148.672.856.750.072.190.855.461.5
Bio-AU\(^{*}\) [5]41.544.960.351.550.370.491.355.358.2
MGRR-Net61.362.975.848.753.875.594.373.168.2
Table 2. Comparisons of AU Recognition for 8 AUs on DISFA in Terms of F1-frame Score (in %)
CLP\(^{\dagger }\) is a semi-supervised method. \(^{*}\) means the method employed a pre-trained model on the additional dataset, such as ImageNet [7] and VGGFace2 [1], and so on.
Table 3.
AUAccuracyAUC
JAA [43]ARL [45]J\(\rm \hat{A}\)ANetMMA-Net [42]UGN-B\(^{*}\) [49]MGRR-NetDRML [72]SRERL [19]DML\(^{*}\) [59]DAR-GCN [17]MGRR-Net
193.492.197.096.895.196.853.376.290.584.589.5
296.192.797.396.593.297.453.280.992.792.593.0
486.988.588.091.688.592.760.079.193.872.293.6
691.491.692.191.593.292.154.980.490.348.391.1
995.895.995.696.596.896.951.576.584.478.391.9
1291.293.992.392.393.493.454.687.995.737.895.9
2593.497.394.995.594.896.845.690.998.250.399.0
2693.294.394.895.093.895.645.373.487.474.394.4
Avg.92.793.394.094.593.495.252.380.791.667.393.6
Table 3. Comparisons of AU Recognition for 8 AUs on DISFA in Terms of Accuracy and AUC (in %)
\(^{*}\) means the method employed pretrained model on additional dataset.
Furthermore, the results of the Accuracy and AUC evaluations provide further evidence of the effectiveness of our method compared to other state-of-the-art methods. In particular, our MGRR-Net obtains a significant improvement on the average of Accuracy, i.e., 95.2% vs. 94.5%, compared with MMA-Net [42]. And on AUC metric, our MGRR-Net also achieves higher results on most metrics and increases 2.0% compared to DML* [59].

4.4.2 Quantitative Comparison on BP4D.

Tables 4 and 5 show the AU detection results of different methods in terms of F1-frame, Accuracy, and AUC on BP4D dataset, where the method in the left of Table 4 uses a feature extractor without pre-training and the method with * is based on the pre-trained feature extractor (our method is trained on BP4D only). Compared with the multi-branch combination-based J\(\rm \hat{A}\)ANet [44], the average F1 frame score and average accuracy score of MGRR-Net get 1.3% and 1.1% higher, respectively. Furthermore, compared with the latest graph-based relational modeling method SRERL [19], MGRR-Net increases the average F1-frame and average AUC by large margins of 0.8% and 8.3%. This is mainly due to the fact that the proposed method models the semantic relationships among AUs while also gaining complementary features from multiple global perspectives to increase the distinguishability of each AU. In addition, our MGRR-Net achieves the best or second-best AU detection performance in terms of F1-frame, Accuracy, and AUC for most of the 12 AUs annotated in BP4D compared with the state-of-the-art methods. For example, compared with the latest method MMA-Net [42], which simultaneously modeled the deep feature learning and the structured AU relationship in a unified framework, ours greatly outperforms it by 0.3% in terms of the average of F1-frame. In addition, compared with the advanced models pre-trained with additional data (marked with \(*\) in Table 4 and Table 5), our MGRR-Net still has strong competitiveness.
Table 4.
AUF1-frame   
MLCR [35]JAA [43]LP-Net [36]ARL [45]SRERL[19]J\(\rm \hat{A}\)ANet [44]CLP [24]MMA-Net [42]OursR-CNN* [33]UGN-B\(^{*}\) [49]HMP-PS*[51]DML*[59]TransAU\(^{*}\) [14]Bio-AU\(^{*}\) [5]Ours
142.447.243.345.846.953.847.752.5[52.6]50.254.253.152.651.757.4[52.6]
236.944.038.039.845.347.850.950.9[47.9]43.746.446.144.949.352.6[47.9]
448.154.954.255.155.658.249.558.3[57.3]57.056.856.056.261.064.6[57.3]
677.577.577.175.777.178.575.876.3[78.5]78.576.276.579.877.879.3[78.5]
777.674.676.777.278.475.878.775.7[77.6]78.576.776.980.479.581.5[77.6]
1083.684.083.882.383.582.780.283.8[84.9]82.682.482.185.282.982.7[ 84.9]
1285.886.587.286.687.688.284.187.9[88.4]87.086.186.488.386.385.6[88.4]
1461.061.963.658.863.963.767.163.8[67.8]67.764.764.865.667.667.8[67.8]
1543.743.645.347.652.243.352.048.7[47.6]49.151.251.551.751.947.3[47.6]
1763.260.360.562.163.961.862.761.7[63.3]62.463.163.059.463.058.0[63.3]
2342.142.748.147.447.145.645.746.5[47.4]50.448.549.947.343.747.0[47.4]
2455.641.954.255.453.349.954.854.4[51.3]49.353.654.549.256.344.9[51.3]
Avg.59.860.061.061.162.962.462.463.4[63.7]62.663.363.463.464.264.1[63.7]
Table 4. Comparisons with State-of-the-art Methods for 12 AUs on BP4D in Terms of F1-frame (in %)
\(^{*}\) means the method employed pretrained model on additional dataset.
Table 5.
AUAccuracyAUC
UGN-B* [49]JAA [43]ARL [45]J\(\rm \hat{A}\)ANet [44]MGRR-NetDRML [72]SRERL [19]DML* [59]MGRR-Net
178.674.773.975.278.755.767.678.578.1
280.280.876.780.282.154.570.075.977.2
480.080.480.982.981.658.873.484.483.8
676.678.978.279.878.756.678.488.688.4
772.371.074.472.373.761.076.184.882.3
1077.880.279.178.281.253.680.087.386.3
1284.285.485.586.686.960.885.993.993.6
1463.864.862.865.167.057.064.471.872.9
1584.083.184.781.084.256.275.180.780.8
1772.873.574.172.872.250.071.775.078.2
2382.882.382.982.984.153.971.678.779.3
2486.485.485.786.386.053.974.684.387.8
Avg.78.278.478.278.679.756.074.182.082.4
Table 5. Comparisons with State-of-the-art Methods for 12 AUs on BP4D in Terms of Accuracy and AUC, Respectively (in %)
\(^{*}\) means the method employed pretrained model on additional dataset, such as ImageNet [7], and so on. So we do not directly compare.
Experimental results of MGRR-Net demonstrate its effectiveness in improving AU detection accuracy on DISFA and BP4D, as well as good robustness and generalization ability. Note that the main reason why some AUs are clearly less accurate than others is due to data imbalance; as shown in Figure 3, this is a phenomenon that exists in all existing methods [5, 14, 24, 42, 44, 55]. In BP4D, where the data distribution is relatively reasonable, the results’ distribution of each method is close. But in DISFA, where the data distribution is more extreme, the result distribution of our MGRR-Net can perform better, i.e., lower variance and no outliers. We infer that two aspects promote this improvement. On one hand, we use a weighted multi-label cross-entropy loss function as Equation (9) to solve the data imbalance problem to a certain extent. On the other hand, our multi-level fused representation can complement each AU representation, as well as combine with other AU areas, to further improve AU classification.
Fig. 3.
Fig. 3. Box plots of the distribution of performances on all AU categories (the labeled values are medians). (a) on DISFA 3-flod test set and (b) on BP4D 3-flod test set.

4.5 Ablation Studies

We perform detailed ablation studies on DISFA to investigate the effectiveness of each part of our proposed MGRR-Net. Due to space limitations, we do not show the ablation results for BP4D, but it is consistent with DISFA. To assess the effect of different components, we run the experiments with same parameter setting (e.g., layer K = 2) for variations of the proposed network in Table 6.
Table 6.
Method123456GRR-Net
SettingD_G\(\surd\)\(\surd\)\(\surd\)\(\surd\)\(\surd\)\(\surd\)
O_G\(\surd\)\(\surd\)\(\surd\)\(\surd\)
C_G\(\surd\)\(\surd\)-\(\surd\)
P_G\(\surd\)\(\surd\)\(\surd\)
AU Index147.152.558.460.065.461.0[61.3]
261.158.163.065.764.567.3[62.9]
466.373.370.967.472.576.8[75.8]
644.744.446.243.842.640.9[48.7]
952.252.547.757.152.958.0[53.8]
1274.973.272.175.475.374.8[75.5]
2592.294.793.493.394.393.7[94.3]
2666.271.271.864.771.465.8[73.1]
Avg.63.165.065.465.967.467.3[68.2]
Table 6. Effectiveness of Key Components of MGRR-Net Evaluated on DISFA in Terms of F1-frame Score (in %)

4.5.1 Effects of Region-level Dynamic Graph.

In Table 6, we can see that learning by the dynamic graph initialized with prior knowledge (indicated by D_G) outperforms baseline with an improvement of average F1-frame from 63.1% to 65.0%, indicating that the dynamic graph could get richer features from other correlated AU regions to improve robustness. Furthermore, to cancel out the initialization of prior knowledge, we randomly initialize the dynamic graph, which decreases F1-frame to 64.7%. These observations suggest that the relationship reasoning in the dynamic graph can significantly boost the performance of AU detection, while prior knowledge makes a great contribution but not predominantly.

4.5.2 Effects of Multi-level Global Features.

We test the contributions of multiple important global feature components of the model in Table 6, namely, original global feature (O_G) from stem network, channel-level global feature (C_G) from channel-level MH-GAT and pixel-level global feature (P_G) from pixel-level MH-GAT. After we supplemented O_G for each target AU, the average F1-frame score has been improved from 65.0% to 65.4%, demonstrating the effectiveness of global detail supplementation. The fusion of channel- and pixel-level global features (C_G and P_G) results in a 0.9% increase, indicating that they make the AU more discriminative than only using the original global features. Comparing the results of the fifth test (with C_G) and the sixth test (with P_G) in Table 6 with the third test, one of the channel-level and pixel-level global features can boost the performance by roughly the same amount. It suggests that by supplementing and training different levels of global features for each AU branch, more global details can be provided to detect AUs in terms of different expressions and individuals.
Finally, the hierarchical gated fusion of multi-level global and local features leads to a significant performance improvement to 68.2% in terms of F1-frame score. It validates that the dynamic relationship of multiple related face regions provides more robustness, while the supplementation of multi-level global features makes the AU more discriminative.

4.5.3 Effects of Layer Number.

We evaluate the impact of layer number of our proposed iterative reasoning network. As shown in Table 7, MGRR-Net achieves the averaged F1-frame score of 67.4%, 68.2%, and 66.6% on DISFA when the reasoning layer number K is set to 1, 2, and 3, respectively. The averaged F1-frame scores on BP4D dataset are 63.5%, 63.7%, and 63.1%, respectively. It achieves the best performance when K = 2 and is overfitted when K \(\gt\) 2. Finally, the optimal number of layers is 2 for our MGRR-Net on DISFA and BP4D datasets.
Table 7.
LayersAU IndexAvg.
12469122526
K = 164.558.374.946.154.475.492.373.167.4
K = 261.362.975.848.753.875.594.373.168.2
K = 365.567.077.640.044.975.194.068.866.6
Table 7. Performance Comparison of MGRR-Net with Different Iteration Step Number \(\rm K\) on DISFA in Terms of F1-frame Score (in %)

4.5.4 Results for Face Alignment.

We jointly take face alignment network into our MGRR-Net via auxiliary training, which can provide effective muscle regions corresponding to AUs based on the detected landmarks. Table 8 shows the mean error results of our MGRR-Net and baseline method J\(\rm \hat{A}\)ANet [44] on DISFA and BP4D. We also compare with state-of-the-art face alignment methods that have released trained models, including MCL [46], JAA [43]. Our MGRR-Net achieves competitive 3.95 and 4.01 mean errors on DISFA and BP4D, respectively. It indicates that with the comparable face alignment performance as J\(\rm \hat{A}\)ANet, our MGRR-Net can achieve better AU detection accuracy.
Table 8.
DatasetsMCLJAAJ\(\rm \hat{A}\)ANetMGRR-Net
DISFA7.156.304.023.95
BP4D7.206.383.804.01
Table 8. Mean Error (%) Results of Different Face Alignment Models on DISFA and BP4D (Lower Is Better)

4.6 Visualization of Results

To better understand the effectiveness of our proposed model, we visualize the learned class activation maps of MGRR-Net corresponding to different AUs in terms of different expressions, postures, and individuals, as shown in Figure 4. Three examples are from DISFA and three are from BP4D (two bad examples of abnormal offsets happening are shown at the bottom of Figure 4), containing visualization results of different genders and different poses with different AU categories. Through the learning of MGRR-Net, not only the concerned AU regions can be accurately located, but also the positive correlation with other AU areas can be established and other details of the global face can be supplemented. The different activation maps of the same AU on different individuals show that our MGRR-Net can dynamically adjust according to the differences of expression, posture, and individual. Some activation maps are inconsistent with the predefined AU areas, which may be caused by the insensitivity to the target predefined areas after the introduction of multi-level global supplementation. In addition, as shown in Figure 5, we further visualize the learned relevance matrix (marked as (b)) and the predefined AU correlations (marked as (a)) of the individual corresponding to the first row of Figure 4 on BP4D. The predefined correlation matrices are used to roughly calculate the co-occurrence relevance between different AUs by counting the dependence of positive and negative samples. It over-emphasizes target AU as well as a few other AUs while other AU regions are completely ignored due to bias in the statistics of the data. From the correlation matrix we learned, the target AU and the relevant AU are highlighted without discarding information from other branches at all, which is beneficial for increasing the distinguishability between AUs. Furthermore, the supplementation of global features with multiple perspectives allows different AUs to access a lot of information outside the defined areas, as shown in Figure 4, which is helpful for adaptive changes in terms of different individuals and their expressions.
Fig. 4.
Fig. 4. Class activation maps that show the discriminative regions for different AUs in terms of different expressions and individuals on DISFA and BP4D datasets. We show the region center positions defined by the detected landmarks for the corresponding AUs. Abnormally shifted AU activation maps are marked with red boxes.
Fig. 5.
Fig. 5. Visualizations of the predefined AU correlation (a) and the learned relevance matrix (b) for the individual on BP4D. The corresponding class activation maps are shown in the first row of Figure 4.

5 Conclusion

In this article, we have proposed a novel multi-level graph relational reasoning network (termed MGRR-Net) for facial AU detection. Each layer of MGRR-Net can encode the dynamic relationships among AUs via a region-level relationship graph and multiple complementary levels of global information covering expression and subject diversities. The multi-layer iterative feature refinement finally obtains robust and discriminative features for each AU. Extensive experimental evaluations on DISFA and BP4D show that our MGRR-Net outperforms state-of-the-art AU detection methods with impressive margins.
In our future work, we will introduce the pre-trained models to improve the performance of the stem network in extracting feature representation, and we would like to investigate the implementation of facial AU detection into real applications, such as automatically estimating facial palsy severity for patients. This will be helpful for the diagnosis and treatment of people who have facial palsy across the world. In collaboration with medical professionals, we will collect and annotate facial palsy datasets, such as Reference [38], to further validate the migration capability and effectiveness of the proposed model.

References

[1]
Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, and Andrew Zisserman. 2018. VGGFace2: A dataset for recognising faces across pose and age. In IEEE FG. 67–74.
[2]
Yingjie Chen, Diqi Chen, Tao Wang, Yizhou Wang, and Yun Liang. 2022. Causal intervention for subject-deconfounded facial action unit recognition. In AAAI, Vol. 36. 374–382.
[3]
Yuedong Chen, Guoxian Song, Zhiwen Shao, Jianfei Cai, Tat-Jen Cham, and Jianmin Zheng. 2022. GeoConv: Geodesic guided convolution for facial action unit recognition. Pattern Recognit. 122 (2022), 108355.
[4]
Ciprian Corneanu, Meysam Madadi, and Sergio Escalera. 2018. Deep structure inference network for facial action unit recognition. In ECCV. 298–313.
[5]
Zijun Cui, Chenyi Kuang, Tian Gao, Kartik Talamadupula, and Qiang Ji. 2023. Biomechanics-guided facial action unit detection through force modeling. In CVPR. 8694–8703.
[6]
Zijun Cui, Tengfei Song, Yuru Wang, and Qiang Ji. 2020. Knowledge augmented deep neural networks for joint facial expression and action unit recognition. NeurIPS 33 (2020), 14338–14349.
[7]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In IEEE CVPR. 248–255.
[8]
Paul Ekman and Erika L. Rosenberg. 1997. What the Face Reveals: Basic and Applied Studies of Spontaneous Expression using the Facial Action Coding System (FACS). Oxford University Press.
[9]
Xuri Ge, Joemon M. Jose, Pengcheng Wang, Arunachalam Iyer, Xiao Liu, and Hu Han. 2023. ALGRNet: Multi-relational adaptive facial action unit modelling for face representation and relevant recognitions. IEEE Trans. Biom. Behav. Ident. Sci. (2023).
[10]
Xuri Ge, Pengcheng Wan, Hu Han, Joemon M. Jose, Zhilong Ji, Zhongqin Wu, and Xiao Liu. 2021. Local global relational network for facial action units recognition. In IEEE FG. IEEE, 01–08.
[11]
Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18, 5-6 (2005), 602–610.
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE CVPR. 770–778.
[13]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computat. 9, 8 (1997), 1735–1780.
[14]
Geethu Miriam Jacob and Bjorn Stenger. 2021. Facial action unit detection with transformers. In IEEE CVPR. 7680–7689.
[15]
Shashank Jaiswal and Michel Valstar. 2016. Deep learning the dynamic appearance and shape of facial action units. In IEEE WACV. 1–8.
[16]
Xibin Jia, Shaowu Xu, Yuhan Zhou, Luo Wang, and Weiting Li. 2023. A novel dual-channel graph convolutional neural network for facial action unit recognition. Pattern Recog. Lett. 166 (2023), 61–68.
[17]
Xibin Jia, Yuhan Zhou, Weiting Li, Jinghua Li, and Baocai Yin. 2022. Data-aware relation learning-based graph convolution neural network for facial action unit recognition. Pattern Recog. Lett. 155 (2022), 100–106.
[18]
Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
[19]
Guanbin Li, Xin Zhu, Yirui Zeng, Qing Wang, and Liang Lin. 2019. Semantic relationships guided representation learning for facial action unit recognition. In AAAI. 8594–8601.
[20]
Wei Li, Farnaz Abtahi, and Zhigang Zhu. 2017. Action unit detection with region adaptation, multi-labeling learning and optimal temporal fusing. In IEEE CVPR. 1841–1850.
[21]
Wei Li, Farnaz Abtahi, Zhigang Zhu, and Lijun Yin. 2018. EAC-Net: Deep nets with enhancing and cropping for facial action unit detection. IEEE Trans. Pattern Anal. Mach. Intell. 40, 11 (2018), 2583–2596.
[22]
Xiaobai Li, Jukka Komulainen, Guoying Zhao, Pong-Chi Yuen, and Matti Pietikäinen. 2016. Generalized face anti-spoofing by detecting pulse from face videos. In IEEE ICPR. 4244–4249.
[23]
Yante Li, Xiaohua Huang, and Guoying Zhao. 2021. Micro-expression action unit detection with spatial and channel attention. Neurocomputing 436 (2021), 221–231.
[24]
Yong Li and Shiguang Shan. 2023. Contrastive learning of person-independent representations for facial action unit detection. IEEE Trans. Image Process. (2023).
[25]
Yongqiang Li, Shangfei Wang, Yongping Zhao, and Qiang Ji. 2013. Simultaneous facial feature tracking and facial expression recognition. IEEE Trans. Image Process. 22, 7 (2013), 2559–2573.
[26]
Yante Li and Guoying Zhao. 2021. Intra-and inter-contrastive learning for micro-expression action unit detection. In ACM ICMI. 702–706.
[27]
Xiaodan Liang, Xiaohui Shen, Jiashi Feng, Liang Lin, and Shuicheng Yan. 2016. Semantic object parsing with graph lstm. In ECCV. Springer, 125–143.
[28]
Peng Liu, Zheng Zhang, Huiyuan Yang, and Lijun Yin. 2019. Multi-modality empowered network for facial action unit detection. In IEEE WACV. 2175–2184.
[29]
Ping Liu, Joey Tianyi Zhou, Ivor Wai-Hung Tsang, Zibo Meng, Shizhong Han, and Yan Tong. 2014. Feature disentangling machine-a novel approach of feature selection and disentangling in facial expression analysis. In ECCV. 151–166.
[30]
Zhilei Liu, Jiahui Dong, Cuicui Zhang, Longbiao Wang, and Jianwu Dang. 2020. Relation modeling with graph convolutional networks for facial action unit detection. In MMM. Springer, 489–501.
[31]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE ICCV. 10012–10022.
[32]
Cheng Luo, Siyang Song, Weicheng Xie, Linlin Shen, and Hatice Gunes. 2022. Learning multi-dimensional edge feature-based AU relation graph for facial action unit recognition. In IJCAI. 1239–1246.
[33]
Chen Ma, Li Chen, and Junhai Yong. 2019. AU R-CNN: Encoding expert prior knowledge into R-CNN for action unit detection. Neurocomputing 355 (2019), 35–47.
[34]
S. Mohammad Mavadati, Mohammad H. Mahoor, Kevin Bartlett, Philip Trinh, and Jeffrey F. Cohn. 2013. DISFA: A spontaneous facial action intensity database. IEEE Trans. Affect. Comput. 4, 2 (2013), 151–160.
[35]
Xuesong Niu, Hu Han, Shiguang Shan, and Xilin Chen. 2019. Multi-label co-regularization for semi-supervised facial action unit recognition. In NIPS. 909–919.
[36]
Xuesong Niu, Hu Han, Songfan Yang, Yan Huang, and Shiguang Shan. 2019. Local relationship learning with person-specific shape regularization for facial action unit detection. In IEEE CVPR. 11917–11926.
[37]
Xuesong Niu, Hu Han, Jiabei Zeng, Xuran Sun, Shiguang Shan, Yan Huang, Songfan Yang, and Xilin Chen. 2018. Automatic engagement prediction with GAP feature. In ACM ICMI. 599–603.
[38]
Brian F. O’Reilly, John J. Soraghan, Stewart McGrenary, and Shu He. 2010. Objective method of assessing and presenting the House-Brackmann and regional grades of facial palsy by production of a facogram. Otol. Neurotol. 31, 3 (2010), 486–491.
[39]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In NIPS. 8026–8037.
[40]
David R. Rubinow and Robert M. Post. 1992. Impaired recognition of affect in facial expression in depressed patients. Biolog. Psychiat. 31, 9 (1992), 947–953.
[41]
Nishant Sankaran, Deen Dayal Mohan, Nagashri N. Lakshminarayana, Srirangaraj Setlur, and Venu Govindaraju. 2020. Domain adaptive representation learning for facial action unit recognition. Pattern Recog. 102 (2020), 107127.
[42]
Ziqiao Shang, Congju Du, Bingyin Li, Zengqiang Yan, and Li Yu. 2023. MMA-Net: Multi-view mixed attention mechanism for facial action unit detection. Pattern Recog. Lett. (2023).
[43]
Zhiwen Shao, Zhilei Liu, Jianfei Cai, and Lizhuang Ma. 2018. Deep adaptive attention for joint facial action unit detection and face alignment. In ECCV. 705–720.
[44]
Zhiwen Shao, Zhilei Liu, Jianfei Cai, and Lizhuang Ma. 2021. JAA-Net: Joint facial action unit detection and face alignment via adaptive attention. Int. J. Comput. Vis. 129, 2 (2021), 321–340.
[45]
Zhiwen Shao, Zhilei Liu, Jianfei Cai, Yunsheng Wu, and Lizhuang Ma. 2019. Facial action unit detection using attention and relation learning. IEEE Trans. Affect. Comput. (2019).
[46]
Zhiwen Shao, Hengliang Zhu, Xin Tan, Yangyang Hao, and Lizhuang Ma. 2020. Deep multi-center learning for face alignment. Neurocomputing 396 (2020), 477–486.
[47]
Zhiwen Shao, Lixin Zou, Jianfei Cai, Yunsheng Wu, and Lizhuang Ma. 2020. Spatio-temporal relation and attention learning for facial action unit detection. arXiv preprint arXiv:2001.01168 (2020).
[48]
Jingang Shi, Iman Alikhani, Xiaobai Li, Zitong Yu, Tapio Seppänen, and Guoying Zhao. 2019. Atrial fibrillation detection from face videos by fusing subtle variations. IEEE Trans. Circ. Syst. Video Technol. 30, 8 (2019), 2781–2795.
[49]
Tengfei Song, Lisha Chen, Wenming Zheng, and Qiang Ji. 2021. Uncertain graph neural networks for facial action unit detection. In AAAI. 5993–6001.
[50]
Tengfei Song, Zijun Cui, Yuru Wang, Wenming Zheng, and Qiang Ji. 2021. Dynamic probabilistic graph convolution for facial action unit intensity estimation. In IEEE CVPR. 4845–4854.
[51]
Tengfei Song, Zijun Cui, Wenming Zheng, and Qiang Ji. 2021. Hybrid message passing with performance-driven structures for facial action unit detection. In IEEE CVPR. 6267–6276.
[52]
Tengfei Song, Wenming Zheng, Peng Song, and Zhen Cui. 2018. EEG emotion recognition using dynamical graph convolutional neural networks. IEEE Trans. Affect. Comput. 11, 3 (2018), 532–541.
[53]
Sima Taheri, Qiang Qiu, and Rama Chellappa. 2014. Structure-preserving sparse decomposition for facial expression analysis. IEEE Trans. Image Process. 23, 8 (2014), 3590–3603.
[54]
Gauthier Tallec, Arnaud Dapogny, and Kevin Bailly. 2022. Multi-order networks for action unit detection. IEEE Trans. Affect. Comput. (2022).
[55]
Yang Tang, Wangding Zeng, Dafei Zhao, and Honggang Zhang. 2021. PIAP-DF: Pixel-interested and anti person-specific facial action unit detection net with discrete feedback learning. In IEEE ICCV. 12899–12908.
[56]
Yan Tong and Qiang Ji. 2008. Learning Bayesian networks with qualitative constraints. In IEEE CVPR. 1–8.
[57]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS 30 (2017).
[58]
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In ICLR. 1–12.
[59]
Shangfei Wang, Yanan Chang, and Can Wang. 2021. Dual learning for joint facial landmark detection and action unit recognition. IEEE Trans. Affect. Comput. (2021).
[60]
Zhouxia Wang, Tianshui Chen, Jimmy Ren, Weihao Yu, Hui Cheng, and Liang Lin. 2018. Deep reasoning with knowledge graph for social relationship understanding. arXiv preprint arXiv:1807.00504 (2018).
[61]
Xuehan Xiong and Fernando De la Torre. 2013. Supervised descent method and its applications to face alignment. In IEEE CVPR. 532–539.
[62]
Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI.
[63]
Huiyuan Yang, Taoyue Wang, and Lijun Yin. 2020. Adaptive multimodal fusion for facial action units recognition. In ACM MM. 2982–2990.
[64]
Huiyuan Yang, Lijun Yin, Yi Zhou, and Jiuxiang Gu. 2021. Exploiting semantic embedding and visual feature for facial action unit detection. In IEEE CVPR. 10482–10491.
[65]
Mingjing Yu, Huicheng Zheng, Zhifeng Peng, Jiayu Dong, and Heran Du. 2020. Facial expression recognition based on a multi-task global-local network. Pattern Recog. Lett. 131 (2020), 166–171.
[66]
Liangfei Zhang, Ognjen Arandjelovic, and Xiaopeng Hong. 2021. Facial action unit detection with local key facial sub-region based multi-label classification for micro-expression analysis. In ACM MM. 11–18.
[67]
Liangfei Zhang, Xiaopeng Hong, Ognjen Arandjelovic, and Guoying Zhao. 2021. Short and long range relation based spatio-temporal transformer for micro-expression recognition. arXiv preprint arXiv:2112.05851 (2021).
[68]
Xing Zhang, Lijun Yin, Jeffrey F. Cohn, Shaun Canavan, Michael Reale, Andy Horowitz, Peng Liu, and Jeffrey M. Girard. 2014. BP4D-spontaneous: A high-resolution spontaneous 3D dynamic facial expression database. Image Vis. Comput. 32, 10 (2014), 692–706.
[69]
Yong Zhang, Weiming Dong, Bao-Gang Hu, and Qiang Ji. 2018. Classifier learning with prior probabilities for facial action unit recognition. In IEEE CVPR. 5108–5116.
[70]
Kaili Zhao, Wen-Sheng Chu, Fernando De la Torre, Jeffrey F. Cohn, and Honggang Zhang. 2015. Joint patch and multi-label learning for facial action unit detection. In IEEE CVPR. 2207–2216.
[71]
Kaili Zhao, Wen-Sheng Chu, Fernando De la Torre, Jeffrey F. Cohn, and Honggang Zhang. 2016. Joint patch and multi-label learning for facial action unit and holistic expression recognition. IEEE Trans. Image Process. 25, 8 (2016), 3931–3946.
[72]
Kaili Zhao, Wen-Sheng Chu, and Honggang Zhang. 2016. Deep region and multi-label learning for facial action unit detection. In IEEE CVPR. 3391–3399.

Cited By

View all
  • (2025)LRA-GNN: Latent Relation-Aware Graph Neural Network with initial and Dynamic Residual for facial age estimationExpert Systems with Applications10.1016/j.eswa.2025.126819(126819)Online publication date: Feb-2025
  • (2024)Towards End-to-End Explainable Facial Action Unit Recognition via Vision-Language Joint LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681443(8189-8198)Online publication date: 28-Oct-2024
  • (2024)Facial Action Unit Detection and Intensity Estimation From Self-Supervised RepresentationIEEE Transactions on Affective Computing10.1109/TAFFC.2024.336701515:3(1669-1683)Online publication date: Jul-2024

Index Terms

  1. MGRR-Net: Multi-level Graph Relational Reasoning Network for Facial Action Unit Detection

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Intelligent Systems and Technology
      ACM Transactions on Intelligent Systems and Technology  Volume 15, Issue 3
      June 2024
      646 pages
      EISSN:2157-6912
      DOI:10.1145/3613609
      • Editor:
      • Huan Liu
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 29 March 2024
      Online AM: 09 February 2024
      Accepted: 23 January 2024
      Revised: 03 January 2024
      Received: 18 October 2023
      Published in TIST Volume 15, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Facial action units
      2. graph attention network
      3. local-global interaction
      4. multi-level relational reasoning

      Qualifiers

      • Research-article

      Funding Sources

      • National Natural Science Foundation of China
      • China Scholarship Council (CSC) from the Ministry of Education of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1,086
      • Downloads (Last 6 weeks)110
      Reflects downloads up to 12 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)LRA-GNN: Latent Relation-Aware Graph Neural Network with initial and Dynamic Residual for facial age estimationExpert Systems with Applications10.1016/j.eswa.2025.126819(126819)Online publication date: Feb-2025
      • (2024)Towards End-to-End Explainable Facial Action Unit Recognition via Vision-Language Joint LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681443(8189-8198)Online publication date: 28-Oct-2024
      • (2024)Facial Action Unit Detection and Intensity Estimation From Self-Supervised RepresentationIEEE Transactions on Affective Computing10.1109/TAFFC.2024.336701515:3(1669-1683)Online publication date: Jul-2024
      • (2024)Detail-Enhanced Intra- and Inter-modal Interaction for Audio-Visual Emotion RecognitionPattern Recognition10.1007/978-3-031-78305-0_29(451-465)Online publication date: 1-Dec-2024

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Full Access

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media