1 Introduction
Facial
action units (AUs) are defined as a set of facial muscle movements that correspond to a displayed expression according to the
Facial Action Coding System (FACS) [
8]. As a fundamental research problem, AU detection is beneficial to facial expression analysis [
26,
65,
67] and has wide potential applications in diagnosing mental health issues [
40,
48], improving e-learning experiences [
37], detecting deception [
22], etc. However, AU detection is challenging because of the difficulty in identifying the subtle facial changes caused by AUs and individual physiology. Some earlier studies [
25,
56] design hand-crafted features to represent different local facial regions related to AUs, according to the corresponding movements of facial muscles. However, hand-crafted shallow features are not discriminative enough to represent the rich facial morphology. Hence, deep learning–based AU detection methods that rely on global and local facial features have been studied to enhance the feature representation of each AU.
Several recent works [
29,
36,
41,
45] aim to enhance the corresponding AU feature representation by combining the affected features in a deep global face feature map. For instance, LP-Net [
36] using an LSTM model [
13] combines the patch features from grids of equal partition made by a global
Convolutional Neural Network (CNN). ARL [
45] directly learns spatial attention from the global CNN features of independent AU branches, as shown in Figure
1(a). And Reference [
32] separately represented AU features directly from a shared full-face feature via multiple independent fully connected layers to model the relationships among all AUs in a graph. However, these methods suffered from the challenges of accurate localization of muscle areas corresponding to AUs, leading to potential interference from some irrelevant regions. In the past, such issues were addressed by extracting AU-related features from
regions of interest (ROIs) centered around the associated facial landmarks [
43,
44,
71], which provide more precise muscle locations for AUs and lead to a better AU detection performance. For example, JAA [
43] and J
\(\rm \hat{A}\)ANet [
44] propose attention-based deep models to adaptively select the highly contributing neighboring pixels of initially predefined muscle region for joint AU detection and face alignment, as shown in the Figure
1(b). However, the above local attention-based methods emphasize learning the appearance representation of each facial region based on detected landmarks while ignoring some intrinsic dependencies between different facial muscles. For example, AU2 (“Outer Brow Raiser”) and AU7 (“Lid Tightener”) will be activated simultaneously when scaring and AU6 (“Cheek Raiser) and AU12 (“Lip Corner Puller”), usually simultaneously in a smiling face. To this end, some methods [
6,
30,
35,
69] try to utilize prior knowledge of AU correlation by defining a fixed graph that represents the statistical AU correlations. For instance, Reference [
30] constructs a predefined graph for each face based on the AU co-occurrences to explicitly model the relationships between AU regions and enhance their semantic representations. However, it is difficult to effectively capture the dynamic relationships between AUs and the distinction of related AUs by a single predefined graph due to the complexity of AU activation and diversity across different subjects. Recent works [
49,
50,
51] make an attempt to exploit an adaptive graph to model the uncertainty relationship between AUs. For instance, Reference [
50] emphasizes the learning of important local facial regions based on probabilistic graph and obtains better facial appearance features by emphasizing important local facial regions via
Long Short-Term Memory (LSTM) [
11]. However, these approaches still enhance the semantic AU representations from the perspective of better regional feature representation, neglecting the modeling of the distinctive local and global features of each AU.
The key issue of facial AU detection lies in obtaining a better facial appearance representation by improving the feature discriminative ability of local AUs and global features from the whole face. On the one hand, region-level dynamic AU relevance mining based on facial landmarks accurately detects the corresponding muscles and flexibly models the relevance among muscle regions. It is different from the existing methods focusing on extracting features for a single AU region [
43,
44] or a predefined fixed graph representing prior knowledge [
19]. Although there have been many methods [
30,
49,
50,
51] on modeling relationships between AU regions, this issue still needs to be addressed effectively. On the other hand, due to the differences in expressions, postures, and individuals, fully learning the responses of the target AU in the global face can better capture the contextual differences between different AUs and complement more semantic details from the global face. For instance, References [
43,
44] simply concatenated the global features extracted from the whole face via CNNs with all local AU features for input into the final classifier. However, it is difficult for all these methods to learn the sensitivity of the target AU within the global face and supplement enough semantic details from the global face representation in terms of different expressions, postures, and individuals. To the best of our knowledge, how to better respond globally to each AU remains unexploited in existing works [
19,
30,
32,
44].
Motivated by the above insights, we propose a novel technique for facial AU detection called MGRR-Net. Our main innovations lie in three aspects, as shown in Figure
1(c). First, we introduce a dynamic graph to model and reason the relationship between a target AU and other AUs. The region-level AU features (as nodes) can accurately locate the corresponding muscles. Second, we supplement each AU with different levels (channel- and pixel-level) of attention-aware details from global features, which greatly improves the distinction between AUs. Finally, we iteratively refine the AU features of the proposed multi-level local-global relational reasoning layer, which makes them more robust and more interpretable. Different from the existing GNN-based approaches [
19,
30,
32,
35,
49,
51] that utilize complex GCNs [
18] to enhance the distinguishability of AUs by constructing AU relationships, however, we supplement each AU with different perspectives (channel- and pixel-level) of attention-aware details from global features, making it possible to achieve the same purpose in a basic GNN and solve a certain over-smoothing issue. In particular, we extract the global features by multi-layer CNNs and precise AU region features based on the detected facial landmarks, which serve as the inputs of each multi-level relational reasoning layer. A simple region-level AU graph is constructed to represent the relationships by the adjacency matrix (as edges) among AU regions (as nodes), initialized by prior knowledge and iteratively updated. We propose a method to learn channel- and pixel-wise semantic relations for different AUs at the same time by processing them in two separate efficient and effective
multi-head graph attention networks (MH-GATs) [
58]. Through this, we model the complementary channel- and pixel-level global details. After these local and global relation-oriented modules, a hierarchical gated fusion strategy helps to select more useful information for the final AU representation in terms of different individuals.
The contributions of this work are as follows:
—
We propose a novel end-to-end iterative reasoning and training scheme for facial AU detection, which leverages the complementary multi-level local-global feature relationships to improve the robustness and discrimination for AU detection;
—
We construct a region-level AU graph with the prior knowledge initialization and dynamically reason the correlated relationship of individual AUs, thereby improving the robustness of AU detection;
—
We propose a GAT-based model to improve the discrimination of each local AU patch by supplementing multiple levels of global features;
—
The proposed MGRR-Net outperforms the state-of-the-art approaches for AU detection on two widely used benchmarks, i.e., BP4D and DISFA, without any external data or pre-trained models.