0% found this document useful (0 votes)
61 views15 pages

Face Recog Nation

This document summarizes a research paper that proposes a joint framework to simultaneously solve the problems of face alignment (localizing landmarks on 2D face images) and 3D face reconstruction from a single image. The method uses two sets of regressors that are applied iteratively - one set updates the 2D landmark positions based on texture features, while the other set refines the reconstructed 3D face shape using the estimated landmarks as clues. This joint approach can fully automatically generate pose-normalized 3D face shapes and localize both visible and invisible landmarks on faces with arbitrary poses and expressions, outperforming previous methods that handled these tasks separately.

Uploaded by

Andrew Taghulihi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views15 pages

Face Recog Nation

This document summarizes a research paper that proposes a joint framework to simultaneously solve the problems of face alignment (localizing landmarks on 2D face images) and 3D face reconstruction from a single image. The method uses two sets of regressors that are applied iteratively - one set updates the 2D landmark positions based on texture features, while the other set refines the reconstructed 3D face shape using the estimated landmarks as clues. This joint approach can fully automatically generate pose-normalized 3D face shapes and localize both visible and invisible landmarks on faces with arbitrary poses and expressions, outperforming previous methods that handled these tasks separately.

Uploaded by

Andrew Taghulihi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/319035816

Joint Face Alignment and 3D Face Reconstruction


with Application to Face Recognition

Article · August 2017

CITATIONS READS

0 14

4 authors, including:

Liu Feng Qijun Zhao


Sichuan University Sichuan University
10 PUBLICATIONS 50 CITATIONS 31 PUBLICATIONS 141 CITATIONS

SEE PROFILE SEE PROFILE

Xiaoming Liu
Michigan State University
132 PUBLICATIONS 2,186 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Face recognition View project

medical image analysis View project

All content following this page was uploaded by Xiaoming Liu on 26 August 2017.

The user has requested enhancement of the downloaded file.


1

Joint Face Alignment and 3D Face


Reconstruction with Application to Face
Recognition
Feng Liu, Qijun Zhao, Member, IEEE, Xiaoming Liu, Member, IEEE and Dan Zeng

Abstract—Face alignment and 3D face reconstruction are traditionally accomplished as separated tasks. By exploring the strong
correlation between 2D landmarks and 3D shapes, in contrast, we propose a joint face alignment and 3D face reconstruction method to
simultaneously solve these two problems for 2D face images of arbitrary poses and expressions. This method, based on a summation
arXiv:1708.02734v1 [cs.CV] 9 Aug 2017

model of 3D face shapes and cascaded regression in 2D and 3D face shape spaces, iteratively and alternately applies two cascaded
regressors, one for updating 2D landmarks and the other for 3D face shape.The 3D face shape and the landmarks are correlated via a
3D-to-2D mapping matrix, which is updated in each iteration to refine the location and visibility of 2D landmarks. Unlike existing
methods, the proposed method can fully automatically generate both pose-and-expression-normalized (PEN) and expressive 3D face
shapes and localize both visible and invisible 2D landmarks. Based on the PEN 3D face shapes, we devise a method to enhance face
recognition accuracy across poses and expressions. Extensive experiments show that the proposed method can achieve the
state-of-the-art accuracy in both face alignment and 3D face reconstruction, and benefit face recognition owing to its reconstructed
PEN 3D face shapes.

Index Terms—3D face reconstruction; face alignment; cascaded regression; pose and expression normalization; face recognition.

1 I NTRODUCTION

T HREE-dimensional (3D) face models have recently been


employed to assist pose or expression invariant face
recognition and achieve state-of-the-art performance [1], [2],
[3]. A crucial step in these 3D face assisted face recognition
methods is to reconstruct the 3D face model from a two-
dimensional (2D) face image. Besides its applications in face
recognition, 3D face reconstruction is also useful in other
face-related tasks, e.g., facial expression analysis [4], [5] and
facial animation [6], [7]. While many 3D face reconstruction
methods are available, they mostly require landmarks on the
face image as input, and are difficult to handle large-pose
faces that have invisible landmarks due to self-occlusion.
Existing studies tackle the problems of facial landmark
localization (or face alignment) and 3D face reconstruction
separately. However, these two problems are chicken-and-
egg problems. On one hand, 2D face images are projections
of 3D faces onto the 2D plane. Given a 3D face and a 3D-to- Fig. 1. We view 2D landmarks as being generated from a 3D face
2D mapping function, it is easy to compute the visibility and through 3D expression (fE ) and pose (fP ) deformation, and camera
projection (fC ). While conventional face alignment and 3D face recon-
position of 2D landmarks. On the other hand, the landmarks struction are two separated tasks and the latter requires the former as the
provide rich information about facial geometry, which is input, this paper performs these two tasks jointly, i.e., reconstructing
the basis of 3D face reconstruction. Figure 1 illustrates the a 3D face and estimating visible/invisible landmarks (green/red points)
correlation between 2D landmarks and 3D faces. That is, from a 2D face image with arbitrary poses and expressions.
the visibility and position of landmarks in the projected 2D
image are determined by three factors: the 3D face shape,
the deformation due to expression and pose, and the camera projection parameters. Given such a clear correlation between
2D landmarks and 3D shape, it is evident that they should ideally
• Feng Liu, Dan Zeng and Qijun Zhao are with the National Key Labora- be solved jointly, instead of separately as in prior works - indeed
tory of Fundamental Science on Synthetic Vision, College of Computer this is the core of this work.
Science, Sichuan University, Chengdu, Sichuan 610065, P. R. China. Motivated by the aforementioned observation, this paper
Qijun Zhao is the corresponding author, reachable at qjzhao@scu.edu.cn.
• Xiaoming Liu is with the Dept. of Computer Science and Engineering, proposes a unified framework to simultaneously solve the
Michigan State University, East Lansing, MI 48824, U.S.A. two problems of face alignment and 3D face shape recon-
• Manuscript received June 25, 2017. struction. Two sets of regressors are jointly learned from a
training set of pairing annotated 2D face images and 3D face
2

shapes. Based on the texture features around landmarks on to regularize the landmark locations, but it employs dis-
a 2D face image, one set of regressors (called landmark re- criminative local texture models. Regression based meth-
gressors) gradually move the landmarks towards their true ods [18], [19], [20], [21] have been recently proposed to
positions. By utilizing the landmarks on the 2D face image directly estimate landmark locations by applying cascaded
as clues, the other set of regressors (called shape regressors) regressors to an input 2D face image. These methods mostly
gradually improve the reconstructed 3D face shape. These do not consider the visibility of facial landmarks under
two sets of regressors are alternately and iteratively applied. different view angles. Consequently, their performance de-
Specifically, in each iteration, adjustment to the landmarks grades substantially for non-frontal faces, and their detected
is firstly estimated via the landmark regressors, and this landmarks could be ambiguous because the anatomically
landmark adjustment is also used to estimate 3D face shape correct landmarks might be invisible due to self-occlusion
adjustment via the shape regressors. The 3D-to-2D mapping (see Fig. 1).
is then computed based on the adjusted 3D face shape and A few methods focused on large-pose face alignment,
2D landmarks, and it further refines the 2D landmarks. which can be roughly divided into two categories: multi-
A preliminary version of this work was published in the view based and 3D model based. Multi-view based meth-
14th European Conference on Computer Vision (ECCV2016) ods [22], [23] define different sets of landmarks as templates,
[8]. We further extend the work from three aspects. (i) We one for each view range. Given an input image, they fit the
explicitly reconstruct expression deformation of 3D faces, multi-view templates to it and choose the best fitted one as
so that both PEN (pose and expression normalized) and the final result. These methods are usually complicated to
expressive 3D face shapes can be reconstructed. (ii) We apply, and can not detect invisible self-occluded landmarks.
present in detail the application of the proposed method to 3D model based methods, in contrast, can better handle self-
face recognition. (iii) We carry out more extensive evaluation occluded landmarks with the assistance of 3D face models.
with comparison to state-of-the-art methods. In summary, Their basic idea is to fit a 3D face model to the input
this paper makes the following contributions. image to recover the 3D landmark locations. Most of these
methods [10], [24], [25], [26], [27] use 3D morphable models
• We present a novel cascaded coupled-regressor
(3DMM) [28] – either a simplified one with a sparse set of
based method for joint face alignment and 3D face
landmarks [10], [25] or a relatively dense one [24]. They
reconstruction from a single 2D image of arbitrary
estimate the 3DMM parameters by using cascaded regres-
pose and expression.
sors with texture features as the input. In [25], the visibility
• By integrating 3D shape information, the proposed
of landmarks is explicitly computed, and the method can
method can more accurately locate landmarks on
cope with face images of yaw angles ranging from −90◦
images of arbitrary view angles in [−90◦ , 90◦ ].
to 90◦ , whereas the method in [24] does not work properly
• We explicitly deal with expression deformation of
for faces of yaw angles beyond 60◦ . In [29], Tulyakov and
3D faces, so that both PEN and expressive 3D face
Sebe propose to directly estimate the 3D landmark locations
shapes can be reconstructed at a high accuracy.
via texture-feature-based regressors for faces of yaw angles
• We propose a 3D-enhanced approach to improve face
upto 50◦ .
recognition accuracy on off-angle and expressive face
These existing 3D model based methods establish regres-
images based on the reconstructed PEN 3D faces.
sions between 2D image features and 3D landmark locations
• We achieve state-of-the-art 3D face reconstruction
(or indirectly, 3DMM parameters). While our proposed ap-
and face alignment performance on BU3DFE [5],
proach is also based on 3D model, unlike existing methods,
AFLW [9], and AFLW2000 3D [10] databases. We
it carries out regressions both on 2D images and in the 3D
investigate the other-race effect on 3D face recon-
space. Regressions on 2D images predict 2D landmarks,
struction of the proposed method on FRGC v2.0
while regressions in the 3D space predict 3D landmarks
database [11]. We demonstrate the effectiveness of
coordinates. By integrating both regressions, our proposed
our proposed 3D-enhanced face recognition method
method can more accurately estimate or localize landmarks,
in improving state-of-the-art deep learning based
and better handle self-occluded landmarks. It thus works
face matchers on Multi-PIE database [12].
well for images of arbitrary view angles in [−90◦ , 90◦ ].
The rest of this paper is organized as follows. Section
2 briefly reviews related work in the literature. Section 3
introduces in detail the proposed joint face alignment and 2.2 3D Face Reconstruction
3D face reconstruction method. Section 4 shows its applica- Estimating the 3D face geometry from a single 2D image
tion to face recognition. Section 5 reports the experimental is an ill-posed problem. Existing methods, such as Shape
results. Section 6 concludes the paper. from Shading (SFS) and 3DMM, thus heavily depend on
priors or constraints. SFS based methods [30], [31] usually
utilize an average 3D face model as a reference, and assume
2 P RIOR W ORK
the Lambertian lighting model for the 3D face surface. One
2.1 Face Alignment limitation of SFS methods lies in its assumed connection
Classical face alignment methods, including Active Shape between 2D texture clues and 3D shape, which is too weak
Model (ASM) [13], [14] or Active Appearance Model to discriminate among different individuals. 3DMM [1],
(AAM) [15], [16], search for landmarks based on global [28], [32], [33] establishes statistical parametric models for
shape models and generative texture models. Constrained both texture and shape, and represents a 3D face as a linear
Local Model (CLM) [17] also utilizes global shape models combination of basis shapes and textures. To recover the
3

3D face from a 2D image, 3DMM-based methods estimate


the combination coefficients by minimizing the discrepancy
between the input 2D face image and the one rendered from
the reconstructed 3D face. They can better cope with 2D face
images of varying illuminations and poses. However, they
still suffer from invisible facial landmarks when the input
face has large pose angles. To deal with extreme poses, Lee
et al. [34], Qu et al. [35] and Liu et al. [36] propose to discard
the self-occluded landmarks or treat them as missing data.
All the aforementioned 3D face reconstruction meth- Fig. 2. A 3D face shape of a subject (S ) is represented as summation
ods require landmarks as input. Consequently, they either of the mean pose-and-expression-normalized (PEN) 3D face shape (S̄ ),
manually mark the landmarks, or employ standalone face the difference between the subject’s PEN 3D face shape and the mean
PEN 3D face shape (∆SId ), and the expression deformation (∆SExp ).
alignment methods to automatically locate the landmarks.
Very recently, Tran et al. [37] propose a convolutional neural
network (CNN) based method to estimate discriminative [47] propose to directly transform a non-frontal face image
3DMM parameters directly from single 2D images with- into frontal one by Learning a Displacement Field network
out requirement of input landmarks. Yet, existing methods (LDF-Net). LDF-Net achieves state-of-the-art performance
always generate 3D faces that have the same pose and for face recognition across poses on Multi-PIE, especially
expression as the input image, which may not be desired in at large poses. To summarize, all these existing methods
face recognition due to the challenge of matching 3D faces carry out pose and expression normalization on 2D face
with expressions [38]. In this paper, we improve 3D face images and utilize merely 2D features for recognition. In this
reconstruction by (i) integrating the face alignment step into paper, on the contrary, we will generate pose and expression
the 3D face reconstruction procedure, and (ii) reconstructing normalized 3D faces from the input 2D face images, and use
both expressive and PEN 3D faces, which is believed to be these 3D faces to improve the unconstrained face recogni-
useful for face recognition. tion accuracy.

2.3 Unconstrained Face Recognition 3 P ROPOSED M ETHOD


Face recognition has been developed rapidly during the In this section, we introduce the proposed joint face align-
past decade, especially since the emergence of deep learning ment and 3D face reconstruction method in detail. We start
techniques. Automated methods [39], [40], [41] even beat by defining the 3D face model with separable identity and
humans at face recognition accuracy on the labelled faces in expression components, and based on this model formulat-
the wild (LFW) benchmark database. Yet, it is still very chal- ing the problem to be solved in this paper. We then give the
lenging to recognize faces in unconstrained 2D images with overall procedure of the proposed method. Afterwards, the
large pose angles or intensive expressions [42], [43], [44]. preparation of training data is presented, followed by the
Potential reasons for degraded face recognition accuracy on detailed introduction of key steps in the proposed method,
off-angle and expressive face images include (i) off-angle including learning 2D landmark and 3D shape regressors,
faces usually have less discriminative texture information and estimating 3D-to-2D mapping and landmark visibility.
for identification than frontal faces, resulting in small inter-
class differences, (ii) cross-view face images (e.g., frontal 3.1 Problem Formulation
and profile face images) may have very limited features We denote an n-vertex frontal pose 3D face shape of one
in common, leading to large intra-class differences, and subject as
(iii) pose and expression variations could cause substantial  
x1 x2 ··· xn
deformation to face images.
S =  y1 y2 ··· yn  ∈ R3×n , (1)
Existing methods recognize off-angle and expressive
z1 z2 ··· zn
faces either by extracting invariant features or by normaliz-
ing out the deformation caused by pose or expression. Yi et and represent it as a summation of three components:
al. [45] fitted a 3D face mesh to the input arbitrary view face S = SId + ∆SExp = S̄ + ∆SId + ∆SExp , (2)
images, and extracted pose-invariant features based on the
3D face mesh that was adaptively deformed to the input face where S̄ is the mean of frontal pose and neutral expression
images. In DeepFace [46], the input face images were first 3D face shapes, termed pose-and-expression-normalized
aligned to the frontal view with assistance of a generic 3D (PEN) 3D face shapes in this paper, ∆SId is the difference
face model, and then recognized based on a deep network. between the subject’s PEN 3D face shape (denoted as SId )
Zhu et al. [3] proposed to generate frontal and neutral face and S̄ , and ∆SExp is the expression-induced deformation in
images from the input images by using a method based on S with respect to SId . See Fig. 2.
3DMM [28] and deep convolutional neural networks. Very We use SL to denote a subset of S with columns corre-
recently, generative adversarial networks (GAN) have been sponding to l landmarks. The projections of these landmarks
explored by Tran et al. [44] for unconstrained face recog- onto a 2D face image I of the subject with arbitrary view are
nition. They devised a novel network, namely DR-GAN, represented by
which can simultaneously synthesize frontal face images
 
u1 u2 ··· ul
and learn pose-invariant feature representations. Hu et al. U= = fC ◦fP (SL ) ∈ R2×l , (3)
v1 v2 ··· vl
4

Fig. 3. Flowchart of the proposed joint face alignment and 3D face reconstruction method.

where fC and fP are, respectively, camera projection and Updating 3D face shape In this step, the above-obtained
pose-caused deformation. In this paper, we employ a 3D- landmark location adjustment is used to estimate the ad-
to-2D mapping matrix M ≈ fC ◦ fP to approximate the justment of the 3D face shape ∆Sk , which consists of two
composite effect of pose induced deformation and camera components, ∆SkId and ∆SkExp . Specifically, a regression
projection. k
function RS models the correlation between the landmark
Given an input 2D face image I, our goal is to location adjustment ∆Uk and the expected adjustment
simultaneously localize its landmarks U and estimate ∆SkId and ∆SkExp , i.e.,
its PEN 3D face shape SId and expression deforma-
tion ∆SExp . Note that, in some context, we also write ∆Sk = [∆SkId ; ∆SkExp ] = RSk (∆Uk ). (5)
the 3D face shape and the landmarks as column vec- The 3D shape can be then updated by S = S k k−1
+ ∆SkId
+
tors: S = (x1 , y1 , z1 , x2 , y2 , z2 , · · · , xn , yn , zn )T , and U = ∆SkExp . The method for learning these shape regressors will
(u1 , v1 , u2 , v2 , · · · , ul , vl )T , where ‘T’ is transpose operator. be given in Sec. 3.5.
Refining landmarks Once a new estimate of the 3D
3.2 The Overall Procedure shape is obtained, the landmarks can be further refined
Figure 3 shows the flowchart of the proposed method. For accordingly. For this purpose, the 3D-to-2D mapping matrix
the input 2D face image I, its 3D face shape S is initialized is needed. Hence, we estimate Mk based on Sk and Ûk . The
as the mean PEN 3D shape of training faces (i.e., S0 = S̄). Its refined landmarks Uk can be then obtained by projecting
landmarks U are initialized by placing the mean landmarks Sk onto the image via Mk according to Eq. (3). During this
of training frontal and neutral faces into the face region process, the visibility of the landmarks is also re-computed.
specified by a bounding box in I via similarity transforms. U Details about this step will be given in Sec. 3.6.
and S are iteratively updated by applying a series of regres-
sors. Each iteration contains three main steps: (i) updating 3.3 Training Data Preparation
landmarks, (ii) updating 3D face shape, and (iii) refining
Before we provide the details about the three steps, we first
landmarks.
Updating landmarks This step updates the landmarks’ introduce the training data needed for learning the land-
marks and 3D shape regressors, which will also facilitate the
locations from Uk−1 to Ûk based on the texture features
understanding of our learning algorithms. Since the purpose
in the input 2D image. This is similar to the conventional
of these regressors is to gradually adjust the estimated
cascaded regressor based 2D face alignment [18]. The ad-
landmarks and 3D shape towards their ground truth values,
justment to the landmarks’ locations in k th iteration, ∆Uk
we need a sufficient number of triplet data {(Ii , S∗i , U∗i )|i =
is determined by the local texture feature around Uk−1 via
1, 2, · · · , N }, where S∗i and U∗i are, respectively, the ground
a regressor,
truth 3D shape and landmarks for the image Ii , and N is
∆Uk = RU k
(h(I, Uk−1 )), (4)
the total number of training samples. All the 3D face shapes
where h(I, U) denotes the texture feature extracted around have been established dense correspondences among their
k
the landmarks U in the image I, and RU is a regression vertices; in other words, they have the same number of
function. The landmarks can then be updated by Ûk = vertices, and vertices of the same index in the 3D face shapes
Uk−1 + ∆Uk . The method for learning these landmark have the same semantic meaning. Here, each of the ground
regressors will be introduced in Sec. 3.4. truth 3D face shapes includes two parts, the PEN 3D face
5

shape S∗Id and its expression shape S∗Exp = S̄ + ∆S∗Exp ,


i.e., S∗ = [S∗Id ; S∗Exp ]. Moreover, both visible and invisible
landmarks in Ii have been annotated and included in U∗i .
For invisible landmarks, the annotated positions should be
anatomically correct positions (for example the red points in
Fig. 1).
Obviously, to enable the regressors to cope with expres-
sion and pose variations, the training data should contain
2D face images of varying expressions and poses. It is,
however, difficult to find in the public domain such data sets
of 3D face shapes and corresponding annotated 2D images
with various expressions/poses. Thus, we construct two sets
of training data by ourselves: one based on BU3DFE [5], and
the other based on 300W-LP [10], [48].
BU3DFE database contains 3D face scans of 56 females
and 44 males, acquired in neutral plus six basic expressions
(happiness, disgust, fear, anger, surprise and sadness). All
basic expressions are acquired at four levels of intensity.
These 3D face scans have been manually annotated with 84
landmarks (83 landmarks provided by the database plus one
nose tip marked by ourselves). For each of the 100 subjects,
we select the scans of neutral and the level-one intensity of
the rest six expressions as the ground truth 3D face shapes.
From each of the chosen seven scans of a subject, 19 face
Fig. 4. Example 2D face images with annotated landmarks (1st and 4th
images are rendered at different poses (−90◦ to 90◦ yaw rows), their 3D face shapes (2nd and 5th rows) and expression shapes
with a 10◦ interval) with landmark locations recorded. As (3rd and 6th rows) from the BU3DFE database. Seven expressions are
a result, each subject has 133 images of different poses and shown: angry (AN), disgust (DI), fear (FE), happy (HA), neutral (NE),
sad (SA), and surprise (SU). The 3D face shapes corresponding to the
expressions. We use the method in [49] to establish dense neutral expression are their PEN 3D face shapes, which are highlighted
correspondence of the 3D face scans of n = 5, 996 vertices. with blue boxes.
With the registered 3D face scans, we compute the mean
PEN 3D face shape by averaging all the subjects’ PEN 3D
face shapes, which are defined by their 3D face scans of
frontal pose and neutral expression. All the 2D face images
of one subject share the same PEN 3D face shape of the
subject, while their expression shapes can be obtained by
first subtracting from their corresponding 3D face scans
their PEN 3D face shape and then adding the mean PEN
3D face shape.
300W-LP database [10] is created based on 300W [48]
database, which is an integration of multiple face alignment Fig. 5. Example data of four subjects in the 300W-LP database. From
benchmark datasets (i.e., AFW [22], LFPW [50], HELEN [51], left to right: 2D face images with annotated landmarks, PEN 3D face
IBUG [48] and XM2VTS [52]). It includes 122,450 in-the-wild shapes, and expression shapes.
face images of a large variety of poses and expressions.
For each image, its corresponding registered PEN 3D face
shape and expression shape are estimated by using the as the landmark regressors, and learn them by fulfilling the
method in [3] based on BFM [53] and FaceWarehouse [54]. following optimization:
The obtained 3D face shapes have n = 53, 215 vertices.
Figure 5 shows some example 2D face images and corre- N
X  
sponding PEN 3D face shapes and expression shapes in our
k
RU = arg min k U∗i − Uk−1
i
k
− RU (h(Ii , Uk−1
i )) k22 ,
k
RU
constructed training datasets. i=1
(6)
which has a closed-form least-square solution. Note that
other regression schemes, such as CNN [26], can be easily
3.4 Learning Landmark Regressors
adopted in our framework.
According to Eq. (4), landmark regressors estimate the ad- We use 128-dim SIFT descriptors [55] as the local feature.
justment to Uk−1 such that the updated landmarks Uk are The feature vector of h is a concatenation of the SIFT
closer to their true positions. In the training phase, the true descriptors at all the l landmarks, i.e., a 128l-dim vector.
positions and visibility of the landmarks are given by the If a landmark is invisible, no feature will be extracted, and
ground truth U∗ . Therefore, the objective of the landmark its corresponding entries of h will be zero. It is worth men-
k
regressors RU is to better predict the difference between tioning that the regressors estimate the semantic positions
U k−1
and U∗ . In this paper, we employ linear regressors of all landmarks including invisible landmarks.
6

3.5 Learning 3D Shape Regressors Algorithm 1 Cascaded Coupled-Regressor Learning.


k
The landmark adjustment ∆U is also used as the input Input: Training data {(Ii , S∗i , U∗i )|i = 1, 2, · · · , N }, initial
to the 3D shape regressor RS k k
. The objective of RS is to shape S0i & landmarks U0i .
 k K
compute an update to the initially estimated 3D shape Sk−1 Output: Cascaded coupled-regressors RU , RSk k=1 .
in the k th iteration to minimize the difference between the 1: for k = 1, ..., K do
k
updated 3D shape and the ground truth. Using similar 2: Estimate RU via Eq. (6), and compute landmark ad-
linear regressors, the 3D shape regressors can be learned justment ∆Uki via Eq. (4);
by solving the following optimization via least squares: 3: Update landmarks Ûki for all images: Ûki = Uk−1 i +
N
∆Uki ;
k
Estimate RS via Eq. (7), and compute shape adjust-
 
4:
X
RSk = arg min k (S∗i − Sk−1
i ) − RSk ∆Uki k22 , (7) k
k
RS i=1
ment ∆Si via Eq. (5);
5: Update 3D face Ski : Ski = Sk−1i + ∆Ski ;
with its closed-form solution as 6: Estimate the 3D-to-2D mapping matrix Mki via Eq.
RSk = ∆Sk (∆Uk )T (∆Uk (∆Uk )T )−1 , (8) (9);
7: Compute the refined landmarks Uki via Eq. (3) and
where ∆Sk = S∗ − Sk−1 and ∆Uk are, respectively, the 3D their visibility via Eq. (10).
shape and landmark adjustment. S and U denote, respec- 8: end for
tively, the ensemble of 3D face shapes and 2D landmarks of
all training samples with each column corresponding to one
sample.
Since S ∈ R6n×N (recall that S has two parts, PEN shape
and expression deformation) and U ∈ R2l×N , it can be
mathematically shown that N should be larger than 2l so
that ∆Uk (∆Uk )T is invertible. Fortunately, since the set of
used landmarks is usually sparse, this requirement can be
easily satisfied in real-world applications.

3.6 Estimating 3D-to-2D Mapping and Landmark Visi-


bility
In order to refine the landmarks with the updated 3D
face shape, we have to project the 3D shape to the 2D Fig. 6. Block diagram of the proposed 3D-enhanced face recognition
image with a 3D-to-2D mapping matrix. In this paper, we method.
dynamically estimate the mapping matrix based on Sk and
Ûk . As discussed earlier in Sec. 3.1, the mapping matrix is
4 A PPLICATION TO FACE R ECOGNITION
a composite effect of pose induced deformation and camera
projection. Here, we assume a weak perspective projection In this section we explore the reconstructed 3D faces to
for the camera projection as in prior work [25], [56]. As a improve face recognition accuracy on off-angle and expres-
result, the mapping matrix Mk is represented by a 2 × 4 sive face images. The basic idea is to utilize the additional
matrix, and can be estimated as a least squares solution to feature provided by the reconstructed PEN 3D face shapes
the following fitting problem: and fuse it with conventional 2D face matchers. Figure 6
shows the 3D-enhanced face recognition method employed
Mk = arg min k Û k − Mk × SLk k22 . (9) in this paper. As can be seen, 3D face reconstruction meth-
Mk
ods are applied to both gallery and probe face images to
Once a new mapping matrix is computed, the landmarks generate PEN 3D face shapes. The iterative closest point
can be further refined as U k = Mk × SL k
. (ICP) algorithm [57] is applied to match the reconstructed
The visibility of the landmarks can be then computed normalized 3D face shapes. It aligns the 3D shapes re-
based on the mapping matrix M using the method in [25]. constructed from probe and gallery images, and computes
Suppose the average surface normal around a landmark in the distances between them, which are then converted to
the 3D face shape S is →

n . Its visibility v can be measured by similarity scores via subtracting them from the maximum
1
  
M1 M2
 distance. These scores are finally normalized to the range of
v= 1 + sgn → −
n · × , (10) [0, 1] via min-max normalization, and fused with the scores
2 kM1 k kM2 k
of the the conventional 2D face matcher on the gallery and
where sgn() is the sign function, ‘·’ means dot product and probe face images (which are within [0, 1] also) by using
‘×’ cross-product, and M1 and M2 are the left-most three a sum rule. The recognition result for a probe is defined
elements at the first and second row of the mapping matrix as the subject whose gallery sample has the highest match
M. This basically rotates the surface normal and validates if score with it. Note that we employ the ICP-based 3D face
it points toward the camera or not. shape matcher and the sum fusion rule for simplicity sake.
The whole process of learning the cascaded coupled Other more elaborated 3D face matchers and fusion rules
landmark and 3D shape regressors is summarized in Al- can be also applied with our proposed method. Thanks to
gorithm 1. the additional discriminative feature in PEN 3D face shapes
7

Fig. 7. Reconstruction results for a BU3DFE subject at nine different pose angles. First row: The input images. Second, forth, sixth, eighth and
tenth rows: The reconstructed 3D face shapes by [3], [37], [36], [8] and the proposed method. Third, fifth, seventh, ninth and eleventh rows: Their
corresponding NPDE maps. The colormap goes from dark blue to dark red (corresponding to an error between 0 and 5). The numbers under each
of the error maps represent mean and standard deviation values (in %).
8

TABLE 1
3D face reconstruction accuracy (MAE) of the proposed method and state-of-the-art methods at different yaw poses on the BU3DFE database.

Method ±90◦ ±80◦ ±70◦ ±60◦ ±50◦ ±40◦ ±30◦ ±20◦ ±10◦ 0◦ Avg.

Zhu et al. [3] - - - - - 2.73 2.74 2.56 2.32 2.22 2.51


Tran et al. [37] - - - - - 2.26 2.19 2.16 2.08 2.06 2.15
Liu et al. [36] 1.95 1.91 1.95 1.96 1.97 1.97 1.96 1.98 2.01 2.03 1.97
Liu et al. [8] 1.92 1.89 1.90 1.93 1.95 1.93 1.93 1.95 1.98 2.01 1.94
Proposed 1.85 1.83 1.83 1.83 1.86 1.89 1.90 1.91 1.90 1.91 1.87

and its robustness to pose and expression variations, the


accuracy of conventional 2D face matchers on off-angle and
expressive face images can be effectively improved after
fusion with the PEN 3D face shape based matcher. In the
next Section, we will experimentally demonstrate this.

5 E XPERIMENTS
We conduct three sets of experiments to evaluate the pro-
posed method in 3D face reconstruction, face alignment, and
benefits to face recognition.
Fig. 8. 3D face reconstruction accuracy (MAE) of the proposed method,
[36] and [3] under different expressions. i.e., angry (AN), disgust (DI),
fear (FE), happy (HA), neutral (NE), sad (SA) and surprise (SU).
5.1 3D Face Reconstruction Accuracy
To evaluate the 3D shape reconstruction accuracy, a 10-
fold cross validation is applied to split the BU3DFE data
into training and testing subsets, resulting in 11,970 train-
ing samples and 1,330 testing samples. We compare the
proposed method with its preliminary version in [8] and
another three state-of-the-art methods in [36], [3] and [37].
The methods in [8] and [37] reconstruct PEN 3D face shapes
only, while the methods in [36] and [3] reconstruct 3D face
shapes that have the same pose and expression as the input
images. Moreover, the method in [36] requires that visible
landmarks are available together with the input images. In
the following experiments, we use the visible landmarks Fig. 9. PEN 3D face reconstruction accuracy (MAE) of the proposed
method, [8] and [37] under different expressions. i.e., angry (AN), dis-
projected from ground truth 3D face shapes for the method gust (DI), fear (FE), happy (HA), neutral (NE), sad (SA) and surprise
in [36]. For the methods of [3] and [37], we use the imple- (SU).
mentation provided by the authors. In the implementation,
these two methods are based on the 68 landmarks that are
detected by using the method in [58]. As a result, they can sample, and zj∗ and ẑj are the ground truth and recon-
not be applied to face images of extreme pose angles (i.e., structed depth values at the j th vertex.
beyond 40 degrees). Reconstruction accuracy across poses Table 1 shows
Two metrics are used to evaluate the 3D face shape the average MAE of the proposed method under different
reconstruction accuracy: Mean Absolute Error (MAE) and pose angles of the input 2D images. For a fair compari-
Normalized Per-vertex Depth Error (NPDE). MAE is de- son with the counterpart methods, we only compute the
fined as [59] reconstruction error of neutral testing images. To compute
NT MAE, the reconstructed 3D faces should be first aligned
1 X
MAE = (kS∗i − Ŝi k/n), (11) to the ground truth. Since the results of [8], [36] and our
NT i=1 proposed method already have the same number of vertices
as the ground truth, we employ Procrustes alignment for
where NT is the total number of testing samples, S∗i and Ŝi these methods as being suggested by [60]. For the results
are the ground truth and reconstructed 3D face shape of the of [3] and [37], however, the number of vertices is different
ith testing sample. from the ground truth. Hence, we align them by using rigid
NPDE measures the depth error at the j th vertex in a iterative closest point method as [37] does. It can be seen
testing sample as [30] from Table 1 that the average MAE of the proposed method
NPDE(xj , yj ) = |zj∗ − ẑj | / (zmax
 ∗ ∗
− zmin ), (12) is lower than that of counterpart methods. Moreover, as
the pose angle becomes large, the error does not increase
∗ ∗
where zmax and zmin are the maximum and minimum substantially. This proves the effectiveness of the proposed
depth values in the ground truth 3D shape of the testing method in handling arbitrary view face images. Figure 7
9

Fig. 10. Reconstruction results for a BU3DFE subject under seven different expressions. The first row shows the input images. In the red box, we
show the reconstructed 3D shapes that have the same expression as the input images, using the methods of [36], [3] and the proposed method.
In the blue box, we show the reconstructed PEN 3D shapes obtained by [8], [37] and the proposed method. The NPDE maps of these results are
shown below the reconstructed 3D faces, which go from dark blue to dark red (corresponding to an error between 0 and 5). The numbers under
each of the error maps represent mean and standard deviation values (in %).
10

TABLE 2
Number and percentage of subjects of different genders and races in
the FRGC v2.0 database.

Asian Black Hispanic White Unknown Total


55 2 5 134 6 202
Female
(11.8%) (0.4%) (1.1%) (28.8%) (1.3%) (43.3%)
57 4 8 185 10 264
Male
(12.2%) (0.9%) (1.7%) (39.7%) (2.1%) (56.7%)
112 6 13 319 16 466
Total
(24.0%) (1.3%) (2.8%) (68.5%) (3.4%) (100%)

Fig. 11. 3D face reconstruction accuracy (MAE) of the proposed method


shows the reconstruction results of one subject. across different ethnic groups.
Reconstruction accuracy across expressions Figure 8
shows the average MAE of the proposed method and the
methods in [36] and [3] across expressions based on their worse accuracy on Black. These results reveal the variations
reconstructed 3D face shapes that have the same pose in the facial shapes of people from different races. Further-
and expression as the input images. The proposed method more, by combining the training data of Asian and White
overwhelms its counterpart method for all different expres- (Setting III), comparable reconstruction accuracy is achieved
sions. Moreover, as expressions change, the maximum MAE for both Asian and White, which is also comparable to those
variance of the methods in [3] and [36] are 15.7% and in Setting I and Setting II. This proves the capacity of the
17.9%, whereas that of the proposed method is 3.4%. This proposed method in handling the 3D facial shape variations
proves the superior robustness of the proposed method to among people from different ethnic groups.
expression variations.
Figure 9 compares the average MAE of the proposed 5.2 Face Alignment Accuracy
method and the methods in [8] and [37] across expressions
based on their reconstructed PEN 3D face shapes. Again, In face alignment accuracy evaluation, several state-of-the-
the proposed method shows superior performance both in art face alignment methods are considered for comparison
terms of MAE under all different expressions and robust- to the proposed method, including RCPR [62], ESR [19],
ness across expressions. We believe that the improvement SDM [18], 3DDFA and 3DDFA+SDM [10]. The dataset con-
achieved by the proposed method is owed to its explicit structed from 300W-LP is used for training, the AFLW [9]
modelling of expression deformation. Figure 10 shows the and AFLW2000-3D [10] are used for testing. AFLW contains
reconstruction results for a subject under seven expressions. 25,993 in-the-wild faces with large-pose variations (yaw
Reconstruction accuracy across races It is well known from −90◦ to 90◦ ). Each image is annotated with up to
that people from different races (e.g., Asian and Western 21 visible landmarks. For a fair comparison to [10], we
people) show different characteristics in their facial shapes. use the same 21,080 samples as our testing set, and divide
Such other-race effect has been reported in the face recogni- the testing set into three subsets according to the absolute
tion literature [61]. In this experiment, we study the impact yaw angles of the testing images: [0◦ , 30◦ ), [30◦ , 60◦ ) and
of races on the 3D face reconstruction accuracy based on [60◦ , 90◦ ]. The resulting three subsets have 11,596, 5,457
the FRGC v2.0 database [11]. FRGC v2.0 contains 3D face and 4,027 samples, respectively. AFLW2000-3D contains the
models and 2D face images of 466 subjects who are from dif- ground truth 3D faces and the corresponding 68 landmarks
ferent ethnic groups (see Table 2). Since these face data have of the first 2,000 AFLW samples. There are 1,306 samples
no variation in expression, the expression shape component in [0◦ , 30◦ ), 462 samples in [30◦ , 60◦ ) and 232 samples in
in our proposed model is set to zero. We use the method [60◦ , 90◦ ]. The bounding boxes provided by AFLW are used
in [49] to establish dense correspondence of the 3D faces of in the AFLW testing experiment, while the ground truth
n = 5, 996 vertices. We conduct three series of experiments: bounding boxes enclosing all 68 landmarks are used for the
(i) training using 100 Asian samples (denoted as Setting AFLW2000-3D testing experiment.
I), (ii) training using 100 White samples (denoted as Setting Normalized Mean Error (NME) [25] is employed to mea-
II), and (iii) training using 100 Asian and 100 White samples sure the face alignment accuracy. It is defined as the mean
(denoted as Setting III). The testing set contains the samples of the normalized estimation error of visible landmarks for
of the remaining subjects in FRGC v2.0, including 12 Asian, all testing samples:
6 Black, 13 Hispanic, 19 White and 16 Unknown races.
 
NT l
1 X 1 1 X
Figure 11 compares the 3D face reconstruction accuracy NME =  vij ||(ûij , v̂ij ) − (u∗ij , vij

)||,
(MAE) across different ethnic groups. Not surprisingly, NT i=1 di Niv j=1
training for one ethnic group can yield better accuracy on (13)
the same ethnic testing samples. As for the other-race effect, where di is the square root of the face bounding box area
the model trained on White achieves comparable accuracy of the ith testing sample, Niv is the number of visible
on White and Hispanic, but much worse accuracy on the landmarks in it, (u∗ij , vij

) and (ûij , v̂ij ) are, respectively, the
other races (and worst on Asian). On the other hand, the ground truth and estimated coordinates of its j th landmark.
model trained on Asian performs much worse on all the Table 3 provides the face alignment accuracy of different
other races compared with on its own race, and obtains the methods on the AFLW and AFLW2000-3D datasets. As can
11

TABLE 3
The face alignment accuracy (NME, %) of the proposed method and existing state-of-the-art methods on AFLW and AFLW2000-3D databases.

AFLW Database (21 points) AFLW2000-3D Database (68 points)


Method [0◦ , 30◦ ) [30◦ , 60◦ ) [60◦ , 90◦ ] Mean Std [0◦ , 30◦ ) [30◦ , 60◦ ) [60◦ , 90◦ ] Mean Std
RCPR [62] 5.43 6.58 11.53 7.85 3.24 4.26 5.96 13.18 7.80 4.74
ESR [19] 5.66 7.12 11.94 8.24 3.29 4.60 6.70 12.67 7.99 4.19
SDM [18] 4.75 5.55 9.34 6.55 2.45 3.67 4.94 9.76 6.12 3.21
3DDFA [10] 5.00 5.06 6.74 5.60 0.99 3.78 4.54 7.93 5.42 2.21
3DDFA+SDM [10] 4.75 4.83 6.38 5.32 0.92 3.43 4.24 7.17 4.94 1.97
Proposed 3.75 4.33 5.39 4.49 0.83 3.25 3.95 6.42 4.61 1.78

Fig. 12. The 68 landmarks detected by the proposed method for images in AFLW. Green/red points denote visible/invisible landmarks.

be seen, the proposed method achieves the best accuracy in [49] to establish dense correspondence with n = 5, 996
among all the considered methods for all poses and on vertices.
both datasets. In order to assess the robustness of different CMU Multi-PIE is a widely used benchmark database
methods to pose variations, we also report the standard for evaluating face recognition accuracy under pose, illumi-
deviations of the NME of different methods in Table 3. The nation and expression variations. It contains 2D face images
results again demonstrate the superiority of the proposed of 337 subjects collected under various views, expressions
method over the counterpart methods. Figure 12 shows the and lighting conditions. Here, we consider pose and expres-
landmarks detected by proposed method on some example sion variations, and conduct two experiments. In the first
images in AFLW. experiment, following the setting of [3], [69], probe images
consist of the images of all the 337 subjects at 12 poses (±90◦ ,
±75◦ , ±60◦ , ±45◦ , ±30◦ , ±15◦ ) with neutral expression and
5.3 Benefits to Face Recognition
frontal illumination. In the second experiment, instead of
While there are many recent face alignment and reconstruc- the neutral expression images, all the images with smile,
tion work [63], [64], [65], [66], [67], few work takes one step surprise, squint, disgust and scream expressions at the 12
further to evaluate the contribution of alignment or recon- poses and under frontal illumination are used as probe
struction to subsequent tasks. In contrast, we quantitatively images. This protocols is a extended and modified version
evaluate the effect of the reconstructed pose-expression- of [4] and [3] by using more large pose images (±60◦ ,
normalized (PEN) 3D face shapes on face recognition by ±75◦ , ±90◦ ). In both experiments, the frontal images of the
performing direct 3D to 3D shape matching and fuse it subjects captured in the first session are used as gallery.
with conventional 2D face recognition. Refer to Sec. 4 for And four state-of-the-art deep learning based (DL-based)
details of the PEN 3D face shapes enhanced face recognition face matchers are used as baseline 2D face matchers, i.e.,
method. VGG [70], Lightened CNN [71], CenterLoss [72] and LDF-
In this evaluation, we use the BU3DFE (13,300 images of Net [47]. The first three matchers are publicly available. We
100 subjects; refer to Sec. 3.3) and MICC [68] databases as evaluate them with all the 337 subjects in Multi-PIE. The last
training data, and the CMU Multi-PIE database [12] as test matcher, LDF-Net, is a latest matcher specially designed for
data. MICC contains 3D face scans and video clips (indoor, pose-invariant face recognition. It uses the first 229 subjects
outdoor and cooperative head rotations environments) of for training and the remaining 108 subjects for testing. Since
53 subjects. We randomly select face images with different it is not publicly available, we request the match scores
poses from the cooperative environment videos, resulting in from the authors, and fuse our 3D shape match scores with
11,788 images of 53 subjects and their corresponding neutral theirs. Note that given the good performance of LDF-Net,
3D face shapes (whose expression shape components are we assign higher weight (i.e., 0.7) to it, whereas the weight
thus set to zero). The 3D faces are processed by the method for all the other three baseline matchers is set as 0.5.
12

TABLE 4
Recognition accuracy in the first experiment on Multi-PIE by the four state-of-the-art DL-based face matchers before (indicated by suffix “2D”) and
after (indicated by suffix “Fusion”) the enhancement by our proposed method. Avg. is the average accuracy.

Method ±90◦ ±75◦ ±60◦ ±45◦ ±30◦ ±15◦ Avg.

VGG-2D 36.2% 66.9% 83.5% 93.8% 97.7% 98.6% 79.5%


Lightened CNN-2D 7.5% 31.5% 78.6% 96.3% 99.1% 99.8% 68.8%
CenterLoss-2D 48.2% 72.7% 92.6% 98.8% 99.6% 99.7% 85.3%
LDF-Net-2D 65.3% 86.2% 93.7% 98.4% 98.9% 98.6% 90.2%
VGG-Fusion 52.6% 75.2% 90.5% 96.8% 98.5% 99.4% 85.5%
Lightened CNN-Fusion 23.6% 45.3% 84.6% 97.6% 99.6% 99.9% 75.1%
CenterLoss-Fusion 63.7% 76.7% 92.5% 97.8% 98.4% 98.7% 88.0%
LDF-Net-Fusion 70.4% 87.6% 93.4% 98.1% 97.9% 97.7% 90.9%

TABLE 5
Recognition accuracy of the CenterLoss matcher in the second experiment on Multi-PIE. The results shown in brackets are obtained by using the
original CenterLoss matcher without enhancement by our reconstructed 3D shapes.

Pose \ Expression Smile Surprise Squint Disgust Scream Avg.


±90◦ 51.4%(36.9%) 46.1%(35.7%) 58.8%(38.7%) 42.0%(24.9%) 63.6%(52.4%) 52.4%(37.7%)
±75◦ 73.1%(67.0%) 56.6%(53.0%) 72.6%(67.8%) 52.5%(43.4%) 75.1%(71.6%) 66.0%(60.4%)
±60◦ 88.6%(89.8%) 80.2%(80.7%) 91.6%(88.2%) 74.6%(69.8%) 91.8%(92.7%) 85.4%(84.2%)
±45◦ 95.9%(97.6%) 89.4%(95.1%) 95.6%(97.8%) 86.7%(83.5%) 97.3%(98.7%) 93.0%(94.5%)
±30◦ 97.8%(99.1%) 93.1%(97.0%) 96.8%(99.3%) 90.4%(91.5%) 98.5%(99.8%) 95.3%(97.3%)
±15◦ 98.5%(99.6%) 95.6%(97.3%) 97.5%(100%) 92.6%(93.5%) 98.1%(99.2%) 96.5%(97.9%)
Avg. 84.2%(81.7%) 76.8%(76.5%) 85.5%(82.0%) 73.1%(67.8%) 87.4%(85.7%) 81.4%(78.7%)

Table 4 reports the rank-1 recognition accuracy of the


baseline face matchers in the first experiment. According to
the results in Table 4, the baseline matchers are all further
improved with our proposed method. Specifically, VGG and
Lightened CNN are consistently improved across different
pose angles when fused with 3D cues, while CenterLoss
gains substantial improvement at large pose angles (15.5%
at ±90◦ and 4.0% at ±75◦ ). Even for the best LDF-Net
method, the recognition accuracy is improved by 5.1% at
±90◦ and 1.4% at ±75◦ . For all the baseline matchers,
the larger the yaw angle is, the more evident the accuracy
improvement. This proves the effectiveness of the proposed Fig. 13. The reconstruction (a) and alignment (b) objective function
method in dealing with pose variations, as well as in re- values of the proposed method as iteration proceeds, when trained on
the BU3DFE database.
constructing individual 3D face shapes with discriminative
details that are beneficial to face recognition.
Given its best performance among the three publicly training the proposed method on the BU3DFE database. We
available baseline matchers, we employ the CenterLoss conduct ten-fold cross-validation experiments, and compute
matcher in the second experiment. The results are shown the objective function values through ten iterations. The
in Table 5. As can be seen, the compound impact of pose average results are shown in Fig. 13. It can be seen that both
and expression variations makes the face recognition more optimization processes converge in about five iterations.
challenging, resulting in obviously lower accuracy com- Hence, in our experiments, we set the number of iterations
pared with the results in Table 4. Yet, our proposed method as K = 5.
still improves the overall accuracy of the baseline matcher,
especially for the probe face images of large pose or disgust 5.5 Computational Complexity
expression. We believe that such performance gain in recog- According to our experiments on a PC with i7-4790 CPU and
nizing non-frontal and expressive faces is owing to the capa- 32 GB memory, the Matlab implementation of the proposed
bility of the proposed method in providing complementary method runs at ∼ 26 FPS (K = 5 and n = 5, 996). This
pose-and-expression-invariant discriminative features in 3D indicates that the proposed method can detect landmarks
face shape space. and reconstruct 3D face shape in real time. We also report
the consumed time of every step in Table 6, and comparison
5.4 Convergence with other existing methods in Table 7.
The proposed method has two alternate optimization pro-
cesses, one in 2D space for face alignment and the other in 6 C ONCLUSION
3D space for 3D shape reconstruction. We experimentally In this paper, we present a novel regression based method
investigate the convergence of these two processes when for joint face alignment and 3D face reconstruction from sin-
13

TABLE 6 [12] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi-


The time efficiency (in milliseconds or ms) of the proposed method. pie,” IVC, vol. 28, no. 5, pp. 807–813, 2010.
[13] T. F. Cootes and A. Lanitis, “Active shape models: Evaluation of a
multi-resolution method for improving image search,” in BMVC,
Updating Updating Refining
Step Total 1994, pp. 327–338.
landmarks shape landmarks [14] D. Cristinacce and T. F. Cootes, “Boosted regression active shape
Time (ms) 14.9 15.3 8.7 38.9 models.” in BMVC, 2007, pp. 1–10.
[15] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance
models,” TPAMI, no. 6, pp. 681–685, 2001.
TABLE 7 [16] I. Matthews and S. Baker, “Active appearance models revisited,”
Efficiency comparison of different reconstruction methods. For the IJCV, vol. 60, no. 2, pp. 135–164, 2004.
methods [3], [36] , and [37], stand-alone landmark detection is required, [17] D. Cristinacce and T. Cootes, “Automatic feature localisation with
but the time here does not include the landmark detection time. constrained local models,” Pattern Recognition, vol. 41, no. 10, pp.
3054–3067, 2008.
Method [3] [37] [36] [8] Proposed [18] X. Xiong and F. De la Torre, “Supervised descent method and its
applications to face alignment,” in CVPR, 2013, pp. 532–539.
Time (ms) 56.3 88 12.6 32.8 38.9 [19] X. Cao, Y. Wei, F. Wen, and J. Sun, “Face alignment by explicit
shape regression,” IJCV, vol. 107, no. 2, pp. 177–190, 2014.
[20] S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment at 3000 fps via
gle 2D images of arbitrary poses and expressions. It utilizes regressing local binary features,” in CVPR, 2014, pp. 1685–1692.
[21] S. Zhu, C. Li, C. C. Loy, and X. Tang, “Face alignment by coarse-
landmarks on a 2D face image as clues for reconstructing to-fine shape searching,” in CVPR, 2015, pp. 4998–5006.
3D shapes, and uses the reconstructed 3D shapes to refine [22] X. Zhu and D. Ramanan, “Face detection, pose estimation, and
landmarks. By alternately applying cascaded landmark re- landmark localization in the wild,” in CVPR, 2012, pp. 2879–2886.
gressors and 3D shape regressors, the proposed method can [23] X. Yu, J. Huang, S. Zhang, W. Yan, and D. N. Metaxas, “Pose-free
facial landmark fitting via optimized part mixtures and cascaded
effectively accomplish the two tasks simultaneously in real deformable shape model,” in ICCV, 2013, pp. 1944–1951.
time. Unlike existing 3D face reconstruction methods, the [24] L. A. Jeni, J. F. Cohn, and T. Kanade, “Dense 3D face alignment
proposed method does not require additional face align- from 2D videos in real-time,” in FG, vol. 1, 2015, pp. 1–8.
ment methods, but can fully automatically reconstruct both [25] A. Jourabloo and X. Liu, “Pose-invariant 3D face alignment,” in
ICCV, 2015, pp. 3694–3702.
pose-and-expression-normalized and expressive 3D shapes
[26] A. Jourabloo and X. Liu, “Large-pose face alignment via CNN-
from a single face image of arbitrary poses and expres- based dense 3D model fitting,” in CVPR, 2016, pp. 4188–4196.
sions. Compared with existing face alignment methods, [27] A. Jourabloo and X. Liu, “Pose-invariant face alignment via CNN-
the proposed method can effectively handle invisible and based dense 3D model fitting,” IJCV, in press, 2017.
[28] V. Blanz and T. Vetter, “A morphable model for the synthesis of
expression-deformed landmarks with the assistance of 3D
3D faces,” in SIGGRAPH, 1999, pp. 187–194.
face models. Extensive experiments with comparison to [29] S. Tulyakov and N. Sebe, “Regressing a 3D face shape from a single
state-of-the-art methods demonstrate the effectiveness and image,” in ICCV, 2015, pp. 3748–3755.
superiority of the proposed method in both face alignment [30] I. Kemelmacher-Shlizerman and R. Basri, “3D face reconstruction
from a single image using a single reference face shape,” TPAMI,
and 3D face shape reconstruction, and in facilitating cross-
vol. 33, no. 2, pp. 394–405, 2011.
view and cross-expression face recognition as well. [31] S. Suwajanakorn, I. Kemelmacher-Shlizerman, and S. M. Seitz,
“Total moving face reconstruction,” in ECCV, 2014, pp. 796–812.
[32] S. Romdhani and T. Vetter, “Estimating 3D shape and texture using
R EFERENCES pixel intensity, edges, specular highlights, texture constraints and
a prior,” in CVPR, vol. 2, 2005, pp. 986–993.
[1] V. Blanz and T. Vetter, “Face recognition based on fitting a 3D [33] G. Hu, F. Yan, J. Kittler, W. Christmas, C. H. Chan, Z. Feng, and
morphable model,” TPAMI, vol. 25, no. 9, pp. 1063–1074, 2003. P. Huber, “Efficient 3D morphable face model fitting,” PR, vol. 67,
[2] H. Han and A. K. Jain, “3D face texture modeling from uncali- pp. 366–379, 2017.
brated frontal and profile images,” in BTAS, 2012, pp. 223–230. [34] Y. J. Lee, S. J. Lee, K. R. Park, J. Jo, and J. Kim, “Single view-based
[3] X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li, “High-Fidelity pose 3D face reconstruction robust to self-occlusion,” EURASIP Journal
and expression normalization for face recognition in the wild,” in on Advances in Signal Processing, vol. 2012, no. 1, pp. 1–20, 2012.
CVPR, 2015, pp. 787–796. [35] C. Qu, E. Monari, T. Schuchert, and J. Beyerer, “Fast, robust and
[4] B. Chu, S. Romdhani, and L. Chen, “3D-aided face recognition automatic 3D face model reconstruction from videos,” in AVSS,
robust to expression and pose variations,” in CVPR, 2014, pp. 2014, pp. 113–118.
1907–1914. [36] F. Liu, D. Zeng, J. Li, and Q. Zhao, “Cascaded regressor based
[5] L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato, “A 3D facial 3D face reconstruction from a single arbitrary view image,”
expression database for facial behavior research,” in FG, 2006, pp. arXiv:1509.06161, 2015.
211–216. [37] A. T. Tran, T. Hassner, I. Masi, and G. Medioni, “Regressing robust
[6] C. Cao, Y. Weng, S. Lin, and K. Zhou, “3D shape regression for and discriminative 3D morphable models with a very deep neural
real-time facial animation,” TOG, vol. 32, no. 4, p. 41, 2013. network,” in CVPR, in press, 2017.
[7] C. Cao, H. Wu, Y. Weng, T. Shao, and K. Zhou, “Real-time facial [38] H. Drira, B. Ben Amor, A. Srivastava, M. Daoudi, and R. Slama,
animation with image-based dynamic avatars,” TOG, vol. 35, no. 4, “3D face recognition under expressions, occlusions, and pose
pp. 126:1–126:12, 2016. variations,” TPAMI, vol. 35, no. 9, pp. 2270–2283, 2013.
[8] F. Liu, D. Zeng, Q. Zhao, and X. Liu, “Joint face alignment and 3D [39] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified
face reconstruction,” in ECCV, 2016, pp. 545–560. embedding for face recognition and clustering,” in CVPR, 2015,
[9] M. Kstinger, P. Wohlhart, P. M. Roth, and H. Bischof, “Annotated pp. 815–823.
facial landmarks in the wild: A large-scale, real-world database for [40] Y. Sun, D. Liang, X. Wang, and X. Tang, “DeepID3: Face recogni-
facial landmark localization,” in ICCVW, 2011, pp. 2144–2151. tion with very deep neural networks,” arXiv:1502.00873, 2015.
[10] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Li, “Face alignment across [41] E. Zhou, Z. Cao, and Q. Yin, “Naive-deep face recognition: Touch-
large poses: A 3D solution,” in CVPR, 2016, pp. 146–155. ing the limit of LFW benchmark or not?” arXiv:1501.04690, 2015.
[11] P. J. Phillips, P. J. Flynn, T. Scruggs, K. W. Bowyer, J. Chang, [42] C. A. Corneanu, M. Oliu, J. F. Cohn, and S. Escalera, “Survey on
K. Hoffman, J. Marques, J. Min, and W. Worek, “Overview of the rgb, 3D, thermal, and multimodal approaches for facial expres-
face recognition grand challenge,” in CVPR, vol. 1, 2005, pp. 947– sion recognition: History, trends, and affect-related applications.”
954. TPAMI, vol. 38, no. 8, pp. 1–1, 2016.
14

[43] S. Sengupta, J.-C. Chen, C. Castillo, V. M. Patel, R. Chellappa, and


D. W. Jacobs, “Frontal to profile face verification in the wild,” in [72] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature
WACV, 2016, pp. 1–9. learning approach for deep face recognition,” in ECCV, 2016, pp.
[44] L. Tran, X. Yin, and X. Liu, “Disentangled representation learning 499–515.
GAN for pose-invariant face recognition,” in CVPR, in press, 2017.
[45] D. Yi, Z. Lei, and S. Z. Li, “Towards pose robust face recognition,”
in CVPR, 2013, pp. 3539–3545.
[46] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing
the gap to human-level performance in face verification,” in CVPR, Feng Liu received the M.Sc. degrees in com-
2014, pp. 1701–1708. puter science in 2014. He is currently a Ph.D.
[47] L. Hu, M. Kan, S. Shan, X. Song, and X. Chen, “LDF-Net: Learning candidate at the National Key Laboratory of Fun-
a displacement field network for face recognition across pose,” in damental Science on Synthetic Vision, Sichuan
FG, in press, 2017. University, China. His main research interests fo-
[48] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “300 cus on computer vision and pattern recognition,
faces in-the-wild challenge: The first facial landmark localization specifically for face modeling, 2D and 3D face
challenge,” in ICCVW, 2013, pp. 397–403. recognition. He is a student member of the IEEE.
[49] T. Bolkart and S. Wuhrer, “3D faces in motion: Fully automatic
registration and statistical analysis,” CVIU, vol. 131, pp. 100–115,
2015.
[50] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar,
“Localizing parts of faces using a consensus of exemplars,” in
CVPR, 2011, pp. 545–552.
[51] E. Zhou, H. Fan, Z. Cao, Y. Jiang, and Q. Yin, “Extensive facial
landmark localization with coarse-to-fine convolutional network Qijun Zhao obtained B.Sc. and M.Sc. degrees
cascade,” in ICCVW, 2013, pp. 386–391. both from Shanghai Jiao Tong University, and
[52] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, Ph.D. degree from the Hong Kong Polytechnic
“XM2VTSDB: The extended M2VTS database,” in AVBPA, vol. 964, University. He worked as a post-doc researcher
1999, pp. 965–966. in the Pattern Recognition and Image Process-
[53] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter, “A ing lab at Michigan State University from 2010
3D face model for pose and illumination invariant face recogni- to 2012. He is currently an associate professor
tion,” in AVSS, 2009, pp. 296–301. in College of Computer Science at Sichuan Uni-
[54] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou, “Facewarehouse: versity. His research interests lie in biometrics,
A 3D facial expression database for visual computing,” TVCG, particularly, face recognition, fingerprint recogni-
vol. 20, no. 3, pp. 413–425, 2014. tion, and affective computing. Dr. Zhao has pub-
[55] D. G. Lowe, “Distinctive image features from scale-invariant key- lished more than 50 papers in academic journals and conferences, and
points,” IJCV, vol. 60, no. 2, pp. 91–110, 2004. participated in many research projects either as principal investigators
[56] X. Zhou, S. Leonardos, X. Hu, and K. Daniilidis, “3D shape or as primary researchers. He is a program committee co-chair of the
estimation from 2D landmarks: A convex relaxation approach,” 2016 Chinese Conference on Biometric Recognition and the 2018 IEEE
in CVPR, 2015, pp. 4447–4455. International Conference on Identity, Security and Behavior Analysis.
[57] Y. Chen and G. Medioni, “Object modeling by registration of
multiple range images,” IVC, pp. 2724–2729, 1991.
[58] V. Kazemi and J. Sullivan, “One millisecond face alignment with
an ensemble of regression trees,” in CVPR, 2014, pp. 1867–1874.
[59] Z. Lei, Q. Bai, R. He, and S. Z. Li, “Face shape recovery from a
single image using cca mapping between tensor spaces,” in CVPR,
2008, pp. 1–7. Xiaoming Liu is an Assistant Professor at the
[60] A. Bas, W. A. Smith, T. Bolkart, and S. Wuhrer, “Fitting a 3D Department of Computer Science and Engineer-
morphable model to edges: A comparison between hard and soft ing of Michigan State University. He received
correspondences,” in ACCV, 2016, pp. 377–391. the Ph.D. degree in Electrical and Computer
[61] P. J. Phillips, F. Jiang, A. Narvekar, J. Ayyad, and A. J. O’Toole, Engineering from Carnegie Mellon University in
“An other-race effect for face recognition algorithms,” ACM Tran. 2004. Before joining MSU in Fall 2012, he was
on Applied Perception (TAP), vol. 8, no. 2, p. 14, 2011. a research scientist at General Electric (GE)
[62] X. P. Burgos-Artizzu, P. Perona, and P. Dollár, “Robust face land- Global Research. His research interests include
mark estimation under occlusion,” in ICCV, 2013, pp. 1513–1520. computer vision, machine learning, and biomet-
[63] R. A. Güler, G. Trigeorgis, E. Antonakos, P. Snape, S. Zafeiriou, rics. As a co-author, he is a recipient of Best In-
and I. Kokkinos, “DenseReg: Fully convolutional dense shape dustry Related Paper Award runner-up at ICPR
regression in-the-wild,” in CVPR, in press, 2017. 2014, Best Student Paper Award at WACV 2012 and 2014, and Best
[64] O. Tuzel, T. K. Marks, and S. Tambe, “Robust face alignment using Poster Award at BMVC 2015. He has been the Area Chair for numerous
a mixture of invariant experts,” in ECCV, 2016, pp. 825–841. conferences, including FG, ICPR, WACV, ICIP, and CVPR. He is the pro-
[65] X. Peng, R. S. Feris, X. Wang, and D. N. Metaxas, “A recurrent gram chair of WACV 2018. He is an Associate Editor of Neurocomputing
encoder-decoder network for sequential face alignment,” in ECCV, journal. He has authored more than 100 scientific publications, and has
2016, pp. 38–56. filed 22 U.S. patents.
[66] J. Booth, A. Roussos, S. Zafeiriou, A. Ponniah, and D. Dunaway,
“A 3D morphable model learnt from 10,000 faces,” in CVPR, 2016,
pp. 5543–5552.
[67] E. Richardson, M. Sela, R. Or-El, and R. Kimmel, “Learning
detailed face reconstruction from a single image,” in press, 2017. Dan Zeng receieved the B.Sc. degree in Col-
[68] A. D. Bagdanov, A. Del Bimbo, and I. Masi, “The florence 2D/3D lege of Computer Science from Sichuan Uni-
hybrid face dataset,” in Workshop on Human gesture and behavior versity, China, in 2013. Since 2012, she patici-
understanding. ACM, 2011, pp. 79–80. pated in ’3+2+3’ successive graduate, postgrad-
[69] Z. Zhu, P. Luo, X. Wang, and X. Tang, “Multi-view perceptron: a uate and doctoral program of Sichuan University.
deep model for learning face identity and view representations,” Currently, she is studying in the SCS group,
in NIPS, 2014, pp. 217–225. uTwente, Netherlands as a visiting PhD student.
[70] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recogni- Her research area is biometrics, especially low
tion.” in BMVC, vol. 1, no. 3, 2015, p. 6. resolution as well as pose problems in face
[71] X. Wu, R. He, Z. Sun, and T. Tan, “A light CNN for deep face recognition.
representation with noisy labels,” arXiv:1511.02683, 2015.

View publication stats

You might also like