0% found this document useful (0 votes)
8 views13 pages

Siggraph21 Deca

The document presents DECA (Detailed Expression Capture and Animation), a novel approach for learning an animatable 3D face model from in-the-wild images without requiring paired 3D supervision. DECA effectively disentangles person-specific details from expression-dependent wrinkles, enabling realistic animation and shape reconstruction. The model achieves state-of-the-art accuracy on benchmarks and is publicly available for research purposes.

Uploaded by

dimas herjuno
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views13 pages

Siggraph21 Deca

The document presents DECA (Detailed Expression Capture and Animation), a novel approach for learning an animatable 3D face model from in-the-wild images without requiring paired 3D supervision. DECA effectively disentangles person-specific details from expression-dependent wrinkles, enabling realistic animation and shape reconstruction. The model achieves state-of-the-art accuracy on benchmarks and is publicly available for research purposes.

Uploaded by

dimas herjuno
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Learning an Animatable Detailed 3D Face Model from In-The-Wild

Images
YAO FENG∗ , Max Planck Institute for Intelligent Systems and Max Planck ETH Center for Learning System, Germany
HAIWEN FENG∗ , Max Planck Institute for Intelligent Systems, Germany
MICHAEL J. BLACK, Max Planck Institute for Intelligent Systems, Germany
TIMO BOLKART, Max Planck Institute for Intelligent Systems, Germany
While current monocular 3D face reconstruction methods can recover fine
geometric details, they suffer several limitations. Some methods produce
faces that cannot be realistically animated because they do not model how
wrinkles vary with expression. Other methods are trained on high-quality
face scans and do not generalize well to in-the-wild images. We present
the first approach that regresses 3D face shape and animatable details that
are specific to an individual but change with expression. Our model, DECA
(Detailed Expression Capture and Animation), is trained to robustly produce
a UV displacement map from a low-dimensional latent representation that
consists of person-specific detail parameters and generic expression param-
eters, while a regressor is trained to predict detail, shape, albedo, expression,
pose and illumination parameters from a single image. To enable this, we
introduce a novel detail-consistency loss that disentangles person-specific
details from expression-dependent wrinkles. This disentanglement allows
us to synthesize realistic person-specific wrinkles by controlling expres-
sion parameters while keeping person-specific details unchanged. DECA is
learned from in-the-wild images with no paired 3D supervision and achieves
state-of-the-art shape reconstruction accuracy on two benchmarks. Quali-
tative results on in-the-wild data demonstrate DECA’s robustness and its
ability to disentangle identity- and expression-dependent details enabling
animation of reconstructed faces. The model and code are publicly available
at https://deca.is.tue.mpg.de.
CCS Concepts: • Computing methodologies → Mesh models.
Additional Key Words and Phrases: Detailed face model, 3D face reconstruc-
tion, facial animation, detail disentanglement
ACM Reference Format: Fig. 1. DECA. Example images (row 1), the regressed coarse shape (row 2),
Yao Feng, Haiwen Feng, Michael J. Black, and Timo Bolkart. 2021. Learning detail shape (row 3) and reposed coarse shape (row 4), and reposed with
an Animatable Detailed 3D Face Model from In-The-Wild Images. ACM person-specific details (row 5) where the source expression is extracted by
Trans. Graph. 40, 4, Article 88 (August 2021), 13 pages. https://doi.org/10. DECA from the faces in the corresponding colored boxes (row 6). DECA is
1145/3450626.3459936 robust to in-the-wild variations and captures person-specific details as well
as expression-dependent wrinkles that appear in regions like the forehead
1 INTRODUCTION and mouth. Our novelty is that this detailed shape can be reposed (animated)
such that the wrinkles are specific to the source shape and target expression.
Two decades have passed since the seminal work of Vetter and Images are taken from Pexels [2021] (row 1; col. 5), Flickr [2021] (bottom
Blanz [1998] that first showed how to reconstruct 3D facial geometry left) @ Gage Skidmore, Chicago [Ma et al. 2015] (bottom right), and from
∗ Both authors contributed equally to the paper NoW [Sanyal et al. 2019] (remaining images).

Authors’ addresses: Yao Feng, Max Planck Institute for Intelligent Systems, Tübingen,
Max Planck ETH Center for Learning System, Tübingen, Germany, yfeng@tuebingen.
mpg.de; Haiwen Feng, Max Planck Institute for Intelligent Systems, Tübingen, Germany, from a single image. Since then, 3D face reconstruction methods
hfeng@tuebingen.mpg.de; Michael J. Black, Max Planck Institute for Intelligent Systems, have rapidly advanced (for a comprehensive overview see [Morales
Tübingen, Germany, black@tuebingen.mpg.de; Timo Bolkart, Max Planck Institute for
Intelligent Systems, Tübingen, Germany, tbolkart@tuebingen.mpg.de.
et al. 2021; Zollhöfer et al. 2018]) enabling applications such as
3D avatar creation for VR/AR [Hu et al. 2017], video editing [Kim
Permission to make digital or hard copies of part or all of this work for personal or et al. 2018a; Thies et al. 2016], image synthesis [Ghosh et al. 2020;
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation Tewari et al. 2020] face recognition [Blanz et al. 2002; Romdhani et al.
on the first page. Copyrights for third-party components of this work must be honored. 2002], virtual make-up [Scherbaum et al. 2011], or speech-driven
For all other uses, contact the owner/author(s). facial animation [Cudeiro et al. 2019; Karras et al. 2017; Richard
© 2021 Copyright held by the owner/author(s).
0730-0301/2021/8-ART88 et al. 2021]. To make the problem tractable, most existing methods
https://doi.org/10.1145/3450626.3459936 incorporate prior knowledge about geometry or appearance by

ACM Trans. Graph., Vol. 40, No. 4, Article 88. Publication date: August 2021.
88:2 • Yao Feng, Haiwen Feng, Michael J. Black, Timo Bolkart

leveraging pre-computed 3D face models [Brunton et al. 2014; Egger In summary, our main contributions are: 1) The first approach
et al. 2020]. These models reconstruct the coarse face shape but are to learn an animatable displacement model from in-the-wild images
unable to capture geometric details such as expression-dependent that can synthesize plausible geometric details by varying expres-
wrinkles, which are essential for realism and support analysis of sion parameters. 2) A novel detail consistency loss that disentangles
human emotion. identity-dependent and expression-dependent facial details. 3) Re-
Several methods recover detailed facial geometry [Abrevaya et al. construction of geometric details that is, unlike most competing
2020; Cao et al. 2015; Chen et al. 2019; Guo et al. 2018; Richardson methods, robust to common occlusions, wide pose variation, and il-
et al. 2017; Tran et al. 2018, 2019], however, they require high-quality lumination variation. This is enabled by our low-dimensional detail
training scans [Cao et al. 2015; Chen et al. 2019] or lack robustness representation, the detail disentanglement, and training from a large
to occlusions [Abrevaya et al. 2020; Guo et al. 2018; Richardson et al. dataset of in-the-wild images. 4) State-of-the-art shape reconstruc-
2017]. None of these explore how the recovered wrinkles change tion accuracy on two different benchmarks. 5) The code and model
with varying expressions. Previous methods that learn expression- are available for research purposes at https://deca.is.tue.mpg.de.
dependent detail models [Bickel et al. 2008; Chaudhuri et al. 2020;
Yang et al. 2020] either use detailed 3D scans as training data and,
hence, do not generalize to unconstrained images [Yang et al. 2020], 2 RELATED WORK
or model expression-dependent details as part of the appearance The reconstruction of 3D faces from visual input has received sig-
map rather than the geometry [Chaudhuri et al. 2020], preventing nificant attention over the last decades after the pioneering work of
realistic mesh relighting. Parke [1974], the first method to reconstruct 3D faces from multi-
We introduce DECA (Detailed Expression Capture and Anima- view images. While a large body of related work aims to recon-
tion), which learns an animatable displacement model from in-the- struct 3D faces from various input modalities such as multi-view
wild images without 2D-to-3D supervision. In contrast to prior work, images [Beeler et al. 2010; Cao et al. 2018a; Pighin et al. 1998], video
these animatable expression-dependent wrinkles are specific to an in- data [Garrido et al. 2016; Ichim et al. 2015; Jeni et al. 2015; Shi
dividual and are regressed from a single image. Specifically, DECA et al. 2014; Suwajanakorn et al. 2014], RGB-D data [Li et al. 2013;
jointly learns 1) a geometric detail model that generates a UV dis- Thies et al. 2015; Weise et al. 2011] or subject-specific image collec-
placement map from a low-dimensional representation that consists tions [Kemelmacher-Shlizerman and Seitz 2011; Roth et al. 2016],
of subject-specific detail parameters and expression parameters, and our main focus is on methods that use only a single RGB image. For
2) a regressor that predicts subject-specific detail, albedo, shape, a more comprehensive overview, see Zollhöfer et al. [2018].
expression, pose, and lighting parameters from an image. The detail Coarse reconstruction: Many monocular 3D face reconstruction
model builds upon FLAME’s [Li et al. 2017] coarse geometry, and we methods follow Vetter and Blanz [1998] by estimating coefficients of
formulate the displacements as a function of subject-specific detail pre-computed statistical models in an analysis-by-synthesis fashion.
parameters and FLAME’s jaw pose and expression parameters. Such methods can be categorized into optimization-based [Aldrian
This enables important applications such as easy avatar creation and Smith 2013; Bas et al. 2017; Blanz et al. 2002; Blanz and Vetter
from a single image. While previous methods can capture detailed 1999; Gerig et al. 2018; Romdhani and Vetter 2005; Thies et al. 2016],
geometry in the image, most applications require a face that can be or learning-based methods [Chang et al. 2018; Deng et al. 2019;
animated. For this, it is not sufficient or recover accurate geometry Genova et al. 2018; Kim et al. 2018b; Ploumpis et al. 2020; Richardson
in the input image. Rather, we must be able to animate that de- et al. 2016; Sanyal et al. 2019; Tewari et al. 2017; Tran et al. 2017;
tailed geometry and, more specifically, the details should be person Tu et al. 2019]. These methods estimate parameters of a statistical
specific. face model with a fixed linear shape space, which captures only
To gain control over expression-dependent wrinkles of the recon- low-frequency shape information. This results in overly-smooth
structed face, while preserving person-specific details (i.e. moles, reconstructions.
pores, eyebrows, and expression-independent wrinkles), the person- Several works are model-free and directly regress 3D faces (i.e.
specific details and expression-dependent wrinkles must be disen- voxels [Jackson et al. 2017] or meshes [Dou et al. 2017; Feng et al.
tangled. Our key contribution is a novel detail consistency loss that 2018b; Güler et al. 2017; Wei et al. 2019]) and hence can capture
enforces this disentanglement. During training, if we are given two more variation than the model-based methods. However, all these
images of the same person with different expressions, we observe methods require explicit 3D supervision, which is provided either
that their 3D face shape and their person-specific details are the by an optimization-based model fitting [Feng et al. 2018b; Güler
same in both images, but the expression and the intensity of the et al. 2017; Jackson et al. 2017; Wei et al. 2019] or by synthetic data
wrinkles differ with expression. We exploit this observation during generated by sampling a statistical face model [Dou et al. 2017] and
training by swapping the detail codes between different images of therefore also only capture coarse shape variations.
the same identity and enforcing the newly rendered results to look Instead of capturing high-frequency geometric details, some meth-
similar to the original input images. Once trained, DECA recon- ods reconstruct coarse facial geometry along with high-fidelity
structs a detailed 3D face from a single image (Fig. 1, third row) in textures [Gecer et al. 2019; Saito et al. 2017; Slossberg et al. 2018;
real time (about 120fps on a Nvidia Quadro RTX 5000), and is able Yamaguchi et al. 2018]. As this “bakes" shading details into the tex-
to animate the reconstruction with realistic adaptive expression ture, lighting changes do not affect these details, limiting realism
wrinkles (Fig. 1, fifth row). and the range of applications. To enable animation and relighting,
DECA captures these details as part of the geometry.

ACM Trans. Graph., Vol. 40, No. 4, Article 88. Publication date: August 2021.
Learning an Animatable Detailed 3D Face Model from In-The-Wild Images • 88:3

Detail reconstruction: Another body of work aims to reconstruct adapt with varying lighting. This results in unrealistic renderings. In
faces with “mid-frequency" details. Common optimization-based contrast, DECA models details as geometric displacements, which
methods fit a statistical face model to images to obtain a coarse look natural when re-lit.
shape estimate, followed by a shape from shading (SfS) method to In summary, DECA occupies a unique space. It takes a single
reconstruct facial details from monocular images [Jiang et al. 2018; image as input and produces person-specific details that can be real-
Li et al. 2018; Riviere et al. 2020], or videos [Garrido et al. 2016; istically animated. While some methods produce higher-frequency
Suwajanakorn et al. 2014]. Unlike DECA, these approaches are slow, pixel-aligned details, these are not animatable. Still other methods
the results lack robustness to occlusions, and the coarse model fitting require high-resolution scans for training. We show that these are
step requires facial landmarks, making them error-prone for large not necessary and that animatable details can be learned from 2D
viewing angles and occlusions. images without paired 3D ground truth. This is not just conve-
Most regression-based approaches [Cao et al. 2015; Chen et al. nient, but means that DECA learns to be robust to a wide variety of
2019; Guo et al. 2018; Lattas et al. 2020; Richardson et al. 2017; Tran real-world variation. We want to emphasize that, while elements of
et al. 2018] follow a similar approach by first reconstructing the pa- DECA are built on well-understood principles (dating back to Vetter
rameters of a statistical face model to obtain a coarse shape, followed and Blanz), our core contribution is new and essential. The key to
by a refinement step to capture localized details. Chen et al. [2019] making DECA work is the detail consistency loss, which has not
and Cao et al. [2015] compute local wrinkle statistics from high- appeared previously in the literature.
resolution scans and leverage these to constrain the fine-scale detail
reconstruction from images [Chen et al. 2019] or videos [Cao et al. 3 PRELIMINARIES
2015]. Guo et al. [2018] and Richardson et al. [2017] directly regress Geometry prior: FLAME [Li et al. 2017] is a statistical 3D head
per-pixel displacement maps. All these methods only reconstruct model that combines separate linear identity shape and expression
fine-scale details in non-occluded regions, causing visible artifacts in spaces with linear blend skinning (LBS) and pose-dependent cor-
the presence of occlusions. Tran et al. [2018] gain robustness to oc- rective blendshapes to articulate the neck, jaw, and eyeballs. Given
clusions by applying a face segmentation method [Nirkin et al. 2018] parameters of facial identity 𝜷 ∈ R |𝜷 | , pose 𝜽 ∈ R3𝑘+3 (with 𝑘 = 4
to determine occluded regions, and employ an example-based hole joints for neck, jaw, and eyeballs), and expression 𝝍 ∈ R |𝝍 | , FLAME
filling approach to deal with the occluded regions. Further, model- outputs a mesh with 𝑛 = 5023 vertices. The model is defined as
free methods exist that directly reconstruct detailed meshes [Sela
et al. 2017; Zeng et al. 2019] or surface normals that add detail to 𝑀 (𝜷, 𝜽, 𝝍) = 𝑊 (𝑇𝑃 (𝜷, 𝜽, 𝝍), J(𝜷), 𝜽, W), (1)
coarse reconstructions [Abrevaya et al. 2020; Sengupta et al. 2018].
with the blend skinning function 𝑊 (T, J, 𝜽, W) that rotates the
Tran et al. [2019] and Tewari et al. [2019; 2018] jointly learn a statisti-
vertices in T ∈ R3𝑛 around joints J ∈ R3𝑘 , linearly smoothed by
cal face model and reconstruct 3D faces from images. While offering
more flexibility than fixed statistical models, these methods capture blendweights W ∈ R𝑘×𝑛 . The joint locations J are defined as a
limited geometric details compared to other detail reconstruction function of the identity 𝜷. Further,
methods. Lattas et al. [2020] use image translation networks to infer 𝑇𝑃 (𝜷, 𝜽, 𝝍) = T + 𝐵𝑆 (𝜷; S) + 𝐵𝑃 (𝜽 ; P) + 𝐵𝐸 (𝝍; E) (2)
the diffuse normals and specular normals, resulting in realistic ren-
dering. Unlike DECA, none of these detail reconstruction methods denotes the mean template T in “zero pose” with added shape
offer animatable details after reconstruction. blendshapes 𝐵𝑆 (𝜷; S) : R |𝜷 | → R3𝑛 , pose correctives 𝐵𝑃 (𝜽 ; P) :
Animatable detail reconstruction: Most relevant to DECA are R3𝑘+3 → R3𝑛 , and expression blendshapes 𝐵𝐸 (𝝍; E) : R |𝝍 | → R3𝑛 ,
methods that reconstruct detailed faces while allowing animation with the learned identity, pose, and expression bases (i.e. linear
of the result. Existing methods [Bickel et al. 2008; Golovinskiy et al. subspaces) S, P and E. See [Li et al. 2017] for details.
2006; Ma et al. 2008; Shin et al. 2014; Yang et al. 2020] learn correla- Appearance model: FLAME does not have an appearance model,
tions between wrinkles or attributes like age and gender [Golovin- hence we convert the Basel Face Model’s linear albedo subspace
skiy et al. 2006], pose [Bickel et al. 2008] or expression [Shin et al. [Paysan et al. 2009] into the FLAME UV layout to make it compatible
2014; Yang et al. 2020] from high-quality 3D face meshes [Bickel with FLAME. The appearance model outputs a UV albedo map
et al. 2008]. Fyffe et al. [2014] use optical flow correspondence com- 𝐴(𝜶 ) ∈ R𝑑×𝑑×3 for albedo parameters 𝜶 ∈ R |𝜶 | .
puted from dynamic video frames to animate static high-resolution Camera model: Photographs in existing in-the-wild face datasets
scans. In contrast, DECA learns an animatable detail model solely are often taken from a distance. We, therefore, use an orthographic
from in-the-wild images without paired 3D training data. While camera model c to project the 3D mesh into image space. Face
FaceScape [Yang et al. 2020] predicts an animatable 3D face from a vertices are projected into the image as v = 𝑠Π(𝑀𝑖 ) + t, where
single image, the method is not robust to occlusions. This is due to 𝑀𝑖 ∈ R3 is a vertex in 𝑀, Π ∈ R2×3 is the orthographic 3D-2D
a two step reconstruction process: first optimize the coarse shape, projection matrix, and 𝑠 ∈ R and t ∈ R2 denote isotropic scale and
then predict a displacement map from the texture map extracted 2D translation, respectively. The parameters 𝑠, and t are summarized
with the coarse reconstruction. as 𝒄.
Chaudhuri et al. [2020] learn identity and expression corrective Illumination model: For face reconstruction, the most frequently-
blendshapes with dynamic (expression-dependent) albedo maps employed illumination model is based on Spherical Harmonics
[Nagano et al. 2018]. They model geometric details as part of the (SH) [Ramamoorthi and Hanrahan 2001]. By assuming that the light
albedo map, and therefore, the shading of these details does not source is distant and the face’s surface reflectance is Lambertian,

ACM Trans. Graph., Vol. 40, No. 4, Article 88. Publication date: August 2021.
88:4 • Yao Feng, Haiwen Feng, Michael J. Black, Timo Bolkart

the shaded face image is computed as: 2D image 𝐼𝑟 , and minimize the difference between the synthesized
9 image and the input. As shown in Fig. 2, we train an encoder 𝐸𝑐 ,
Õ
𝐵(𝜶 , l, 𝑁𝑢𝑣 )𝑖,𝑗 = 𝐴(𝜶 )𝑖,𝑗 ⊙ l𝑘 𝐻𝑘 (𝑁𝑖,𝑗 ), (3) which consists of a ResNet50 [He et al. 2016] network followed by a
𝑘=1 fully connected layer, to regress a low-dimensional latent code. This
latent code consists of FLAME parameters 𝜷, 𝝍, 𝜽 (i.e. representing
where the albedo, 𝐴, surface normals, 𝑁 , and shaded texture, 𝐵,
the coarse geometry), albedo coefficients 𝜶 , camera 𝒄, and lighting
are represented in UV coordinates and where 𝐵𝑖,𝑗 ∈ R3 , 𝐴𝑖,𝑗 ∈ R3 ,
parameters l. More specifically, the coarse geometry uses the first
and 𝑁𝑖,𝑗 ∈ R3 denote pixel (𝑖, 𝑗) in the UV coordinate system. The
100 FLAME shape parameters (𝜷), 50 expression parameters (𝝍), and
SH basis and coefficients are defined as 𝐻𝑘 : R3 → R and l = 50 albedo parameters (𝜶 ). In total, 𝐸𝑐 predicts a 236 dimensional
[l𝑇1 , · · · , l𝑇9 ]𝑇 , with l𝑘 ∈ R3 , and ⊙ denotes the Hadamard product. latent code.
Texture rendering: Given the geometry parameters (𝜷, 𝜽, 𝝍), albedo Given a dataset of 2𝐷 face images 𝐼𝑖 with multiple images per
(𝜶 ), lighting (l) and camera information 𝒄, we can generate the 2D subject, corresponding identity labels 𝑐𝑖 , and 68 2𝐷 keypoints k𝑖 per
image 𝐼𝑟 by rendering as 𝐼𝑟 = R (𝑀, 𝐵, c), where R denotes the image, the coarse reconstruction branch is trained by minimizing
rendering function.
FLAME is able to generate the face geometry with various poses, 𝐿coarse = 𝐿lmk + 𝐿eye + 𝐿pho + 𝐿𝑖𝑑 + 𝐿𝑠𝑐 + 𝐿reg , (4)
shapes and expressions from a low-dimensional latent space. How- with landmark loss 𝐿lmk , eye closure loss 𝐿eye , photometric loss 𝐿pho ,
ever, the representational power of the model is limited by the low identity loss 𝐿𝑖𝑑 , shape consistency loss 𝐿𝑠𝑐 and regularization 𝐿reg .
mesh resolution and therefore mid-frequency details are mostly Landmark re-projection loss: The landmark loss measures the
missing from FLAME’s surface. The next section introduces our difference between ground-truth 2𝐷 face landmarks k𝑖 and the
expression-dependent displacement model that augments FLAME corresponding landmarks on the FLAME model’s surface 𝑀𝑖 ∈
with mid-frequency details, and it demonstrates how to reconstruct R3 , projected into the image by the estimated camera model. The
this geometry from a single image and animate it. landmark loss is defined as
68
4 METHOD 𝐿lmk =
Õ
∥k𝑖 − 𝑠Π(𝑀𝑖 ) + t∥ 1 . (5)
DECA learns to regress a parameterized face model with geomet- 𝑖=1
ric detail solely from in-the-wild training images (Fig. 2 left). Once Eye closure loss: The eye closure loss computes the relative offset
trained, DECA reconstructs the 3D head with detailed face geometry of landmarks k𝑖 and k 𝑗 on the upper and lower eyelid, and mea-
from a single face image, 𝐼 . The learned parametrization of the recon- sures the difference to the offset of the corresponding landmarks
structed details enables us to then animate the detail reconstruction on FLAME’s surface 𝑀𝑖 and 𝑀 𝑗 projected into the image. Formally,
by controlling FLAME’s expression and jaw pose parameters (Fig. 2, the loss is given as
right). This synthesizes new wrinkles while keeping person-specific Õ
details unchanged. 𝐿eye = k𝑖 − k 𝑗 − 𝑠Π(𝑀𝑖 − 𝑀 𝑗 ) 1 , (6)
Key idea: The key idea of DECA is grounded in the observation (𝑖,𝑗) ∈𝐸
that an individual’s face shows different details (i.e. wrinkles), de- where 𝐸 is the set of upper/lower eyelid landmark pairs. While the
pending on their facial expressions but that other properties of their landmark loss, 𝐿lmk (Eq. 5), penalizes the absolute landmark location
shape remain unchanged. Consequently, facial details should be differences, 𝐿eye penalizes the relative difference between eyelid
separated into static person-specific details and dynamic expression- landmarks. Because the eye closure loss 𝐿eye is translation invariant,
dependent details such as wrinkles [Li et al. 2009]. However, dis- it is less susceptible to a misalignment between the projected 3D
entangling static and dynamic facial details is a non-trivial task. face and the image, compared to 𝐿lmk . In contrast, simply increasing
Static facial details are different across people, whereas dynamic the landmark loss for the eye landmarks affects the overall face
expression dependent facial details even vary for the same person. shape and can lead to unsatisfactory reconstructions. See Fig. 10 for
Thus, DECA learns an expression-conditioned detail model to infer the effect of the eye-closure loss.
facial details from both the person-specific detail latent space and Photometric loss: The photometric loss computes the error be-
the expression space. tween the input image 𝐼 and the rendering 𝐼𝑟 as
The main difficulty in learning a detail displacement model is the
lack of training data. Prior work uses specialized camera systems 𝐿pho = ∥𝑉𝐼 ⊙ (𝐼 − 𝐼𝑟 )∥ 1,1 .
to scan people in a controlled environment to obtain detailed facial Here, 𝑉𝐼 is a face mask with value 1 in the face skin region, and value
geometry. However, this approach is expensive and impractical for 0 elsewhere obtained by an existing face segmentation method [Nirkin
capturing large numbers of identities with varying expressions and et al. 2018], and ⊙ denotes the Hadamard product. Computing the
diversity in ethnicity and age. Therefore we propose an approach error in only the face region provides robustness to common occlu-
to learn detail geometry from in-the-wild images. sions by e.g. hair, clothes, sunglasses, etc. Without this, the predicted
albedo will also consider the color of the occluder, which may be
4.1 Coarse reconstruction far from skin color, resulting in unnatural rendering (see Fig. 10).
We first learn a coarse reconstruction (i.e. in FLAME’s model space) Identity loss: Recent 3D face reconstruction methods demonstrate
in an analysis-by-synthesis way: given a 2D image 𝐼 as input, we the effectiveness of utilizing an identity loss to produce more realis-
encode the image into a latent code, decode this to synthesize a tic face shapes [Deng et al. 2019; Gecer et al. 2019]. Motivated by

ACM Trans. Graph., Vol. 40, No. 4, Article 88. Publication date: August 2021.
Learning an Animatable Detailed 3D Face Model from In-The-Wild Images • 88:5

Fig. 2. DECA training and animation. During training (left box), DECA estimates parameters to reconstruct face shape for each image with the aid of the
shape consistency information (following the blue arrows) and, then, learns an expression-conditioned displacement model by leveraging detail consistency
information (following the red arrows) from multiple images of the same individual (see Sec. 4.3 for details). While the analysis-by-synthesis pipeline is, by
now, standard, the yellow box region contains our key novelty. This displacement consistency loss is further illustrated in Fig. 3. Once trained, DECA animates
a face (right box) by combining the reconstructed source identity’s shape, head pose, and detail code, with the reconstructed source expression’s jaw pose and
expression parameters to obtain an animated coarse shape and an animated displacement map. Finally, DECA outputs an animated detail shape. Images are
taken from NoW [Sanyal et al. 2019]. Note that NoW images are not used for training DECA, but are just selected for illustration purposes.

identity, ensuring that the rendered image looks like the same person
as the input subject. Figure 10 shows that the coarse shape results
with 𝐿𝑖𝑑 look more like the input subject than those without.
Shape consistency loss: Given two images 𝐼𝑖 and 𝐼 𝑗 of the same
subject (i.e. 𝑐𝑖 = 𝑐 𝑗 ), the coarse encoder 𝐸𝑐 should output the same
shape parameters (i.e. 𝜷 𝑖 = 𝜷 𝑗 ). Previous work encourages shape
consistency by enforcing the distance between 𝜷 𝑖 and 𝜷 𝑗 to be
smaller by a margin than the distance to the shape coefficients
corresponding to a different subject [Sanyal et al. 2019]. However,
choosing this fixed margin is challenging in practice. Instead, we
propose a different strategy by replacing 𝜷 𝑖 with 𝜷 𝑗 while keeping
Fig. 3. Detail consistency loss. DECA uses multiple images of the same all other parameters unchanged. Given that 𝜷 𝑖 and 𝜷 𝑗 represent the
person during training to disentangle static person-specific details from same subject, this new set of parameters must reconstruct 𝐼𝑖 well.
expression-dependent details. When properly factored, we should be able to Formally, we minimize
take the detail code from one image of a person and use it to reconstruct an-
other image of that person with a different expression. See Sec. 4.3 for details. 𝐿𝑠𝑐 = 𝐿coarse (𝐼𝑖 , R (𝑀 (𝜷 𝑗 , 𝜽 𝑖 , 𝝍 𝑖 ), 𝐵(𝜶 𝑖 , l𝑖 , 𝑁𝑢𝑣,𝑖 ), c𝑖 )). (8)
Images are taken from NoW [Sanyal et al. 2019]. Note that NoW images
are not used for training, but are just selected for illustration purposes. The goal is to make the rendered images look like the real person.
If the method has correctly estimated the shape of the face in two
images of the same person, then swapping the shape parameters
this, we also use a pretrained face recognition network [Cao et al. between these images should produce rendered images that are
2018b], to employ an identity loss during training. indistinguishable. Thus, we employ the photometric and identity
The face recognition network 𝑓 outputs feature embeddings of loss on the rendered images from swapped shape parameters.
the rendered images and the input image, and the identity loss Regularization: 𝐿reg regularizes shape 𝐸 𝜷 = ∥𝜷 ∥ 22 , expression
then measures the cosine similarity between the two embeddings. 𝐸 𝝍 = ∥𝝍 ∥ 22 , and albedo 𝐸 𝜶 = ∥𝜶 ∥ 22 .
Formally, the loss is defined as
𝑓 (𝐼 ) 𝑓 (𝐼𝑟 ) 4.2 Detail reconstruction
𝐿𝑖𝑑 = 1 − . (7)
∥𝑓 (𝐼 ) ∥ 2 · ∥𝑓 (𝐼𝑟 ) ∥ 2 The detail reconstruction augments the coarse FLAME geometry
By computing the error between embeddings, the loss encourages with a detailed UV displacement map 𝐷 ∈ [−0.01, 0.01]𝑑×𝑑 (see
the rendered image to capture fundamental properties of a person’s Fig. 2). Similar to the coarse reconstruction, we train an encoder 𝐸𝑑

ACM Trans. Graph., Vol. 40, No. 4, Article 88. Publication date: August 2021.
88:6 • Yao Feng, Haiwen Feng, Michael J. Black, Timo Bolkart

(with the same architecture as 𝐸𝑐 ) to encode 𝐼 to a 128-dimensional where 𝐿𝑀 (𝑙𝑎𝑦𝑒𝑟𝑡ℎ ) denotes the ID-MRF loss that is employed on
latent code 𝜹, representing subject-specific details. The latent code the feature patches extracted from 𝐼𝑟′ and 𝐼 with layer 𝑙𝑎𝑦𝑒𝑟𝑡ℎ of
𝜹 is then concatenated with FLAME’s expression 𝝍 and jaw pose VGG19. As with the photometric losses, we compute 𝐿mrf only for
parameters 𝜽 𝑗𝑎𝑤 , and decoded by 𝐹𝑑 to 𝐷. the face skin region in UV space.
Detail decoder: The detail decoder is defined as Soft symmetry loss: To add robustness to self-occlusions, we add
a soft symmetry loss to regularize non-visible face parts. Specifically,
𝐷 = 𝐹𝑑 (𝜹, 𝝍, 𝜽 𝑗𝑎𝑤 ), (9)
we minimize
where the detail code 𝜹 ∈ R128
controls the static person-specific
𝐿𝑠𝑦𝑚 = ∥𝑉𝑢𝑣 ⊙ (𝐷 − flip(𝐷)) ∥ 1,1 , (14)
details. We leverage the expression 𝝍 ∈ R50 and jaw pose parameters
𝜽 𝑗𝑎𝑤 ∈ R3 from the coarse reconstruction branch to capture the where 𝑉𝑢𝑣 denotes the face skin mask in UV space, and flip is the
dynamic expression wrinkle details. For rendering, 𝐷 is converted horizontal flip operation. Without 𝐿sym , for extreme poses, boundary
to a normal map. artifacts become visible in occluded regions (Fig. 9).
Detail rendering: The detail displacement model allows us to gen- Detail regularization: The detail displacements are regularized by
erate images with mid-frequency surface details. To reconstruct 𝐿regD = ∥𝐷 ∥ 1,1 to reduce noise.
the detailed geometry 𝑀 ′ , we convert 𝑀 and its surface normals
𝑁 to UV space, denoted as 𝑀𝑢𝑣 ∈ R𝑑×𝑑×3 and 𝑁𝑢𝑣 ∈ R𝑑×𝑑×3 , and 4.3 Detail disentanglement
combine them with 𝐷 as Optimizing 𝐿detail enables us to reconstruct faces with mid-frequen-

𝑀𝑢𝑣 = 𝑀𝑢𝑣 + 𝐷 ⊙ 𝑁𝑢𝑣 . (10) cy details. Making these detail reconstructions animatable, however,
requires us to disentangle person specific details (i.e. moles, pores,
By calculating normals 𝑁 ′ from 𝑀 ′ , we obtain the detail rendering eyebrows, and expression-independent wrinkles) controlled by 𝜹
𝐼𝑟′ by rendering 𝑀 with the applied normal map as from expression-dependent wrinkles (i.e. wrinkles that change for
𝐼𝑟′ = R (𝑀, 𝐵(𝜶 , l, 𝑁 ′ ), c). (11) varying facial expression) controlled by FLAME’s expression and jaw
pose parameters, 𝝍 and 𝜽 jaw . Our key observation is that the same
The detail reconstruction is trained by minimizing person in two images should have both similar coarse geometry and
𝐿detail = 𝐿phoD + 𝐿mrf + 𝐿sym + 𝐿𝑑𝑐 + 𝐿regD , (12) personalized details.
Specifically, for the rendered detail image, exchanging the detail
with photometric detail loss 𝐿phoD , ID-MRF loss 𝐿mrf , soft symmetry codes between two images of the same subject should have no effect on
loss 𝐿sym , and detail regularization 𝐿regD . Since our estimated albedo the rendered image. This concept is illustrated in Fig. 3. Here we take
is generated by a linear model with 50 basis vectors, the rendered the the jaw and expression parameters from image 𝑖, extract the
coarse face image only recovers low frequency information such as detail code from image 𝑗, and combine these to estimate the wrinkle
skin tone and basic facial attributes. High frequency details in the detail. When we swap detail codes between different images of the
the rendered image result mainly from the displacement map, and same person, the produced results must remain realistic.
hence, since 𝐿detail compares the rendered detailed image with the Detail consistency loss: Given two images 𝐼𝑖 and 𝐼 𝑗 of the same
real image, 𝐹𝑑 is forced to model detailed geometric information. subject (i.e. 𝑐𝑖 = 𝑐 𝑗 ), the loss is defined as
Detail photometric losses: With the applied detail displacement
map, the rendered images 𝐼𝑟′ contain some geometric details. Equiv- 𝐿𝑑𝑐 = 𝐿detail (𝐼𝑖 , R (𝑀 (𝜷 𝑖 , 𝜽 𝑖 , 𝝍 𝑖 ), 𝐴(𝜶 𝑖 ),
(15)
alent to the coarse rendering, we use a photometric loss 𝐿phoD = 𝐹𝑑 (𝜹 𝑗 , 𝝍 𝑖 , 𝜽 jaw,𝑖 ), l𝑖 , c𝑖 )),
𝑉𝐼 ⊙ (𝐼 − 𝐼𝑟′ ) 1,1 , where, recall, 𝑉𝐼 is a mask representing the visible where 𝜷 𝑖 , 𝜽 𝑖 , 𝝍 𝑖 , 𝜽 jaw,𝑖 , 𝜶 𝑖 , l𝑖 , and 𝒄 𝑖 are the parameters of 𝐼𝑖 ,
skin pixels. while 𝜹 𝑗 is the detail code of 𝐼 𝑗 (see Fig. 3). The detail consistency
ID-MRF loss: We adopt an Implicit Diversified Markov Random loss is essential for the disentanglement of identity-dependent and
Field (ID-MRF) loss [Wang et al. 2018] to reconstruct geometric expression-dependent details. Without the detail consistency loss,
details. Given the input image and the detail rendering, the ID-MRF the person-specific detail code, 𝛿, captures identity and expression
loss extracts feature patches from different layers of a pre-trained dependent details, and therefore, reconstructed details cannot be
network, and then minimizes the difference between correspond- re-posed by varying the FLAME jaw pose and expression. We show
ing nearest neighbor feature patches from both images. Larsen et the necessity and effectiveness of 𝐿𝑑𝑐 in Sec. 6.3.
al. [2016] and Isola et al. [2017] point out that L1 losses are not able
to recover the high frequency information in the data. Consequently, 5 IMPLEMENTATION DETAILS
these two methods use a discriminator to obtain high-frequency
Data: We train DECA on three publicly available datasets: VGGFace2
detail. Unfortunately, this may result in an unstable adversarial
[Cao et al. 2018b], BUPT-Balancedface [Wang et al. 2019] and Vox-
training process. Instead, the ID-MRF loss regularizes the generated
Celeb2 [Chung et al. 2018a]. VGGFace2 [Cao et al. 2018b] contains
content to the original input at the local patch level; this encourages
images of over 8𝑘 subjects, with an average of more than 350 images
DECA to capture high-frequency details.
per subject. BUPT-Balancedface [Wang et al. 2019] offers 7𝑘 sub-
Following Wang et al. [2018], the loss is computed on layers
jects per ethnicity (i.e. Caucasian, Indian, Asian and African), and
𝑐𝑜𝑛𝑣3_2 and 𝑐𝑜𝑛𝑣4_2 of VGG19 [Simonyan and Zisserman 2014] as
VoxCeleb2 [Chung et al. 2018a] contains 145𝑘 videos of 6𝑘 subjects.
𝐿mrf = 2𝐿𝑀 (𝑐𝑜𝑛𝑣4_2) + 𝐿𝑀 (𝑐𝑜𝑛𝑣3_2), (13) In total, DECA is trained on 2 Million images.

ACM Trans. Graph., Vol. 40, No. 4, Article 88. Publication date: August 2021.
Learning an Animatable Detailed 3D Face Model from In-The-Wild Images • 88:7

All datasets provide an identity label for each image. We use Detail animation: DECA models detail displacements as a func-
FAN [Bulat and Tzimiropoulos 2017] to predict 68 2D landmarks k𝑖 tion of subject-specific detail parameters 𝜹 and FLAME’s jaw pose
on each face. To improve the robustness of the predicted landmarks, 𝜽 jaw and expression parameters 𝝍 as illustrated in Fig. 2 (right).
we run FAN for each image twice with different face crops, and This formulation allows us to animate detailed facial geometry such
discard all images with non-matching landmarks. See Sup. Mat. for that wrinkles are specific to the source shape and expression as
details on data selection and data cleaning. shown in Fig. 1. Using static details instead of DECA’s animatable
Implementation details: DECA is implemented in PyTorch [Paszke details (i.e. by using the reconstructed details as a static displace-
et al. 2019], using the differentiable rasterizer from Pytorch3D [Ravi ment map) and animating only the coarse shape by changing the
et al. 2020] for rendering. We use Adam [Kingma and Ba 2015] as FLAME parameters results in visible artifacts as shown in Fig. 4
optimizer with a learning rate of 1𝑒 − 4. The input image size is 2242 (top), while animatable details (middle) look similar to the reference
and the UV space size is 𝑑 = 256. See Sup. Mat. for details. shape (bottom) of the same identity. Figure 7 shows more examples
where using static details results in artifacts at the mouth corner or
the forehead region, while DECA’s animated results look plausible.
6 EVALUATION
6.1 Qualitative evaluation 6.2 Quantitative evaluation
Reconstruction: Given a single face image, DECA reconstructs the We compare DECA with publicly available methods, namely 3DDFA-
3D face shape with mid-frequency geometric details. The second V2 [Guo et al. 2020], Deng et al. [2019], RingNet [Sanyal et al. 2019],
row of Fig. 1 shows that the coarse shape (i.e. in FLAME space) PRNet [Feng et al. 2018b], 3DMM-CNN [Tran et al. 2017] and Ex-
well represents the overall face shape, and the learned DECA detail treme3D [Tran et al. 2018]. Note that there is no benchmark face
model reconstructs subject-specific details and wrinkles of the input dataset with ground truth shape detail. Consequently, our quantita-
identity (Fig. 1, row three), while being robust to partial occlusions. tive analysis focuses on the accuracy of the coarse shape. Note that
Figure 5 qualitatively compares DECA results with state-of-the- DECA achieves SOTA performance on 3D reconstruction without
art coarse face reconstruction methods, namely PRNet [Feng et al. any paired 3D data in training.
2018b], RingNet [Sanyal et al. 2019], Deng et al. [2019], FML [Tewari NoW benchmark: The NoW challenge [Sanyal et al. 2019] consists
et al. 2019] and 3DDFA-V2 [Guo et al. 2020]. Compared to these of 2054 face images of 100 subjects, split into a validation set (20
methods, DECA better reconstructs the overall face shape with subjects) and a test set (80 subjects), with a reference 3D face scan per
details like the nasolabial fold (rows 1, 2, 3, 4, and 6) and forehead subject. The images consist of indoor and outdoor images, neutral
wrinkles (row 3). DECA better reconstructs the mouth shape and expression and expressive face images, partially occluded faces, and
the eye region than all other methods. DECA further reconstructs varying viewing angles ranging from frontal view to profile view,
a full head while PRNet [Feng et al. 2018b], Deng et al. [2019], and selfie images. The challenge provides a standard evaluation
FML [Tewari et al. 2019] and 3DDFA-V2 [Guo et al. 2020] reconstruct protocol that measures the distance from all reference scan vertices
tightly cropped faces. While RingNet [Sanyal et al. 2019], like DECA, to the closest point in the reconstructed mesh surface, after rigidly
is based on FLAME [Li et al. 2017], DECA better reconstructs the aligning scans and reconstructions. For details, see [NoW challenge
face shape and the facial expression. 2019].
Figure 6 compares DECA visually to existing detailed face re- We found that the tightly cropped face meshes predicted by Deng
construction methods, namely Extreme3D [Tran et al. 2018], Cross- et al. [2019] are smaller than the NoW reference scans, which would
modal [Abrevaya et al. 2020], and FaceScape [Yang et al. 2020]. Ex- result in a high reconstruction error in the missing region. For a
treme3D [Tran et al. 2018] and Cross-modal [Abrevaya et al. 2020] fair comparison to the method of Deng et al. [2019], we use the
reconstruct more details than DECA but at the cost of being less Basel Face Model (BFM) [Paysan et al. 2009] parameters they output,
robust to occlusions (rows 1, 2, 3). Unlike DECA, Extreme3D and reconstruct the complete BFM mesh, and get the NoW evaluation for
Cross-modal only reconstruct static details. However, using static these complete meshes. As shown in Tab. 1 and the cumulative error
details instead of DECA’s animatable details leads to visible arti- plot in Figure 8 (left), DECA gives state-of-the-art results on NoW,
facts when animating the face (see Fig. 4). While FaceScape [Yang providing the reconstruction error with the lowest mean, median,
et al. 2020] provides animatable details, unlike DECA, the method and standard deviation.
is trained on high-resolution scans while DECA is solely trained To quantify the influence of the geometric details, we separately
on in-the-wild images. Also, with occlusion, FaceScape produces evaluate the coarse and the detail shape (i.e. w/o and w/ details)
artifacts (rows 1, 2) or effectively fails (row 3). on the NoW validation set. The reconstruction errors are, median:
In summary, DECA produces high-quality reconstructions, out- 1.18/1.19 (coarse / detailed), mean: 1.46/1.47 (coarse / detailed), std:
performing previous work in terms of robustness, while enabling 1.25/1.25 (coarse / detailed). This indicates that while the detail
animation of the detailed reconstruction. To demonstrate the quality shape improves visual quality when compared to the coarse shape,
of DECA and the robustness to variations in head pose, expression, the quantitative performance is slightly worse.
occlusions, image resolution, lighting conditions, etc., we show re- To test for gender bias in the results, we report errors separately
sults for 200 randomly selected ALFW2000 [Zhu et al. 2015] images for female (f) and male (m) NoW test subjects. We find that re-
in the Sup. Mat. along with more qualitative coarse and detail re- covered female shapes are slightly more accurate. Reconstruction
construction comparisons to the state-of-the-art. errors are, median: 1.03/1.16 (f/m), mean: 1.32/1.45 (f/m), and std:

ACM Trans. Graph., Vol. 40, No. 4, Article 88. Publication date: August 2021.
88:8 • Yao Feng, Haiwen Feng, Michael J. Black, Timo Bolkart

Fig. 4. Effect of DECA’s animatable details. Given images of source identity I and source expression E (left), DECA reconstructs the detail shapes (middle) and
animates the detail shape of I with the expression of E (right, middle). This synthesized DECA expression appears nearly identical to the reconstructed same
subject’s reference detail shape (right, bottom). Using the reconstructed details of I instead (i.e. static details) and animating the coarse shape only, results in
visible artifacts (right, top). See Sec. 6.1 for details. Input images are taken from NoW [Sanyal et al. 2019].

Method Median (mm) Mean (mm) Std (mm)


3DMM-CNN [Tran et al. 2017] 1.84 2.33 2.05
PRNet [Feng et al. 2018b] 1.50 1.98 1.88
Deng et al.19 [2019] 1.23 1.54 1.29
RingNet [Sanyal et al. 2019] 1.21 1.54 1.31
3DDFA-V2 [Guo et al. 2020] 1.23 1.57 1.39
MGCNet [Shang et al. 2020] 1.31 1.87 2.63
DECA (ours) 1.09 1.38 1.18
Table 1. Reconstruction error on the NoW [Sanyal et al. 2019] benchmark.

Method Median (mm) Mean (mm) Std (mm)


LQ HQ LQ HQ LQ HQ
3DMM-CNN [Tran et al. 2017] 1.88 1.85 2.32 2.29 1.89 1.88
Extreme3D [Tran et al. 2018] 2.40 2.37 3.49 3.58 6.15 6.75
PRNet [Feng et al. 2018b] 1.79 1.59 2.38 2.06 2.19 1.79
RingNet [Sanyal et al. 2019] 1.63 1.59 2.08 2.02 1.79 1.69
3DDFA-V2 [Guo et al. 2020] 1.62 1.49 2.10 1.91 1.87 1.64
DECA (ours) 1.48 1.45 1.91 1.89 1.66 1.68
Table 2. Feng et al. [2018a] benchmark performance.

images extracted from videos, and 656 high-quality (HQ) images


Fig. 5. Comparison to other coarse reconstruction methods, from left
to right: PRNet [Feng et al. 2018b], RingNet [Sanyal et al. 2019] Deng et taken in controlled scenarios. A protocol similar to NoW is used for
al. [2019], FML [Tewari et al. 2019], 3DDFA-V2 [Guo et al. 2020], DECA evaluation, which measures the distance between all reference scan
(ours). Input images are taken from VoxCeleb2 [Chung et al. 2018b]. vertices to the closest points on the reconstructed mesh surface, after
rigidly aligning scan and reconstruction. As shown in Tab. 2 and
the cumulative error plot in Fig. 8 (middle & right), DECA provides
1.16/1.20 (f/m). The cumulative error plots in Fig. 1 of the Sup. Mat. state-of-the-art performance.
demonstrate that DECA gives state-of-the-art performance for both
genders. 6.3 Ablation experiment
Feng et al. benchmark: The Feng et al. [2018a] challenge contains Detail consistency loss: To evaluate the importance of our novel
2000 face images of 135 subjects, and a reference 3D face scan detail consistency loss 𝐿𝑑𝑐 (Eq. 15), we train DECA with and without
for each subject. The benchmark consists of 1344 low-quality (LQ) 𝐿𝑑𝑐 . Figure 9 (left) shows the DECA details for detail code 𝜹 𝐼 from

ACM Trans. Graph., Vol. 40, No. 4, Article 88. Publication date: August 2021.
Learning an Animatable Detailed 3D Face Model from In-The-Wild Images • 88:9

no such wrinkles appear. This indicates that without 𝐿𝑑𝑐 , person-


specific details and expression-dependent wrinkles are not well
disentangled. See Sup. Mat. for more disentanglement results.
ID-MRF loss: Figure 9 (right) shows the effect of 𝐿mrf on the detail
reconstruction. Without 𝐿mrf (middle), wrinkle details (e.g. in the
forehead) are not reconstructed, resulting in an overly smooth result.
With 𝐿mrf (right), DECA captures the details.
Other losses: We also evaluate the effect of the eye-closure loss
𝐿𝑒𝑦𝑒 , segmentation on the photometric loss, and the identity loss 𝐿𝑖𝑑 .
Fig. 10 provides a qualitative comparison of the DECA coarse model
with/without using these losses. Quantitatively, we also evaluate
DECA with and without 𝐿𝑖𝑑 on the NoW validation set; the former
gives a mean error of 1.46mm, while the latter is worse with an
error of 1.59mm.

7 LIMITATIONS AND FUTURE WORK


While DECA achieves SOTA results for reconstructed face shape and
provides novel animatable details, there are several limitations. First,
the rendering quality for DECA detailed meshes is mainly limited
by the albedo model, which is derived from BFM. DECA requires an
albedo space without baked in shading, specularities, and shadows
in order to disentangle facial albedo from geometric details. Future
work should focus on learning a high-quality albedo model with
Fig. 6. Comparison to other detailed face reconstruction methods, from a sufficiently large variety of skin colors, texture details, and no
left to right: Extreme3D [Tran et al. 2018], FaceScape [Yang et al. 2020], illumination effects. Second, existing methods, like DECA, do not
Cross-modal [Abrevaya et al. 2020], DECA (ours). See Sup. Mat. for many explicitly model facial hair. This pushes skin tone into the lighting
more examples. Input images are taken from AFLW2000 [Zhu et al. 2015] model and causes facial hair to be explained by shape deformations.
(rows 1-3) and VGGFace2 [Cao et al. 2018b] (rows 4-6). A different approach is needed to properly model this. Third, while
robust, our method can still fail due to extreme head pose and
lighting. While we are tolerant to common occlusions in existing
face datasets (Fig. 6 and examples in Sup. Mat.), we do not address
extreme occlusion, e.g. where the hand covers large portions of the
face. This suggests the need for more diverse training data.
Further, the training set contains many low-res images, which
help with robustness but can introduce noisy details. Existing high-
res datasets (e.g. [Karras et al. 2018, 2019]) are less varied, thus
training DECA from these datasets results in a model that is less
robust to general in-the-wild images, but captures more detail. Addi-
tionally, the limited size of high-resolution datasets makes it difficult
Fig. 7. Effect of DECA’s animatable details. Given a single image (left), to disentangle expression- and identity-dependent details. To fur-
DECA reconstructs a course mesh (second column) and a detailed mesh ther research on this topic, we also release a model trained using
(third column). Using static details and animating (i.e. reposing) the coarse high-resolution images only (i.e. DECA-HR). Using DECA-HR in-
FLAME shape only (fourth column) results in visible artifacts as highlighted
creases the visual quality and reduces noise in the reconstructed
by the red boxes. Instead, reposing with DECA’s animatable details (right)
details at the cost of being less robust (i.e. to low image resolutions,
results in a more realistic mesh with geometric details. The reposing uses
the source expression shown in Fig. 1 (bottom). Input images are taken from extreme head poses, extreme expressions, etc.).
NoW [Sanyal et al. 2019] (top), and Pexels [2021] (bottom). DECA uses a weak perspective camera model. To use DECA to
recover head geometry from “selfies”, we would need to extend
the method to include the focal length. For some applications, the
the source identity, and expression 𝝍 𝐸 and jaw pose parameters focal length may be directly available from the camera. However,
𝜽 jaw,𝐸 from the source expression. For DECA trained with 𝐿𝑑𝑐 (top), inferring 3D geometry and focal length from a single image under
wrinkles appear in the forehead as a result of the raised eyebrows of perspective projection for in-the-wild images is unsolved and likely
the source expression, while for DECA trained without 𝐿𝑑𝑐 (bottom), requires explicit supervision during training (cf. [Zhao et al. 2019]).

ACM Trans. Graph., Vol. 40, No. 4, Article 88. Publication date: August 2021.
88:10 • Yao Feng, Haiwen Feng, Michael J. Black, Timo Bolkart

NoW [Sanyal et al. 2019] Feng et al. [2018a] LQ Feng et al. [2018a] HQ

Fig. 8. Quantitative comparison to state-of-the-art on two 3D face reconstruction benchmarks, namely the NoW [Sanyal et al. 2019] challenge (left) and the
Feng et al. [2018a] benchmark for low-quality (middle) and high-quality (right) images.

Fig. 10. More ablation experiments. Left: estimated landmarks and recon-
structed coarse shape from DECA (first column) and DECA without 𝐿eye
(second column), and without 𝐿𝑖𝑑 (third column). When trained without
𝐿eye , DECA is not able to capture closed-eye expressions. Using 𝐿𝑖𝑑 helps
reconstruct coarse shape. Right: rendered image from DECA and DECA
without segmentation. Without using the skin mask in the photometric loss,
the estimated result bakes in the color of the occluder (e.g. sunglasses, hats)
into the albedo. Input images are taken from NoW [Sanyal et al. 2019].

8 CONCLUSION
We have presented DECA, which enables detailed expression cap-
ture and animation from single images by learning an animatable
detail model from a dataset of in-the-wild images. In total, DECA is
trained from about 2M in-the-wild face images without 2D-to-3D
supervision. DECA reaches state-of-the-art shape reconstruction
performance enabled by a shape consistency loss. A novel detail
consistency loss helps DECA to disentangle expression-dependent
Fig. 9. Ablation experiments. Top: Effects of 𝐿𝑑𝑐 on the animation of the wrinkles from person-specific details. The low-dimensional detail
source identity with the source expression visualized on a neutral expression latent space makes the fine-scale reconstruction robust to noise and
template mesh. Without 𝐿𝑑𝑐 , no wrinkles appear in the forehead despite occlusions, and the novel loss leads to disentanglement of identity
the “surprise" source expression. Middle: Effect of 𝐿𝑚𝑟 𝑓 on the detail recon- and expression-dependent wrinkle details. This enables applications
struction. Without 𝐿𝑚𝑟 𝑓 , fewer details are reconstructed. Bottom: Effect like animation, shape change, wrinkle transfer, etc. DECA is publicly
of 𝐿𝑠𝑦𝑚 on the reconstructed details. Without 𝐿𝑠𝑦𝑚 , boundary artifacts
available for research purposes. Due to the reconstruction accuracy,
become visible. Input images are taken from NoW [Sanyal et al. 2019] (rows
1 & 4), Chicago [Ma et al. 2015] (row 2), and Pexels [2021] (row 3).
the reliability, and the speed, DECA is useful for applications like
face reenactment or virtual avatar creation.
Finally, in future work, we want to extend the model over time,
ACKNOWLEDGMENTS
both for tracking and to learn more personalized models of indi-
viduals from video where we could enforce continuity of intrinsic We thank S. Sanyal for providing us the RingNet PyTorch imple-
wrinkles over time. mentation, support with paper writing, and fruitful discussions, M.

ACM Trans. Graph., Vol. 40, No. 4, Article 88. Publication date: August 2021.
Learning an Animatable Detailed 3D Face Model from In-The-Wild Images • 88:11

Kocabas, N. Athanasiou, V. Fernández Abrevaya, and R. Danecek for Models - Past, Present, and Future. ACM Transactions on Graphics (TOG) 39, 5 (2020),
the helpful suggestions, and T. McConnell and S. Sorce for the video 157:1–157:38.
Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou. 2018b. Joint 3D Face
voice over. This work was partially supported by the Max Planck Reconstruction and Dense Alignment with Position Map Regression Network. In
ETH Center for Learning Systems. European Conference on Computer Vision (ECCV). 534–551.
Zhen-Hua Feng, Patrik Huber, Josef Kittler, Peter Hancock, Xiao-Jun Wu, Qijun Zhao,
Disclosure: MJB has received research gift funds from Intel, Nvidia, Paul Koppen, and Matthias Rätsch. 2018a. Evaluation of dense 3D reconstruction
Adobe, Facebook, and Amazon. While MJB is a part-time employee from 2D face images in the wild. In International Conference on Automatic Face &
of Amazon, his research was performed solely at, and funded solely Gesture Recognition (FG).
Flickr image. 2021. https://www.flickr.com/photos/gageskidmore/14602415448/.
by, MPI. MJB has financial interests in Amazon, Datagen Technolo- Graham Fyffe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, and Paul E. Debevec.
gies, and Meshcapade GmbH. 2014. Driving High-Resolution Facial Scans with Video Performance Capture. ACM
Transactions on Graphics (TOG) 34, 1 (2014), 8:1–8:14.
Pablo Garrido, Michael Zollhöfer, Dan Casas, Levi Valgaerts, Kiran Varanasi, Patrick
REFERENCES Pérez, and Christian Theobalt. 2016. Reconstruction of personalized 3D face rigs
from monocular video. ACM Transactions on Graphics (TOG) 35, 3 (2016), 28.
Victoria Fernández Abrevaya, Adnane Boukhayma, Philip HS Torr, and Edmond Boyer. Baris Gecer, Stylianos Ploumpis, Irene Kotsia, and Stefanos Zafeiriou. 2019. GANFIT:
2020. Cross-modal Deep Face Normals with Deactivable Skip Connections. In IEEE Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction.
Conference on Computer Vision and Pattern Recognition (CVPR). 4979–4989. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1155–1164.
Oswald Aldrian and William AP Smith. 2013. Inverse Rendering of Faces with a 3D Kyle Genova, Forrester Cole, Aaron Maschinot, Aaron Sarna, Daniel Vlasic, and
Morphable Model. IEEE Transactions on Pattern Analysis and Machine Intelligence William T. Freeman. 2018. Unsupervised Training for 3D Morphable Model Re-
(PAMI) 35, 5 (2013), 1080–1093. gression. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Anil Bas, William A. P. Smith, Timo Bolkart, and Stefanie Wuhrer. 2017. Fitting a 3D 8377–8386.
Morphable Model to Edges: A Comparison Between Hard and Soft Correspondences. Thomas Gerig, Andreas Morel-Forster, Clemens Blumer, Bernhard Egger, Marcel Luthi,
In Asian Conference on Computer Vision Workshops. 377–391. Sandro Schönborn, and Thomas Vetter. 2018. Morphable face models-an open
Thabo Beeler, Bernd Bickel, Paul Beardsley, Bob Sumner, and Markus Gross. 2010. framework. In International Conference on Automatic Face & Gesture Recognition
High-quality single-shot capture of facial geometry. ACM Transactions on Graphics (FG). 75–82.
(TOG) 29, 4 (2010), 40. Partha Ghosh, Pravir Singh Gupta, Roy Uziel, Anurag Ranjan, Michael J. Black, and
Bernd Bickel, Manuel Lang, Mario Botsch, Miguel A. Otaduy, and Markus H. Gross. 2008. Timo Bolkart. 2020. GIF: Generative Interpretable Faces. In International Conference
Pose-Space Animation and Transfer of Facial Details. In Eurographics/SIGGRAPH on 3D Vision (3DV). 868–878.
Symposium on Computer Animation (SCA), Markus H. Gross and Doug L. James Aleksey Golovinskiy, Wojciech Matusik, Hanspeter Pfister, Szymon Rusinkiewicz, and
(Eds.). 57–66. Thomas A. Funkhouser. 2006. A statistical model for synthesis of detailed facial
Volker Blanz, Sami Romdhani, and Thomas Vetter. 2002. Face identification across geometry. ACM Transactions on Graphics (TOG) 25, 3 (2006), 1025–1034.
different poses and illuminations with a 3D morphable model. In International Riza Alp Güler, George Trigeorgis, Epameinondas Antonakos, Patrick Snape, Stefanos
Conference on Automatic Face & Gesture Recognition (FG). 202–207. Zafeiriou, and Iasonas Kokkinos. 2017. DenseReg: Fully convolutional dense shape
Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D regression in-the-wild. In IEEE Conference on Computer Vision and Pattern Recogni-
faces. In SIGGRAPH. 187–194. tion (CVPR). 6799–6808.
Alan Brunton, Augusto Salazar, Timo Bolkart, and Stefanie Wuhrer. 2014. Review of Jianzhu Guo, Xiangyu Zhu, Yang Yang, Fan Yang, Zhen Lei, and Stan Z Li. 2020. Towards
statistical shape spaces for 3D data with comparative analysis for human faces. Fast, Accurate and Stable 3D Dense Face Alignment. In European Conference on
Computer Vision and Image Understanding (CVIU) 128, 0 (2014), 1–17. Computer Vision (ECCV). 152–168.
Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2D Yudong Guo, Jianfei Cai, Boyi Jiang, Jianmin Zheng, et al. 2018. CNN-based real-
& 3D face alignment problem? (and a dataset of 230,000 3D facial landmarks). In time dense face reconstruction with inverse-rendered photo-realistic face images.
IEEE International Conference on Computer Vision (ICCV). 1021–1030. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 41, 6 (2018),
Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. 2015. Real-time high-fidelity 1294–1307.
facial performance capture. ACM Transactions on Graphics (TOG) 34, 4 (2015), 1–9. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning
Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. 2018b. VG- for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition
GFace2: A dataset for recognising faces across pose and age. In International Confer- (CVPR). 770–778.
ence on Automatic Face & Gesture Recognition (FG). 67–74. Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jaewoo Seo, Jens Fursund, Iman
Xuan Cao, Zhang Chen, Anpei Chen, Xin Chen, Shiying Li, and Jingyi Yu. 2018a. Sadeghi, Carrie Sun, Yen-Chun Chen, and Hao Li. 2017. Avatar Digitization from a
Sparse Photometric 3D Face Reconstruction Guided by Morphable Models. In IEEE Single Image for Real-time Rendering. ACM Transactions on Graphics (TOG) 36, 6
Conference on Computer Vision and Pattern Recognition (CVPR). 4635–4644. (2017), 195:1–195:14.
Feng-Ju Chang, Anh Tuan Tran, Tal Hassner, Iacopo Masi, Ram Nevatia, and Gerard Alexandru Eugen Ichim, Sofien Bouaziz, and Mark Pauly. 2015. Dynamic 3D avatar
Medioni. 2018. ExpNet: Landmark-free, deep, 3D facial expressions. In International creation from hand-held video input. ACM Transactions on Graphics (TOG) 34, 4
Conference on Automatic Face & Gesture Recognition (FG). 122–129. (2015), 45.
Bindita Chaudhuri, Noranart Vesdapunt, Linda G. Shapiro, and Baoyuan Wang. 2020. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-Image
Personalized Face Modeling for Improved Face Reconstruction and Motion Retar- Translation with Conditional Adversarial Networks. In IEEE Conference on Computer
geting. In European Conference on Computer Vision (ECCV). 142–160. Vision and Pattern Recognition (CVPR). 5967–5976.
Anpei Chen, Zhang Chen, Guli Zhang, Kenny Mitchell, and Jingyi Yu. 2019. Photo- Aaron S Jackson, Adrian Bulat, Vasileios Argyriou, and Georgios Tzimiropoulos. 2017.
Realistic Facial Details Synthesis from Single Image. In IEEE International Conference Large pose 3D face reconstruction from a single image via direct volumetric CNN
on Computer Vision (ICCV). 9429–9439. regression. In IEEE International Conference on Computer Vision (ICCV). 1031–1039.
J. S. Chung, A. Nagrani, and A. Zisserman. 2018a. VoxCeleb2: Deep Speaker Recognition. László A Jeni, Jeffrey F Cohn, and Takeo Kanade. 2015. Dense 3D face alignment from
In INTERSPEECH. 2D videos in real-time. In International Conference on Automatic Face & Gesture
Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018b. VoxCeleb2: Deep Recognition (FG), Vol. 1. 1–8.
Speaker Recognition. In Annual Conference of the International Speech Communica- Luo Jiang, Juyong Zhang, Bailin Deng, Hao Li, and Ligang Liu. 2018. 3D face reconstruc-
tion Association (Interspeech), B. Yegnanarayana (Ed.). ISCA, 1086–1090. tion with geometry details from a single image. Transactions on Image Processing
Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael Black. 27, 10 (2018), 4756–4770.
2019. Capture, Learning, and Synthesis of 3D Speaking Styles. In IEEE Conference Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-
on Computer Vision and Pattern Recognition (CVPR). 10101–10111. driven facial animation by joint end-to-end learning of pose and emotion. ACM
Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. 2019. Transactions on Graphics, (Proc. SIGGRAPH) 36, 4 (2017), 94:1–94:12.
Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive growing
Image to Image Set. In Computer Vision and Pattern Recognition Workshops. 285–295. of GANs for improved quality, stability, and variation. In International Conference
Pengfei Dou, Shishir K Shah, and Ioannis A Kakadiaris. 2017. End-to-end 3D face on Learning Representations (ICLR).
reconstruction with deep neural networks. In IEEE Conference on Computer Vision Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture
and Pattern Recognition (CVPR). 5908–5917. for generative adversarial networks. In IEEE Conference on Computer Vision and
Bernhard Egger, William A. P. Smith, Ayush Tewari, Stefanie Wuhrer, Michael Zollhöfer, Pattern Recognition (CVPR). 4401–4410.
Thabo Beeler, Florian Bernard, Timo Bolkart, Adam Kortylewski, Sami Romdhani,
Christian Theobalt, Volker Blanz, and Thomas Vetter. 2020. 3D Morphable Face

ACM Trans. Graph., Vol. 40, No. 4, Article 88. Publication date: August 2021.
88:12 • Yao Feng, Haiwen Feng, Michael J. Black, Timo Bolkart

Ira Kemelmacher-Shlizerman and Steven M Seitz. 2011. Face reconstruction in the wild. Alexander Richard, Michael Zollhöfer, Yandong Wen, Fernando De la Torre, and Yaser
In IEEE International Conference on Computer Vision (ICCV). 1746–1753. Sheikh. 2021. MeshTalk: 3D Face Animation from Speech using Cross-Modality
Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Disentanglement. CoRR abs/2104.08223 (2021).
Nießner, Patrick Pérez, Christian Richardt, Michael Zollhöfer, and Christian Theobalt. E. Richardson, M. Sela, and R. Kimmel. 2016. 3D Face Reconstruction by Learning from
2018a. Deep video portraits. ACM Transactions on Graphics (TOG) 37, 4 (2018), Synthetic Data. In International Conference on 3D Vision (3DV). 460–469.
163:1–163:14. Elad Richardson, Matan Sela, Roy Or-El, and Ron Kimmel. 2017. Learning Detailed
Hyeongwoo Kim, Michael Zollhöfer, Ayush Tewari, Justus Thies, Christian Richardt, Face Reconstruction From a Single Image. In IEEE Conference on Computer Vision
and Christian Theobalt. 2018b. InverseFaceNet: Deep Monocular Inverse Face and Pattern Recognition (CVPR). 1259–1268.
Rendering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Jérémy Riviere, Paulo F. U. Gotardo, Derek Bradley, Abhijeet Ghosh, and Thabo Beeler.
4625–4634. 2020. Single-shot high-quality facial geometry and skin appearance capture. ACM
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. Transactions on Graphics, (Proc. SIGGRAPH) 39, 4 (2020), 81.
In International Conference on Learning Representations (ICLR). Sami Romdhani, Volker Blanz, and Thomas Vetter. 2002. Face identification by fitting a
Martin Köstinger, Paul Wohlhart, Peter M. Roth, and Horst Bischof. 2011. Annotated 3D morphable model using linear shape and texture error functions. In European
Facial Landmarks in the Wild: A large-scale, real-world database for facial landmark Conference on Computer Vision (ECCV). 3–19.
localization. In IEEE International Conference on Computer Vision Workshops (ICCV- Sami Romdhani and Thomas Vetter. 2005. Estimating 3D shape and texture using
W). 2144–2151. pixel intensity, edges, specular highlights, texture constraints and a prior. In IEEE
Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2. 986–993.
2016. Autoencoding beyond pixels using a learned similarity metric. In International Joseph Roth, Yiying Tong, and Xiaoming Liu. 2016. Adaptive 3D face reconstruction
Conference on Machine Learning (ICML), Vol. 48. 1558–1566. from unconstrained photo collections. In IEEE Conference on Computer Vision and
Alexandros Lattas, Stylianos Moschoglou, Baris Gecer, Stylianos Ploumpis, Vasileios Pattern Recognition (CVPR). 4197–4206.
Triantafyllou, Abhijeet Ghosh, and Stefanos Zafeiriou. 2020. AvatarMe: Realistically Shunsuke Saito, Lingyu Wei, Liwen Hu, Koki Nagano, and Hao Li. 2017. Photorealis-
Renderable 3D Facial Reconstruction "In-the-Wild". In IEEE Conference on Computer tic Facial Texture Inference Using Deep Neural Networks. In IEEE Conference on
Vision and Pattern Recognition (CVPR). 757–766. Computer Vision and Pattern Recognition (CVPR). 5144–5153.
Hao Li, Bart Adams, Leonidas J. Guibas, and Mark Pauly. 2009. Robust single-view Soubhik Sanyal, Timo Bolkart, Haiwen Feng, and Michael Black. 2019. Learning to
geometry and motion reconstruction. ACM Transactions on Graphics (TOG) 28 Regress 3D Face Shape and Expression from an Image without 3D Supervision. In
(2009), 175. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7763–7772.
Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. 2013. Realtime facial animation with Kristina Scherbaum, Tobias Ritschel, Matthias Hullin, Thorsten Thormählen, Volker
on-the-fly correctives. ACM Transactions on Graphics (TOG) 32, 4 (2013), 42–1. Blanz, and Hans-Peter Seidel. 2011. Computer-suggested facial makeup. Computer
Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. 2017. Learning a Graphics Forum 30, 2 (2011), 485–492.
model of facial shape and expression from 4D scans. ACM Transactions on Graphics, Matan Sela, Elad Richardson, and Ron Kimmel. 2017. Unrestricted facial geometry
(Proc. SIGGRAPH Asia) 36, 6 (2017), 194:1–194:17. reconstruction using image-to-image translation. In IEEE International Conference
Yue Li, Liqian Ma, Haoqiang Fan, and Kenny Mitchell. 2018. Feature-preserving detailed on Computer Vision (ICCV). 1576–1585.
3D face reconstruction from a single image. In European Conference on Visual Media Soumyadip Sengupta, Angjoo Kanazawa, Carlos D. Castillo, and David W. Jacobs. 2018.
Production. 1–9. SfSNet: Learning Shape, Reflectance and Illuminance of Faces in the Wild. In IEEE
Debbie S. Ma, Joshua Correll, and Bernd Wittenbrink. 2015. The Chicago face database: Conference on Computer Vision and Pattern Recognition (CVPR). 6296–6305.
A free stimulus set of faces and norming datan. Behavior Research Methods volume Jiaxiang Shang, Tianwei Shen, Shiwei Li, Lei Zhou, Mingmin Zhen, Tian Fang, and
47 (2015), 1122–1135. Long Quan. 2020. Self-Supervised Monocular 3D Face Reconstruction by Occlusion-
Wan-Chun Ma, Andrew Jones, Jen-Yuan Chiang, Tim Hawkins, Sune Frederiksen, Aware Multi-view Geometry Consistency. In European Conference on Computer
Pieter Peers, Marko Vukovic, Ming Ouhyoung, and Paul E. Debevec. 2008. Facial Vision (ECCV), Vol. 12360. 53–70.
performance synthesis using deformation-driven polynomial displacement maps. Fuhao Shi, Hsiang-Tao Wu, Xin Tong, and Jinxiang Chai. 2014. Automatic acquisition
ACM Transactions on Graphics (TOG) 27, 5 (2008), 121. of high-fidelity facial performances using monocular videos. ACM Transactions on
Araceli Morales, Gemma Piella, and Federico M Sukno. 2021. Survey on 3D face Graphics (TOG) 33, 6 (2014), 222.
reconstruction from uncalibrated images. Computer Science Review 40 (2021), 100400. Il-Kyu Shin, A Cengiz Öztireli, Hyeon-Joong Kim, Thabo Beeler, Markus Gross, and
Koki Nagano, Jaewoo Seo, Jun Xing, Lingyu Wei, Zimo Li, Shunsuke Saito, Aviral Soo-Mi Choi. 2014. Extraction and transfer of facial expression wrinkles for fa-
Agarwal, Jens Fursund, and Hao Li. 2018. paGAN: real-time avatars using dynamic cial performance enhancement. In Pacific Conference on Computer Graphics and
textures. ACM Transactions on Graphics (TOG) 37, 6 (2018), 258:1–258:12. Applications. 113–118.
Yuval Nirkin, Iacopo Masi, Anh Tran Tuan, Tal Hassner, and Gerard Medioni. 2018. On Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for
face segmentation, face swapping, and face perception. In International Conference Large-Scale Image Recognition. CoRR abs/1409.1556 (2014).
on Automatic Face & Gesture Recognition (FG). 98–105. Ron Slossberg, Gil Shamai, and Ron Kimmel. 2018. High quality facial surface and
NoW challenge. 2019. https://ringnet.is.tue.mpg.de/challenge. texture synthesis via generative adversarial networks. In European Conference on
Frederick Ira Parke. 1974. A parametric model for human faces. Technical Report. Computer Vision Workshops (ECCV-W).
University of Utah. Supasorn Suwajanakorn, Ira Kemelmacher-Shlizerman, and Steven M Seitz. 2014. Total
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory moving face reconstruction. In European Conference on Computer Vision (ECCV).
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Des- 796–812.
maison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Ayush Tewari, Florian Bernard, Pablo Garrido, Gaurav Bharaj, Mohamed Elgharib,
Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Hans-Peter Seidel, Patrick Pérez, Michael Zollhöfer, and Christian Theobalt. 2019.
2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In FML: Face Model Learning from Videos. In IEEE Conference on Computer Vision and
Advances in Neural Information Processing Systems (NeurIPS). Pattern Recognition (CVPR). 10812–10822.
Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian Bernard, Hans-Peter Seidel,
2009. A 3D face model for pose and illumination invariant face recognition. In Patrick Pérez, Michael Zollhöfer, and Christian Theobalt. 2020. StyleRig: Rigging
International Conference on Advanced Video and Signal Based Surveillance. 296–301. StyleGAN for 3D Control Over Portrait Images. In IEEE Conference on Computer
Pexels. 2021. https://www.pexels.com. Vision and Pattern Recognition (CVPR). 6141–6150.
Frédéric Pighin, Jamie Hecker, Dani Lischinski, Richard Szeliski, and David H. Salesin. Ayush Tewari, Michael Zollhöfer, Pablo Garrido, Florian Bernard, Hyeongwoo Kim,
1998. Synthesizing Realistic Facial Expressions from Photographs. In SIGGRAPH. Patrick Pérez, and Christian Theobalt. 2018. Self-supervised multi-level face model
75–84. learning for monocular reconstruction at over 250 Hz. In IEEE Conference on Com-
Stylianos Ploumpis, Evangelos Ververas, Eimear O’Sullivan, Stylianos Moschoglou, puter Vision and Pattern Recognition (CVPR). 2549–2559.
Haoyang Wang, Nick Pears, William Smith, Baris Gecer, and Stefanos P Zafeiriou. Ayush Tewari, Michael Zollhöfer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard,
2020. Towards a complete 3D morphable model of the human head. IEEE Transactions Patrick Perez, and Christian Theobalt. 2017. MoFA: Model-Based Deep Convo-
on Pattern Analysis and Machine Intelligence (PAMI) (2020). lutional Face Autoencoder for Unsupervised Monocular Reconstruction. In IEEE
R. Ramamoorthi and P. Hanrahan. 2001. An efficient representation for irradiance International Conference on Computer Vision (ICCV). 1274–1283.
environment maps. Proceedings of the 28th annual conference on Computer graphics Justus Thies, Michael Zollhöfer, Matthias Nießner, Levi Valgaerts, Marc Stamminger,
and interactive techniques (2001). and Christian Theobalt. 2015. Real-time expression transfer for facial reenactment.
Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, ACM Transactions on Graphics (TOG) 34, 6 (2015), 183–1.
Justin Johnson, and Georgia Gkioxari. 2020. PyTorch3D. https://github.com/ Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias
facebookresearch/pytorch3d. Nießner. 2016. Face2Face: Real-time face capture and reenactment of RGB videos.
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2387–2395.

ACM Trans. Graph., Vol. 40, No. 4, Article 88. Publication date: August 2021.
Learning an Animatable Detailed 3D Face Model from In-The-Wild Images • 88:13

Anh Tuan Tran, Tal Hassner, Iacopo Masi, and Gerard Medioni. 2017. Regressing Robust Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. 2011. Realtime performance-
and Discriminative 3D Morphable Models With a Very Deep Neural Network. In based facial animation. ACM Transactions on Graphics, (Proc. SIGGRAPH) 30, 4
IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1599–1608. (2011), 77.
Anh Tuan Tran, Tal Hassner, Iacopo Masi, Eran Paz, Yuval Nirkin, and Gérard Medioni. Shugo Yamaguchi, Shunsuke Saito, Koki Nagano, Yajie Zhao, Weikai Chen, Kyle Ol-
2018. Extreme 3D face reconstruction: Seeing through occlusions. In IEEE Conference szewski, Shigeo Morishima, and Hao Li. 2018. High-fidelity Facial Reflectance and
on Computer Vision and Pattern Recognition (CVPR). 3935–3944. Geometry Inference from an Unconstrained Image. ACM Transactions on Graphics
Luan Tran, Feng Liu, and Xiaoming Liu. 2019. Towards High-Fidelity Nonlinear 3D Face (TOG) 37, 4 (2018), 162:1–162:14.
Morphable Model. In IEEE Conference on Computer Vision and Pattern Recognition Haotian Yang, Hao Zhu, Yanru Wang, Mingkai Huang, Qiu Shen, Ruigang Yang, and
(CVPR). 1126–1135. Xun Cao. 2020. FaceScape: a Large-scale High Quality 3D Face Dataset and Detailed
Xiaoguang Tu, Jian Zhao, Zihang Jiang, Yao Luo, Mei Xie, Yang Zhao, Linxiao He, Riggable 3D Face Prediction. In IEEE Conference on Computer Vision and Pattern
Zheng Ma, and Jiashi Feng. 2019. Joint 3D Face Reconstruction and Dense Face Recognition (CVPR). 601–610.
Alignment from A Single Image with 2D-Assisted Self-Supervised Learning. IEEE Xiaoxing Zeng, Xiaojiang Peng, and Yu Qiao. 2019. DF2Net: A Dense-Fine-Finer
International Conference on Computer Vision (ICCV) (2019). Network for Detailed 3D Face Reconstruction. In IEEE International Conference on
Thomas Vetter and Volker Blanz. 1998. Estimating coloured 3D face models from single Computer Vision (ICCV).
images: An example based approach. In European Conference on Computer Vision Yajie Zhao, Zeng Huang, Tianye Li, Weikai Chen, Chloe LeGendre, Xinglei Ren, Ari
(ECCV). 499–513. Shapiro, and Hao Li. 2019. Learning perspective undistortion of portraits. In IEEE
Mei Wang, Weihong Deng, Jiani Hu, Xunqiang Tao, and Yaohai Huang. 2019. Racial International Conference on Computer Vision (ICCV). 7849–7859.
Faces in the Wild: Reducing Racial Bias by Information Maximization Adaptation Xiangyu Zhu, Zhen Lei, Junjie Yan, Dong Yi, and Stan Z Li. 2015. High-fidelity pose
Network. In IEEE International Conference on Computer Vision (ICCV). and expression normalization for face recognition in the wild. In IEEE Conference
Yi Wang, Xin Tao, Xiaojuan Qi, Xiaoyong Shen, and Jiaya Jia. 2018. Image inpainting on Computer Vision and Pattern Recognition (CVPR). 787–796.
via generative multi-column convolutional neural networks. In Advances in Neural Michael Zollhöfer, Justus Thies, Pablo Garrido, Derek Bradley, Thabo Beeler, Patrick
Information Processing Systems (NeurIPS). 331–340. Pérez, Marc Stamminger, Matthias Nießner, and Christian Theobalt. 2018. State of
Huawei Wei, Shuang Liang, and Yichen Wei. 2019. 3D Dense Face Alignment via Graph the Art on Monocular 3D Face Reconstruction, Tracking, and Applications. Computer
Convolution Networks. arXiv preprint arXiv:1904.05562 (2019). Graphics Forum (Eurographics State of the Art Reports 2018) 37, 2 (2018).

ACM Trans. Graph., Vol. 40, No. 4, Article 88. Publication date: August 2021.

You might also like