Supplementary Material for Large Pose 3D Face Reconstruction from a Single
Image via Direct Volumetric CNN Regression
       Aaron S. Jackson1                               Adrian Bulat1     Vasileios Argyriou2                               Georgios Tzimiropoulos1
                                          1                                                             2
                                              The University of Nottingham, UK                              Kingston University, UK
                                      1
                                          {aaron.jackson, adrian.bulat, yorgos.tzimiropoulos}@nottingham.ac.uk
                                                           2
                                                             vasileios.argyriou@kingston.ac.uk
              100%                                                                               100%
               90%           VRN - Guided (ours)                                                  90%           VRN - Guided (ours)
                             VRN - Multitask (ours)                                                             VRN - Multitask (ours)
               80%           VRN (ours)                                                           80%           VRN (ours)
                             3DDFA                                                                              3DDFA
               70%           EOS λ = 5000                                                         70%           EOS λ = 5000
% of images
                                                                                   % of images
               60%                                                                                60%
               50%                                                                                50%
               40%                                                                                40%
               30%                                                                                30%
               20%                                                                                20%
               10%                                                                                10%
               0%                                                                                 0%
                     0         0.02             0.04      0.06    0.08   0.1                            0         0.02          0.04     0.06       0.08         0.1
                         NME normalised by outer interocular distance                                       NME normalised by outer interocular distance
Figure 1: NME-based performance on the in-the-wild                                 Figure 2: NME-based performance on our large pose and
AFLW2000-3D dataset, where ICP has been used to remove                             expression renderings of the BU4DFE dataset, where ICP
the rigid transformation. The proposed Volumetric Regres-                          has been used to remove the rigid transformation. The
sion Networks, and EOS and 3DDFA are compared.                                     proposed Volumetric Regression Networks, and EOS and
                                                                                   3DDFA are compared.
1. Results with ICP Registration                                                   Table 1: Reconstruction accuracy on AFLW2000-3D,
   We present results where ICP has been used not only to                          BU4DFE and Florence in terms of NME where ICP has
find the correspondence between the groundtruth and pre-                           been used to remove the rigid transformation. Lower is bet-
dicted vertices, but also to remove the rigid transformation                       ter.
between them. We find that this offers a marginal improve-
ment to all methods. However, the relative performance re-                                       Method                   AFLW2000       BU4DFE       Florence
mains mostly the same between each method. Results on                                            VRN                        0.0605        0.0514       0.0470
AFLW2000 [5], BU4DFE [3] and Florence [1] can be seen                                            VRN - Multitask            0.0625        0.0533       0.0439
in Figs. 1, 2 and 3 respectively. Numeric results can be                                         VRN - Guided               0.0543        0.0471       0.0429
found in Table 1.                                                                                3DDFA [5]                  0.1012        0.1144       0.0784
                                                                                                 EOS [2]                    0.0890        0.1456       0.1200
2. Results on 300VW
   To demonstrate that our method can work in uncon-                               footage from the 300VW [4] dataset. These videos are chal-
strained environments and video, we ran our VRN - Guided                           lenging usually for at least one of the following reasons:
method on some of the more challenging Category C                                  large pose, low quality video, heavy motion blurring and
                                                                               1
              100%
               90%           VRN - Guided (ours)
                             VRN - Multitask (ours)
               80%           VRN (ours)
                             3DDFA
               70%           EOS λ = 5000
% of images
               60%
               50%
               40%
               30%
               20%
               10%
               0%
                     0         0.02          0.04     0.06       0.08   0.1
                         NME normalised by outer interocular distance
Figure 3: NME-based performance on our large pose ren-                            Figure 4: Some failure cases on AFLW2000-2D from our
derings of the Florence dataset, where ICP has been used to                       VRN - Guided network. In general, these images are diffi-
remove the rigidla transformation. The proposed Volumet-                          cult poses not seen during training.
ric Regression Networks, and EOS and 3DDFA are com-
pared.
occlusion. We produce these results on a frame-by-frame
basis, each frame is regressed individually without track-
ing. Videos will be made available on our project website
and can also be found in the supplementary material.
3. Additional qualitative results
   This section provides additional visual results and com-
parisons. Failure cases are shown in Fig. 4. These are
                                                                                  Figure 5: A visual comparison between VRN and VRN -
mostly unusual poses which can not be found in the train-
                                                                                  Guided. The main difference is that the projection of the
ing set, or are not covered by the augmentation as described
                                                                                  volume has a better fit around the shape of the face.
Section 3.4 of our paper. In Fig. 5 we show some visual
comparison between VRN and VRN - Guided. These dif-
ferences are quite minor. Finally, in Fig. 6 we show some
typical examples from our renderings of BU-4DFE [3] and                           (a)
Florence [1], taken from their respective testing sets.
References
[1] A. D. ”Bagdanov, I. Masi, and A. Del Bimbo. The flo-
                                                                                  (b)
    rence 2d/3d hybrid face datset. In Proc. of ACM Multimedia
    Int.l Workshop on Multimedia access to 3D Human Objects
    (MA3HO11). ACM, ACM Press, December 2011.
[2] P. Huber, G. Hu, R. Tena, P. Mortazavian, W. P. Koppen,                       Figure 6: Examples of rendered images from (a) BU4DFE
    W. Christmas, M. Rätsch, and J. Kittler. A multiresolution
                                                                                  (containing large poses and expressions), and (b) Florence
    3d morphable face model and fitting framework.
                                                                                  (containing large poses) datasets.
[3] L. Yin, X. Chen, Y. Sun, T. Worm, and M. Reale. A high-
    resolution 3d dynamic facial expression database. In Auto-
    matic Face & Gesture Recognition, 2008. FG’08. 8th IEEE                       [5] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li. Face alignment
    International Conference on, pages 1–6. IEEE, 2008.                               across large poses: A 3d solution. 2016.
[4] S. Zafeiriou, G. Tzimiropoulos, and M. Pantic. The 300
    videos in the wild (300-vw) facial landmark tracking in-the-
    wild challenge. In ICCV Workshop, 2015.