Mag Face
Mag Face
                                                                                                           Easy
                                            The performance of face recognition system degrades
                                         when the variability of the acquired faces increases. Prior
                                         work alleviates this issue by either monitoring the face
                                                                                                         Semi-hard
                                         quality in pre-processing or predicting the data uncertainty                                           𝒍
                                         along with the face feature. This paper proposes MagFace,                                                               𝑙
            𝒎
                                                                𝒎
                                                                          𝒎
                                                                𝟐
                                                  𝒎
                                                                          𝟏
                                                     𝟑
                                                                                    𝑩𝟐
           1        3                                                                                1              3
                                                                                                                                                                    3
                                                                                     𝑩𝟑
                          2                                                                                                  2                                                  2
                               𝑾                                                     𝑾                                                            𝑾                                 1   𝑾
O                                  O                                𝒂𝟑    𝒂𝟐   𝒂𝟏         O                                                           O
     𝜃" > 𝜽𝟑 > 𝜃%                      Feasible Region by 𝒎 𝒂                                      Effect of 𝒈(𝒂)                Effect of 𝐦(𝒂)            𝜽𝟑 > 𝜃% > 𝜃"
                    (a)                                             (b)                                             (c)                                                   (d)
Figure 2: Geometrical interpretation of the feature space (without normalization) optimized by ArcFace and MagFace. (a) Two-class
distributions optimized by ArcFace, where w and w0 are the class centers and their decision boundaries B and B 0 are separated by the
additive margin m. Circle 1, 2, 3 represent three types samples of class w with descending qualities. (b) MagFace introduces m(ai ) which
dynamically adjust boundaries based on feature magnitudes, and ends to a new feasible region. (c) Effects of g(ai ) and m(ai ). (d) Final
feature distributions of our MagFace. Best viewed in color.
work in an unsupervised manner like ours, they require ad-                                    neural networks and yi = 1, · · · , n is its associated class
ditional computational costs or network blocks, which com-                                    label. ArcFace and other variants improve the conventional
plicate their usage in conventional face systems.                                             softmax loss by optimizing the feature embedding on a hy-
                                                                                              persphere manifold where the learned face representation is
2.3. Face Clustering                                                                          more discriminative. By defining the angle θj between fi
   Face clustering exploits unlabeled data to cluster them                                    and j-th class center wj ∈ Rd as wjT fi = kwj kkfi k cos θj ,
into pseudo classes. Traditional clustering methods usually                                   the objective of ArcFace [9] can be formulated as
work in an unsupervised manner, such as K-means [21].
                                                                                                                           N
DBSCAN [11] and hierarchical clustering. Several su-                                                                    1 X                  es cos (θyi +m)
                                                                                                     L=−                      log s cos (θ +m) P                     ,                  (1)
pervised clustering methods based on graph convolutional                                                                N i=1    e        yi
                                                                                                                                                   + j6=yi es cos θj
network (GCN) are proposed recently. For example, L-
GCN [43] performs reasoning and infers the likelihood of                                      where m > 0 denotes the additive angular margin and s is
linkage between pairs in the sub-graphs. Yang et al. [46]                                     the scaling parameter.
designs two graph convolutional networks, named GCN-V
                                                                                                 Despite its superior performances on enforcing intra-
and GCN-E, to estimate the confidence of vertices and the
                                                                                              class compactness and inter-class discrepancy, the angular
connectivity of edges, respectively. Instead of developing
                                                                                              margin penalty m used by ArcFace is quality-agnostic and
clustering methods, we aim at improving feature distribu-
                                                                                              the resulting structure of the within-class distribution could
tion structure for clustering.
                                                                                              be arbitrary in unconstrained scenarios. For example, let us
3. Methodology                                                                                consider the scenario illustrated in Fig. 2a, where we have
                                                                                              face images of the same class in three levels of qualities in-
    In this section, we first review the definition of Arc-                                   dicated by the circle sizes: the larger the radius, the more
Face [9], one of the most popular losses used in face recog-                                  uncertain the feature representation and the more difficulty
nition. Based on the analysis of ArcFace, we then derive the                                  the face can be recognized. Because ArcFace employs a
objective and prove the key properties for MagFace. In the                                    uniform margin m, each image in one class shares the same
end, we compare softmax and ArcFace with MagFace from                                         decision boundary, i.e., B : cos(θ + m) = cos(θ0 ) with
the perspective of feature magnitude.                                                         respect to the neighbor class. The three types of samples
3.1. ArcFace Revisited                                                                        can stay at arbitrary location in the feasible region (shading
                                                                                              area in Fig. 2a) without any penalization by the angular mar-
    Training loss plays an important role in face represen-
                                                                                              gin. This leads to unstable within-class distribution, e.g., the
tation learning. Among the various choices (see [10] for
a recent survey), ArcFace [9] is perhaps the most widely                                      high-quality face (type 1) stay along the boundary B while
adopted one in both academy and industry application due                                      the low-quality ones (type 2 and 3) are closer to the center
to its easiness in implementation and state-of-the-art per-                                   w. This unstableness can hurt the performances on in-the-
formance on a number of benchmarks. Suppose that we are                                       wild recognition as well as other facial application such as
given a training batch of N face samples {fi , yi }N
                                                   i=1 of n
                                                                                              face clustering. Moreover, hard and noisy samples are over-
                          d                                                                   weighted as they are hard to stay in the feasible area and the
identities, where fi ∈ R denotes the d-dimensional em-
bedding computed from the last fully connected layer of the                                   models may overfit to them.
     Hard                                  Easy     Hard                                   Easy      Hard                                 Easy
Table 2: Verification accuracy (%) on difficult benchmarks. “*”                 Figure 5: Distributions of magnitudes on different datasets.
indicates the result quoted from the original paper.
                                                                            tude increases, the corresponding mean face reveals more
1e-5 and 1e-4 as shown in Tab. 2.                                           details. This is because high-quality faces are inclined to be
   Our implemented ArcFace is on par with the original                      more frontal and distinctive. This implies the magnitude of
paper, e.g., our TARs at FAR=1e-4 differ from the au-                       MagFace feature is a good quality indicator.
thors by −0.11% and +0.14% on IJB-B and IJB-C re-                           Sample distribution of datasets. Fig. 5 plots the sample
spectively. Compared to baselines, our MagFace remains                      histograms of different benchmarks with respect to Mag-
the top at all FAR criteria except for FAR=1e-6 on IJB-B                    Face magnitudes. We observe that LFW is the least noisy
as the TAR is very sensitive to the noise when the num-                     one where most samples are of large magnitudes. Due
ber of FP is tiny. Compared to CosFace, MagFace gains                       to the larger age variation, the distribution of AGEDB-30
0.50%, 0.63%, 0.32% on IJB-B at TAR@FAR=1e-6, 1e-5,                         slightly shifts left compared to LFW. For CFP-FP, there are
1e-4 and 1.30%, 0.99%, 0.25% on IJB-C. Compared to Arc-                     two peaks at the magnitude around 28 and 34, correspond-
Face, improvements are of 2.23%, 1.38%, 0.24% on IJB-B                      ing to the frontal and profile faces respectively. Given the
and 3.61%, 0.98%, 0.07% on IJB-C respectively. This re-                     large variations in face qualities, we can conclude IJB-C is
sult demonstrates the superiority of MagFace on more chal-                  much more challenging than other benchmarks. For images
lenging benchmarks. It is worth to mention that when multi-                 (more examples can be found in the supplementary) with
ple images existed for one identity, the average embedding                  magnitudes a ' 15, there are no faces or very noisy faces
can be further improved by aggregating features weighted                    to observe. When feature magnitudes increase from 20 to
by magnitudes. For instance, MagFace+ outperforms Mag-                      40, there is a clear trend that the face changes from profile,
Face by 1.41%/0.98% at FAR=1e-6, 0.48%/0.41% at                             blurred and occluded, to more frontal and distinctive. Over-
FAR=1e-5 and 0.18%/0.16% at FAR=1e-4.                                       all, this figure convinces us that MagFace is an effective tool
                                                                            to rank face images according to their qualities.
4.2. Face Quality Assessment                                                Baselines. We choose six baselines of three types for quan-
   In this part, we investigate the qualitative and quantita-               titative quality evaluation. Brisque [31], Niqe [23] and
tive performance of the pre-trained MagFace model men-                      Piqe [37] are image-based quality metrics. FaceQNet [15]
tioned in Tab. 2 for quality assessment.                                    and SER-FIQ [36] are face-based ones. For FaceQNet, we
Visualization of the mean face. We first sample 100k im-                    adopt the released models by the authors. For SER-FIQ, we
ages form IJB-C database and divide them into 8 groups                      use the “same model” version which yields the best perfor-
based on feature magnitudes. We visualize the mean faces                    mance in the paper. Following the authors’ setting, we set
of each group in Fig. 4. It can be seen that when magni-                    m = 100 to forward each image 100 times with drop-out
            0.0040                                                                         0.0040
                                                                         MagFace                                                                       MagFace
            0.0035                                                       SER-FIQ           0.0035                                                      SER-FIQ
                                                                         FaceQNet                                                                      FaceQNet
            0.0030                                                       DUL               0.0030                                                      DUL
                                                                         Brisque                                                                       Brisque
            0.0025                                                       Niqe              0.0025                                                      Niqe
                                                                         Piqe                                                                          Piqe
     FNMR
                                                                                    FNMR
            0.0020                                                                         0.0020
            0.0015                                                                         0.0015
            0.0010                                                                         0.0010
            0.0005                                                                         0.0005
            0.0000                                                                         0.0000
                     0   5   10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90                     0   5   10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
                                       Ratio of unconsidered image [%]                                               Ratio of unconsidered image [%]
                                        (a) LFW - ArcFace                                                             (b) LFW - MagFace
              0.06                                                                           0.06
                                                                         MagFace                                                                       MagFace
                                                                         SER-FIQ                                                                       SER-FIQ
              0.05                                                       FaceQNet            0.05                                                      FaceQNet
                                                                         DUL                                                                           DUL
                                                                         Brisque                                                                       Brisque
              0.04                                                       Niqe                0.04                                                      Niqe
                                                                         Piqe                                                                          Piqe
       FNMR
                                                                                     FNMR
              0.03                                                                           0.03
0.02 0.02
0.01 0.01
              0.00                                                                           0.00
                     0   5   10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90                     0   5   10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
                                      Ratio of unconsidered image [%]                                                Ratio of unconsidered image [%]
                                      (c) CFP-FP - ArcFace                                                           (d) CFP-FP - MagFace
              0.05
                                                                         MagFace                                                                       MagFace
                                                                         SER-FIQ             0.05                                                      SER-FIQ
              0.04                                                       FaceQNet                                                                      FaceQNet
                                                                         DUL                                                                           DUL
                                                                         Brisque             0.04                                                      Brisque
                                                                         Niqe                                                                          Niqe
              0.03                                                       Piqe                                                                          Piqe
                                                                                             0.03
       FNMR
FNMR
              0.02
                                                                                             0.02
0.01 0.01
              0.00                                                                           0.00
                     0   5   10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90                     0   5   10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
                                       Ratio of unconsidered image [%]                                               Ratio of unconsidered image [%]
                                     (e) AgeDB-30 - ArcFace                                                         (f) AgeDB-30 - MagFace
Figure 6: Face verification performance for the predicted face quality values with two evaluation models (ArcFace and MagFace). The
curves show the effectiveness of rejecting low-quality face images in terms of false non-match rate (FNMR). Best viewed in color.
active in inference. As a related work, we re-implement the                                 much more smooth than the ones obtained on LFW. This
recent DUL [7] method that can estimate uncertainty along                                   is because CFP-FP and AgeDB-30 consist of faces with
with the face feature.                                                                      larger variations in pose and age. Effectively dropping low-
                                                                                            quality faces can benefit the verification performance more
Evaluation metric. Following previous work [12, 36,
                                                                                            on these two benchmarks. 2) No matter computing the
4], we evaluate the quality assessment on LFW/CFP-
                                                                                            feature from ArcFace (left column) or MagFace (right col-
FP/AgeDB via the error-versus-reject curves, where images
                                                                                            umn), the curves corresponding to MagFace magnitude are
with the lowest predicted qualities are unconsidered and
                                                                                            consistently the lowest ones across different benchmarks.
error rates are calculated on the remaining images. Error-
                                                                                            This indicates that the performance of MagFace magni-
versus-reject curve indicates good quality estimation when
                                                                                            tude as quality generalizes well across datasets as well as
the verification error decreases consistently while increas-
                                                                                            face features. We then analyze the quality performance of
ing the ratio of unconsidered images. To compute the fea-
                                                                                            each type of methods. 1) The image-based quality metrics
ture for verification, we adopt the ArcFace* as well as our
                                                                                            (Brisque [31], Niqe [23], Piqe [37]) lead to relatively higher
MagFace models in Tab. 2.
                                                                                            errors in most cases as the image quality alone is not suit-
Results on face verification. Fig. 6 shows the error-versus-                                able for generalized face quality estimation. Factors of the
reject curves of different quality methods in terms of false                                face (such as pose, occlusions, and expressions) and model
non-match rate (FNMR) reported at false match rate (FMR)                                    biases are not covered by these algorithms and might play
threshold of 0.001. Overall, we have two high-level ob-                                     an important role for face quality assessment. 2) The face-
servations. 1) The curves on CFP-FP and AgeDB-30 are
based methods (FaceQNet [15] and SER-FIQ [36]) outper-                       Method          Net     IJB-B-512     IJB-B-1024    IJB-B-1845
                                                                                                      F    NMI       F   NMI       F   NMI
forms other baselines in most cases. In particular, SER-FIQ
                                                                             K-means [21]ArcFace    66.70 88.83   66.82 89.48   66.93 89.88
is more effective than FaceQNet in terms of the verification                             MagFace    66.75 88.86   67.33 89.62   67.06 89.96
error rates. This is due to the fact that SER-FIQ is built on                AHC [17]    ArcFace    69.72 89.61   70.47 90.54   70.66 90.90
top of the deployed recognition model so that its prediction                             MagFace    70.24 89.99   70.68 90.67   70.98 91.06
                                                                             DBSCAN [11] ArcFace    72.72 90.42   72.50 91.15   73.89 91.96
is more suitable for the verification task. However, SEQ-                                MagFace    73.13 90.61   72.68 91.30   74.26 92.13
FIQ takes a quadratic computational cost w.r.t. the number                   L-GCN [43]  ArcFace    84.92 93.72   83.50 93.78   80.35 92.30
of sub-networks m randomly sampled using dropout. In                                     MagFace    85.27 93.83   83.79 94.10   81.58 92.79
contrary, the neglectable overhead of computing magnitude
makes the proposed MagFace more practical in many real-                      Table 3: F-score (%) and NMI (%) on clustering benchmarks.
time scenarios. Moreover, the training of MagFace does not
require explicit labeling of face quality, which is not only                F-measure [3] are employed as the evaluation metrics.
time consuming but also error-prone to obtain. 3) At last,                  Results. Tab. 3 summarizes the clustering results. We
the uncertainty method (DUL) performs well on CFP-FP                        can observe that with stronger clustering methods from K-
but yields more verification errors on AgeDB-30 when the                    means to L-GCN, the overall clustering performance can be
proportion of unconsidered images is increased. This may                    improved. For any combination of clustering and protocol,
indicate that the Gaussian assumption of data variance in                   MagFace always achieves better performance than ArcFace
DUL is over-simplified such that the model cannot general-                  in terms of both F-score and NMI metrics. This consis-
ize well to different kinds of quality factors.                             tent superiority demonstrates the MagFace feature is more
                                                                            suitable for clustering. Notice that we keep the same hyper-
4.3. Face Clustering                                                        parameters for clustering. The improvement of using Mag-
   In this section, we conduct experiments on face cluster-                 Face must come from its better within-class feature distribu-
ing to further investigate the structure of feature representa-             tion, where the high-quality samples around the class center
tions learned by MagFace.                                                   are more likely to be separated across different classes.
                                                                                We further explore the relationship between feature mag-
                        40                                                  nitudes and the confidences of being class centers. Fol-
                        35                                                  lowing the idea mentioned in [46], the confidence of be-
    feature magnitude
θyi + m(ai ) ∈ [0, π/2] approaches 1 in most cases.                     λg g 0 (ai ) and considering the constraints m0 (ai ) ∈ (0, K],
                                                                        g 0 (ua ) = 0, λg ≥ −gsK  0 (l ) , the values of derivatives of la , ua
A.1. Requirements for MagFace                                           are
                                                                                                      a
Table 4: Verification accuracy (%) on CFP-FP and IJB-C with different distributions of margins. Backbone network: ResNet50.
nitudes. All the faces are sample from the IJB-C bench-                         E.1. Property of Convergence for Mag-CosFace
mark. It can be seen that faces with magnitudes around
                                                                                Property 5. For ai ∈ [la , ua ], Li is a strictly convex func-
28 are mostly profile faces while around 35 are high-
                                                                                tion of ai .
quality and frontal faces. That is consistent with the pro-
file/frontal peaks in the CFP-FP benchmark and indicates
                                                                                Proof. The first and second deriviates of A(ai ) are
that faces with similar magnitudes show similar quality pat-
terns across benchmarks. In real applications, we can set a                                            A0 (ai ) = −sm0 (ai )
proper threshold for the magnitude and should be able to fil-                                                                                   (12)
                                                                                                       A00 (ai ) = −sm00 (ai )
ter similar low-quality faces, even under various scenarios.
    Besides directly served as qualities, our feature magni-                    As A00 (ai ) ≤ 0, the property can be proved following that
tudes can also be used as quality labels for faces, which                       presented before.
avoids human labelling costs. These labels are more suit-
able for recognition, and therefore can be used to boost
other quality models.
                                                                                Property 6. A unique optimal solution a∗i exists in [la , ua ].
E. Mag-CosFace
                                                                                Proof. We only need to prove
   In the main text, MagFace is modified from the Arc-
Face loss. In this section, we show that MagFace based                                      ∂Li             Bs
                                                                                                (ua ) = A(a )         m0 (ua ) > 0
on CosFace loss (denote as Mag-CosFace) can theoretically                                   ∂ai          e i +B
                                                                                            ∂Li             Bs                                  (13)
achieve the same effects. Mag-CosFace loss for a sample i                                        (la ) = A(a )        m0 (la ) + λg g 0 (la )
                                                                                             ∂ai         e i +B
is
                                                                                                       < sK + λg g 0 (la ) ≤ 0
                              es(cos θyi −m(ai ))
  Li = − log       s(cos θyi −m(ai ))
                                        Pn
               e                     + j=1,j6=yi es cos θj           (10)       Then it’s easy to have there is a unique optimu.
                     + λg g(ai )
                                                                                E.2. Property of Monotonicity for Mag-CosFace
Let A(ai )        =     s(cos θyi − m(ai )) and B                       =
P n           s cos θj                                                          Property 7. With fixed fi and Wj , j ∈ {1, · · · , n}, j 6= yi ,
  j=1,j6=yi e          and rewrite the loss as
                                                                                the optimal feature magnitude a∗i is monotonically increas-
                                eA(ai )
               Li = − log      eA(ai ) +B
                                                + λg g(ai )          (11)       ing if the cosine-distance to its class center Wyi decreases.
Proof. The first derivate of Li is
            ∂Li          B
                = − A(a )     A0 (ai ) + λg g 0 (ai )
            ∂ai     e i +B
                                                              (14)
                       Bsm0 (ai )
                = s(cos θ −m(a ))       + λg g 0 (ai )
                  e      yi   i
                                  +B
                                           ∂Li (θ 1 )    ∂Li (θ 2 )
For θy1i < θy2i , the core here is to prove ∂aiyi < ∂aiyi ,
which is obvious as cos θy1i > cos θy2i . The rest of the proofs
is the same as those for the original MagFace.
Property 8. With other things fixed, the optimal feature
magnitude a∗i is monotonically increasing with a decreas-
ing B (i.e., increasing inter-class distance).
F. Authors’ Contributions
   Shichao Zhao and Zhida Huang contribute similarly to
this work. Besides involved in discussions, Shichao Zhao
mainly conducted experiments on face clustering and Zhida
Huang implemented baselines as well as evaluation metrics
for quality experiments.