0% found this document useful (0 votes)
89 views14 pages

Mag Face

MagFace is a novel approach for face recognition that introduces a universal feature embedding which measures face quality through the magnitude of the embedding. It enhances recognition performance by structuring within-class distributions, pulling easier samples towards class centers while pushing harder samples away, thus preventing overfitting on low-quality images. Extensive experiments demonstrate its superiority over existing methods in face recognition and quality assessment tasks.

Uploaded by

mohamedm985
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views14 pages

Mag Face

MagFace is a novel approach for face recognition that introduces a universal feature embedding which measures face quality through the magnitude of the embedding. It enhances recognition performance by structuring within-class distributions, pulling easier samples towards class centers while pushing harder samples away, thus preventing overfitting on low-quality images. Extensive experiments demonstrate its superiority over existing methods in face recognition and quality assessment tasks.

Uploaded by

mohamedm985
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

MagFace: A Universal Representation for

Face Recognition and Quality Assessment

Qiang Meng, Shichao Zhao, Zhida Huang, Feng Zhou


Algorithm Research, Aibee Inc.
{qmeng,sczhao,zdhuang,fzhou}@aibee.com
arXiv:2103.06627v4 [cs.CV] 26 Jul 2021

Abstract Class center

Easy
The performance of face recognition system degrades
when the variability of the acquired faces increases. Prior
work alleviates this issue by either monitoring the face
Semi-hard
quality in pre-processing or predicting the data uncertainty 𝒍
along with the face feature. This paper proposes MagFace, 𝑙

a category of losses that learn a universal feature embed- Hard


𝜽
ding whose magnitude can measure the quality of the given
𝑐𝑜𝑠𝜃
face. Under the new loss, it can be proven that the magni- O

tude of the feature embedding monotonically increases if the (a) (b)


subject is more likely to be recognized. In addition, Mag- Figure 1: MagFace learns for (a) in-the-wild faces (b) a universal
Face introduces an adaptive mechanism to learn a well- embedding by pulling the easier samples closer to the class center
structured within-class feature distributions by pulling easy and pushing them away from the origin o. As shown in our exper-
iments and supported by mathematical proof, the magnitude l be-
samples to class centers while pushing hard samples away.
fore normalization increases along with feature’s cosine distance
This prevents models from overfitting on noisy low-quality
to its class center, and therefore reveals the quality for each face.
samples and improves face recognition in the wild. Ex- The larger the l, the more likely the sample can be recognized.
tensive experiments conducted on face recognition, qual-
ity assessments as well as clustering demonstrate its su- To acquire the optimal reference image in the first stage,
periority over state-of-the-arts. The code is available at a technique called face quality assessment [4, 26] is of-
https://github.com/IrvingMeng/MagFace. ten employed on each detected face. Although the ideal
quality score should be indicative of the face recognition
performance, most of early work [1, 2] estimates quali-
1. Introduction ties based on human-understandable factors such as lumi-
nances, distortions and pose angles, which may not directly
Recognizing face in the wild is difficult mainly due to favor the face feature learning in the second stage. Alter-
the large variability exhibited by face images acquired in natively, learning-based methods [4, 15] train quality as-
unconstrained settings. This variability is associated to the sessment models with artificially or human labelled quality
image acquisition conditions (such as illumination, back- values. Theses methods are error-prone as there lacks of a
ground, blurriness, and low resolution), factors of the face clear definition of quality and human may not know the best
(such as pose, occlusion and expression) or biases of the characteristics for the whole systems.
deployed face recognition system [36]. To cope with these
challenges, most relevant face analysis system under uncon- To achieve high end-to-end application performances in
strained environment (e.g., surveillance video) consists of the second stage, various metric-learning [27, 30] or classi-
three stages: 1) face acquisition to select from a set of raw fication losses [48, 25, 20, 13, 40, 9, 5] emerged in the past
images or capture from video stream the most suitable face few years. These works learn to represent each face im-
image for recognition purpose; 2) feature extraction to ex- age as a deterministic point embedding in the latent space
tract discriminative representation from each face image; 3) regardless of the variance inherent in faces. In reality, how-
facial application to match the reference image towards a ever, low-quality or large-pose images like Fig. 1a widely
given gallery or cluster faces into groups of same person. exist and their facial features are ambiguous or absent.
Given these challenges, a large shift in the embedded points 2. Related Works
is inevitable, leading to false recognition. For instance, per-
2.1. Face Recognition
formance reported by prior state-of-the-art [29] on IJB-C is
much lower than LFW. Recently, confidence-aware meth- Recent years have witnessed the breakthrough of deep
ods [29, 7] propose to represent each face image as a Gaus- convolutional face recognition techniques. A number of
sian distribution in the latent space, where the mean of the successful systems, such as DeepFace [35], DeepID [33],
distribution estimates the most likely feature values while FaceNet [27] have shown impressive performance on face
the variance shows the uncertainty in the feature values. De- identification and verification. Apart from the large-scale
spite the performance improvement, these methods seek to training data and deep network architectures, the major
separate the face feature learning from data noise modeling. advance comes from the evolution of training losses for
Therefore, additional network blocks are introduced in the CNN. Most of early works rely on metric-learning based
architecture to compute the uncertainty level for each im- loss, including contrastive loss [8], triplet loss [27], n-
age. This complicates the training procedure and adds com- pair loss [30], angular loss [41], etc. Suffering from the
putational burden in inference. In addition, the uncertainty combinatorial explosion in the number of face triplets,
measure cannot be directed used in conventional metrics for embedding-based method is usually inefficient in training
comparing face features. on large-scale dataset. Therefore, the main body of research
This paper proposes MagFace to learn a universal and in deep face recognition has focused on devising more effi-
quality-aware face representation. The design of MagFace cient and effective classification-based loss. Wen et al. [44]
follows two principles: 1) Given the face images of the develop a center loss to learn centers for each identity to
same subject but in different levels of quality (e.g., Fig. 1a), enhance the intra-class compactness. L2 -softmax [25] and
it seeks to learn a within-class distribution, where the high- NormFace [39] study the necessity of the normalization op-
quality ones stay close to the class center while the low- eration and applied L2 normalization constraint on both fea-
quality ones are distributed around the boundary. 2) It tures and weights. From then on, several angular margin-
should pose the minimum cost for changing existing infer- based losses, such as SphereFace [20], AM-softmax [38],
ence architecture to measure the face quality along with the SV-AM-Softmax [42], CosFace [40], ArcFace [9], progres-
computation of face feature. To achieve the above goals, we sively improve the performance on various benchmarks to
choose magnitude, the independent property to the direc- the newer level. More recently, AdaptiveFace [19], Ada-
tion of the feature vector, as the indicator for quality assess- Cos [49] and FairLoss [18] introduce adaptive margin strat-
ment. The core objective of MagFace is to not only enlarge egy to automatically tune hyperparameters and generate
inter-class distance, but also maintain a cone-like within- more effective supervisions during training. Compared to
class structure like Fig. 1b, where ambiguous samples are our method, all these work tend to suppress the effect of
pushed away from the class centers and pulled to the origin. magnitude in the loss by normalizing the feature vector.
This is realized by adaptively down-weighting ambiguous 2.2. Face Quality Assessment
samples during training and rewarding the learned feature
Face image quality is an important factor to enable
vector with large magnitude in the MagFace loss. To sum
high-performance face recognition systems [4]. Traditional
up, MagFace improves previous work in two aspects:
methods, such as ISO/IEC 19794-5 standard [1], ICAO
9303 standard [2], Brisque [31], Niqe [23] and Piqe [37],
1. For the first time, MagFace explores the complete set
describe qualities from image-based aspects (e.g., distor-
of two properties associated with feature vector, direc-
tion, illumination and occlusion) or subject-based mea-
tion and magnitude, in the problem of face recognition
sures (e.g., accessories). Learning-based approaches such
while previous works often neglect the importance of
as FaceQNet [15] and Best-Rowden [4] regress qualities by
the magnitude by normalizing the feature. With exten-
networks trained on human-assessed and similarity-based
sive experimental study and solid mathematical proof,
labels. However, these quality labels are error-prone as hu-
we show that the magnitude can reveal the quality of
man may not know the best characteristics for the recog-
faces and can be bundled with the characteristics of
nition system and therefore cannot consider all proper fac-
recognition without any quality labels involved.
tors. Recently, several uncertainty-based methods are pro-
2. MagFace explicitly distributes features structurally in posed to express face qualities by the uncertainties of fea-
the angular direction (as shown in Fig. 1b). By dynam- tures. SER-FIQ [36] forwards an image to a network with
ically assigning angular margins based on samples’ dropout several times and measures face quality by the vari-
hardness for recognition, MagFace prevents model ation of extracted features. Confidence-aware face recogni-
from overfitting on noisy and low-quality samples and tion methods [29, 7] propose to represent each face image
learns a well-structured distributions that are more as a Gaussian distribution in the latent space and learn the
suitable for recognition and clustering purpose. uncertainty in the feature values. Although these methods
𝑾' 𝑩' 𝑾' 𝑩' 𝑾' 𝑩' 𝑾' 𝑩'
𝑩 𝑩𝟏

𝒎
𝒎

𝒎
𝟐
𝒎

𝟏
𝟑
𝑩𝟐

1 3 1 3
3
𝑩𝟑
2 2 2
𝑾 𝑾 𝑾 1 𝑾
O O 𝒂𝟑 𝒂𝟐 𝒂𝟏 O O
𝜃" > 𝜽𝟑 > 𝜃% Feasible Region by 𝒎 𝒂 Effect of 𝒈(𝒂) Effect of 𝐦(𝒂) 𝜽𝟑 > 𝜃% > 𝜃"
(a) (b) (c) (d)
Figure 2: Geometrical interpretation of the feature space (without normalization) optimized by ArcFace and MagFace. (a) Two-class
distributions optimized by ArcFace, where w and w0 are the class centers and their decision boundaries B and B 0 are separated by the
additive margin m. Circle 1, 2, 3 represent three types samples of class w with descending qualities. (b) MagFace introduces m(ai ) which
dynamically adjust boundaries based on feature magnitudes, and ends to a new feasible region. (c) Effects of g(ai ) and m(ai ). (d) Final
feature distributions of our MagFace. Best viewed in color.

work in an unsupervised manner like ours, they require ad- neural networks and yi = 1, · · · , n is its associated class
ditional computational costs or network blocks, which com- label. ArcFace and other variants improve the conventional
plicate their usage in conventional face systems. softmax loss by optimizing the feature embedding on a hy-
persphere manifold where the learned face representation is
2.3. Face Clustering more discriminative. By defining the angle θj between fi
Face clustering exploits unlabeled data to cluster them and j-th class center wj ∈ Rd as wjT fi = kwj kkfi k cos θj ,
into pseudo classes. Traditional clustering methods usually the objective of ArcFace [9] can be formulated as
work in an unsupervised manner, such as K-means [21].
N
DBSCAN [11] and hierarchical clustering. Several su- 1 X es cos (θyi +m)
L=− log s cos (θ +m) P , (1)
pervised clustering methods based on graph convolutional N i=1 e yi
+ j6=yi es cos θj
network (GCN) are proposed recently. For example, L-
GCN [43] performs reasoning and infers the likelihood of where m > 0 denotes the additive angular margin and s is
linkage between pairs in the sub-graphs. Yang et al. [46] the scaling parameter.
designs two graph convolutional networks, named GCN-V
Despite its superior performances on enforcing intra-
and GCN-E, to estimate the confidence of vertices and the
class compactness and inter-class discrepancy, the angular
connectivity of edges, respectively. Instead of developing
margin penalty m used by ArcFace is quality-agnostic and
clustering methods, we aim at improving feature distribu-
the resulting structure of the within-class distribution could
tion structure for clustering.
be arbitrary in unconstrained scenarios. For example, let us
3. Methodology consider the scenario illustrated in Fig. 2a, where we have
face images of the same class in three levels of qualities in-
In this section, we first review the definition of Arc- dicated by the circle sizes: the larger the radius, the more
Face [9], one of the most popular losses used in face recog- uncertain the feature representation and the more difficulty
nition. Based on the analysis of ArcFace, we then derive the the face can be recognized. Because ArcFace employs a
objective and prove the key properties for MagFace. In the uniform margin m, each image in one class shares the same
end, we compare softmax and ArcFace with MagFace from decision boundary, i.e., B : cos(θ + m) = cos(θ0 ) with
the perspective of feature magnitude. respect to the neighbor class. The three types of samples
3.1. ArcFace Revisited can stay at arbitrary location in the feasible region (shading
area in Fig. 2a) without any penalization by the angular mar-
Training loss plays an important role in face represen-
gin. This leads to unstable within-class distribution, e.g., the
tation learning. Among the various choices (see [10] for
a recent survey), ArcFace [9] is perhaps the most widely high-quality face (type 1) stay along the boundary B while
adopted one in both academy and industry application due the low-quality ones (type 2 and 3) are closer to the center
to its easiness in implementation and state-of-the-art per- w. This unstableness can hurt the performances on in-the-
formance on a number of benchmarks. Suppose that we are wild recognition as well as other facial application such as
given a training batch of N face samples {fi , yi }N
i=1 of n
face clustering. Moreover, hard and noisy samples are over-
d weighted as they are hard to stay in the feasible area and the
identities, where fi ∈ R denotes the d-dimensional em-
bedding computed from the last fully connected layer of the models may overfit to them.
Hard Easy Hard Easy Hard Easy

(a) Softmax (b) ArcFace (c) MagFace


Figure 3: Visualization of feature magnitudes and difficulties for recognition. Models are trained on MS1M-V2 [14, 9] and 512 samples
of the last iteration are used for visualization. Negative losses are used to reveal the hardness for Softmax while we use cosine value of θ
(angle between a feature and its class center) for ArcFace and MagFace.

3.2. MagFace respect to ai , each sample would be pushed towards the


Based on the above analysis, previous cosine-similarity- boundary of the feasible region and the high-quality ones
(circle 1) would be dragged closer to the class center w as
based face recognition loss lacks more fine-grained con-
shown in Fig. 2d. In a nutshell, MagFace extends ArcFace
straint beyond a fixed margin m. This leads to unstable (Eq. 1) with magnitude-aware margin and regularizer to en-
within-class structure especially in the unconstrained case force higher diversity for inter-class samples and similarity
(e.g., Fig. 2a) where the variability of each subject’s faces is for intra-class samples by optimizing:
large. To address the aforementioned problem, this section
N
proposes MagFace, a novel framework to encode quality 1 X
LM ag = Li , where (2)
measure into the face representation. Unlike previous work N i=1
[29, 7] that call for additional uncertainty term, we pur-
es cos (θyi +m(ai ))
sue a minimalism design by optimizing over the magnitude Li = − log s cos (θyi +m(ai ))
+ λg g(ai ).
+ j6=yi es cos θj
P
ai = kfi k without normalization of each feature fi . Our e
design has two major advantages: 1) We can keep using the The hyper-parameter λg is used to trade-off between the
cosine-based metric that has been widely adopted by most classification and regularization losses.
existing inference systems; 2) By simultaneously enforcing The design of MagFace not only follows intuitive mo-
its direction and magnitude, the learned face representation tivations, but also yields result with theoretical guarantees.
is more robust to the variability of faces in the wild. To our Assuming the magnitude ai is bounded in [la , ua ], where
best understanding, this is the first work to unify the feature m(ai ) is a strictly increasing convex function, g(ai ) is a
magnitude as quality indicator in face recognition. strictly decreasing convex function and λg is large enough,
Before defining the loss, let us first introduce two aux-
we can prove (see detailed requirements and proofs in the
iliary functions related to ai , the magnitude-aware angu-
lar margin m(ai ) and the regularizer g(ai ). The design supplementary) that the following two properties of Mag-
of m(ai ) follows a natural intuition: for high-quality sam- Face loss always hold when optimizing Li over ai :
ples xi , they should concentrate in a small region around the
Property of Convergence. For ai ∈ [la , ua ], Li is a strictly
cluster center w with high certainty. By assuming a positive
correlation between the magnitude and quality, we thereby convex function which has a unique optimal solution a∗i .
penalize more on xi in terms of m(ai ) if its magnitude ai is Property of Monotonicity. The optimal a∗i is monotoni-
large. To have a better understanding, Fig. 2b visualizes the
cally increasing as the cosine-distance to its class center
margins m(ai ) corresponding to different magnitude val-
ues. In contrast to ArcFace (Fig. 2a), the feasible region decreases and the cos-distances to other classes increase.
defined by m(ai ) has a shrinking boundary with respect The property of convergence guarantees the unique op-
to feature magnitude towards the class center w. Conse-
timal solution for ai as well as the fast convergence. The
quently, this boundary pulls the low-quality samples (circle
2 and 3 in Fig. 2c) to the origin where they have lower risk property of monotonicity states that the feature magni-
to be penalized. However, the structure formed solely by tudes reveal the difficulties for recognition, therefore can
m(ai ) is unstable for high-quality samples like circle 1 in be treated as a metric for face qualities.
Fig. 2c as they have large freedom moving inside the fea-
sible region. We therefore introduce the regularizer g(ai ) 3.3. Analysis on Feature Magnitude
that rewards sample with large magnitude. By designing To better understand the effect of the MagFace loss,
g(ai ) as a monotonically decreasing convex function with we conduct experiments on the widely used MS1M-V2 [9]
dataset and investigate for the training examples at conver- Method LFW CFP-FP AgeDB-30 CALFW CPLFW
gence the relation between the feature magnitude and their Softmax 99.70 98.20 97.72 95.65 92.02
SV-AM-Softmax [42] 99.50 95.10 95.68 94.38 89.48
similarity with class center as shown in Fig. 3.
SphereFace [20] 99.67 96.84 97.05 95.58 91.27
Softmax. The classical softmax-based loss underlies the CosFace [40] 99.78 98.26 98.17 96.18 92.18
objective of the pioneer work [35, 34] on deep face recog- ArcFace [9] 99.81 98.40 98.05 95.96 92.72
nition. Without explicit constraint on magnitude, the value MagFace 99.83 98.46 98.17 96.15 92.87
of the negative loss for each sample is almost independent
Table 1: Verification accuracy (%) on easy benchmarks.
to its magnitude as observed from Fig. 3a. As pointed in
[25, 39], softmax tends to create a radial feature distribution Baselines. We re-implement state-of-the-art baselines in-
because softmax loss acts as the soft version of max opera- cluding Softmax, SV-AM-Softmax [42], SphereFace [20],
tor and scaling the feature magnitude does not affect the as- CosFace [40], ArcFace [9]. ResNet100 is equipped as the
signment of its class. To eliminate this effect, [25, 39] sug- backbone. We use the recommended hyperparameters for
gest that using normalized feature would benefit the task. each model, e.g., s = 64, m = 0.5 for ArcFace.
ArcFace. ArcFace can be considered as a special case Training. We train models on 8 1080Tis by stochastic
of MagFace when m(ai ) = m and g(ai ) = 0. As shown gradient descent. The learning rate is initialized from 0.1
in Fig. 3b, high-quality samples with large similarity cos(θ) and divided by 10 at 10, 18, 22 epochs, and we stop the
to class center yield large variation in magnitude. This ev- training at the 25th epoch. The weight decay is set to
idence echos our motivation on the unstable structure de- 5e-4 and the momentum is 0.9. We only augment train-
fined by a fixed angular margin in ArcFace for easy samples. ing samples by random horizontal flip. For MagFace, we
On the other hand, for low-quality samples that are difficult fix the upper bound and lower bound of the magnitude as
to be recognized (cos(θ) is small), the fixed angular margin la = 10, ua = 110. m(ai ) is chosen to be a linear function
determines the magnitude needs to be large enough in order and g(ai ) as a hyperbola. For detailed definition of m(ai ),
to fit inside the feasible region (Fig. 2a). Therefore, there g(ai ) and λg , please refer to Sec. B2 in the supplementary.
is a decreasing low bound for feature magnitudes w.r.t. the In the end, our mean margin as well as other hyperparame-
quality of face as indicated by the dash line in Fig. 3b. ters are all consistent with ArcFace.
MagFace. In contrast to ArcFace, our MagFace opti- Test. During testing, cosine distance is used as metric on
mizes the feature with adaptive margin and regularization comparing 512-D features. For evaluations on IJB-B/C, one
based on its magnitude. Under this loss, it is clear to observe identity can have multiple images. The common way to rep-
from Fig. 3c that there is a strong correlation between the resent for an identity is to sum up the normalized feature
feature magnitudes and their cosine similarities with class finorm = kffii k of each image and then normalize the em-
center. Those examples at the upper-right corner are the P
f norm
most high-quality ones. As the magnitude becomes smaller, bedding for comparisons, i.e., f = k Pi finorm k . One ben-
i i
the examples are more deviated from the class center. This efit of MagFace is that we can assign quality-aware weight
distribution strongly supports the fact that the feature mag- kfi k to each normalized feature finorm . Therefore, we fur-
nitude learned by MagFace is a good metric for face quality. ther evaluate “MagFace+”P in Tab. 2 by computing the iden-
f
tity embedding as f+ = k i fii k .
P
i

4. Experiments Results on LFW, CFP-FP, AgeDB-30, CALFW and


CPLFW. We directly use the aligned images and proto-
In this section, we examine the proposed MagFace on cols adopted by ArcFace [9] and present our results in
three important face tasks: face recognition, quality as- Tab. 1. We note that performances are almost saturated.
sessment and face clustering. Sec. C in the supplementary Compared to CosFace which is the second best baseline,
presents the ablation study on relationships between margin ArcFace achieves 0.03%, 0.14%, 0.54% improvement on
distributions and recognition performances. LFW, CFP-FP and CPLFW, while drops 0.12%, 0.22% on
AgeDB-30 and CALFW. MagFace obtains the overall best
4.1. Face Recognition results and surpasses ArcFace by 0.02%, 0.06%, 0.12%,
Datasets. The original MS-Celeb-1M dataset [14] contains 0.19% and 0.15% on five benchmarks respectively.
about 10 million images of 100k identities. However, it Results on IJB-B/IJB-C. The IJB-B dataset contains 1,845
consists of a great many noisy face images. Instead, we subjects with 21.8K still images and 55K frames from 7,011
employ MS1M-V2 [9] (5.8M images, 85k identities) as our videos. As the extension of IJB-B, the IJB-C dataset cov-
training dataset. For evaluation, we adopt LFW [16], CFP- ers about 3,500 identities with a total of 31,334 images and
FP [28], AgeDB-30 [24], CALFW [51], CPLFW [50], IJB- 117,542 unconstrained video frames. In the 1:1 verification,
B [45] and IJB-C [22] as the benchmarks. All the images the number of positive/negative matches are 10k/8M in IJB-
are aligned to 112×112 by following the setting in ArcFace. B and 19k/15M in IJB-C. We report the TARs at FAR=1e-6,
(a) mean: 22.84 (b) mean: 25.13 (c) mean: 27.03 (d) mean: 29.03 (e) mean: 31.01 (f) mean: 32.99 (g) mean: 34.80 (h) mean: 36.55
range: (-∞, 24) range: [24, 26) range: [26, 28) range: [28, 30) range: [30, 32) range: [32, 34) range: [34, 36) range: [36, ∞)
# of faces: 3692 # of faces: 9955 # of faces: 15459 # of faces: 17565 # of faces: 20627 # of faces: 19743 # of faces: 11238 # of faces: 1721
Figure 4: Visualization of the mean faces of 100k images sampled from the IJB-C dataset. Each mean face corresponds to a group of faces
based on the magnitude level of the features learned by MagFace.

Method IJB-B (TAR@FAR) IJB-C (TAR@FAR)


1e-6 1e-5 1e-4 1e-6 1e-5 1e-4
VGGFace2* [6] - 67.10 80.00 - 74.70 84.00 34.59 34.81

CenterFace* [44] - - - - 78.10 85.30 28.08 27.96

CircleLoss* [32] - - - - 89.60 93.95


ArcFace* [9] - - 94.20 - - 95.60
Softmax 46.73 75.17 90.06 64.07 83.68 92.40 20.46 19.83
SV-AM-Softmax [42] 29.81 69.25 84.79 63.45 80.30 88.34
SphereFace [20] 39.40 73.58 89.19 68.86 83.33 91.77
CosFace [40] 40.41 89.25 94.01 87.96 92.68 95.56
ArcFace [9] 38.68 88.50 94.09 85.65 92.69 95.74 16.08 16.86
MagFace 40.91 89.88 94.33 89.26 93.67 95.81
MagFace+ 42.32 90.36 94.51 90.24 94.08 95.97

Table 2: Verification accuracy (%) on difficult benchmarks. “*” Figure 5: Distributions of magnitudes on different datasets.
indicates the result quoted from the original paper.
tude increases, the corresponding mean face reveals more
1e-5 and 1e-4 as shown in Tab. 2. details. This is because high-quality faces are inclined to be
Our implemented ArcFace is on par with the original more frontal and distinctive. This implies the magnitude of
paper, e.g., our TARs at FAR=1e-4 differ from the au- MagFace feature is a good quality indicator.
thors by −0.11% and +0.14% on IJB-B and IJB-C re- Sample distribution of datasets. Fig. 5 plots the sample
spectively. Compared to baselines, our MagFace remains histograms of different benchmarks with respect to Mag-
the top at all FAR criteria except for FAR=1e-6 on IJB-B Face magnitudes. We observe that LFW is the least noisy
as the TAR is very sensitive to the noise when the num- one where most samples are of large magnitudes. Due
ber of FP is tiny. Compared to CosFace, MagFace gains to the larger age variation, the distribution of AGEDB-30
0.50%, 0.63%, 0.32% on IJB-B at TAR@FAR=1e-6, 1e-5, slightly shifts left compared to LFW. For CFP-FP, there are
1e-4 and 1.30%, 0.99%, 0.25% on IJB-C. Compared to Arc- two peaks at the magnitude around 28 and 34, correspond-
Face, improvements are of 2.23%, 1.38%, 0.24% on IJB-B ing to the frontal and profile faces respectively. Given the
and 3.61%, 0.98%, 0.07% on IJB-C respectively. This re- large variations in face qualities, we can conclude IJB-C is
sult demonstrates the superiority of MagFace on more chal- much more challenging than other benchmarks. For images
lenging benchmarks. It is worth to mention that when multi- (more examples can be found in the supplementary) with
ple images existed for one identity, the average embedding magnitudes a ' 15, there are no faces or very noisy faces
can be further improved by aggregating features weighted to observe. When feature magnitudes increase from 20 to
by magnitudes. For instance, MagFace+ outperforms Mag- 40, there is a clear trend that the face changes from profile,
Face by 1.41%/0.98% at FAR=1e-6, 0.48%/0.41% at blurred and occluded, to more frontal and distinctive. Over-
FAR=1e-5 and 0.18%/0.16% at FAR=1e-4. all, this figure convinces us that MagFace is an effective tool
to rank face images according to their qualities.
4.2. Face Quality Assessment Baselines. We choose six baselines of three types for quan-
In this part, we investigate the qualitative and quantita- titative quality evaluation. Brisque [31], Niqe [23] and
tive performance of the pre-trained MagFace model men- Piqe [37] are image-based quality metrics. FaceQNet [15]
tioned in Tab. 2 for quality assessment. and SER-FIQ [36] are face-based ones. For FaceQNet, we
Visualization of the mean face. We first sample 100k im- adopt the released models by the authors. For SER-FIQ, we
ages form IJB-C database and divide them into 8 groups use the “same model” version which yields the best perfor-
based on feature magnitudes. We visualize the mean faces mance in the paper. Following the authors’ setting, we set
of each group in Fig. 4. It can be seen that when magni- m = 100 to forward each image 100 times with drop-out
0.0040 0.0040
MagFace MagFace
0.0035 SER-FIQ 0.0035 SER-FIQ
FaceQNet FaceQNet
0.0030 DUL 0.0030 DUL
Brisque Brisque
0.0025 Niqe 0.0025 Niqe
Piqe Piqe
FNMR

FNMR
0.0020 0.0020
0.0015 0.0015
0.0010 0.0010
0.0005 0.0005
0.0000 0.0000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Ratio of unconsidered image [%] Ratio of unconsidered image [%]
(a) LFW - ArcFace (b) LFW - MagFace
0.06 0.06
MagFace MagFace
SER-FIQ SER-FIQ
0.05 FaceQNet 0.05 FaceQNet
DUL DUL
Brisque Brisque
0.04 Niqe 0.04 Niqe
Piqe Piqe
FNMR

FNMR
0.03 0.03

0.02 0.02

0.01 0.01

0.00 0.00
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Ratio of unconsidered image [%] Ratio of unconsidered image [%]
(c) CFP-FP - ArcFace (d) CFP-FP - MagFace
0.05
MagFace MagFace
SER-FIQ 0.05 SER-FIQ
0.04 FaceQNet FaceQNet
DUL DUL
Brisque 0.04 Brisque
Niqe Niqe
0.03 Piqe Piqe
0.03
FNMR

FNMR

0.02
0.02

0.01 0.01

0.00 0.00
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Ratio of unconsidered image [%] Ratio of unconsidered image [%]
(e) AgeDB-30 - ArcFace (f) AgeDB-30 - MagFace
Figure 6: Face verification performance for the predicted face quality values with two evaluation models (ArcFace and MagFace). The
curves show the effectiveness of rejecting low-quality face images in terms of false non-match rate (FNMR). Best viewed in color.

active in inference. As a related work, we re-implement the much more smooth than the ones obtained on LFW. This
recent DUL [7] method that can estimate uncertainty along is because CFP-FP and AgeDB-30 consist of faces with
with the face feature. larger variations in pose and age. Effectively dropping low-
quality faces can benefit the verification performance more
Evaluation metric. Following previous work [12, 36,
on these two benchmarks. 2) No matter computing the
4], we evaluate the quality assessment on LFW/CFP-
feature from ArcFace (left column) or MagFace (right col-
FP/AgeDB via the error-versus-reject curves, where images
umn), the curves corresponding to MagFace magnitude are
with the lowest predicted qualities are unconsidered and
consistently the lowest ones across different benchmarks.
error rates are calculated on the remaining images. Error-
This indicates that the performance of MagFace magni-
versus-reject curve indicates good quality estimation when
tude as quality generalizes well across datasets as well as
the verification error decreases consistently while increas-
face features. We then analyze the quality performance of
ing the ratio of unconsidered images. To compute the fea-
each type of methods. 1) The image-based quality metrics
ture for verification, we adopt the ArcFace* as well as our
(Brisque [31], Niqe [23], Piqe [37]) lead to relatively higher
MagFace models in Tab. 2.
errors in most cases as the image quality alone is not suit-
Results on face verification. Fig. 6 shows the error-versus- able for generalized face quality estimation. Factors of the
reject curves of different quality methods in terms of false face (such as pose, occlusions, and expressions) and model
non-match rate (FNMR) reported at false match rate (FMR) biases are not covered by these algorithms and might play
threshold of 0.001. Overall, we have two high-level ob- an important role for face quality assessment. 2) The face-
servations. 1) The curves on CFP-FP and AgeDB-30 are
based methods (FaceQNet [15] and SER-FIQ [36]) outper- Method Net IJB-B-512 IJB-B-1024 IJB-B-1845
F NMI F NMI F NMI
forms other baselines in most cases. In particular, SER-FIQ
K-means [21]ArcFace 66.70 88.83 66.82 89.48 66.93 89.88
is more effective than FaceQNet in terms of the verification MagFace 66.75 88.86 67.33 89.62 67.06 89.96
error rates. This is due to the fact that SER-FIQ is built on AHC [17] ArcFace 69.72 89.61 70.47 90.54 70.66 90.90
top of the deployed recognition model so that its prediction MagFace 70.24 89.99 70.68 90.67 70.98 91.06
DBSCAN [11] ArcFace 72.72 90.42 72.50 91.15 73.89 91.96
is more suitable for the verification task. However, SEQ- MagFace 73.13 90.61 72.68 91.30 74.26 92.13
FIQ takes a quadratic computational cost w.r.t. the number L-GCN [43] ArcFace 84.92 93.72 83.50 93.78 80.35 92.30
of sub-networks m randomly sampled using dropout. In MagFace 85.27 93.83 83.79 94.10 81.58 92.79
contrary, the neglectable overhead of computing magnitude
makes the proposed MagFace more practical in many real- Table 3: F-score (%) and NMI (%) on clustering benchmarks.
time scenarios. Moreover, the training of MagFace does not
require explicit labeling of face quality, which is not only F-measure [3] are employed as the evaluation metrics.
time consuming but also error-prone to obtain. 3) At last, Results. Tab. 3 summarizes the clustering results. We
the uncertainty method (DUL) performs well on CFP-FP can observe that with stronger clustering methods from K-
but yields more verification errors on AgeDB-30 when the means to L-GCN, the overall clustering performance can be
proportion of unconsidered images is increased. This may improved. For any combination of clustering and protocol,
indicate that the Gaussian assumption of data variance in MagFace always achieves better performance than ArcFace
DUL is over-simplified such that the model cannot general- in terms of both F-score and NMI metrics. This consis-
ize well to different kinds of quality factors. tent superiority demonstrates the MagFace feature is more
suitable for clustering. Notice that we keep the same hyper-
4.3. Face Clustering parameters for clustering. The improvement of using Mag-
In this section, we conduct experiments on face cluster- Face must come from its better within-class feature distribu-
ing to further investigate the structure of feature representa- tion, where the high-quality samples around the class center
tions learned by MagFace. are more likely to be separated across different classes.
We further explore the relationship between feature mag-
40 nitudes and the confidences of being class centers. Fol-
35 lowing the idea mentioned in [46], the confidence of be-
feature magnitude

ing a class center for each sample is estimated based on


30
its neighbor structure defined by face features. The sam-
25 ples with dense and pure local connection have high con-
fidence, while those with sparse connections or residing in
20
the boundary among several clusters have low confidence.
0.2 0.4 0.6 0.8
confidence of being class center From Fig. 7, it is easy to observe that the MagFace magni-
Figure 7: Visualization of MagFace magnitudes of 500 samples tude is positively correlated with confidence of class center
from IJB-B-1845 w.r.t. their confidences of being class centers. on the IJB-B-1845 benchmark. This result reflects that the
MagFace feature exhibits the expected within-class struc-
Baselines. We compare the performances of MagFace and ture, where high quality samples distribute around class
ArcFace by integrating their features with various cluster- center while low quality ones are far away from the center.
ing methods. For fair comparisons, we constrain hyper-
parameters of the two models to be consistent (e.g., s=64,
5. Conclusion
mean margin 0.5) during training. Four clustering methods
are used in the evaluation: K-means [21], AHC [17], DB- In this paper, we propose MagFace to learn unified fea-
SCAN [11] and L-GCN [43]. For non-deterministic algo- tures for face recognition and quality assessment. By push-
rithms (K-means and AHC), we report the average results ing ambiguous samples away from class centers, MagFace
from 10 runs. For L-GCN, we train the model on CASIA- improves the within-class feature distribution from previous
WebFace [47] (0.5M images from 10k individuals) and fol- margin-based work for face recognition. The adequate theo-
low the recommended settings in the paper [43]. retical and experimental results convince that MagFace can
Benchmarks. We adopt the IJB-B [45] dataset as the simultaneously access quality for the input face image. As a
benchmark as it contains a clustering protocol of seven sub- general framework, MagFace can be potentially extended to
tasks varying in the number of ground truth identities. Fol- benefit other classification tasks such as fine-grained object
lowing [43], we evaluate on three largest sub-tasks where recognition, person re-identification. Moreover, the pro-
the numbers of identities are 512, 1,024 and 1,845, and the posed principle of exploring feature magnitude paves the
numbers of samples are 18,171, 36,575 and 68,195, respec- way to estimate quality for other objects, e.g., person body
tively. Normalized mutual information (NMI) and BCubed in reid or action snippet in activity classification.
References assessment for face recognition based on deep learning. In
International Conference on Biometrics, 2019. 1, 2, 6, 8
[1] Information technology – Biometric data interchange for-
[16] Gary B Huang, Marwan Mattar, Tamara Berg, and Eric
mats – Part 5: Face image data. Standard, International Or-
Learned-Miller. Labeled faces in the wild: A database
ganization for Standardization, Nov. 2011. 1, 2
for studying face recognition in unconstrained environ-
[2] Machine Readable Travel Documents. Standard, Interna- ments. Technical Report 07-49, University of Massachusetts,
tional Civil Aviation Organization, 2015. 1, 2 Amherst, 2007. 5
[3] Enrique Amigó, Julio Gonzalo, Javier Artiles, and Felisa [17] Anil k. Jai, Richard C. Dubes, and Englewood Cliffs. Algo-
Verdejo. A comparison of extrinsic clustering evaluation rithms for clustering data. NJ:Prentice-Hall, 1988. 8
metrics based on formal constraints. Information retrieval, [18] Bingyu Liu, Weihong Deng, Yaoyao Zhong, Mei Wang, Jiani
12(4):461–486, 2009. 8 Hu, Xunqiang Tao, and Yaohai Huang. Fair loss: margin-
[4] Lacey Best-Rowden and Anil K Jain. Learning face image aware reinforcement learning for deep face recognition. In
quality from human assessments. IEEE Trans. Information International Conference on Computer Vision, pages 10052–
Forensics and Security, 13(12):3064–3077, 2018. 1, 2, 7 10061, 2019. 2
[5] Dong Cao, Xiangyu Zhu, Xingyu Huang, Jianzhu Guo, and [19] Hao Liu, Xiangyu Zhu, Zhen Lei, and Stan Z Li. Adaptive-
Zhen Lei. Domain balancing: Face recognition on long- Face: Adaptive margin and sampling for face recognition. In
tailed domains. In IEEE Conference on Computer Vision IEEE Conference on Computer Vision and Pattern Recogni-
and Pattern Recognition, pages 5671–5679, 2020. 1 tion, pages 11947–11956, 2019. 2
[6] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and An- [20] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha
drew Zisserman. VggFace2: A dataset for recognising faces Raj, and Le Song. Sphereface: Deep hypersphere embed-
across pose and age. In IEEE Int’l Conf. Automatic Face & ding for face recognition. In IEEE Conference on Computer
Gesture Recognition (FG), pages 67–74. IEEE, 2018. 6 Vision and Pattern Recognition, pages 212–220, 2017. 1, 2,
[7] Jie Chang, Zhonghao Lan, Changmao Cheng, and Yichen 5, 6
Wei. Data uncertainty learning in face recognition. In IEEE [21] Stuart Lloyd. Least squares quantization in PCM. IEEE
Conference on Computer Vision and Pattern Recognition, Trans. Information Theory, 28(2):129–137, 1982. 3, 8
pages 5710–5719, 2020. 2, 4, 7 [22] Brianna Maze, Jocelyn Adams, James A Duncan, Nathan
[8] Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning Kalka, Tim Miller, Charles Otto, Anil K Jain, W Tyler
a similarity metric discriminatively, with application to face Niggel, Janet Anderson, Jordan Cheney, et al. IARPA Janus
verification. In IEEE Conference on Computer Vision and benchmark-C: Face dataset and protocol. In International
Pattern Recognition, volume 1, pages 539–546. IEEE, 2005. Conference on Biometrics, pages 158–165. IEEE, 2018. 5
2 [23] Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Mak-
[9] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos ing a “completely blind” image quality analyzer. IEEE Sig-
Zafeiriou. ArcFace: Additive angular margin loss for deep nal processing letters, 20(3):209–212, 2012. 2, 6, 7
face recognition. In IEEE Conference on Computer Vision [24] Stylianos Moschoglou, Athanasios Papaioannou, Chris-
and Pattern Recognition, pages 4690–4699, 2019. 1, 2, 3, 4, tos Sagonas, Jiankang Deng, Irene Kotsia, and Stefanos
5, 6, 11, 12 Zafeiriou. AgeDB: the first manually collected, in-the-wild
[10] Hang Du, Hailin Shi, Dan Zeng, and Tao Mei. The elements age database. In IEEE Conference on Computer Vision and
of end-to-end deep face recognition: A survey of recent ad- Pattern Recognition Workshop, pages 51–59, 2017. 5
vances. arXiv preprint arXiv:2009.13290, 2020. 3 [25] Rajeev Ranjan, Carlos D Castillo, and Rama Chellappa. L2-
[11] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, constrained softmax loss for discriminative face verification.
et al. A density-based algorithm for discovering clusters in arXiv preprint arXiv:1703.09507, 2017. 1, 2, 5
large spatial databases with noise. In KDD, volume 96, pages [26] Torsten Schlett, Christian Rathgeb, Olaf Henniger, Javier
226–231, 1996. 3, 8 Galbally, Julian Fierrez, and Christoph Busch. Face im-
[12] Patrick Grother and Elham Tabassi. Performance of biomet- age quality assessment: A literature survey. arXiv preprint
ric quality measures. IEEE Trans on Pattern Analysis and arXiv:2009.01103, 2020. 1
Machine Intelligence, 29(4):531–543, 2007. 7 [27] Florian Schroff, Dmitry Kalenichenko, and James Philbin.
[13] Jianzhu Guo, Xiangyu Zhu, Chenxu Zhao, Dong Cao, Zhen FaceNet: A unified embedding for face recognition and clus-
Lei, and Stan Z Li. Learning meta face recognition in un- tering. In IEEE Conference on Computer Vision and Pattern
seen domains. In IEEE Conference on Computer Vision and Recognition, pages 815–823, 2015. 1, 2
Pattern Recognition, pages 6163–6172, 2020. 1 [28] Soumyadip Sengupta, Jun-Cheng Chen, Carlos Castillo,
[14] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Vishal M Patel, Rama Chellappa, and David W Jacobs.
Jianfeng Gao. MS-Celeb-1M: A dataset and benchmark for Frontal to profile face verification in the wild. In IEEE Win-
large-scale face recognition. In European Conference on ter Conference on Applications of Computer Vision (WACV),
Computer Vision, pages 87–102. Springer, 2016. 4, 5, 11, pages 1–9. IEEE, 2016. 5
12 [29] Yichun Shi and Anil K Jain. Probabilistic face embed-
[15] Javier Hernandez-Ortega, Javier Galbally, Julian Fierrez, dings. In International Conference on Computer Vision,
Rudolf Haraksim, and Laurent Beslay. FaceQnet: quality pages 6902–6911, 2019. 2, 4
[30] Kihyuk Sohn. Improved deep metric learning with multi- [44] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A
class n-pair loss objective. In Annual Conference on Neural discriminative feature learning approach for deep face recog-
Information Processing Systems, pages 1857–1865, 2016. 1, nition. In European Conference on Computer Vision, pages
2 499–515. Springer, 2016. 2, 6
[31] Tao Sun, Xingjie Zhu, Jeng-Shyang Pan, Jiajun Wen, and [45] Cameron Whitelam, Emma Taborsky, Austin Blanton, Bri-
Fanqiang Meng. No-reference image quality assessment in anna Maze, Jocelyn Adams, Tim Miller, Nathan Kalka,
spatial domain. In Genetic and Evolutionary Computing, Anil K Jain, James A Duncan, Kristen Allen, et al. IARPA
pages 381–388. Springer, 2015. 2, 6, 7 Janus benchmark-B face dataset. In IEEE Conference on
[32] Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Computer Vision and Pattern Recognition Workshop, pages
Liang Zheng, Zhongdao Wang, and Yichen Wei. Circle loss: 90–98, 2017. 5, 8
A unified perspective of pair similarity optimization. In IEEE [46] Lei Yang, Dapeng Chen, Xiaohang Zhan, Rui Zhao,
Conference on Computer Vision and Pattern Recognition, Chen Change Loy, and Dahua Lin. Learning to cluster faces
pages 6398–6407, 2020. 6 via confidence and connectivity estimation. In IEEE Confer-
[33] Yi Sun, Ding Liang, Xiaogang Wang, and Xiaoou Tang. ence on Computer Vision and Pattern Recognition, 2020. 3,
DeepID3: Face recognition with very deep neural networks. 8
arXiv preprint arXiv:1502.00873, 2015. 2 [47] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learn-
[34] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep learn- ing face representation from scratch. arXiv preprint
ing face representation from predicting 10,000 classes. In arXiv:1411.7923, 2014. 8
IEEE Conference on Computer Vision and Pattern Recogni- [48] Yuhui Yuan, Kuiyuan Yang, and Chao Zhang. Feature
tion, pages 1891–1898, 2014. 5 incay for representation regularization. arXiv preprint
[35] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior arXiv:1705.10284, 2017. 1
Wolf. DeepFace: Closing the gap to human-level perfor- [49] Xiao Zhang, Rui Zhao, Yu Qiao, Xiaogang Wang, and Hong-
mance in face verification. In IEEE Conference on Computer sheng Li. AdaCos: Adaptively scaling cosine logits for ef-
Vision and Pattern Recognition, pages 1701–1708, 2014. 2, fectively learning deep face representations. In IEEE Con-
5 ference on Computer Vision and Pattern Recognition, pages
[36] Philipp Terhorst, Jan Niklas Kolf, Naser Damer, Florian 10823–10832, 2019. 2
Kirchbuchner, and Arjan Kuijper. SER-FIQ: Unsupervised [50] Tianyue Zheng and Weihong Deng. Cross-pose LFW: A
estimation of face image quality based on stochastic embed- database for studying cross-pose face recognition in un-
ding robustness. In IEEE Conference on Computer Vision constrained environments. Beijing University of Posts and
and Pattern Recognition, 2020. 1, 2, 6, 7, 8 Telecommunications, Tech. Rep, 5, 2018. 5
[37] N Venkatanath, D Praneeth, Maruthi Chandrasekhar Bh,
[51] Tianyue Zheng, Weihong Deng, and Jiani Hu. Cross-
Sumohana S Channappayya, and Swarup S Medasani. Blind
age LFW: A database for studying cross-age face recog-
image quality evaluation using perception based features. In
nition in unconstrained environments. arXiv preprint
National Conference on Communications (NCC), pages 1–6.
arXiv:1708.08197, 2017. 5
IEEE, 2015. 2, 6, 7
[38] Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu. Ad-
ditive margin softmax for face verification. IEEE Signal Pro-
cessing Letters, 25(7):926–930, 2018. 2
[39] Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon
Yuille. NormFace: L2 hypersphere embedding for face ver-
ification. In ACM International Conference on Multimedia,
pages 1041–1049, 2017. 2, 5
[40] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong
Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. CosFace:
Large margin cosine loss for deep face recognition. In IEEE
Conference on Computer Vision and Pattern Recognition,
pages 5265–5274, 2018. 1, 2, 5, 6
[41] Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, and Yuanqing
Lin. Deep metric learning with angular loss. In International
Conference on Computer Vision, pages 2593–2601, 2017. 2
[42] Xiaobo Wang, Shuo Wang, Shifeng Zhang, Tianyu Fu,
Hailin Shi, and Tao Mei. Support vector guided softmax
loss for face recognition. arXiv preprint arXiv:1812.11317,
2018. 2, 5, 6
[43] Zhongdao Wang, Liang Zheng, Yali Li, and Shengjin Wang.
Linkage based face clustering via graph convolution net-
work. In IEEE Conference on Computer Vision and Pattern
Recognition, 2019. 3, 8
A. Proofs for MagFace A.2. Proof for Property of Convergence
Recall the MagFace loss for a sample i is We prove the property of convergence by showing the
strict convexity of the function Li (Property 5) and the ex-
es cos (θyi +m(ai )) istence of the optimum (Property 6).
Li = − log s cos (θyi +m(ai ))
Pn
e + j=1,j6=yi es cos θj (3) Property 1. For ai ∈ [la , ua ], Li is a strictly convex func-
+ λg g(ai ) tion of ai .

Proof. The first and second deriviates of A(ai ) are


Pn A(ai ) s cos=θj s cos(θyi + m(ai )) and B
Let =
j=1,j6=yi e and rewrite the loss as
A0 (ai ) = −s sin(θyi + m(ai ))m0 (ai )
Li = − log eA(ai )
+ λg g(ai ) (4) A00 (ai ) = −s cos(θyi + m(ai ))(m0 (ai ))2 (6)
eA(ai ) +B
00
− s sin(θyi + m(ai ))m (ai )
We first introduce and prove Lemma 1.
According to Lemma 1, we have cos(θyi + m(ai )) ≥ 0
Lemma 1. Assume that fi is top-k correctly classified and and sin(θyi + m(ai )) ≥ 0. Because we define m(ai ) to be
m(ai ) ∈ [0, π/2]. If the number of identities n is much convex and g(ai ) to be strictly convex for ai ∈ [la , ua ],
larger than k (i.e., n  k), the probability of θyi + m(ai ) ∈ m00 (ai ) ≥ 0 and g 00 (ai ) > 0 always hold. Therefore,
[0, π/2] approaches 1. A00 (ai ) ≤ 0.
The first and second order derivatives of the loss Li are
Proof. Denote the angle between feature fi and center class
∂Li B
Wj , j ∈ {1, · · · , n} as θj . Assuming the distribution of θj = − A(a ) A0 (ai ) + λg g 0 (ai )
∂ai e i +B
is uniform, it’s easy to prove P (θj + m(ai ) ∈ [0, π/2]) = ∂ 2 Li B  
π/2−m(ai )
. Let p = π/2−m(a i)
. If fi is top-k correctly = − A(a ) (eA(ai ) + B)A00 (ai ) − BeA(ai ) A0 (ai )2
π π (∂ai )2 (e i + B)2
classified, the probability of θyi + m(ai ) ∈ [0, π/2] is the + λg g 00 (ai )
same as the probability of there are at least k θ to satisfy B B2
=− A00 (ai ) + A(a ) eA(ai ) A0 (ai )2
θ + m(ai ) ∈ [0, π/2]. Then the probability is eA(ai ) + B (e i + B)2
+ λg g 00 (ai )
n  
X n i
P (θyi + m(ai ) ∈ [0, π/2]) = p (1 − p)(n−i) o
i=k
i As B > 0, eA(ai ) + B > 0, it’s easy to prove that first
(5) ∂ 2 Li
two parts of (∂a
k−1
X n 2 are non-negative while the third part is
i)
=1− pi (1 − p)(n−i) ∂ Li 2
i always positive. Therefore, (∂a 2 > 0 and Li is a strictly
i=0 i)
convex function with respect to ai .
When n is a large integer and n  k, each ni pi (1 −

Property 2. A unique optimal solution a∗i exists in [la , ua ].
p)(n−i) , i = 1, 2, · · · k − 1 converges to 0. Therefore, prob-
ability of θyi + m(ai ) ∈ [0, π/2] approaches 1. Proof. Because the loss function Li is a strictly convex
function, we have ∂L i
∂a1i
> ∂L i
∂a2i
if ua ≥ a1i > a2i ≥ la . Next
Lemma 1 is fundamental for the following proofs. The we prove that there exist a optimal solution a∗i ∈ [la , ua ]. If
number of identities is large in real-world applications (e.g., it exists, then it is unique because of the strict convexity.
3.8M for MS1Mv2 [14, 9]). Therefore, the probability of As ∂L Bs 0
∂ai (ai ) = eA(ai ) +B sin(θyi + m(ai ))m (ai ) +
i

θyi + m(ai ) ∈ [0, π/2] approaches 1 in most cases. λg g 0 (ai ) and considering the constraints m0 (ai ) ∈ (0, K],
g 0 (ua ) = 0, λg ≥ −gsK 0 (l ) , the values of derivatives of la , ua
A.1. Requirements for MagFace are
a

In MagFace, m(ai ), g(ai ), λg are required to have the ∂Li Bs


following properties: (ua ) = A(a ) sin(θyi + m(ai ))m0 (ua ) > 0
∂ai e i +B
∂Li Bs (7)
(la ) = A(a ) sin(θyi + m(ai ))m0 (la ) + λg g 0 (la )
1. m(ai ) is an increasing convex function in [la , ua ] and ∂ai e i +B
m0 (ai ) ∈ (0, K], where K is a upper bound; < sK + λg g 0 (la ) ≤ 0

2. g(ai ) is a strictly convex function with g 0 (ua ) = 0; As ∂L


∂ai is monotonically and strictly increasing, there must
i

exist a unique value in [la , ua ] which have a 0 derivative.


sK
3. λg ≥ −g 0 (la ) . Therefore, an optimal solution exists and is unique.
Method Hyperparameters Margin CFP-FP IJB-C (TAR@FAR)
lm um λg la ua mean max min 1e-6 1e-5 1e-4 1e-3
ArcFace - - - - - 0.50 - - 97.32 83.88 91.59 95.00 96.86
MagFace 0.45 0.65 35 10 110 0.50 0.49 0.52 97.23 81.12 91.44 94.95 96.96
0.40 0.80 35 10 110 0.50 0.46 0.53 97.47 85.82 92.06 95.12 96.92
0.35 1.00 35 10 110 0.50 0.42 0.54 97.40 84.35 91.65 95.05 97.02
0.25 1.60 35 10 110 0.50 0.35 0.61 97.30 81.64 91.09 94.91 96.87

Table 4: Verification accuracy (%) on CFP-FP and IJB-C with different distributions of margins. Backbone network: ResNet50.

A.3. Proof for Property of Monotonicity Therefore, we have ∂L∂ai (B1 )


i
< ∂L∂a
i (B2 )
i
. Based on the prop-
erty of optimal solution for strictly convex function, we
To prove the property of monotonicity, we first show that
optimal a∗i increases with a smaller cosine-distance to its have 0 = ∂L∂ai (B

1)
= ∂L∂ai (B

2)
> ∂L∂ai (B

1)
, which leads to
i,1 i,2 i,2
∗ ∗
class center (Property 3). As B can reveal the overall cos- ai,1 > ai,2 .
distances to other class centers, we further prove that de-
creasing B (distances to other class centers increases) can B. Experimental Settings
increase optimal feature magnitude (Property 4). In the end, B.1. Training settings for Figure 3
we can conclude that a∗i is monotonically increasing as the
cosine-distance to its class center decreases and the cosine- We adopt ResNet50 as the backbone network. Models
distances to other classes increase. are trained on MS1Mv2 [14, 9] for 20 epochs with batch
size 512 and initial learning rate 0.1, dropped by 0.1 every
Property 3. With fixed fi and Wj , j ∈ {1, · · · , n}, j 6= yi , 5 epochs. 512 samples of the last iteration are used for vi-
the optimal feature magnitude a∗i is monotonically increas- sualization.
ing if the cosine-distance to its class center Wyi decreases.
B.2. Settings of m(ai ), g(ai ) and λg
Proof. Assuming there are two class center Wy1i , Wy2i and
In our experiments, we define function m(ai ) as a linear
their cosine distances to feature fi are θy1i , θy2i . Assuming
function defined on [la , ua ] with m(la ) = lm , m(ua ) = um
θy1i < θy2i (i.e., class center Wy1i has a smaller distance with
and g(ai ) = a1i + u12 ai . Therefore, we have
feature fi ) and the corresponding optimal feature magni- a

tudes are a∗i,1 , a∗i,2 . um − lm


m(ai ) = (ai − la ) + lm
The first derivate of Li is ua − la
(9)
sK su2 l2 um − lm
∂Li B λg ≥ 0
= 2 a a2
= − A(a ) A0 (ai ) + λg g 0 (ai ) −g (la ) (ua − la ) ua − la
∂ai e i +B
(8) C. Ablation Study on Margin Distributions
Bsm0 (ai )
= s cos(θ +m(a )) sin(θyi + m(ai )) + λg g 0 (ai )
e yi i
+B
In this section, effects of the feature distributions during
For θyi + m(ai ) ∈ (0, π/2], we have cos(θy1i + m(ai )) > training are studied. With (λg , la , ua ) fixed to (35, 10, 110),
cos(θy2i +m(ai )) and sin(θy1i +m(ai )) < sin(θy2i +m(ai )). we carefully select various combinations of lm , um to align
With m0 (ai ) > 0, it’s obvious that the mean margin on the training dataset to ArcFace (0.5)
in our implementation. Features are distributed more sepa-
Bsm0 (ai ) Bsm0 (ai )
1 +m(a ))
s cos(θy i
sin(θy1i + m(ai )) < 2 +m(a ))
s cos(θy i
sin(θy2i + m(ai )). rated if with a larger maximum margin and a smaller mini-
e i +B e i +B
mum margin.
∂Li (θ 1 ) ∂Li (θ 2 ) Table 4 shows the recognition results with various hyper-
Therefore, we have ∂aiyi < yi
. Based on the
∂ai parameters. With (lm , um ) = (0.45, 0.65), the penalty of
property of optimal solution for strictly convex function, we
∂Li (θy1 ) ∂Li (θy2 ) ∂Li (θy1 )
magnitude loss degrades the performance of the recogni-
have 0 = ∂a∗
i
= ∂a∗
i
> ∂a∗
i
, which leads to tion. With (lm , um ) = (0.25, 1.60), the performance is
i,1 i,2 i,2
a∗i,1 > a∗i,2 . also worse than then baseline as hard samples are assigned
to small margins (a.k.a., hard/noisy samples are down-
Property 4. With other things fixed, the optimal feature weighted). Parameter (0.40, 0.80) balances the feature dis-
magnitude a∗i is monotonically increasing with a decreas- tribution and margins for hard/noisy samples, and therefore
ing B (i.e., increasing inter-class distance). achieves a significant improvement on benchmarks.
Proof. Assume 0 < B1 < B2 with optimum a∗i,1 , a∗i,2 . D. Extended Visualization of Figure 6
Similar to the proof before, it’s easy to show
We present a extended visualization of figure 6 in fig-
B1 sm0 (ai ) B2 sm0 (ai )
es cos(θyi +m(ai )) +B1
sin(θyi + m(ai )) <
es cos(θyi +m(ai )) +B2
sin(θyi + m(ai )). ure 8 which has more examples of faces with feature mag-
34.59 35.45 38.42 38.02

29.60 29.77 29.84 30.08

24.68 24.72 25.23 25.38

19.63 19.83 20.04 20.46

14.08 15.32 16.08 16.86

Figure 8: Extended Visualization of Figure 6.

nitudes. All the faces are sample from the IJB-C bench- E.1. Property of Convergence for Mag-CosFace
mark. It can be seen that faces with magnitudes around
Property 5. For ai ∈ [la , ua ], Li is a strictly convex func-
28 are mostly profile faces while around 35 are high-
tion of ai .
quality and frontal faces. That is consistent with the pro-
file/frontal peaks in the CFP-FP benchmark and indicates
Proof. The first and second deriviates of A(ai ) are
that faces with similar magnitudes show similar quality pat-
terns across benchmarks. In real applications, we can set a A0 (ai ) = −sm0 (ai )
proper threshold for the magnitude and should be able to fil- (12)
A00 (ai ) = −sm00 (ai )
ter similar low-quality faces, even under various scenarios.
Besides directly served as qualities, our feature magni- As A00 (ai ) ≤ 0, the property can be proved following that
tudes can also be used as quality labels for faces, which presented before.
avoids human labelling costs. These labels are more suit-
able for recognition, and therefore can be used to boost
other quality models.
Property 6. A unique optimal solution a∗i exists in [la , ua ].
E. Mag-CosFace
Proof. We only need to prove
In the main text, MagFace is modified from the Arc-
Face loss. In this section, we show that MagFace based ∂Li Bs
(ua ) = A(a ) m0 (ua ) > 0
on CosFace loss (denote as Mag-CosFace) can theoretically ∂ai e i +B
∂Li Bs (13)
achieve the same effects. Mag-CosFace loss for a sample i (la ) = A(a ) m0 (la ) + λg g 0 (la )
∂ai e i +B
is
< sK + λg g 0 (la ) ≤ 0
es(cos θyi −m(ai ))
Li = − log s(cos θyi −m(ai ))
Pn
e + j=1,j6=yi es cos θj (10) Then it’s easy to have there is a unique optimu.
+ λg g(ai )
E.2. Property of Monotonicity for Mag-CosFace
Let A(ai ) = s(cos θyi − m(ai )) and B =
P n s cos θj Property 7. With fixed fi and Wj , j ∈ {1, · · · , n}, j 6= yi ,
j=1,j6=yi e and rewrite the loss as
the optimal feature magnitude a∗i is monotonically increas-
eA(ai )
Li = − log eA(ai ) +B
+ λg g(ai ) (11) ing if the cosine-distance to its class center Wyi decreases.
Proof. The first derivate of Li is
∂Li B
= − A(a ) A0 (ai ) + λg g 0 (ai )
∂ai e i +B
(14)
Bsm0 (ai )
= s(cos θ −m(a )) + λg g 0 (ai )
e yi i
+B

∂Li (θ 1 ) ∂Li (θ 2 )
For θy1i < θy2i , the core here is to prove ∂aiyi < ∂aiyi ,
which is obvious as cos θy1i > cos θy2i . The rest of the proofs
is the same as those for the original MagFace.
Property 8. With other things fixed, the optimal feature
magnitude a∗i is monotonically increasing with a decreas-
ing B (i.e., increasing inter-class distance).

Proof. It’s easy to have ∂L∂ai (B1 )


i
< ∂L∂a
i (B2 )
i
if B1 < B2 .
The rest of the proofs is the same as those for the original
MagFace.

F. Authors’ Contributions
Shichao Zhao and Zhida Huang contribute similarly to
this work. Besides involved in discussions, Shichao Zhao
mainly conducted experiments on face clustering and Zhida
Huang implemented baselines as well as evaluation metrics
for quality experiments.

You might also like