0% found this document useful (0 votes)
88 views10 pages

YOLO5Face: Why Reinventing A Face Detector: Delong Qi, Weijun Tan, Qi Yao, Jingfeng Liu

1) The document presents YOLO5Face, a face detector created by modifying the YOLOv5 object detector to better handle faces. 2) YOLO5Face adds modifications to the network architecture and implements models of varying sizes for different applications from high performance to real-time use on mobile. 3) Experiments on the WiderFace dataset show YOLO5Face achieves state-of-the-art performance, demonstrating that reinventing a new face detector was unnecessary since YOLO5Face accomplishes this goal.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views10 pages

YOLO5Face: Why Reinventing A Face Detector: Delong Qi, Weijun Tan, Qi Yao, Jingfeng Liu

1) The document presents YOLO5Face, a face detector created by modifying the YOLOv5 object detector to better handle faces. 2) YOLO5Face adds modifications to the network architecture and implements models of varying sizes for different applications from high performance to real-time use on mobile. 3) Experiments on the WiderFace dataset show YOLO5Face achieves state-of-the-art performance, demonstrating that reinventing a new face detector was unnecessary since YOLO5Face accomplishes this goal.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

YOLO5Face: Why Reinventing a Face Detector

Delong Qi, Weijun Tan*, Qi Yao, Jingfeng Liu


Shenzhen Deepcam Information Technologies
Shenzhen, China
{delong.qi,weijun.tan,qi.yao,jingfeng.liu}@deepcam.com
*LinkSprite Technologies, USA, weijun.tan@linksprite.com

Abstract—Tremendous progress has been made on face detec- of challenges encountered by face detection like multi-scale,
tion in recent years using convolutional neural networks. While small faces and dense scenes, they all exist in generic object
many face detectors use designs designated for the detection of detection. Thus, face detection is just a sub task of general
face, we treat face detection as a general object detection task.
arXiv:2105.12931v3 [cs.CV] 27 Jan 2022

We implement a face detector based on YOLOv5 object detector object detection.


and call it YOLO5Face. We add a five-point landmark regression In this paper, we follow this intuition and design a face
head into it and use the Wing loss function. We design detectors detector based on the YOLOv5 object detector [5]. We modify
with different model sizes, from a large model to achieve the best the design for face detection considering large faces, small
performance, to a super small model for real-time detection on an faces, landmark supervision, for different complexities and
embedded or mobile device. Experiment results on the WiderFace
dataset show that our face detectors can achieve state-of-the-art applications. Our goal is to provide a portfolio of models for
performance in almost all the Easy, Medium, and Hard subsets, different applications, from very complex ones to get the best
exceeding the more complex designated face detectors. The code performance to very simple ones to get the best trade-off of
is available at https://www.github.com/deepcam-cn/yolov5-face. performance and speed on embedded or mobile devices.
Index Terms—Face detection, convolutional neural network, Our main contributions are summarized as following,
YOLO, real-time, embedded device, object detection
• We redesign the YOLOV5 object detector [5] as a face

I. I NTRODUCTION detector, and call it YOLO5Face. We implement key


modifications to the network to improve the performance
Face detection is a very important computer vision task. in terms of mean average precision (mAP) and speed.
Tremendous progresses have been made since deep learning, The details of these modifications will be presented in
particularly convolutional neural network (CNN), has been Section III.
used in this task. As the first step of many tasks, including • We design a series of models of different model sizes,
face recognition, verification, tracking, alignment, expression from large models, to medium models, to super small
analysis, face detection attracts many researches and develop- models, for needs in different applications. In addition
ments in the academia and the industry. And the performance to the backbone used in YOLOv5 [5], we implement a
of face detection has improved significantly over the years. For backbone based on ShuffleNetV2 [6], which gives the
a survey of the face detection, please refer to the benchmark state-of-the-art (SOTA) performance and fast speed for
results [1], [2]. There are many methods in this field from mobile device.
different perspectives. Research directions include design of • We evaluate our models on the WiderFace [1] dataset. On
CNN network, loss functions, data augmentations, and training VGA resolution images, almost all our models achieve
strategies. For example, in the YOLOv4 paper, the authors the SOTA performance and fast speed. This proves our
explore all these research directions and propose the YOLOV4 goal, as the tile of this paper claims, we do not need
object detector based on optimizations of network architecture, to reinvent a face detector since the YOLO5Face can
selection of bags of freebies, and selection of bags of specials accomplish it.
[3].
In our approach, we treat the face detection as a general II. R ELATED W ORK
object detection task. We have the same intuition as the
TinaFace [4]. Intuitively, face is an object. As discussed in A. Object Detection
the TinaFace [4], from the perspective of data, the properties General object detection aims at locating and classifying
that faces has, like pose, scale, occlusion, illumination, blur the pre-defined objects in a given image. Before deep CNN is
and etc., also exist in other objects. The unique properties used, traditional face detection uses hand crafted features, like
in faces like expression and makeup can also correspond to HAAR, HOG, LBP, SIFT, DPM, ACF, etc. The seminal work
distortion and color in objects. Landmarks are special to face, by Viola and Jones [7] introduces integral image to compute
but they are not unique either. They are just key points of HAAR-like features. For a survey of face detection using hand
an object. For example, in license plate detection, landmarks crafted features, please refer to [8], [9].
are also used. And adding landmark regression in the object Since the deep CNN shows its power in many machine
prediction head is straightforward. Then from the perspective learning tasks, face detection is dominated by deep CNN
methods. There are two-stage and one-stage object detectors. almost all aspects of the YOLOv3 [18] algorithm, including
Typical two-stage methods are the RCNN family, including the backbone, and what they call bags of freebies, and bags
RCNN [10], fast-RCNN [11], faster-RCNN [12], mask-RCNN of specials. It achieves 43.5% AP (65.7% AP50) for the MS
[13], Cascade-RCNN [14]. COCO dataset at a real time speed of 65 FPS on Tesla V100.
The two-stage object detector have very good performance One month later, the YOLOv5 [5] was released by another
but suffers from long latency and slow speed. In order to different research team. In the algorithm prospective, the
overcome this problem, one-stage object detectors are studied. YOLOv5 [5] does not have many innovations. And the team
Typical one-stage networks include SSD [15], YOLO [3], [5], does not publish a paper. These bring quite some controversies
[16]–[18]. about if it should be called YOLOv5. However, due to its
Other object detection networks include FPN [19], MMDe- significantly reduced model size, faster speed, and similar per-
tection [20], EfficientDet [21], transformer(DETR) [22], Cen- formance as YOLOv4 [3], and a full implementation in Python
ternet [23], [24], and so on. (Pytorch), it is welcome by the object detection community.

B. Face Detection III. YOLO5FACE FACE D ETECTOR


The researches for face detection follows the general ob- In this section we present the key modifications we make
ject detection. After the most popular and challenging face in YOLOv5 and make it a face detector - YOLO5Face.
detection benchmark WiderFace dataset [1] is released, face
detection develops rapidly focusing on the extreme and real A. Network Architecture
variation problem including scale, pose, occlusion, expression, We use the YOLOv5 object detector [5] as our baseline and
makeup, illumination, blur and etc. optimize it for face detection. We introduce some modifica-
A lot of methods are proposed to deal with these problems, tions designated for detection of small faces as well as large
particularly the scale, context, anchor in order to detect small faces.
faces. These methods include MTCNN [25], FaceBox [26], The network architecture of our YOLO5Face face detector
S3FD [27], DSFD [28], RetinaFace [29], RefineFace [30], and is depicted in Fig. 1. It consists of the backbone, neck, and
the most recent ASFD [31], MaskFace [32], TinaFace [4], head. In YOLOv5, a new designed backbone called CSPNet
MogFace [33], and SCRFD [34]. For a list of popular face [5] is used. In the neck, an SPP [35] and a PAN [36] are
detectors, the readers are referred to the WiderFace website used to aggregate the features. In the head, regression and
[2]. classification are both used.
It is worth noting that some of these face detectors explore In Fig. 1 (a), the overall network architecture is depicted. In
unique characteristics in human face, the others are just general Fig. 1 (b), a key block called CBS is defined, which consists
object detector adopted and modified for face detection. Use of Conv layer, BN layer, and a SILU [37] activation function.
RetinaFace [29] as an example, it uses landmark (2D and This CBS block is used in many other blocks. In Fig. 1
3D) regression to help the supervision of face detection, while (c), an output label for the head is shown, which include
TinaFace [4] is simply a general object detector. bounding box (bbox), confidence (conf), classification (cls)
and five-point landmarks. The landmarks are our addition to
C. YOLO the YOLOv5 to make it a face detector with landmark output.
YOLO first appeared in 2015 [16] as a different approach If without the landmark, the last dimension 16 should be
than popular two-stage approaches. It treats object detection 6. Please note that, the output dimensions 80*80*16 in P3,
as an regression problem rather than a classification problem. 40*40*16 in P4, 20*20*16 in P5, 10*10*16 in optional P6 are
It performs all the essential stages to detect an object using a for every anchor. The the real dimension should be multiplied
single neural network. As a result, it not only achieves very by the number of anchors.
good detection performance, but also achieves real-time speed. In Fig. 1 (d), a Stem structure [38] is shown, which is
Furthermore, it has excellent generalization capability, can be used to replace the original Focus layer in YOLOv5. The
easily trained to detect different objects. introduction of the Stem block into YOLOv5 for face detection
Over the next five years, the YOLO algorithm have been is one of our innovations.
upgraded to five versions with many innovative ideas from the In Fig. 1 (e), a CSP block (C3) is shown. This block is
object detection community. The first three versions - YOLOv1 inspired by the DenseNet [39]. However, instead of adding
[16],YOLOv2 [17], YOLOv3 [18]are developed by the author the full input and the output after some CNN layers, the input
of the original YOLO algorithm. Out of these three versions, is separated two two halves. One half is passed through a
the YOLOv3 [18] is a milestone with big improvements in per- CBS block, a number of Bottleneck blocks, which is shown
formance and speed by introducing multi-scale features (FPN) in Fig. 1 (f), then another Conv layer. The other half is passed
[19], better backbone network (Darknet53), and replacing the through a Conv layer, then the two are concatenated, followed
Softmax classification loss with the binary cross-entropy loss. by another CBS block.
In early 2020, after the original YOLO authors withdrawn Fig. 1 (g), an SPP block [35] is shown. In this block the
from the research field, YOLOv4 [3] was released by a three kernel sizes 13x13, 9x9, 5x5 in YOLOv5 are revised
different research team. The team explore a lot of options in to 7x7, 5x5, 3x3 in our face detector. This has been shown
as one of the innovations that improves the face detection General loss functions for landmark regression are L2, L1,
performance. or smooth-L1. The MTCNN [25] uses the L2 loss function.
Note that we only consider VGA resolution input images. However, it is found these loss functions are not sensitive
To be more precise, the longer edge of the input image is to small errors. To overcome this problem, the Wing-loss is
scaled to 640, and the shorter edge is scaled accordingly. The proposed [40],
shorter edge is also adjusted to be a multiple of the largest (
stride of the SPP block. For example, when P6 is not used, w · ln(1 + |x|/e), if x < w
wing(x) = (1)
the shorter edge needs to be multiple of 32; when P6 is used, |x| − C, otherwise
the shorter edge needs to multiple of 64.
The non-negative w sets the range of the nonlinear part to
B. Summary of Key Modifications (−w, w), e limits the curvature of the nonlinear region and
The key modifications are summarized as follows. C = w − wln(1 + w/e) is a constant that smoothly links the
piecewise-defined linear and nonlinear parts. Plotted in Fig. 2
• We add a landmark regression head to the YOLOv5
is this Wing loss function with different parameters wand e It
network. The Wing loss [40] is used a loss function for it.
can be seen that the response at small error area near zero is
This makes the face detector more useful since landmarks
boosted compared to the L2, L1, or smooth-L1 functions.
are used in many applications. The landmark locations
are more accurate. This extra supervision helps the face The loss functions for landmark point vector s = {si }, and
detector accuracy. its ground truth s0 = {si }, where i = 1, 2, ..., 10, is defined
• We replace the Focus layer of YOLOv5 [5] with a
as, X
Stem block structure [38]. It increases the network’s lossL (s) = wing(si − s0i ) (2)
i
generalization capability, and reduces the computation
complexity while the performance does not degrade. Let the general object detection loss function of YOLOv5 be
• We change the SPP block [35] and use a smaller kernel. lossO (bounding box, class, probability), then the new total
It makes the YOLOv5 more suitable for face detection loss function is,
and improve the detection accuracy. loss(s) = lossO + λL · lossL (3)
• We add a P6 output block with stride of 64. It increases
the capability to detect large faces. This is an item easily where the λL is a weighting factor for the landmark regression
overlooked by many researchers since their focuses are loss function.
more on the detection of small faces.
D. Stem Block Structure
• We find that some data augmentation methods on general
object detection are not appropriate on face detection, We use a stem block similar to [38]. The stem block is
including up-down flipping and Mosaic. Removing the shown in Fig.1 (d). With this stem block, we implement a
up-down flipping improves the performance. When small stride = 2 in the first spatial down-sampling on the input image,
images are used, the Mosaic augmentation [3] degrades and increase the number of channels. With this stem block,
the performance. However, when the small faces are the computation complexity only increase marginally, while a
ignored, it works well. Random cropping helps the per- strong representation capability is ensured.
formance.
Model Backbone (D,W) With P6?
• We design two super light-weight models based on YOLOv5s YOLO5-CSPNet [5] (0.33,0.50) No
ShuffleNetV2 [6]. This backbone is very different from YOLOv5s6 YOLO5-CSPNet (0.33,0.50) Yes
the CSP network. These models are super small, while YOLOv5m YOLO5-CSPNet (0.50,0.75) No
YOLOv5m6 YOLO5-CSPNet (0.50,0.75) Yes
achieve SOTA performance for embedded or mobile YOLOv5l YOLO5-CSPNet (1.0,1.0) No
device. YOLOv5l6 YOLO5-CSPNet (1.0,1.0) Yes
YOLOv5n ShuffleNetv2 [6] - No
C. Landmark Regression YOLOv5n-0.5 ShuffleNetv2-0.5 [6] - No
TABLE I
Landmarks are important characteristics for human face. D ETAIL OF IMPLEMENTED YOLO5FACE MODELS , WHERE (D,W) ARE
They can be used to do face alignment, face recognition, THE DEPTH AND WIDTH MULTIPLES OF THE YOLOV 5 CSPN ET [5]. T HE
NUMBER OF PARAMETERS AND F LOPS ARE LISTED IN TABLE III.
face express analysis, age analysis etc. Traditional landmarks
consist of 68 points. They are simplified to 5 points in
MTCNN [25] Since then, the five-point landmarks have been
used widely in face recognition. The quality of landmarks E. SPP with Smaller Kernels
affects the quality of face alignment and face recognition. Before forwarding to feature aggregation block in the neck,
The general object detector does not include landmarks. It the output feature maps of the YOLO5 backbone are sent to
is straightforward to add it as a regression head. Therefore, an additional SPP block [35] to increase the receptive field
we add it into our YOLO5Face. The landmark outputs will and separate out the most important features. Instead of many
be used in align face images before they are sent to the face CNN models containing fully connected layers which only
recognition network. accept input images of specific dimensions, SPP is proposed
Fig. 1. The proposed YOLO5Face network architecture.
Modification Method Easy Medium Hard Params(M) Flops(G)
Focus+Conv 93.56 92.54 82.56 7.091 6.174
Stem block
Stem Block 94.13 92.87 82.79 7.075 5.751
(13,9,5) 93.43 91.12 82.64 - -
SPP Kernel
(7,5,3) 94.33 92.61 84.15 - -
No 94.31 92.52 83.15 7.075 5.751
P6 block
Yes 95.29 93.61 83.13 12.386 6.28
Baseline (with Mosaic)91.34 90.21 83.54 - -
- up-down flipping 91.87 90.56 83.58 - -
Data augmentation
+ Ignore small faces 94.12 92.21 82.21 - -
+ Random crop 94.34 92.58 83.17 - -
TABLE II
A BLATION STUDY RESULTS ON THE W IDER FACE VALIDATION DATASET.

Detector Backbone Easy Medium Hard Params(M) Flops(G)


DSFD [28] ResNet152 [41] 94.29 91.47 71.39 120.06 259.55
RetinaFace [29] ResNet50 [41] 94.92 91.90 64.17 29.50 37.59
HAMBox [42] ResNet50 [41] 95.27 93.76 76.75 30.24 43.28
TinaFace [4] ResNet50 [41] 95.61 94.25 81.43 37.98 172.95
SCRFD-34GF [34] Bottleneck ResNet 96.06 94.92 85.29 9.80 34.13
SCRFD-10GF [34] Basic ResNet [41] 95.16 93.87 83.05 3.86 9.98
Our YOLOv5s YOLOv5-CSPNet [5] 94.33 92.61 83.15 7.075 5.751
Our YOLOv5s6 YOLOv5-CSPNet 95.48 93.66 82.8 12.386 6.280
Our YOLOv5m YOLOv5-CSPNet 95.30 93.76 85.28 21.063 18.146
Our YOLOv5m6 YOLOv5-CSPNet 95.66 94.1 85.2 35.485 19.773
Our YOLOv5l YOLOv5-CSPNet 95.9 94.4 84.5 46.627 41.607
Our YOLOv5l6 YOLOv5-CSPNet 96.38 94.90 85.88 76.674 45.279
Our YOLOv5x6 YOLOv5-CSPNet 96.67 95.08 86.55 141.158 88.665
SCRFD-2.5GF [34] Basic Resnet 93.78 92.16 77.87 0.67 2.53
SCRFD-0.5GF [34] Depth-wise Conv 90.57 88.12 68.51 0.57 0.508
RetinaFace [29] MobileNet0.25 [43] 87.78 81.16 47.32 0.44 0.802
FaceBoxes [26] - 76.17 57.17 24.18 1.01 0.275
Our YOLOv5n ShuffleNetv2 [6] 93.61 91.54 80.53 1.726 2.111
Our YOLOv5n0.5 ShuffleNetv2-0.5 [6] 90.76 88.12 73.82 0.447 0.571
TABLE III
C OMPARISON OF OUR YOLO5FACE AND EXISTING FACE DETECTORS ON THE W IDER FACE VALIDATION DATASET [1].

to aim at generating a fixed-size output irrespective of the input bottom-up augmentation pyramid are fused by using (Region
size. In addition, SPP also helps to extract important features of Interest) ROI align and fully connected layers with element-
by pooling multi-scale versions of itself. wise max operation.
In YOLO5, three kernel sizes 13x13,9x9,5x5 are used [5]. In YOLOv5, there are three output blocks in the PAN output
We revise them to use smaller size kernels 7x7, 5x5 and 3x3. feature maps, called P3,P4,P5 corresponding to 80x80x16,
These smaller kernels help to detect small faces more easily, 40x40x16, 20x20x16, with strides 8,16,32, respectively. In
and increase the overall face detection performance. our YOLO5Face, we add an extra P6 output block, whose
feature map is 10x10x16 with stride 64. This modification
F. P6 Output Block
particularly helps the detection of large faces. While almost
The backbone of YOLO object detector has many layers. As all face detectors focus on improving detection of small faces,
the feature becomes more and more abstract as the layers go detection of large faces can be easily overlooked. We fill this
deeper, the spatial resolution of feature maps decreases due to hole by adding the P6 output block.
downsampling, which leads to to a loss of spatial information
as well as fine-grained features. In order to preserve these
G. ShuffleNetV2 as Backbone
fine-grained features, the FPN [19] is introduced to YOLOv3
[18]. The ShuffleNet [44] is an extremely efficient CNN for
In FPN [19], the fine-grained features take a long path mobile device. The key block is called the ShuffleNet block.
traveling from low-level to high-level layers. To overcome It utilizes two new operations, pointwise group convolution
this problem, the PAN is proposed to add a bottom-up and channel shuffle, to greatly reduce computation cost while
augmentation path along the top-down path used in FPN. maintaining accuracy.
In addition, in the connection of the feature maps to the The ShuffleNetv2 [44] is an improved version of Shuf-
lateral architecture, the element-wise addition operation is fleNet. It borrows the shortcut network architecture similar
replaced with concatenation. In FPN, object predictions are to the DenseNet [39], and the the element wise addition is
done independently on different scale levels, which do not changed to concatenation, similar to the change in PAN [36]
utilize information from other feature maps, and may produce in YOLOv5 [5]. But different from DenseNet, ShuffleNetV2
duplicated predictions. In PAN [36], the output feature maps of does not densely concatenate, and after the concatenation, the
FaceDetect traning dataset FNMR
channel shuffling is used to mix the features. This makes the RetinaFace [29] WiderFace [1] 0.1065
ShuffleNetV2 a super fast network. YOLOv5s WiderFace 0.1060
We use the ShuffleNetV2 as the backbone in YOLOv5 and YOLOv5s +Multi-task facial [47] 0.1058
YOLOv5m WiderFace 0.1056
implement super small face detectors YOLOv5n-Face, and YOLOv5m +Multi-task facial 0.1051
YOLOv5n0.5-Face. TABLE IV
E VALUATION OF YOLO5FACE LANDMARK ON FACE RECOGNITION ON
THE W EBFACE TEST DATASET [45].
IV. E XPERIMENTS

A. Dataset
C. Ablation Study
The WiderFace dataset [1] is the largest face detection
dataset, which contains 32,203 images and 393,703 faces. In this subsection we present the effects of the modifications
For its large variety of scale, pose, occlusion, expression, we have in our YOLO5Face. In this study we use the YOLO5s
illumination and event, it is close to reality and is very model. We use the WiderFace [1] validation dataset and use
challenging. the mAP as the performance metric.
Stem Block vs. Focus Layer. The mAP performances of the
The whole dataset is divided into train/validation/test sets
stem block [38] and the focus layer are listed in first panel of
by ratio 50%/10%/40% within each event class. Furthermore,
Table II. Also listed are the number of parameters and number
each subset is defined into three levels of difficulty: Easy,
of flops. From the results we see that the stem block improves
Medium, and Hard. As it names indicates, the Hard subset
the mAP by 0.57%, 0.33%, and 0.23% on the easy, medium,
is most challenging. So the performance on the Hard subset
and hard subset, respectively.
reflects best the effectiveness of a face detector.
SPP with Smaller Size Kernels. The mAP performances
Unless specified otherwise, the WiderFace dataset [1] is of the SPP [35] kernel sizes (7x7,5x5,3x3) and original kernel
used in this work. In the face recognition with YOLO5Face sizes (13x13,9x9,5x5) are listed in the second panel of Table II.
landmark and alignment, the Webface dataset [45] is used. From the results we see that the smaller kernel sizes improve
The FDDB dataset [46] is used in testing to demonstrate our the mAP by 0.9%, 1.49%, and 1.41% on the easy, medium,
model’s performance on cross-domain datasets. and hard subset, respectively. The improvements are larger
than that from the Stem block [38].
B. Implementation Details P6 Output Block. The mAP performances of the addition
of the P6 output block are listed in the third panel of Table II.
We use the YOLOv5-4.0 codebase [5] as our starting point
From the results we see that the P6 block improves the mAP
and implement all the modifications we describe earlier in
by 0.98%, 1.09%, and -0.02% on the easy, medium, and hard
PyTorch.
subset, respectively.
The SGD optimizer is used. The initial learning rate is 1E-2,
Data Augmentation Performance results of a few data
the final learning rate is 1E-5, and the weight decay is 5E-3. A
augmentation methods are listed in the fourth panel of Table
momentum of 0.8 is used in the first three warming-up epochs.
II. From the results we see that ignoring small faces, random
After that, the momentum is changed to 0.937. The training
crop help the mAP in the Easy and Medium dataset, while the
runs 250 epochs with a batch size of 64. The λL = 0.5 is
Mosaic [3] helps the mAP in the Hard dataset. As we explain
optimized by exhaust search.
before, the Mosaic has to work with the ignoring small faces,
Implemented Models. We implement a series of face otherwise the performance degrades dramatically.
detector models, as listed in Table I. We implement eight Please note that in these experiments the network configu-
relatively large models, including extra large-size mod- rations are not incremental. However in each of set of experi-
els (YOLOv5x, YOLOv5x6), large-size models (YOLOv5l, ment, the baselines for the two networks are the same to make
YOLOv5l6) medium-size models (YOLOv5m, YOLOv5m6), the comparison fair. For example in the SPP experiments,
and small-size models (YOLOv5s, YOLOv5s6). In the name except for the kernel sizes are different, all other settting are
of the model, the last postfix 6 means it has the P6 output block identical.
in the SPP. These models all use the YOLOv4 CSPNet as the
backbone with different depth and width multiples, denoted as D. YOLO5Face for Face Recognition
D and W in Table I. Landmark is critical for face recognition accuracy. In Reti-
Furthermore, we implement two super small-size models, naFace [29], the accuracy of the landmark is evaluated with
YOLOv5n and YOLOv5n0.5, which use the ShuffleNetv2 the MSE between estimated landmark coordinates and their
and ShuffleNetv2-0.5 [6] as the backbone. Except for the ground truth and with the face recognition accuracy. The
backbone, all other main blocks, including the stem block, results show that the RetinaFace has better landmarks than
SPP, PAN, are the same as in the larger models. the older MTCNN [25].
The number of parameters and number of flops of all these In this work, we also use face recognition to evaluate the ac-
models is listed in Table III for comparison with existing curacy of landmarks of the YOLO5Face. We use the Webface
methods. test dataset, which is the largest face dataset with noisy 4M
Fig. 2. The precision-recall (PR) curves of face detectors, (a) validation-Easy, (b)validation-Medium, (c) validation-Hard, (d) test-Easy, (e) test-Medium, (f)
test-Hard.
Method MAP
ASFD [31] 0.9911
RefineFace [30] 0.9911
PyramidBox [58] 0.9869
FaceBoxes [26] 0.9598
Our YOLOv5s 0.9843
Our YOLOv5m 0.9849
Our YOLOv5l 0.9867
Our YOLOv5l6 0.9880
TABLE V
Fig. 3. Some examples of detected face and landmarks, where the first row E VALUATION OF YOLO5FACE ON THE FDDB DATASET [46].
is from RetinaFace [29], and second row is from our YOLOv5m.

identities/260M faces, and cleaned 2M identities/42M faces


Next, we look at the performance of super small models
[45]. This dataset is used in the ICCV2021 Masked Face
whose number of parameters is less than 2M and the number
Recognition (MFR) challenge [48]. In this challenge, both
of flops is less than 3G. All existing methods achieve mAP
masked face images and standard face images are included,
in 76.17-93.78% on the Easy subset, 57.17-92.16% on the
and a metric False Non-Match Rate (FNMR) at False Match
Medium subset, and 24.18-77.87% on the Hard subset. Again,
Rate (FMR) = 1E-5 is used. The FNMR*0.25 for MFR plus
the SCRFD [34] achieves the best performance in all sub-
FNMR*0.75 for standard face recognition is combined as the
sets. Our YOLO5Face (YOLOv5n) achieves 93.61%, 91.54%,
final metric.
80.53% on the three subsets, respectively. Our face detector
By default, the RetinaFace [49] is used as the face detec- has a little bit worse performance than the SCRFD [34] on the
tor on the dataset. We compare our YOLO5Face with the Easy and Medium subsets. However, on the Hard subset, our
RetinaFace on this dataset. We use ArcFace [50] framework face detector is leading by 2.66%. Furthermore, our smallest
with Resnet124 [41] as backbone. Extracted features of two model, YOLOv5n0.5, has good performance, even its model
models trained on the Glint360k dataset [51] are concatenated size is much smaller.
as the baseline model. We replace the RetinaFace with our
The precision-recall (PR) curves of our YOLO5Face face
YOLO5Face. We test two models, a small model YOLOv5s,
detector, along with the competitors, are shown in Figure
and a medium model YOLOv5m. More details can be found
2. The leading competitors include DFS [53], ISRN [54],
in [52].
VIM-FD [55], DSFD [28], PyramidBox++ [56], SRN [57],
The results are listed in Table IV. From the results, we
PyramidBox [58] and more. For a full list of the competitors
see that both our small and medium models outperform the
and their results on the WiderFace [1] validation and test
RetinaFace [29]. In addition, we notice that there are very few
datasets, please refer to [2]. In the results on the validation
large face images in the WiderFace dataset, so we add some
dataset, our YOLOv5x6-Face detector achieves 96.9%, 96.0%,
large face images from the Multi-task-facial dataset [47] into
91.6% mAP on the Easy, Medium, and Hard subset, respec-
the YOLO5Face training dataset. We find that this technique
tively, exceeding the previous SOTA by 0.0%, 0.1%, 0.4%. In
improves face recognition performance.
the results on the test dataset, our YOLOv5x6-Face detector
shown in Figure 3 are some detected Webface [45] faces achieves 95.8%, 94.9%, 90.5% mAP on the Easy, Medium,
and landmarks using the RetinaFace [29] and our YOLOv5m. and Hard subset, respectively with 1.1%, 1.0%, 0.7% gap to
On the faces of a large pose, we can visually observe that our the previous SOTA. Please note that, in these evaluations, we
landmarks are more accurate, which has been prooved in our only use multiple scales and left-right flipping without using
face recognition results shown in Table IV. other test-time augmentation (TTA) methods. Our focus is
more on the VGA input images, where we achieve the SOTA
E. YOLO5Face on WiderFace Dataset
in almost all conditions.
We compare our YOLO5Face with many existing face
detectors on the WiderFace dataset. The results are listed in
F. YOLO5Face on FDDB Dataset
Table III, where the previous SOTA results and our best results
are both highlighted. FDDB dataset [46] is a small dataset with 5171 faces
We first look at the performance of relatively large models annotated in 2845 images. To demonstrate our YOLO5Face’s
whose number of parameters is larger than 3M and the number performance on the cross-domain dataset, we test it on the
of flops is larger than 5G. All existing methods achieve mAP in FDDB dataset without retraining on it. The performances of
94.27-96.06% on the Easy subset, 91.9-94.92% on the Medium true positive rate (TPR) when the number of false-positive is
subset, and 71.39-85.29% on the Hard subset. The most re- 1000 are listed in Table 4. Please note that it is pointed out
cently released SCRFD [34] achieves the best performance in in RefineFace [30] that the annotation of FDDB misses many
all subsets. Our YOLO5Face (YOLOv5x6) achieves 96.67%, faces. In order to achieve their performance of 0.9911, the
95.08%, 86.55% on the three subsets, respectively. We achieve RefineFace modifies the FDDB annotation. In our evaluation,
the SOTA performance on all the Easy, Medium, and Hard we use the original FDDB annotation without modifications.
subsets. RetinaFace [29] is not evaluated on the FDDB dataset.
V. C ONCLUSION [25] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and
alignment using multitask cascaded convolutional networks,” IEEE
In this paper we present our YOLO5Face based on Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016. 2,
YOLOv5 object detector [5]. We implement eight models. 3, 6
Both the largest model YOLOv5l6 and the super small model [26] Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang,
and Stan Z. Li, “Faceboxes: A cpu real-time face detector with high
YOLOv5n achieve close to or exceeding SOTA performance accuracy,” IJCB, 2017. 2, 5, 8
on the WiderFace [1] validation Easy, Medium and Hard [27] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li, “S3 fd: Single
subsets. This proves the effectiveness of our YOLO5Face in shot scale-invariant face detector,” ICCV, 2017. 2
[28] J. Li, Y. Wang, C. Wang, Y. Tai, J. Qian, J. Yang, C. Wang, J. Li, and
not only achieving the best performance, but also running F. Huang, “Dsfd: Dual shot face detector,” ArXiv preprint 1810.102207,
fast. Since we open-source the code, a lot of applications and 2018. 2, 5, 8
mobile apps have been developed based on our design, and [29] J. Deng, J. Guo, Y. Zhou, J. Yu, I. Kotsia, and S. Zafeiriou, “Retinaface:
Single-stage dense face localisation in the wild,” CVPR, 2020. 2, 5, 6,
achieve impressive performance. 8
[30] S. Zhang, C. Chi, Z. Lei, and S.Z. Li, “Refineface: Refinement
R EFERENCES neural network for high performance face detection,” ArXiv preprint
1909.04376, 2019. 2, 8
[1] S. Yang, P. Luo, C. C. Loy, and X. Tang, “Wider face: A face detection
benchmark,” CVPR, 2016. 1, 2, 5, 6, 8, 9 [31] B. Zhang, J. Li adn Y. Wang, Y. Tai, C. Wang, J. Li, F. Huang, Y. Xia,
[2] S. Yang, P. Luo, C. C. Loy, and X. Tang, “Wider face: A face detection W. Pei, and R. Ji, “Automatic and scalable face detector,” ArXiv preprint
benchmark,” http://shuoyang1213.me/WIDERFACE/index.html. 1, 2, 8 2003.11228, 2020. 2, 8
[3] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal [32] D. Yashunin, T. Baydasov, and R. Vlasov, “Maskface: multi-task face
speed and accuracy of object detection,” ArXiv preprint 2004.10934, and landmark detector,” ArXiv preprint 2005.09412, 2020. 2
2020. 1, 2, 3, 6 [33] Y. Liu, F. Wang, B. Sun, and H. Li, “Mogface: Rethinking scale
[4] Y. Zhu, H. Cai, S. Zhang, C. Wang, and W. Xiong, “Tinaface: Strong but augmentation on the face detector,” ArXiv preprint 2103.11139, 2021.
simple baseline for face detection,” ArXiv preprint 2011.13183, 2020. 2
1, 2, 5 [34] J. Guo, A. Deng, J. Lattas, and S. Zafeiriou, “Sample and computation
[5] YOLOv5, “Yolov5,” https://github.com/ultralytics/yolov5. 1, 2, 3, 5, 6, redistribution for efficient face detection,” ArXiv preprint 2105.04714,
9 2021. 2, 5, 8
[6] M. Ma, X. Zhang, H. Zheng, and J. Sun, “Shufflenet v2: Practical guide- [35] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Sigmoid-
lines for efficient cnn architecture design,” ArXiv preprint 1807.11164, weighted linear units for neural network function approximation in
2018. 1, 3, 5, 6 reinforcement learning,” TPAMI, 2015. 2, 3, 6
[7] Z. Zhang, C.and Zhang, “Robust real-time face detection,” IJCV, 2004. [36] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for
1 instance segmentation,” ArXiv preprint 1803.01534, 2018. 2, 5
[8] M. Yang, D. Kriegman, and N. Ahuja, “Detecting faces in images: a [37] Stefan Elfwinga, Eiji Uchibea, and Kenji Doyab, “Sigmoid-weighted
survey,” TPAMI, 2002. 1 linear units for neural network function approximation in reinforcement
[9] Z. Zhang, C.and Zhang, “A survey of recent advances in face detection,” learning,” ArXiv preprint 1702.03118, 2017. 2
Microsoft Research Technical report, 2010. 1 [38] Robert J. Wang, Xiang Li, and Charles X. Ling, “Pelee: A real-time
[10] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik, “Rich object detection system on mobile devices,” NeurIPS, 2018. 2, 3, 6
feature hierarchies for accurate object detection and semantic segmenta- [39] G. Huang, Z. Liu, L. Maaten, and K.Q. Weinberger, “Densely connected
tion,” in Proceedings of the IEEE Conference on Computer Vision and convolutional networks,” CVPR, 2017. 2, 5
Pattern Recognition (CVPR), 2014. 2 [40] Z. Feng, J. Kittler, M. Awais, P. Huber, and X. Wu, “Wing loss for
[11] Ross Girshick, “Fast R-CNN,” in Proceedings of the International robust facial landmark localisation with convolutional neural networks,”
Conference on Computer Vision (ICCV), 2015. 2 CVPR, 2018. 3
[12] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time [41] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
object detection with region proposal networks,” IEEE Transactions on recognition,” CVPR, 2016. 5, 8
Pattern Analysis and Machine Intelligence, 2016. 2 [42] Yang Liu, Xu Tang, Xiang Wu, Junyu Han, Jingtuo Liu, and Errui Ding,
[13] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, “Mask “Hambox: Delving into online high-quality anchors mining for detecting
R-CNN,” in Proceedings of the International Conference on Computer outer faces,” CVPR, 2020. 5
Vision (ICCV), 2017. 2
[43] M. Sandler, A. Howard, w. Zhu, A. Zhmoginov, and L. Chen, “Mo-
[14] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality
bilenetv2: Inverted residuals and linear bottlenecks,” CVPR, 2018. 5
object detection,” CVPR, 2018. 2
[15] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. Berg, [44] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely
“Yolov3: An incremental improvement,” ECCV, 2016. 2 efficient convolutional neural network for mobile devices,” ArXiv
[16] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, “You preprint 1707.01083, 2017. 5
only look once: Unified, real-time object detection,” CVPR, 2016. 2 [45] Z. Zhu, G. Huang, J. Deng, Y. Ye, J. Huang, X. Chen, J. Zhu, T. Yang,
[17] Joseph Redmon and Ali Farhadi, “Yolo9000: better, faster,stronger,” J. Lu, D. Du, and J. Zhou, “Webface260m: A benchmark unveiling the
CVPR, 2017. 2 power of million-scale deep face recognition,” CVPR, 2021. 6, 8
[18] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” [46] V.Jain and E. Learned-Miller, “Fddb: A benchmark for face detection
arXiv preprint arXiv:1804.02767, 2015. 2, 5 in unconstrained settings,” University of Massachusetts Report, , no.
[19] T. Lin, P. Dollár, R. Girshick, K. He, B Hariharan, and S. Belongie, UM-CS-2010-009, 2010. 6, 8
“Feature pyramid networks for object detection,” CVPR, 2017. 2, 5 [47] Rui Zhao, Tianshan Liu, Jun Xiao, Daniel P. K. Lun, and Kin-Man Lam,
[20] K. et al. Chen, “Mmdetection: Open mmlab detection toolbox and “Deep multi-task learning for facial expression recognition and synthesis
benchmark,” ECCV, 2020. 2 based on selective feature sharing,” ICPR, 2020. 6, 8
[21] M. Tan, R. Pang, and Q. Le, “Efficientdet: Scalable and efficient object [48] Zheng Zhu, Guan Huang, Jiankang Deng, Yun Ye, Junjie Huang, Xinze
detection,” CVPR, 2020. 2 Chen, Jiagang Zhu, Tian Yang, Jia Guo, Jiwen Lu, Dalong Du, and
[22] N. Carion, F. Massa, G. Synnaeve, N Usunier, A. Kirillov, and Jie Zhou, “Masked face recognition challenge: The webface260m track
Z. Zagoruyko, “End-to-end object detection with transformers,” ECCV, report,” ICCV Workshops, 2021. 8
2020. 2 [49] T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li, “Bag of tricks
[23] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet: for image classification with convolutional neural networks,” CVPR,
Keypoint triplets for object detection,” ICCV, 2019. 2 2019. 8
[24] X. Zhou, D. Wang, and Krähenbühl P., “Objects as points,” in arXiv [50] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular
preprint arXiv:1904.07850, 2019. 2 margin loss for deep face recognition,” CVPR, June 2019. 8
[51] X. An, X. Zhu, Y. Xiao, L. Wu, M. Zhang, Y. Gao, B. Qin, D. Zhang, and
Y. Fu, “Partial fc: Training 10 million identities on a single machine,”
arXiv preprint 2010.05222, 2021. 8
[52] Delong Qi, Kangli Hu, Weijun Tan, Qi Yao, and Jingfeng Liu, “Balanced
masked and standard face recogntion,” ICCV Workshops, 2021. 8
[53] W. Tian, Z. Wang, H. Shen, W. Deng, B. Chen, and X. Zhang, “Learning
better features for face detection with feature fusion and segmentation
supervision,” ArXiv preprint 1811.08557, 2018. 8
[54] S. Zhang, R. Zhu, X. Wang, H. Shi, T. Fu, S. Wang, T. Mei, and Stan Z.
Li, “Isrn - improved selective refinement network for face detection,”
ArXiv preprint 1901.06651, 2019. 8
[55] Y. Zhang, X. Xu, and X. Liu, “Robust and high performance face
detector,” ArXiv preprint 1901.02350, 2019. 8
[56] Z. Li, X. Tang, J. Han, J. Liu, and Z. He, “Pyramidbox++: High
performance detector for finding tiny face,” ArXiv preprint 1904.00386,
2019. 8
[57] C. Chi, S. Zhang, J. Xing, Z. Lei, and S. Z. Li, “Srn - selective
refinement network for high performance face detection,” ArXiv preprint
1809.02693, 2018. 8
[58] X. Tang, Daniel K. Du, Z. He, and J. Liu, “Pyramidbox: A context-
assisted single shot face detector,” ArXiv preprint 1803.07737, 2018.
8

You might also like