YOLO5Face: Why Reinventing a Face Detector
Delong Qi, Weijun Tan*, Qi Yao, Jingfeng Liu
                                                                                  Shenzhen Deepcam Information Technologies
                                                                                                 Shenzhen, China
                                                                             {delong.qi,weijun.tan,qi.yao,jingfeng.liu}@deepcam.com
                                                                            *LinkSprite Technologies, USA, weijun.tan@linksprite.com
                                            Abstract—Tremendous progress has been made on face detec-         of challenges encountered by face detection like multi-scale,
                                         tion in recent years using convolutional neural networks. While      small faces and dense scenes, they all exist in generic object
                                         many face detectors use designs designated for the detection of      detection. Thus, face detection is just a sub task of general
                                         face, we treat face detection as a general object detection task.
arXiv:2105.12931v3 [cs.CV] 27 Jan 2022
                                         We implement a face detector based on YOLOv5 object detector         object detection.
                                         and call it YOLO5Face. We add a five-point landmark regression          In this paper, we follow this intuition and design a face
                                         head into it and use the Wing loss function. We design detectors     detector based on the YOLOv5 object detector [5]. We modify
                                         with different model sizes, from a large model to achieve the best   the design for face detection considering large faces, small
                                         performance, to a super small model for real-time detection on an    faces, landmark supervision, for different complexities and
                                         embedded or mobile device. Experiment results on the WiderFace
                                         dataset show that our face detectors can achieve state-of-the-art    applications. Our goal is to provide a portfolio of models for
                                         performance in almost all the Easy, Medium, and Hard subsets,        different applications, from very complex ones to get the best
                                         exceeding the more complex designated face detectors. The code       performance to very simple ones to get the best trade-off of
                                         is available at https://www.github.com/deepcam-cn/yolov5-face.       performance and speed on embedded or mobile devices.
                                            Index Terms—Face detection, convolutional neural network,            Our main contributions are summarized as following,
                                         YOLO, real-time, embedded device, object detection
                                                                                                                 • We redesign the YOLOV5 object detector [5] as a face
                                                               I. I NTRODUCTION                                    detector, and call it YOLO5Face. We implement key
                                                                                                                   modifications to the network to improve the performance
                                            Face detection is a very important computer vision task.               in terms of mean average precision (mAP) and speed.
                                         Tremendous progresses have been made since deep learning,                 The details of these modifications will be presented in
                                         particularly convolutional neural network (CNN), has been                 Section III.
                                         used in this task. As the first step of many tasks, including           • We design a series of models of different model sizes,
                                         face recognition, verification, tracking, alignment, expression           from large models, to medium models, to super small
                                         analysis, face detection attracts many researches and develop-            models, for needs in different applications. In addition
                                         ments in the academia and the industry. And the performance               to the backbone used in YOLOv5 [5], we implement a
                                         of face detection has improved significantly over the years. For          backbone based on ShuffleNetV2 [6], which gives the
                                         a survey of the face detection, please refer to the benchmark             state-of-the-art (SOTA) performance and fast speed for
                                         results [1], [2]. There are many methods in this field from               mobile device.
                                         different perspectives. Research directions include design of           • We evaluate our models on the WiderFace [1] dataset. On
                                         CNN network, loss functions, data augmentations, and training             VGA resolution images, almost all our models achieve
                                         strategies. For example, in the YOLOv4 paper, the authors                 the SOTA performance and fast speed. This proves our
                                         explore all these research directions and propose the YOLOV4              goal, as the tile of this paper claims, we do not need
                                         object detector based on optimizations of network architecture,           to reinvent a face detector since the YOLO5Face can
                                         selection of bags of freebies, and selection of bags of specials          accomplish it.
                                         [3].
                                            In our approach, we treat the face detection as a general                               II. R ELATED W ORK
                                         object detection task. We have the same intuition as the
                                         TinaFace [4]. Intuitively, face is an object. As discussed in        A. Object Detection
                                         the TinaFace [4], from the perspective of data, the properties          General object detection aims at locating and classifying
                                         that faces has, like pose, scale, occlusion, illumination, blur      the pre-defined objects in a given image. Before deep CNN is
                                         and etc., also exist in other objects. The unique properties         used, traditional face detection uses hand crafted features, like
                                         in faces like expression and makeup can also correspond to           HAAR, HOG, LBP, SIFT, DPM, ACF, etc. The seminal work
                                         distortion and color in objects. Landmarks are special to face,      by Viola and Jones [7] introduces integral image to compute
                                         but they are not unique either. They are just key points of          HAAR-like features. For a survey of face detection using hand
                                         an object. For example, in license plate detection, landmarks        crafted features, please refer to [8], [9].
                                         are also used. And adding landmark regression in the object             Since the deep CNN shows its power in many machine
                                         prediction head is straightforward. Then from the perspective        learning tasks, face detection is dominated by deep CNN
methods. There are two-stage and one-stage object detectors.        almost all aspects of the YOLOv3 [18] algorithm, including
Typical two-stage methods are the RCNN family, including            the backbone, and what they call bags of freebies, and bags
RCNN [10], fast-RCNN [11], faster-RCNN [12], mask-RCNN              of specials. It achieves 43.5% AP (65.7% AP50) for the MS
[13], Cascade-RCNN [14].                                            COCO dataset at a real time speed of 65 FPS on Tesla V100.
   The two-stage object detector have very good performance            One month later, the YOLOv5 [5] was released by another
but suffers from long latency and slow speed. In order to           different research team. In the algorithm prospective, the
overcome this problem, one-stage object detectors are studied.      YOLOv5 [5] does not have many innovations. And the team
Typical one-stage networks include SSD [15], YOLO [3], [5],         does not publish a paper. These bring quite some controversies
[16]–[18].                                                          about if it should be called YOLOv5. However, due to its
   Other object detection networks include FPN [19], MMDe-          significantly reduced model size, faster speed, and similar per-
tection [20], EfficientDet [21], transformer(DETR) [22], Cen-       formance as YOLOv4 [3], and a full implementation in Python
ternet [23], [24], and so on.                                       (Pytorch), it is welcome by the object detection community.
B. Face Detection                                                                III. YOLO5FACE FACE D ETECTOR
   The researches for face detection follows the general ob-           In this section we present the key modifications we make
ject detection. After the most popular and challenging face         in YOLOv5 and make it a face detector - YOLO5Face.
detection benchmark WiderFace dataset [1] is released, face
detection develops rapidly focusing on the extreme and real         A. Network Architecture
variation problem including scale, pose, occlusion, expression,        We use the YOLOv5 object detector [5] as our baseline and
makeup, illumination, blur and etc.                                 optimize it for face detection. We introduce some modifica-
   A lot of methods are proposed to deal with these problems,       tions designated for detection of small faces as well as large
particularly the scale, context, anchor in order to detect small    faces.
faces. These methods include MTCNN [25], FaceBox [26],                 The network architecture of our YOLO5Face face detector
S3FD [27], DSFD [28], RetinaFace [29], RefineFace [30], and         is depicted in Fig. 1. It consists of the backbone, neck, and
the most recent ASFD [31], MaskFace [32], TinaFace [4],             head. In YOLOv5, a new designed backbone called CSPNet
MogFace [33], and SCRFD [34]. For a list of popular face            [5] is used. In the neck, an SPP [35] and a PAN [36] are
detectors, the readers are referred to the WiderFace website        used to aggregate the features. In the head, regression and
[2].                                                                classification are both used.
   It is worth noting that some of these face detectors explore        In Fig. 1 (a), the overall network architecture is depicted. In
unique characteristics in human face, the others are just general   Fig. 1 (b), a key block called CBS is defined, which consists
object detector adopted and modified for face detection. Use        of Conv layer, BN layer, and a SILU [37] activation function.
RetinaFace [29] as an example, it uses landmark (2D and             This CBS block is used in many other blocks. In Fig. 1
3D) regression to help the supervision of face detection, while     (c), an output label for the head is shown, which include
TinaFace [4] is simply a general object detector.                   bounding box (bbox), confidence (conf), classification (cls)
                                                                    and five-point landmarks. The landmarks are our addition to
C. YOLO                                                             the YOLOv5 to make it a face detector with landmark output.
   YOLO first appeared in 2015 [16] as a different approach         If without the landmark, the last dimension 16 should be
than popular two-stage approaches. It treats object detection       6. Please note that, the output dimensions 80*80*16 in P3,
as an regression problem rather than a classification problem.      40*40*16 in P4, 20*20*16 in P5, 10*10*16 in optional P6 are
It performs all the essential stages to detect an object using a    for every anchor. The the real dimension should be multiplied
single neural network. As a result, it not only achieves very       by the number of anchors.
good detection performance, but also achieves real-time speed.         In Fig. 1 (d), a Stem structure [38] is shown, which is
Furthermore, it has excellent generalization capability, can be     used to replace the original Focus layer in YOLOv5. The
easily trained to detect different objects.                         introduction of the Stem block into YOLOv5 for face detection
   Over the next five years, the YOLO algorithm have been           is one of our innovations.
upgraded to five versions with many innovative ideas from the          In Fig. 1 (e), a CSP block (C3) is shown. This block is
object detection community. The first three versions - YOLOv1       inspired by the DenseNet [39]. However, instead of adding
[16],YOLOv2 [17], YOLOv3 [18]are developed by the author            the full input and the output after some CNN layers, the input
of the original YOLO algorithm. Out of these three versions,        is separated two two halves. One half is passed through a
the YOLOv3 [18] is a milestone with big improvements in per-        CBS block, a number of Bottleneck blocks, which is shown
formance and speed by introducing multi-scale features (FPN)        in Fig. 1 (f), then another Conv layer. The other half is passed
[19], better backbone network (Darknet53), and replacing the        through a Conv layer, then the two are concatenated, followed
Softmax classification loss with the binary cross-entropy loss.     by another CBS block.
   In early 2020, after the original YOLO authors withdrawn            Fig. 1 (g), an SPP block [35] is shown. In this block the
from the research field, YOLOv4 [3] was released by a               three kernel sizes 13x13, 9x9, 5x5 in YOLOv5 are revised
different research team. The team explore a lot of options in       to 7x7, 5x5, 3x3 in our face detector. This has been shown
as one of the innovations that improves the face detection            General loss functions for landmark regression are L2, L1,
performance.                                                       or smooth-L1. The MTCNN [25] uses the L2 loss function.
   Note that we only consider VGA resolution input images.         However, it is found these loss functions are not sensitive
To be more precise, the longer edge of the input image is          to small errors. To overcome this problem, the Wing-loss is
scaled to 640, and the shorter edge is scaled accordingly. The     proposed [40],
shorter edge is also adjusted to be a multiple of the largest                           (
stride of the SPP block. For example, when P6 is not used,                                w · ln(1 + |x|/e), if x < w
                                                                            wing(x) =                                        (1)
the shorter edge needs to be multiple of 32; when P6 is used,                             |x| − C,            otherwise
the shorter edge needs to multiple of 64.
                                                                   The non-negative w sets the range of the nonlinear part to
B. Summary of Key Modifications                                    (−w, w), e limits the curvature of the nonlinear region and
  The key modifications are summarized as follows.                 C = w − wln(1 + w/e) is a constant that smoothly links the
                                                                   piecewise-defined linear and nonlinear parts. Plotted in Fig. 2
  • We add a landmark regression head to the YOLOv5
                                                                   is this Wing loss function with different parameters wand e It
    network. The Wing loss [40] is used a loss function for it.
                                                                   can be seen that the response at small error area near zero is
    This makes the face detector more useful since landmarks
                                                                   boosted compared to the L2, L1, or smooth-L1 functions.
    are used in many applications. The landmark locations
    are more accurate. This extra supervision helps the face          The loss functions for landmark point vector s = {si }, and
    detector accuracy.                                             its ground truth s0 = {si }, where i = 1, 2, ..., 10, is defined
  • We replace the Focus layer of YOLOv5 [5] with a
                                                                   as,                          X
    Stem block structure [38]. It increases the network’s                          lossL (s) =     wing(si − s0i )               (2)
                                                                                                 i
    generalization capability, and reduces the computation
    complexity while the performance does not degrade.             Let the general object detection loss function of YOLOv5 be
  • We change the SPP block [35] and use a smaller kernel.         lossO (bounding box, class, probability), then the new total
    It makes the YOLOv5 more suitable for face detection           loss function is,
    and improve the detection accuracy.                                              loss(s) = lossO + λL · lossL               (3)
  • We add a P6 output block with stride of 64. It increases
    the capability to detect large faces. This is an item easily   where the λL is a weighting factor for the landmark regression
    overlooked by many researchers since their focuses are         loss function.
    more on the detection of small faces.
                                                                   D. Stem Block Structure
  • We find that some data augmentation methods on general
    object detection are not appropriate on face detection,           We use a stem block similar to [38]. The stem block is
    including up-down flipping and Mosaic. Removing the            shown in Fig.1 (d). With this stem block, we implement a
    up-down flipping improves the performance. When small          stride = 2 in the first spatial down-sampling on the input image,
    images are used, the Mosaic augmentation [3] degrades          and increase the number of channels. With this stem block,
    the performance. However, when the small faces are             the computation complexity only increase marginally, while a
    ignored, it works well. Random cropping helps the per-         strong representation capability is ensured.
    formance.
                                                                          Model            Backbone          (D,W)     With P6?
  • We design two super light-weight models based on                     YOLOv5s      YOLO5-CSPNet [5]     (0.33,0.50)   No
    ShuffleNetV2 [6]. This backbone is very different from              YOLOv5s6       YOLO5-CSPNet        (0.33,0.50)   Yes
    the CSP network. These models are super small, while                YOLOv5m        YOLO5-CSPNet        (0.50,0.75)   No
                                                                        YOLOv5m6       YOLO5-CSPNet        (0.50,0.75)   Yes
    achieve SOTA performance for embedded or mobile                      YOLOv5l       YOLO5-CSPNet         (1.0,1.0)    No
    device.                                                             YOLOv5l6       YOLO5-CSPNet         (1.0,1.0)    Yes
                                                                         YOLOv5n       ShuffleNetv2 [6]         -        No
C. Landmark Regression                                                 YOLOv5n-0.5    ShuffleNetv2-0.5 [6]      -        No
                                                                                                TABLE I
   Landmarks are important characteristics for human face.          D ETAIL OF IMPLEMENTED YOLO5FACE MODELS , WHERE (D,W) ARE
They can be used to do face alignment, face recognition,           THE DEPTH AND WIDTH MULTIPLES OF THE YOLOV 5 CSPN ET [5]. T HE
                                                                       NUMBER OF PARAMETERS AND F LOPS ARE LISTED IN TABLE III.
face express analysis, age analysis etc. Traditional landmarks
consist of 68 points. They are simplified to 5 points in
MTCNN [25] Since then, the five-point landmarks have been
used widely in face recognition. The quality of landmarks          E. SPP with Smaller Kernels
affects the quality of face alignment and face recognition.          Before forwarding to feature aggregation block in the neck,
   The general object detector does not include landmarks. It      the output feature maps of the YOLO5 backbone are sent to
is straightforward to add it as a regression head. Therefore,      an additional SPP block [35] to increase the receptive field
we add it into our YOLO5Face. The landmark outputs will            and separate out the most important features. Instead of many
be used in align face images before they are sent to the face      CNN models containing fully connected layers which only
recognition network.                                               accept input images of specific dimensions, SPP is proposed
Fig. 1. The proposed YOLO5Face network architecture.
                         Modification              Method        Easy   Medium     Hard   Params(M)        Flops(G)
                                                Focus+Conv       93.56    92.54    82.56     7.091          6.174
                          Stem block
                                                 Stem Block      94.13    92.87    82.79     7.075          5.751
                                                   (13,9,5)      93.43    91.12    82.64       -               -
                          SPP Kernel
                                                    (7,5,3)      94.33    92.61    84.15       -               -
                                                      No         94.31    92.52    83.15     7.075          5.751
                           P6 block
                                                      Yes        95.29    93.61    83.13    12.386           6.28
                                           Baseline (with Mosaic)91.34    90.21    83.54       -               -
                                             - up-down flipping  91.87    90.56    83.58       -               -
                       Data augmentation
                                            + Ignore small faces 94.12    92.21    82.21       -               -
                                               + Random crop     94.34    92.58    83.17       -               -
                                                              TABLE II
                                    A BLATION STUDY RESULTS ON THE W IDER FACE VALIDATION DATASET.
                             Detector            Backbone             Easy    Medium   Hard    Params(M)   Flops(G)
                           DSFD [28]          ResNet152 [41]          94.29    91.47   71.39     120.06     259.55
                         RetinaFace [29]       ResNet50 [41]          94.92    91.90   64.17      29.50      37.59
                         HAMBox [42]           ResNet50 [41]          95.27    93.76   76.75      30.24      43.28
                          TinaFace [4]         ResNet50 [41]          95.61    94.25   81.43      37.98     172.95
                       SCRFD-34GF [34]       Bottleneck ResNet        96.06    94.92   85.29      9.80       34.13
                       SCRFD-10GF [34]       Basic ResNet [41]        95.16    93.87   83.05      3.86       9.98
                         Our YOLOv5s        YOLOv5-CSPNet [5]         94.33    92.61   83.15      7.075      5.751
                        Our YOLOv5s6         YOLOv5-CSPNet            95.48    93.66    82.8     12.386      6.280
                        Our YOLOv5m          YOLOv5-CSPNet            95.30    93.76   85.28     21.063     18.146
                        Our YOLOv5m6         YOLOv5-CSPNet            95.66    94.1     85.2     35.485     19.773
                         Our YOLOv5l         YOLOv5-CSPNet            95.9     94.4     84.5     46.627     41.607
                        Our YOLOv5l6         YOLOv5-CSPNet            96.38    94.90   85.88     76.674     45.279
                        Our YOLOv5x6         YOLOv5-CSPNet            96.67    95.08   86.55    141.158     88.665
                       SCRFD-2.5GF [34]        Basic Resnet           93.78    92.16   77.87      0.67       2.53
                       SCRFD-0.5GF [34]       Depth-wise Conv         90.57    88.12   68.51      0.57       0.508
                         RetinaFace [29]     MobileNet0.25 [43]       87.78    81.16   47.32      0.44       0.802
                         FaceBoxes [26]               -               76.17    57.17   24.18      1.01       0.275
                        Our YOLOv5n           ShuffleNetv2 [6]        93.61    91.54   80.53      1.726      2.111
                       Our YOLOv5n0.5       ShuffleNetv2-0.5 [6]      90.76    88.12   73.82      0.447      0.571
                                                                    TABLE III
               C OMPARISON OF OUR YOLO5FACE AND EXISTING FACE DETECTORS ON THE W IDER FACE VALIDATION DATASET [1].
to aim at generating a fixed-size output irrespective of the input       bottom-up augmentation pyramid are fused by using (Region
size. In addition, SPP also helps to extract important features          of Interest) ROI align and fully connected layers with element-
by pooling multi-scale versions of itself.                               wise max operation.
   In YOLO5, three kernel sizes 13x13,9x9,5x5 are used [5].                 In YOLOv5, there are three output blocks in the PAN output
We revise them to use smaller size kernels 7x7, 5x5 and 3x3.             feature maps, called P3,P4,P5 corresponding to 80x80x16,
These smaller kernels help to detect small faces more easily,            40x40x16, 20x20x16, with strides 8,16,32, respectively. In
and increase the overall face detection performance.                     our YOLO5Face, we add an extra P6 output block, whose
                                                                         feature map is 10x10x16 with stride 64. This modification
F. P6 Output Block
                                                                         particularly helps the detection of large faces. While almost
   The backbone of YOLO object detector has many layers. As              all face detectors focus on improving detection of small faces,
the feature becomes more and more abstract as the layers go              detection of large faces can be easily overlooked. We fill this
deeper, the spatial resolution of feature maps decreases due to          hole by adding the P6 output block.
downsampling, which leads to to a loss of spatial information
as well as fine-grained features. In order to preserve these
                                                                         G. ShuffleNetV2 as Backbone
fine-grained features, the FPN [19] is introduced to YOLOv3
[18].                                                                       The ShuffleNet [44] is an extremely efficient CNN for
   In FPN [19], the fine-grained features take a long path               mobile device. The key block is called the ShuffleNet block.
traveling from low-level to high-level layers. To overcome               It utilizes two new operations, pointwise group convolution
this problem, the PAN is proposed to add a bottom-up                     and channel shuffle, to greatly reduce computation cost while
augmentation path along the top-down path used in FPN.                   maintaining accuracy.
In addition, in the connection of the feature maps to the                   The ShuffleNetv2 [44] is an improved version of Shuf-
lateral architecture, the element-wise addition operation is             fleNet. It borrows the shortcut network architecture similar
replaced with concatenation. In FPN, object predictions are              to the DenseNet [39], and the the element wise addition is
done independently on different scale levels, which do not               changed to concatenation, similar to the change in PAN [36]
utilize information from other feature maps, and may produce             in YOLOv5 [5]. But different from DenseNet, ShuffleNetV2
duplicated predictions. In PAN [36], the output feature maps of          does not densely concatenate, and after the concatenation, the
                                                                                FaceDetect         traning dataset        FNMR
channel shuffling is used to mix the features. This makes the                 RetinaFace [29]      WiderFace [1]          0.1065
ShuffleNetV2 a super fast network.                                              YOLOv5s              WiderFace            0.1060
  We use the ShuffleNetV2 as the backbone in YOLOv5 and                         YOLOv5s         +Multi-task facial [47]   0.1058
                                                                                YOLOv5m              WiderFace            0.1056
implement super small face detectors YOLOv5n-Face, and                          YOLOv5m           +Multi-task facial      0.1051
YOLOv5n0.5-Face.                                                                                 TABLE IV
                                                                     E VALUATION OF YOLO5FACE LANDMARK ON FACE RECOGNITION ON
                                                                                    THE W EBFACE TEST DATASET [45].
                      IV. E XPERIMENTS
A. Dataset
                                                                    C. Ablation Study
   The WiderFace dataset [1] is the largest face detection
dataset, which contains 32,203 images and 393,703 faces.               In this subsection we present the effects of the modifications
For its large variety of scale, pose, occlusion, expression,        we have in our YOLO5Face. In this study we use the YOLO5s
illumination and event, it is close to reality and is very          model. We use the WiderFace [1] validation dataset and use
challenging.                                                        the mAP as the performance metric.
                                                                       Stem Block vs. Focus Layer. The mAP performances of the
   The whole dataset is divided into train/validation/test sets
                                                                    stem block [38] and the focus layer are listed in first panel of
by ratio 50%/10%/40% within each event class. Furthermore,
                                                                    Table II. Also listed are the number of parameters and number
each subset is defined into three levels of difficulty: Easy,
                                                                    of flops. From the results we see that the stem block improves
Medium, and Hard. As it names indicates, the Hard subset
                                                                    the mAP by 0.57%, 0.33%, and 0.23% on the easy, medium,
is most challenging. So the performance on the Hard subset
                                                                    and hard subset, respectively.
reflects best the effectiveness of a face detector.
                                                                       SPP with Smaller Size Kernels. The mAP performances
   Unless specified otherwise, the WiderFace dataset [1] is         of the SPP [35] kernel sizes (7x7,5x5,3x3) and original kernel
used in this work. In the face recognition with YOLO5Face           sizes (13x13,9x9,5x5) are listed in the second panel of Table II.
landmark and alignment, the Webface dataset [45] is used.           From the results we see that the smaller kernel sizes improve
The FDDB dataset [46] is used in testing to demonstrate our         the mAP by 0.9%, 1.49%, and 1.41% on the easy, medium,
model’s performance on cross-domain datasets.                       and hard subset, respectively. The improvements are larger
                                                                    than that from the Stem block [38].
B. Implementation Details                                              P6 Output Block. The mAP performances of the addition
                                                                    of the P6 output block are listed in the third panel of Table II.
   We use the YOLOv5-4.0 codebase [5] as our starting point
                                                                    From the results we see that the P6 block improves the mAP
and implement all the modifications we describe earlier in
                                                                    by 0.98%, 1.09%, and -0.02% on the easy, medium, and hard
PyTorch.
                                                                    subset, respectively.
   The SGD optimizer is used. The initial learning rate is 1E-2,
                                                                       Data Augmentation Performance results of a few data
the final learning rate is 1E-5, and the weight decay is 5E-3. A
                                                                    augmentation methods are listed in the fourth panel of Table
momentum of 0.8 is used in the first three warming-up epochs.
                                                                    II. From the results we see that ignoring small faces, random
After that, the momentum is changed to 0.937. The training
                                                                    crop help the mAP in the Easy and Medium dataset, while the
runs 250 epochs with a batch size of 64. The λL = 0.5 is
                                                                    Mosaic [3] helps the mAP in the Hard dataset. As we explain
optimized by exhaust search.
                                                                    before, the Mosaic has to work with the ignoring small faces,
   Implemented Models. We implement a series of face                otherwise the performance degrades dramatically.
detector models, as listed in Table I. We implement eight              Please note that in these experiments the network configu-
relatively large models, including extra large-size mod-            rations are not incremental. However in each of set of experi-
els (YOLOv5x, YOLOv5x6), large-size models (YOLOv5l,                ment, the baselines for the two networks are the same to make
YOLOv5l6) medium-size models (YOLOv5m, YOLOv5m6),                   the comparison fair. For example in the SPP experiments,
and small-size models (YOLOv5s, YOLOv5s6). In the name              except for the kernel sizes are different, all other settting are
of the model, the last postfix 6 means it has the P6 output block   identical.
in the SPP. These models all use the YOLOv4 CSPNet as the
backbone with different depth and width multiples, denoted as       D. YOLO5Face for Face Recognition
D and W in Table I.                                                    Landmark is critical for face recognition accuracy. In Reti-
   Furthermore, we implement two super small-size models,           naFace [29], the accuracy of the landmark is evaluated with
YOLOv5n and YOLOv5n0.5, which use the ShuffleNetv2                  the MSE between estimated landmark coordinates and their
and ShuffleNetv2-0.5 [6] as the backbone. Except for the            ground truth and with the face recognition accuracy. The
backbone, all other main blocks, including the stem block,          results show that the RetinaFace has better landmarks than
SPP, PAN, are the same as in the larger models.                     the older MTCNN [25].
   The number of parameters and number of flops of all these           In this work, we also use face recognition to evaluate the ac-
models is listed in Table III for comparison with existing          curacy of landmarks of the YOLO5Face. We use the Webface
methods.                                                            test dataset, which is the largest face dataset with noisy 4M
Fig. 2. The precision-recall (PR) curves of face detectors, (a) validation-Easy, (b)validation-Medium, (c) validation-Hard, (d) test-Easy, (e) test-Medium, (f)
test-Hard.
                                                                                                     Method       MAP
                                                                                                   ASFD [31]     0.9911
                                                                                                 RefineFace [30] 0.9911
                                                                                                PyramidBox [58]  0.9869
                                                                                                 FaceBoxes [26]  0.9598
                                                                                                 Our YOLOv5s     0.9843
                                                                                                 Our YOLOv5m     0.9849
                                                                                                 Our YOLOv5l     0.9867
                                                                                                 Our YOLOv5l6    0.9880
                                                                                                        TABLE V
Fig. 3. Some examples of detected face and landmarks, where the first row        E VALUATION OF YOLO5FACE ON THE FDDB DATASET [46].
is from RetinaFace [29], and second row is from our YOLOv5m.
identities/260M faces, and cleaned 2M identities/42M faces
                                                                               Next, we look at the performance of super small models
[45]. This dataset is used in the ICCV2021 Masked Face
                                                                            whose number of parameters is less than 2M and the number
Recognition (MFR) challenge [48]. In this challenge, both
                                                                            of flops is less than 3G. All existing methods achieve mAP
masked face images and standard face images are included,
                                                                            in 76.17-93.78% on the Easy subset, 57.17-92.16% on the
and a metric False Non-Match Rate (FNMR) at False Match
                                                                            Medium subset, and 24.18-77.87% on the Hard subset. Again,
Rate (FMR) = 1E-5 is used. The FNMR*0.25 for MFR plus
                                                                            the SCRFD [34] achieves the best performance in all sub-
FNMR*0.75 for standard face recognition is combined as the
                                                                            sets. Our YOLO5Face (YOLOv5n) achieves 93.61%, 91.54%,
final metric.
                                                                            80.53% on the three subsets, respectively. Our face detector
   By default, the RetinaFace [49] is used as the face detec-               has a little bit worse performance than the SCRFD [34] on the
tor on the dataset. We compare our YOLO5Face with the                       Easy and Medium subsets. However, on the Hard subset, our
RetinaFace on this dataset. We use ArcFace [50] framework                   face detector is leading by 2.66%. Furthermore, our smallest
with Resnet124 [41] as backbone. Extracted features of two                  model, YOLOv5n0.5, has good performance, even its model
models trained on the Glint360k dataset [51] are concatenated               size is much smaller.
as the baseline model. We replace the RetinaFace with our
                                                                               The precision-recall (PR) curves of our YOLO5Face face
YOLO5Face. We test two models, a small model YOLOv5s,
                                                                            detector, along with the competitors, are shown in Figure
and a medium model YOLOv5m. More details can be found
                                                                            2. The leading competitors include DFS [53], ISRN [54],
in [52].
                                                                            VIM-FD [55], DSFD [28], PyramidBox++ [56], SRN [57],
   The results are listed in Table IV. From the results, we
                                                                            PyramidBox [58] and more. For a full list of the competitors
see that both our small and medium models outperform the
                                                                            and their results on the WiderFace [1] validation and test
RetinaFace [29]. In addition, we notice that there are very few
                                                                            datasets, please refer to [2]. In the results on the validation
large face images in the WiderFace dataset, so we add some
                                                                            dataset, our YOLOv5x6-Face detector achieves 96.9%, 96.0%,
large face images from the Multi-task-facial dataset [47] into
                                                                            91.6% mAP on the Easy, Medium, and Hard subset, respec-
the YOLO5Face training dataset. We find that this technique
                                                                            tively, exceeding the previous SOTA by 0.0%, 0.1%, 0.4%. In
improves face recognition performance.
                                                                            the results on the test dataset, our YOLOv5x6-Face detector
   shown in Figure 3 are some detected Webface [45] faces                   achieves 95.8%, 94.9%, 90.5% mAP on the Easy, Medium,
and landmarks using the RetinaFace [29] and our YOLOv5m.                    and Hard subset, respectively with 1.1%, 1.0%, 0.7% gap to
On the faces of a large pose, we can visually observe that our              the previous SOTA. Please note that, in these evaluations, we
landmarks are more accurate, which has been prooved in our                  only use multiple scales and left-right flipping without using
face recognition results shown in Table IV.                                 other test-time augmentation (TTA) methods. Our focus is
                                                                            more on the VGA input images, where we achieve the SOTA
E. YOLO5Face on WiderFace Dataset
                                                                            in almost all conditions.
   We compare our YOLO5Face with many existing face
detectors on the WiderFace dataset. The results are listed in
                                                                            F. YOLO5Face on FDDB Dataset
Table III, where the previous SOTA results and our best results
are both highlighted.                                                          FDDB dataset [46] is a small dataset with 5171 faces
   We first look at the performance of relatively large models              annotated in 2845 images. To demonstrate our YOLO5Face’s
whose number of parameters is larger than 3M and the number                 performance on the cross-domain dataset, we test it on the
of flops is larger than 5G. All existing methods achieve mAP in             FDDB dataset without retraining on it. The performances of
94.27-96.06% on the Easy subset, 91.9-94.92% on the Medium                  true positive rate (TPR) when the number of false-positive is
subset, and 71.39-85.29% on the Hard subset. The most re-                   1000 are listed in Table 4. Please note that it is pointed out
cently released SCRFD [34] achieves the best performance in                 in RefineFace [30] that the annotation of FDDB misses many
all subsets. Our YOLO5Face (YOLOv5x6) achieves 96.67%,                      faces. In order to achieve their performance of 0.9911, the
95.08%, 86.55% on the three subsets, respectively. We achieve               RefineFace modifies the FDDB annotation. In our evaluation,
the SOTA performance on all the Easy, Medium, and Hard                      we use the original FDDB annotation without modifications.
subsets.                                                                    RetinaFace [29] is not evaluated on the FDDB dataset.
                           V. C ONCLUSION                                       [25] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and
                                                                                     alignment using multitask cascaded convolutional networks,” IEEE
   In this paper we present our YOLO5Face based on                                   Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016. 2,
YOLOv5 object detector [5]. We implement eight models.                               3, 6
Both the largest model YOLOv5l6 and the super small model                       [26] Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang,
                                                                                     and Stan Z. Li, “Faceboxes: A cpu real-time face detector with high
YOLOv5n achieve close to or exceeding SOTA performance                               accuracy,” IJCB, 2017. 2, 5, 8
on the WiderFace [1] validation Easy, Medium and Hard                           [27] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li, “S3 fd: Single
subsets. This proves the effectiveness of our YOLO5Face in                           shot scale-invariant face detector,” ICCV, 2017. 2
                                                                                [28] J. Li, Y. Wang, C. Wang, Y. Tai, J. Qian, J. Yang, C. Wang, J. Li, and
not only achieving the best performance, but also running                            F. Huang, “Dsfd: Dual shot face detector,” ArXiv preprint 1810.102207,
fast. Since we open-source the code, a lot of applications and                       2018. 2, 5, 8
mobile apps have been developed based on our design, and                        [29] J. Deng, J. Guo, Y. Zhou, J. Yu, I. Kotsia, and S. Zafeiriou, “Retinaface:
                                                                                     Single-stage dense face localisation in the wild,” CVPR, 2020. 2, 5, 6,
achieve impressive performance.                                                      8
                                                                                [30] S. Zhang, C. Chi, Z. Lei, and S.Z. Li, “Refineface: Refinement
                             R EFERENCES                                             neural network for high performance face detection,” ArXiv preprint
                                                                                     1909.04376, 2019. 2, 8
 [1] S. Yang, P. Luo, C. C. Loy, and X. Tang, “Wider face: A face detection
     benchmark,” CVPR, 2016. 1, 2, 5, 6, 8, 9                                   [31] B. Zhang, J. Li adn Y. Wang, Y. Tai, C. Wang, J. Li, F. Huang, Y. Xia,
 [2] S. Yang, P. Luo, C. C. Loy, and X. Tang, “Wider face: A face detection          W. Pei, and R. Ji, “Automatic and scalable face detector,” ArXiv preprint
     benchmark,” http://shuoyang1213.me/WIDERFACE/index.html. 1, 2, 8                2003.11228, 2020. 2, 8
 [3] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal            [32] D. Yashunin, T. Baydasov, and R. Vlasov, “Maskface: multi-task face
     speed and accuracy of object detection,” ArXiv preprint 2004.10934,             and landmark detector,” ArXiv preprint 2005.09412, 2020. 2
     2020. 1, 2, 3, 6                                                           [33] Y. Liu, F. Wang, B. Sun, and H. Li, “Mogface: Rethinking scale
 [4] Y. Zhu, H. Cai, S. Zhang, C. Wang, and W. Xiong, “Tinaface: Strong but          augmentation on the face detector,” ArXiv preprint 2103.11139, 2021.
     simple baseline for face detection,” ArXiv preprint 2011.13183, 2020.           2
     1, 2, 5                                                                    [34] J. Guo, A. Deng, J. Lattas, and S. Zafeiriou, “Sample and computation
 [5] YOLOv5, “Yolov5,” https://github.com/ultralytics/yolov5. 1, 2, 3, 5, 6,         redistribution for efficient face detection,” ArXiv preprint 2105.04714,
     9                                                                               2021. 2, 5, 8
 [6] M. Ma, X. Zhang, H. Zheng, and J. Sun, “Shufflenet v2: Practical guide-    [35] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Sigmoid-
     lines for efficient cnn architecture design,” ArXiv preprint 1807.11164,        weighted linear units for neural network function approximation in
     2018. 1, 3, 5, 6                                                                reinforcement learning,” TPAMI, 2015. 2, 3, 6
 [7] Z. Zhang, C.and Zhang, “Robust real-time face detection,” IJCV, 2004.      [36] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for
     1                                                                               instance segmentation,” ArXiv preprint 1803.01534, 2018. 2, 5
 [8] M. Yang, D. Kriegman, and N. Ahuja, “Detecting faces in images: a          [37] Stefan Elfwinga, Eiji Uchibea, and Kenji Doyab, “Sigmoid-weighted
     survey,” TPAMI, 2002. 1                                                         linear units for neural network function approximation in reinforcement
 [9] Z. Zhang, C.and Zhang, “A survey of recent advances in face detection,”         learning,” ArXiv preprint 1702.03118, 2017. 2
     Microsoft Research Technical report, 2010. 1                               [38] Robert J. Wang, Xiang Li, and Charles X. Ling, “Pelee: A real-time
[10] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik, “Rich          object detection system on mobile devices,” NeurIPS, 2018. 2, 3, 6
     feature hierarchies for accurate object detection and semantic segmenta-   [39] G. Huang, Z. Liu, L. Maaten, and K.Q. Weinberger, “Densely connected
     tion,” in Proceedings of the IEEE Conference on Computer Vision and             convolutional networks,” CVPR, 2017. 2, 5
     Pattern Recognition (CVPR), 2014. 2                                        [40] Z. Feng, J. Kittler, M. Awais, P. Huber, and X. Wu, “Wing loss for
[11] Ross Girshick, “Fast R-CNN,” in Proceedings of the International                robust facial landmark localisation with convolutional neural networks,”
     Conference on Computer Vision (ICCV), 2015. 2                                   CVPR, 2018. 3
[12] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time   [41] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
     object detection with region proposal networks,” IEEE Transactions on           recognition,” CVPR, 2016. 5, 8
     Pattern Analysis and Machine Intelligence, 2016. 2                         [42] Yang Liu, Xu Tang, Xiang Wu, Junyu Han, Jingtuo Liu, and Errui Ding,
[13] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, “Mask           “Hambox: Delving into online high-quality anchors mining for detecting
     R-CNN,” in Proceedings of the International Conference on Computer              outer faces,” CVPR, 2020. 5
     Vision (ICCV), 2017. 2
                                                                                [43] M. Sandler, A. Howard, w. Zhu, A. Zhmoginov, and L. Chen, “Mo-
[14] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality
                                                                                     bilenetv2: Inverted residuals and linear bottlenecks,” CVPR, 2018. 5
     object detection,” CVPR, 2018. 2
[15] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. Berg,    [44] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely
     “Yolov3: An incremental improvement,” ECCV, 2016. 2                             efficient convolutional neural network for mobile devices,” ArXiv
[16] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, “You            preprint 1707.01083, 2017. 5
     only look once: Unified, real-time object detection,” CVPR, 2016. 2        [45] Z. Zhu, G. Huang, J. Deng, Y. Ye, J. Huang, X. Chen, J. Zhu, T. Yang,
[17] Joseph Redmon and Ali Farhadi, “Yolo9000: better, faster,stronger,”             J. Lu, D. Du, and J. Zhou, “Webface260m: A benchmark unveiling the
     CVPR, 2017. 2                                                                   power of million-scale deep face recognition,” CVPR, 2021. 6, 8
[18] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”            [46] V.Jain and E. Learned-Miller, “Fddb: A benchmark for face detection
     arXiv preprint arXiv:1804.02767, 2015. 2, 5                                     in unconstrained settings,” University of Massachusetts Report, , no.
[19] T. Lin, P. Dollár, R. Girshick, K. He, B Hariharan, and S. Belongie,           UM-CS-2010-009, 2010. 6, 8
     “Feature pyramid networks for object detection,” CVPR, 2017. 2, 5          [47] Rui Zhao, Tianshan Liu, Jun Xiao, Daniel P. K. Lun, and Kin-Man Lam,
[20] K. et al. Chen, “Mmdetection: Open mmlab detection toolbox and                  “Deep multi-task learning for facial expression recognition and synthesis
     benchmark,” ECCV, 2020. 2                                                       based on selective feature sharing,” ICPR, 2020. 6, 8
[21] M. Tan, R. Pang, and Q. Le, “Efficientdet: Scalable and efficient object   [48] Zheng Zhu, Guan Huang, Jiankang Deng, Yun Ye, Junjie Huang, Xinze
     detection,” CVPR, 2020. 2                                                       Chen, Jiagang Zhu, Tian Yang, Jia Guo, Jiwen Lu, Dalong Du, and
[22] N. Carion, F. Massa, G. Synnaeve, N Usunier, A. Kirillov, and                   Jie Zhou, “Masked face recognition challenge: The webface260m track
     Z. Zagoruyko, “End-to-end object detection with transformers,” ECCV,            report,” ICCV Workshops, 2021. 8
     2020. 2                                                                    [49] T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li, “Bag of tricks
[23] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet:              for image classification with convolutional neural networks,” CVPR,
     Keypoint triplets for object detection,” ICCV, 2019. 2                          2019. 8
[24] X. Zhou, D. Wang, and Krähenbühl P., “Objects as points,” in arXiv       [50] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular
     preprint arXiv:1904.07850, 2019. 2                                              margin loss for deep face recognition,” CVPR, June 2019. 8
[51] X. An, X. Zhu, Y. Xiao, L. Wu, M. Zhang, Y. Gao, B. Qin, D. Zhang, and
     Y. Fu, “Partial fc: Training 10 million identities on a single machine,”
     arXiv preprint 2010.05222, 2021. 8
[52] Delong Qi, Kangli Hu, Weijun Tan, Qi Yao, and Jingfeng Liu, “Balanced
     masked and standard face recogntion,” ICCV Workshops, 2021. 8
[53] W. Tian, Z. Wang, H. Shen, W. Deng, B. Chen, and X. Zhang, “Learning
     better features for face detection with feature fusion and segmentation
     supervision,” ArXiv preprint 1811.08557, 2018. 8
[54] S. Zhang, R. Zhu, X. Wang, H. Shi, T. Fu, S. Wang, T. Mei, and Stan Z.
     Li, “Isrn - improved selective refinement network for face detection,”
     ArXiv preprint 1901.06651, 2019. 8
[55] Y. Zhang, X. Xu, and X. Liu, “Robust and high performance face
     detector,” ArXiv preprint 1901.02350, 2019. 8
[56] Z. Li, X. Tang, J. Han, J. Liu, and Z. He, “Pyramidbox++: High
     performance detector for finding tiny face,” ArXiv preprint 1904.00386,
     2019. 8
[57] C. Chi, S. Zhang, J. Xing, Z. Lei, and S. Z. Li, “Srn - selective
     refinement network for high performance face detection,” ArXiv preprint
     1809.02693, 2018. 8
[58] X. Tang, Daniel K. Du, Z. He, and J. Liu, “Pyramidbox: A context-
     assisted single shot face detector,” ArXiv preprint 1803.07737, 2018.
     8