Object Detection Survey
Object Detection Survey
                                               Abstract—Object detection, as of one the most fundamental and challenging problems in computer vision, has received great
                                               attention in recent years. Its development in the past two decades can be regarded as an epitome of computer vision history. If we think
                                               of today’s object detection as a technical aesthetics under the power of deep learning, then turning back the clock 20 years we would
                                               witness the wisdom of cold weapon era. This paper extensively reviews 400+ papers of object detection in the light of its technical
                                               evolution, spanning over a quarter-century’s time (from the 1990s to 2019). A number of topics have been covered in this paper,
                                               including the milestone detectors in history, detection datasets, metrics, fundamental building blocks of the detection system, speed up
                                               techniques, and the recent state of the art detection methods. This paper also reviews some important detection applications, such as
                                               pedestrian detection, face detection, text detection, etc, and makes an in-deep analysis of their challenges as well as technical
arXiv:1905.05055v2 [cs.CV] 16 May 2019
Index Terms—Object detection, Computer vision, Deep learning, Convolutional neural networks, Technical evolution.
1 I NTRODUCTION
Fig. 2. A road map of object detection. Milestone detectors in this figure: VJ Det. [10, 11], HOG Det. [12], DPM [13–15], RCNN [16], SPPNet [17],
Fast RCNN [18], Faster RCNN [19], YOLO [20], SSD [21], Pyramid Networks [22], Retina-Net [23].
regression”, etc. However, previous reviews lack fundamen-                methods in the recent three years are summarized in Section
tal analysis to help readers understand the nature of these               4. Some important detection applications will be reviewed
sophisticated techniques, e.g., “Where did they come from                 in Section 5. In Section 6, we conclude this paper and make
and how did they evolve?” “What are the pros and cons                     an analysis of the further research directions.
of each group of methods?” This paper makes an in-depth
analysis for readers of the above concerns.
    3. A comprehensive analysis of detection speed up
                                                                          2     O BJECT D ETECTION IN 20 Y EARS
techniques: The acceleration of object detection has long                 In this section, we will review the history of object detection
been a crucial but challenging task. This paper makes an                  in multiple aspects, including milestone detectors, object
extensive review of the speed up techniques in 20 years                   detection datasets, metrics, and the evolution of key tech-
of object detection history at multiple levels, including                 niques.
“detection pipeline” (e.g., cascaded detection, feature map
shared computation), “detection backbone” (e.g., network                  2.1       A Road Map of Object Detection
compression, lightweight network design), and “numerical
                                                                          In the past two decades, it is widely accepted that the
computation” (e.g., integral image, vector quantization).
                                                                          progress of object detection has generally gone through
This topic is rarely covered by previous reviews.
                                                                          two historical periods: “traditional object detection period
    •   Difficulties and Challenges in Object Detection                   (before 2014)” and “deep learning based detection period
    Despite people always asking “what are the difficulties               (after 2014)”, as shown in Fig. 2.
and challenges in object detection?”, actually, this question
is not easy to answer and may even be over-generalized.                   2.1.1 Milestones: Traditional Detectors
As different detection tasks have totally different objectives            If we think of today’s object detection as a technical aes-
and constraints, their difficulties may vary from each other.             thetics under the power of deep learning, then turning back
In addition to some common challenges in other computer                   the clock 20 years we would witness “the wisdom of cold
vision tasks such as objects under different viewpoints,                  weapon era”. Most of the early object detection algorithms
illuminations, and intraclass variations, the challenges in               were built based on handcrafted features. Due to the lack of
object detection include but not limited to the following                 effective image representation at that time, people have no
aspects: object rotation and scale changes (e.g., small ob-               choice but to design sophisticated feature representations,
jects), accurate object localization, dense and occluded object           and a variety of speed up skills to exhaust the usage of
detection, speed up of detection, etc. In Sections 4 and 5, we            limited computing resources.
will give a more detailed analysis of these topics.
                                                                                •    Viola Jones Detectors
    The rest of this paper is organized as follows. In Section
2, we review the 20 years’ evolutionary history of object                    18 years ago, P. Viola and M. Jones achieved real-time
detection. Some speed up techniques in object detection will              detection of human faces for the first time without any
be introduced in Section 3. Some state of the art detection               constraints (e.g., skin color segmentation) [10, 11]. Running
                                                                                                                                3
on a 700MHz Pentium III CPU, the detector was tens or                  The DPM follows the detection philosophy of “divide
even hundreds of times faster than any other algorithms in        and conquer”, where the training can be simply considered
its time under comparable detection accuracy. The detection       as the learning of a proper way of decomposing an object,
algorithm, which was later referred to the “Viola-Jones           and the inference can be considered as an ensemble of de-
(VJ) detector”, was herein given by the authors’ names in         tections on different object parts. For example, the problem
memory of their significant contributions.                        of detecting a “car” can be considered as the detection of
     The VJ detector follows a most straight forward way of       its window, body, and wheels. This part of the work, a.k.a.
detection, i.e., sliding windows: to go through all possible      “star-model”, was completed by P. Felzenszwalb et al. [13].
locations and scales in an image to see if any window             Later on, R. Girshick has further extended the star-model to
contains a human face. Although it seems to be a very             the “mixture models” [14, 15, 37, 38] to deal with the objects
simple process, the calculation behind it was far beyond the      in the real world under more significant variations.
computer’s power of its time. The VJ detector has dramat-              A typical DPM detector consists of a root-filter and a
ically improved its detection speed by incorporating three        number of part-filters. Instead of manually specifying the
important techniques: “integral image”, “feature selection”,      configurations of the part filters (e.g., size and location), a
and “detection cascades”.                                         weakly supervised learning method is developed in DPM
     1) Integral image: The integral image is a computational     where all configurations of part filters can be learned auto-
method to speed up box filtering or convolution process.          matically as latent variables. R. Girshick has further formu-
Like other object detection algorithms in its time [29–31],       lated this process as a special case of Multi-Instance learning
the Haar wavelet is used in VJ detector as the feature            [39], and some other important techniques such as “hard
representation of an image. The integral image makes the          negative mining”, “bounding box regression”, and “context
computational complexity of each window in VJ detector            priming” are also applied for improving detection accuracy
independent of its window size.                                   (to be introduced in Section 2.3). To speed up the detection,
     2) Feature selection: Instead of using a set of manually     Girshick developed a technique for “compiling” detection
selected Haar basis filters, the authors used Adaboost al-        models into a much faster one that implements a cascade
gorithm [32] to select a small set of features that are mostly    architecture, which has achieved over 10 times acceleration
helpful for face detection from a huge set of random features     without sacrificing any accuracy [14, 38].
pools (about 180k-dimensional).                                        Although today’s object detectors have far surpassed
     3) Detection cascades: A multi-stage detection paradigm      DPM in terms of the detection accuracy, many of them are
(a.k.a. the “detection cascades”) was introduced in VJ detec-     still deeply influenced by its valuable insights, e.g., mixture
tor to reduce its computational overhead by spending less         models, hard negative mining, bounding box regression, etc.
computations on background windows but more on face               In 2010, P. Felzenszwalb and R. Girshick were awarded the
targets.                                                          “lifetime achievement” by PASCAL VOC.
RCNN yields a signicant performance boost on VOC07,                  Although Faster RCNN breaks through the speed bottle-
with a large improvement of mean Average Precision (mAP)         neck of Fast RCNN, there is still computation redundancy
from 33.7% (DPM-v5 [43]) to 58.5%.                               at subsequent detection stage. Later, a variety of improve-
   Although RCNN has made great progress, its drawbacks          ments have been proposed, including RFCN [46] and Light
are obvious: the redundant feature computations on a large       head RCNN [47]. (See more details in Section 3.)
number of overlapped proposals (over 2000 boxes from one
                                                                    •    Feature Pyramid Networks
image) leads to an extremely slow detection speed (14s per
image with GPU). Later in the same year, SPPNet [17] was             In 2017, T.-Y. Lin et al. proposed Feature Pyramid Net-
proposed and has overcome this problem.                          works (FPN) [22] on basis of Faster RCNN. Before FPN,
                                                                 most of the deep learning based detectors run detection only
   •   SPPNet
                                                                 on a network’s top layer. Although the features in deeper
     In 2014, K. He et al. proposed Spatial Pyramid Pooling      layers of a CNN are beneficial for category recognition, it
Networks (SPPNet) [17]. Previous CNN models require a            is not conducive to localizing objects. To this end, a top-
fixed-size input, e.g., a 224x224 image for AlexNet [40].        down architecture with lateral connections is developed in
The main contribution of SPPNet is the introduction of a         FPN for building high-level semantics at all scales. Since a
Spatial Pyramid Pooling (SPP) layer, which enables a CNN         CNN naturally forms a feature pyramid through its forward
to generate a fixed-length representation regardless of the      propagation, the FPN shows great advances for detecting
size of image/region of interest without rescaling it. When      objects with a wide variety of scales. Using FPN in a basic
using SPPNet for object detection, the feature maps can be       Faster R-CNN system, it achieves state-of-the-art single
computed from the entire image only once, and then fixed-        model detection results on the MSCOCO dataset without
length representations of arbitrary regions can be generated     bells and whistles (COCO mAP@.5=59.1%, COCO mAP@[.5,
for training the detectors, which avoids repeatedly com-         .95]=36.2%). FPN has now become a basic building block of
puting the convolutional features. SPPNet is more than 20        many latest detectors.
times faster than R-CNN without sacrificing any detection
accuracy (VOC07 mAP=59.2%).                                      2.1.3   Milestones: CNN based One-stage Detectors
     Although SPPNet has effectively improved the detection         •    You Only Look Once (YOLO)
speed, there are still some drawbacks: first, the training is
                                                                     YOLO was proposed by R. Joseph et al. in 2015. It was
still multi-stage, second, SPPNet only fine-tunes its fully
                                                                 the first one-stage detector in deep learning era [20]. YOLO
connected layers while simply ignores all previous layers.
                                                                 is extremely fast: a fast version of YOLO runs at 155fps
Later in the next year, Fast RCNN [18] was proposed and
                                                                 with VOC07 mAP=52.7%, while its enhanced version runs
solved these problems.
                                                                 at 45fps with VOC07 mAP=63.4% and VOC12 mAP=57.9%.
   •   Fast RCNN                                                 YOLO is the abbreviation of “You Only Look Once”. It can
                                                                 be seen from its name that the authors have completely
    In 2015, R. Girshick proposed Fast RCNN detector [18],       abandoned the previous detection paradigm of “proposal
which is a further improvement of R-CNN and SPPNet               detection + verification”. Instead, it follows a totally dif-
[16, 17]. Fast RCNN enables us to simultaneously train a         ferent philosophy: to apply a single neural network to the
detector and a bounding box regressor under the same             full image. This network divides the image into regions
network configurations. On VOC07 dataset, Fast RCNN              and predicts bounding boxes and probabilities for each
increased the mAP from 58.5% (RCNN) to 70.0% while with          region simultaneously. Later, R. Joseph has made a series
a detection speed over 200 times faster than R-CNN.              of improvements on basis of YOLO and has proposed its v2
    Although Fast-RCNN successfully integrates the advan-        and v3 editions [48, 49], which further improve the detection
tages of R-CNN and SPPNet, its detection speed is still          accuracy while keeps a very high detection speed.
limited by the proposal detection (see Section 2.3.2 for more        In spite of its great improvement of detection speed,
details). Then, a question naturally arises: “can we generate    YOLO suffers from a drop of the localization accuracy com-
object proposals with a CNN model?” Later, Faster R-CNN          pared with two-stage detectors, especially for some small
[19] has answered this question.                                 objects. YOLO’s subsequent versions [48, 49] and the latter
                                                                 proposed SSD [21] has paid more attention to this problem.
   •   Faster RCNN
                                                                    •    Single Shot MultiBox Detector (SSD)
    In 2015, S. Ren et al. proposed Faster RCNN detector
[19, 44] shortly after the Fast RCNN. Faster RCNN is the first       SSD [21] was proposed by W. Liu et al. in 2015. It was the
end-to-end, and the first near-realtime deep learning de-        second one-stage detector in deep learning era. The main
tector (COCO mAP@.5=42.7%, COCO mAP@[.5,.95]=21.9%,              contribution of SSD is the introduction of the multi-reference
VOC07 mAP=73.2%, VOC12 mAP=70.4%, 17fps with ZF-                 and multi-resolution detection techniques (to be introduce
Net [45]). The main contribution of Faster-RCNN is the in-       in Section 2.3.2), which significantly improves the detection
troduction of Region Proposal Network (RPN) that enables         accuracy of a one-stage detector, especially for some small
nearly cost-free region proposals. From R-CNN to Faster          objects. SSD has advantages in terms of both detection speed
RCNN, most individual blocks of an object detection sys-         and accuracy (VOC07 mAP=76.8%, VOC12 mAP=74.9%,
tem, e.g., proposal detection, feature extraction, bounding      COCO mAP@.5=46.5%, mAP@[.5,.95]=26.8%, a fast version
box regression, etc, have been gradually integrated into a       runs at 59fps). The main difference between SSD and any
unified, end-to-end learning framework.                          previous detectors is that the former one detects objects of
                                                                                                                                          5
  1. http://host.robots.ox.ac.uk/pascal/VOC/                        3. http://cocodataset.org/
  2. http://image-net.org/challenges/LSVRC/                         4. https://storage.googleapis.com/openimages/web/index.html
                                                                                                                                           6
Fig. 4. Some example images and annotations in (a) PASCAL-VOC07, (b) ILSVRC, (c) MS-COCO, and (d) Open Images.
Open Images: 1) the standard object detection, and 2) the                   to predict full image performance in certain cases [59].
visual relationship detection which detects paired objects in               In 2009, the Caltech pedestrian detection benchmark was
particular relations. For the object detection task, the dataset            created [59, 60] and since then, the evaluation metric has
consists of 1,910k images with 15,440k annotated bounding                   changed from per-window (FPPW) to false positives per-
boxes on 600 object categories.                                             image (FPPI).
   •    Datasets of Other Detection Tasks                                       In recent years, the most frequently used evaluation for
                                                                            object detection is “Average Precision (AP)”, which was
    In addition to general object detection, the past 20 years              originally introduced in VOC2007. AP is defined as the
also witness the prosperity of detection applications in spe-               average detection precision under different recalls, and is
cific areas, such as pedestrian detection, face detection, text             usually evaluated in a category specific manner. To compare
detection, traffic sign/light detection, and remote sensing                 performance over all object categories, the mean AP (mAP)
target detection. Tables 2-6 list some of the popular datasets              averaged over all object categories is usually used as the
of these detection tasks5 . A detailed introduction of the                  final metric of performance. To measure the object local-
detection methods of these tasks can be found in Section                    ization accuracy, the Intersection over Union (IoU) is used
5.                                                                          to check whether the IoU between the predicted box and
                                                                            the ground truth box is greater than a predefined threshold,
2.2.1   Metrics                                                             say, 0.5. If yes, the object will be identified as “successfully
How can we evaluate the effectiveness of an object detector?                detected”, otherwise will be identified as “missed”. The 0.5-
This question may even have different answers at different                  IoU based mAP has then become the de facto metric for
time.                                                                       object detection problems for years.
   In the early time’s detection community, there is no                         After 2014, due to the popularity of MS-COCO datasets,
widely accepted evaluation criteria on detection perfor-                    researchers started to pay more attention to the accuracy
mance. For example, in the early research of pedestrian                     of the bounding box location. Instead of using a fixed IoU
detection [12], the “miss rate vs. false positives per-window               threshold, MS-COCO AP is averaged over multiple IoU
(FPPW)” was usually used as a metric. However, the per-                     thresholds between 0.5 (coarse localization) and 0.95 (perfect
window measurement (FPPW) can be flawed and fails                           localization). This change of the metric has encouraged more
                                                                            accurate object localization and may be of great importance
  5. The #Cites shows statistics as of Feb. 2019.                           for some real-world applications (e.g., imagine there is a
                                                                                                                                   7
robot arm trying to grasp a spanner).                               2.3.1    Early Time’s Dark Knowledge
     Recently, there are some further developments of the           The early time’s object detection (before 2000) did not follow
evaluation in the Open Images dataset, e.g., by considering         a unified detection philosophy like sliding window detec-
the group-of boxes and the non-exhaustive image-level cate-         tion. Detectors at that time were usually designed based on
gory hierarchies. Some researchers also have proposed some          low-level and mid-level vision as follows.
alternative metrics, e.g., “localization recall precision” [94].
Despite the recent changes, the VOC/COCO-based mAP is                   •   Components, shapes and edges
still the most frequently used evaluation metric for object
detection.                                                              “Recognition-by-components”, as an important cogni-
                                                                    tive theory [98], has long been the core idea of image
                                                                    recognition and object detection [13, 99, 100]. Some early
                                                                    researchers framed the object detection as a measurement of
2.3   Technical Evolution in Object Detection
                                                                    similarity between the object components, shapes and con-
In this section, we will introduce some important building          tours, including Distance Transforms [101], Shape Contexts
blocks of a detection system and their technical evolutions         [35], and Edgelet [102], etc. Despite promising initial results,
in the past 20 years.                                               things did not work out well on more complicated detec-
                                                                                                                                     8
tion problems. Therefore, machine learning based detection                specific knowledge from data.
methods were beginning to prosper.                                            Wavelet feature transform started to dominate visual
    Machine learning based detection has gone through mul-                recognition and object detection since 2000. The essence of
tiple periods, including the statistical models of appearance             this group of methods is learning by transforming an image
(before 1998), wavelet feature representations (1998-2005),               from pixels to a set of wavelet coefficients. Among these
and gradient-based representations (2005-2012).                           methods, the Haar wavelet, owing to its high computational
    Building statistical models of an object, like Eigenfaces             efficiency, has been mostly used in many object detection
[95, 106] as shown in Fig 5 (a), was the first wave of learning           tasks, such as general object detection [29], face detection
based approaches in object detection history. In 1991, M.                 [10, 11, 109], pedestrian detection [30, 31], etc. Fig 5 (d)
Turk et al. achieved real-time face detection in a lab envi-              shows a set of Haar wavelets basis learned by a VJ detector
ronment by using Eigenface decomposition [95]. Compared                   [10, 11] for human faces.
with the rule-based or template based approaches of its
                                                                              •    Early time’s CNN for object detection
time [107, 108], a statistical model better provides holistic
descriptions of an object’s appearance by learning task-                      The history of using CNN to detecting objects can be
                                                                                                                                 9
traced back to the 1990s [96], where Y. LeCun et al. have               With the increase of computing power after the VJ
made great contributions at that time. Due to limitations in        detector, researchers started to pay more attention to an
computing resources, CNN models at the time were much               intuitive way of detection by building “feature pyramid +
smaller and shallower than those of today. Despite this, the        sliding windows”. From 2004 to 2014, a number of mile-
computational efficiency was still considered as one of the         stone detectors were built based on this detection paradigm,
tough nuts to crack in early times’s CNN based detection            including the HOG detector, DPM, and even the Overfeat
models. Y. LeCun et al. have made a series of improvements          detector [103] of the deep learning era (winner of ILSVRC-
like “shared-weight replicated neural network” [96] and             13 localization task).
“space displacement network” [97] to reduce the computa-
                                                                        Early detection models like VJ detector and HOG de-
tions by extending each layer of the convolutional network
                                                                    tector were specifically designed to detect objects with a
so as to cover the entire input image, as shown in Fig. 5
                                                                    “fixed aspect ratio” (e.g., faces and upright pedestrians) by
(b)-(c). In this way, the feature of any location of the entire
                                                                    simply building the feature pyramid and sliding fixed size
image can be extracted by taking only one time of forward
                                                                    detection window on it. The detection of “various aspect
propagation of the network. This can be considered as the
                                                                    ratios” was not considered at that time. To detect objects
prototype of today’s fully convolutional networks (FCN)
                                                                    with a more complex appearance like those in PASCAL
[110, 111], which was proposed almost 20 years later. CNN
                                                                    VOC, R. Girshick et al. began to seek better solutions outside
also has been applied to other tasks such as face detection
                                                                    the feature pyramid. The “mixture model” [15] was one of
[112, 113] and hand tracking [114] of its time.
                                                                    the best solutions at that time, by training multiple models
                                                                    to detect objects with different aspect ratios. Apart from
2.3.2    Technical Evolution of Multi-Scale Detection               this, exemplar-based detection [36, 115] provided another
Multi-scale detection of objects with “different sizes” and         solution by training individual models for every object
“different aspect ratios” is one of the main technical chal-        instance (exemplar) of the training set.
lenges in object detection. In the past 20 years, multi-
scale detection has gone through multiple historical periods:           As objects in the modern datasets (e.g., MS-COCO)
“feature pyramids and sliding windows (before 2014)”, “de-          become more diversified, the mixture model or exemplar-
tection with object proposals (2010-2015)”, “deep regression        based methods inevitably lead to more miscellaneous de-
(2013-2016)”, “multi-reference detection (after 2015)”, and         tection models. A question then naturally arises: is there a
“multi-resolution detection (after 2016)”, as shown in Fig. 6.      unified multi-scale approach to detect objects of different
                                                                    aspect ratios? The introduction of “object proposals” (to be
   •     Feature pyramids + sliding windows (before 2014)           introduced) has answered this question.
                                                                                                                                           10
Fig. 6. Evolution of multi-scale detection techniques in object detection from 2001 to 2019: 1) feature pyramids and sliding windows, 2) detection
with object proposals, 3) deep regression, 4) multi-reference detection, and 5) multi-resolution detection. Detectors in this figure: VJ Det. [10], HOG
Det. [12], DPM [13, 15], Exemplar SVM [36], Overfeat [103], RCNN [16], SPPNet [17], Fast RCNN [18], Faster RCNN [19], DNN Det. [104], YOLO
[20], YOLO-v2 [48], SSD [21], Unified Det. [105], FPN [22], RetinaNet [23], RefineDet [55], TridentNet [56].
2.3.3 Technical Evolution of Bounding Box Regression                             •   From features to BB (after 2013)
The Bounding Box (BB) regression is an important tech-                          After the introduction of Faster RCNN in 2015, BB re-
nique in object detection. It aims to refine the location of                 gression no longer serves as an individual post-processing
a predicted bounding box based on the initial proposal                       block but has been integrated with the detector and trained
or the anchor box. In the past 20 years, the evolution of                    in an end-to-end fashion. At the same time, BB regression
BB regression has gone through three historical periods:                     has evolved to predicting BB directly based on CNN fea-
“without BB regression (before 2008)”, “from BB to BB (2008-                 tures. In order to get more robust prediction, the smooth-L1
2013)”, and “from feature to BB (after 2013)”. Fig. 7 shows                  function [19] is commonly used,
the evolutions of bounding box regression.                                                           (
    •   Without BB regression (before 2008)
                                                                                                      5t2         |t| 6 0.1
                                                                                             L(t) =                                    (2)
                                                                                                      |t| − 0.05 else,
   Most of the early detection methods such as VJ detector
and HOG detector do not use BB regression, and usually                       or the root-square function [20],
directly consider the sliding window as the detection result.                                              √   √
                                                                                              L(x, x∗ ) = ( x − x∗ )2 ,                            (3)
To obtain accurate locations of an object, researchers have
no choice but to build very dense pyramid and slide the                      as their regression loss, which are more robust to the outliers
detector densely on each location.                                           than the least square loss used in DPM. Some researchers
    •   From BB to BB (2008-2013)                                            also choose to normalize the coordinates to get more robust
                                                                             results [18, 19, 21, 23].
     The first time that BB regression was introduced to an
object detection system was in DPM [15]. The BB regression                   2.3.4    Technical Evolution of Context Priming
at that time usually acted as a post-processing block, thus
                                                                             Visual objects are usually embedded in a typical context
it is optional. As the goal in the PASCAL VOC is to predict
                                                                             with the surrounding environments. Our brain takes advan-
single bounding box for each object, the simplest way for
                                                                             tage of the associations among objects and environments
a DPM to generate final detection should be directly using
                                                                             to facilitate visual perception and cognition [160]. Context
its root filter locations. Later, R. Girshick et al. introduced
                                                                             priming has long been used to improve detection. There
a more complex way to predict a bounding box based
                                                                             are three common approaches in its evolutionary history: 1)
on the complete configuration of an object hypothesis and
                                                                             detection with local context, 2) detection with global context,
formulate this process as a linear least-squares regression
                                                                             and 3) context interactives, as shown in Fig. 8.
problem [15]. This method yields noticeable improvements
of the detection under PASCAL criteria.                                          •   Detection with local context
                                                                                                                                                  12
Fig. 7. Evolution of bounding box regression techniques in object detection from 2001 to 2019. Detectors in this figure: VJ Det. [10], HOG Det. [12],
Exemplar SVM [36], DPM [13, 15], Overfeat [103], RCNN [16], SPPNet [17], Fast RCNN [18], Faster RCNN [19], YOLO [20], SSD [21], YOLO-v2
[48], Unified Det. [105], FPN [22], RetinaNet [23], RefineDet [55], TridentNet [56].
Fig. 8. Evolution of context priming in object detection from 2001 to 2019: 1) detection with local context, 2) detection with global context, 3)
detection with context interactives. Detectors in this figure: Face Det. [139], MultiPath [140], GBDNet [141, 142], CC-Net [143], MultiRegion-CNN
[144], CoupleNet [145], DPM [14, 15], StructDet [146], YOLO [20], RFCN++ [147], ION [148], AttenContext [149], CtxSVM [150], PersonContext
[151], SMN [152], RetinaNet [23], SIN [153].
    Local context refers to the visual information in the area              time’s object detectors, a common way of integrating global
that surrounds the object to detect. It has long been acknowl-              context is to integrate a statistical summary of the elements
edged that local context helps improve object detection. At                 that comprise the scene, like Gist [160]. For modern deep
early 2000s, Sinha and Torralba [139] found that inclusion of               learning based detectors, there are two methods to integrate
local contextual regions such as the facial bounding contour                global context. The first way is to take advantage of large
substantially improves face detection performance. Dalal                    receptive field (even larger than the input image) [20] or
and Triggs also found that incorporating a small amount                     global pooling operation of a CNN feature [147]. The second
of background information improves the accuracy of pedes-                   way is to think of the global context as a kind of sequential
trian detection [12]. Recent deep learning based detectors                  information and to learn it with the recurrent neural net-
can also be improved with local context by simply enlarging                 works [148, 149].
the networks’ receptive field or the size of object proposals                   •   Context interactive
[140–145, 161].
                                                                               Context interactive refers to the piece of information that
    •   Detection with global context                                       conveys by the interactions of visual elements, such as the
    Global context exploits scene configuration as an addi-                 constraints and dependencies. For most object detectors,
tional source of information for object detection. For early                object instances are detected and recognized individually
                                                                                                                                     13
Fig. 9. Evolution of non-max suppression (NMS) techniques in object detection from 1994 to 2019: 1) Greedy selection, 2) Bounding box
aggregation, and 3) Learn to NMS. Detectors in this figure: VJ Det. [10], Face Det. [96], HOG Det. [12], DPM [13, 15], RCNN [16], SPPNet
[17], Fast RCNN [18], Faster RCNN [19], YOLO [20], SSD [21], FPN [22], RetinaNet [23], LearnNMS [154], MAP-Det [155], End2End-DPM [136],
StrucDet [146], Overfeat [103], APC-NMS [156], MAPC [157], SoftNMS [158], FitnessNMS [159].
without exploiting their relations. Some recent researches           shown in Fig 11. First of all, the top-scoring box may not be
have suggested that modern object detectors can be im-               the best fit. Second, it may suppress nearby objects. Finally, it
proved by considering context interactives. Some recent im-          does not suppress false positives. In recent years, in spite of
provements can be grouped into two categories, where the             the fact that some manual modifications have been recently
first one is to explore the relationship between individual          made to improve its performance [158, 159, 163] (see Section
objects [15, 146, 150, 152, 162], and the second one is to           4.4 for more details), to our best knowledge, the greedy
explore modeling the dependencies between objects and                selection still performs as the strongest baseline for today’s
scenes [151, 151, 153].                                              object detection.
Fig. 10. Evolution of hard negative mining techniques in object detection from 1994 to 2019. Detectors in this figure: Face Det. [164], Haar Det. [29],
VJ Det. [10], HOG Det. [12], DPM [13, 15], RCNN [16], SPPNet [17], Fast RCNN [18], Faster RCNN [19], YOLO [20], SSD [21], FasterPed [165],
OHEM [166], RetinaNet [23], RefineDet [55].
windows to every object. Modern detection datasets require                   RCNN and YOLO simply balance the weights between the
the prediction of object aspect ratio, further increasing the                positive and negative windows. However, researchers later
imbalanced ratio to 106 ∼107 [129]. In this case, using all                  noticed that the weight-balancing cannot completely solve
background data will be harmful to training as the vast                      the imbalanced data problem [23]. To this end, after 2016,
number of easy negatives will overwhelm the learning                         the bootstrap was re-introduced to deep learning based
process. Hard negative mining (HNM) aims to deal with the                    detectors [21, 165–168]. For example, in SSD [21] and OHEM
problem of imbalanced data during training. The technical                    [166], only the gradients of a very small part of samples
evolution of HNM in object detection is shown in Fig. 10.                    (those with the largest loss values) will be back-propagated.
                                                                             In RefineDet [55], an “anchor refinement module” is de-
    •   Bootstrap                                                            signed to filter easy negatives. An alternative improvement
    Bootstrap in object detection refers to a group of training              is to design new loss functions [23, 169, 170], by reshaping
techniques in which the training starts with a small part                    the standard cross entropy loss so that it will put more focus
of background samples and then iteratively add new miss-                     on hard, misclassified examples [23].
classified backgrounds during the training process. In early
times object detectors, bootstrap was initially introduced
with the purpose of reducing the training computations
over millions of background samples [10, 29, 164]. Later it                  3    S PEED -U P OF D ETECTION
became a standard training technique in DPM and HOG
detectors [12, 13] for solving the data imbalance problem.                   The acceleration of object detection has long been an im-
                                                                             portant but challenging problem. In the past 20 years, the
    •   HNM in deep learning based detectors                                 object detection community has developed sophisticated
   Later in the deep learning era, due to the improvement                    acceleration techniques. These techniques can be roughly
of computing power, bootstrap was shortly discarded in                       divided into three levels of groups: “speed up of detection
object detection during 2014-2016 [16–20]. To ease the data-                 pipeline”, “speed up of detection engine”, and “speed up of
imbalance problem during training, detectors like Faster                     numerical computation”, as shown in Fig 12.
                                                                                                                                  15
Fig. 14. An overview of speed up methods of a CNN’s convolutional layer and the comparison of their computational complexity: (a) Standard
                                                                      0   0                         0
convolution: O(dk2 c). (b) Factoring convolutional filters (k × k → (k × k )2 or 1 × k, k × 1): O(dk 2 c) or O(dkc). (c) Factoring convolutional
              0 2           2 0
channels: O(d k c) + O(dk d ). (d) Group convolution (#groups=m): O(dk c/m). (e) Depth-wise separable convolution: O(ck2 ) + O(dc).
                                                                          2
3.4.1 Network Pruning                                                    one (“student net”) [193, 194]. Recently, this idea has been
The research of “network pruning” can be traced back to                  used in the acceleration of object detection [195, 196]. One
as early as the 1980s. At that time, Y. LeCun et al. proposed            straight forward approach of this idea is to use a teacher net
a method called “optimal brain damage” to compress the                   to instruct the training of a (light-weight) student net so that
parameters of a multi-layer perceptron network [186]. In this            the latter can be used for speed up detection [195]. Another
method, the loss function of a network is approximated by                approach is to make transform of the candidate regions so
taking the second-order derivatives so that to remove some               as to minimize the features distance between the student net
unimportant weights. Following this idea, the network                    and teacher net. This method makes the detection model 2
pruning methods in recent years usually take an iterative                times faster while achieving a comparable accuracy [196].
training and pruning process, i.e., to remove only a small
group of unimportant weights after each stage of training,               3.5     Lightweight Network Design
and to repeat those operations [187]. As traditional network
pruning simply removes unimportant weights, which may                    The last group of methods to speed up a CNN based de-
result in some sparse connectivity patterns in a convolu-                tector is to directly design a lightweight network instead of
tional filter, it can not be directly applied to compress a CNN          using off-the-shelf detection engines. Researchers have long
model. A simple solution to this problem is to remove the                been exploring the right configurations of a network so that
whole filters instead of the independent weights [188, 189].             to gain accuracy under a constrained time cost. In addition
                                                                         to some general designing principles like “fewer channels
3.4.2 Network Quantification                                             and more layers” [197], some other approaches have been
The recent works on network quantification mainly focus on               proposed in recent years: 1) factorizing convolutions, 2)
network binarization, which aims to accelerate a network                 group convolution, 3) depth-wise separable convolution, 4)
by quantifying its activations or weights to binary variables            bottle-neck design, and 5) neural architecture search.
(say, 0/1) so that the floating-point operation is converted
to AND, OR, NOT logical operations. Network binariza-                    3.5.1    Factorizing Convolutions
tion can significantly speed up computations and reduce                  Factorizing convolutions is the simplest and most straight
the network’s storage so that it can be much easier to be                forward way to build a lightweight CNN model. There are
deployed on mobile devices. One possible implementation                  two groups of factorizing methods.
of the above ideas is to approximate the convolution by                      The first group of methods is to factorize a large convo-
binary variables with the least squares method [190]. A                  lution filter into a set of small ones in their spatial dimension
more accurate approximation can be obtained by using                     [47, 147, 198], as shown in Fig. 14 (b). For example, one can
linear combinations of multiple binary convolutions [191].               factorize a 7x7 filter into three 3x3 filters, where they share
In addition, some researchers have further developed GPU                 the same receptive field but the later one is more efficient.
acceleration libraries for binarized computation, which ob-              Another example is to factorize a k ×k filter into a k ×1 filter
tained more significant acceleration results [192].                      and a 1×k filter [198, 199], which could be more efficient for
                                                                         very large filters, say 15x15 [199]. This idea has been recently
3.4.3 Network Distillation                                               used in object detection [200].
Network distillation is a general framework to compress the                  The second group of methods is to factorize a large
knowledge of a large network (“teacher net”) into a small                group of convolutions into two small groups in their chan-
                                                                                                                                  17
nel dimension [201, 202], as shown in Fig. 14 (c). For exam-         3.6   Numerical Acceleration
ple, one can approximate a convolution layer with d filters          In this section, we mainly introduce four important numer-
and a feature map of c channels by d0 filters + a nonlinear          ical acceleration methods that are frequently used in object
activation + another d filters ( d0 < d ). In this case, the         detection: 1) speed up with the integral image, 2) speed
complexity O(dk 2 c) of the original layer can be reduced to         up in the frequency domain, 3) vector quantization, and 4)
O(d0 k 2 c) + O(dd0 ).                                               reduced rank approximation.
3.5.2 Group Convolution
                                                                     3.6.1 Speed Up with Integral Image
Group convolution aims to reduce the number of parame-
ters in a convolution layer by dividing the feature channels         The integral image is an important method in image pro-
into many different groups, and then convolve on each                cessing. It helps to rapidly calculate summations over image
group independently [189, 203], as shown in Fig. 14 (d).             sub-regions. The essence of integral image is the integral-
If we evenly divide the feature channels into m groups,              differential separability of convolution in signal processing:
without changing other configurations, the computational                                                        dg(x)
                                                                                                  Z
complexity of the convolution will theoretically be reduced                       f (x) ∗ g(x) = ( f (x)dx) ∗ (       ),         (4)
                                                                                                                  dx
to 1/m of that before.
                                                                     where if dg(x)/dx is a sparse signal, then the convolution
3.5.3 Depth-wise Separable Convolution                               can be accelerated by the right part of this equation. Al-
Depth-wise separable convolution, as shown in Fig. 14 (e),           though the VJ detector [10] is well known for the integral
is a recent popular way of building lightweight convolution          image acceleration, before it was born, the integral image
networks [204]. It can be viewed as a special case of the            has already been used to speed up a CNN model [219] and
group convolution when the number of groups is set equal             achieved more than 10 times acceleration.
to the number of channels.                                               In addition to the above examples, integral image can
     Suppose we have a convolutional layer with d filters and        also be used to speed up more general features in ob-
a feature map of c channels. The size of each filter is k × k .      ject detection, e.g., color histogram, gradient histogram
For a depth-wise separable convolution, every k ×k ×c filter         [171, 177, 220, 221], etc. A typical example is to speed up
is first to split into c slices each with the size of k ×k ×1, and   HOG by computing integral HOG maps [177, 220]. Instead
then the convolutions are performed individually in each             of accumulating pixel values in a traditional integral image,
channel with each slice of the filter. Finally, a number of          the integral HOG map accumulates gradient orientations in
1x1 filters are used to make a dimension transform so that           an image, as shown in Fig. 15. As the histogram of a cell
the final output should have d channels. By using depth-             can be viewed as the summation of the gradient vector in
wise separable convolution, the computational complexity             a certain region, by using the integral image, it is possible
can be reduced from O(dk 2 c) to O(ck 2 ) + O(dc). This idea         to compute a histogram in a rectangle region of an arbitrary
has been recently applied to object detection and fine-grain         position and size with a constant computational overhead.
classification [205–207].                                            The integral HOG map has been used in pedestrian detec-
                                                                     tion and has achieved dozens of times’ acceleration without
3.5.4 Bottle-neck Design                                             losing any accuracy [177].
A bottleneck layer in a neural network contains few nodes                Later in 2009, P. Dollár et al. proposed a new type of
compared to the previous layers. It can be used to learning          image feature called Integral Channel Features (ICF), which
efficient data encodings of the input with reduced dimen-            can be considered as a more general case of the integral
sionality, which has been commonly used in deep autoen-              image features, and has been successfully used in pedes-
coders [208]. In recent years, the bottle-neck design has been       trian detection [171]. ICF achieves state-of-the-art detection
widely used for designing lightweight networks [47, 209–             accuracy under the near realtime detection speed in its time.
212]. Among these methods, one common approach is to
compress the input layer of a detector to reduce the amount          3.6.2 Speed Up in Frequency Domain
of computation from the very beginning of the detection              Convolution is an important type of numerical operation
pipeline [209–211]. Another approach is to compress the              in object detection. As the detection of a linear detector
output of the detection engine to make the feature map               can be viewed as the window-wise inner product between
thinner, so as to make it more efficient for subsequent              the feature map and detector’s weights, this process can be
detection stages [47, 212].                                          implemented by convolutions.
3.5.5 Neural Architecture Search                                         The convolution can be accelerated in many ways, where
                                                                     the Fourier transform is a very practical choice especially
More recently, there has been significant interest in de-
                                                                     for speeding up those large filters. The theoretical basis
signing network architectures automatically by neural ar-
                                                                     for accelerating convolution in the frequency domain is the
chitecture search (NAS) instead of relying heavily on ex-
                                                                     convolution theorem in signal processing, that is, under
pert experience and knowledge. NAS has been applied to
                                                                     suitable conditions, the Fourier transform of a convolution
large-scale image classification [213, 214], object detection
                                                                     of two signals is the point-wise product in their Fourier
[215] and image segmentation [216] tasks. NAS also shows
                                                                     space:
promising results in designing lightweight networks very
                                                                                     I ∗ W = F −1 (F (I)  F (W ))            (5)
recently, where the constraints on the prediction accuracy
and computational complexity are both considered during              where F is Fourier transform, F −1 is Inverse Fourier trans-
the searching process [217, 218].                                    form, I and W are the input image and filter, ∗ is the
                                                                                                                                              18
Fig. 15. An illustration of how to compute the “Integral HOG Map” [177]. With integral image techniques, we can efficiently compute the histogram
feature of any location and any size with constant computational complexity.
Fig. 17. A comparison of detection accuracy of three detectors: Faster   Fig. 18. An illustration of different feature fusion methods: (a) bottom-
RCNN [19], R-FCN [46] and SSD [21] on MS-COCO dataset with               up fusion, (b) top-down fusion, (c) element-wise sum, (d) element-wise
different detection engines. Image from J. Huang et al. CVPR2017 [27].   product, and (e) concatenation.
     AlexNet: AlexNet [40], an eight-layer deep network,                 object detection models such as STDN [237], DSOD [238],
was the first CNN model that started the deep learning                   TinyDSOD [207], and Pelee [209] choose DenseNet [235] as
revolution in computer vision. AlexNet famously won the                  their detection engine. The Mask RCNN [4], as the state of
2012 ImageNet LSVRC-2012 competition by a large margin                   the art model for instance segmentation, applied the next
[15.3% VS 26.2% (second place) error rates]. As of Feb. 2019,            generation of ResNet: ResNeXt [239] as its detection engine.
the Alexnet paper has been cited over 30,000 times.                      Besides, to speed up detection, the depth-wise separable
     VGG: VGG was proposed by Oxford’s Visual Geometry                   convolution operation, which was introduced by Xception
Group (VGG) in 2014 [230]. VGG increased the model’s                     [204], an improved version of Incepion, has also been used
depth to 16-19 layers and used very small (3x3) convolution              in detectors such as MobileNet [205] and LightHead RCNN
filters instead of 5x5 and 7x7 those were previously used in             [47].
AlexNet. VGG has achieved the state of the art performance
on the ImageNet dataset of its time.
     GoogLeNet: GoogLeNet, a.k.a Inception [198, 231–233],               4.2     Detection with Better Features
is a big family of CNN models proposed by Google Inc.
since 2014. GoogLeNet increased both of a CNN’s width                    The quality of feature representations is critical for object
and depth (up to 22 layers). The main contribution of the                detection. In recent years, many researchers have made
Inception family is the introduction of factorizing convolu-             efforts to further improve the quality of image features on
tion and batch normalization.                                            basis of some latest engines, where the most important two
     ResNet: The Deep Residual Networks (ResNet) [234],                  groups of methods are: 1) feature fusion and 2) learning
proposed by K. He et al. in 2015, is a new type of convolu-              high-resolution features with large receptive fields.
tional network architecture that is substantially deeper (up
to 152 layers) than those used previously. ResNet aims to                4.2.1    Why Feature Fusion is Important?
ease the training of networks by reformulating its layers as
learning residual functions with reference to the layer in-              Invariance and equivariance are two important properties
puts. ResNet won multiple computer vision competitions in                in image feature representations. Classification desires in-
2015, including ImageNet detection, ImageNet localization,               variant feature representations since it aims at learning
COCO detection, and COCO segmentation.                                   high-level semantic information. Object localization desires
     DenseNet: DenseNet [235] was proposed by G. Huang                   equivariant representations since it aims at discriminating
and Z. Liu et al. in 2017. The success of ResNet suggested               position and scale changes. As object detection consists of
that the short cut connection in CNN enables us to train                 two sub-tasks of object recognition and localization, it is cru-
deeper and more accurate models. The authors embraced                    cial for a detector to learn both invariance and equivariance
this observation and introduced a densely connected block,               at the same time.
which connects each layer to every other layer in a feed-                    Feature fusion has been widely used in object detection
forward fashion.                                                         in the last three years. As a CNN model consists of a series
     SENet: Squeeze and Excitation Networks (SENet) was                  of convolutional and pooling layers, features in deeper
proposed by J. Hu and L. Shen et al. in 2018 [236]. Its                  layers will have stronger invariance but less equivariance.
main contribution is the integration of global pooling and               Although this could be beneficial to category recognition, it
shuffling to learn channel-wise importance of the feature                suffers from low localization accuracy in object detection.
map. SENet won the 1st place in ILSVRC 2017 classification               On the contrary, features in shallower layers is not con-
competition.                                                             ducive to learning semantics, but it helps object localization
                                                                         as it contains more information about edges and contours.
   •    Object detectors with new engines                                Therefore, the integration of deep and shallow features in
   In recent three years, many of the latest engines have                a CNN model helps improve both invariance and equivari-
been applied to object detection. For example, some latest               ance.
                                                                                                                                20
4.2.2   Feature Fusion in Different Ways                               As we mentioned before, the lower the feature resolution
There are many ways to perform feature fusion in object            is, the harder will be to detect small objects. The most
detection. Here we introduce some recent methods in two            straight forward way to increase the feature resolution is to
aspects: 1) processing flow and 2) element-wise operation.         remove pooling layer or to reduce the convolution down-
                                                                   sampling rate. But this will cause a new problem, the
   •    Processing flow                                            receptive field will become too small due to the decreasing
                                                                   of output stride. In other words, this will narrow a detector’s
    Recent feature fusion methods in object detection can be
                                                                   ”sight” and may result in the miss detection of some large
divided into two categories: 1) bottom-up fusion, 2) top-
                                                                   objects.
down fusion, as shown in Fig. 18 (a)-(b). Bottom-up fusion
feeds forward shallow features to deeper layers via skip               A piratical method to increase both of the receptive field
connections [237, 240–242]. In comparison, top-down fusion         and feature resolution at the same time is to introduce di-
feeds back the features of deeper layers into the shallower        lated convolution (a.k.a. atrous convolution, or convolution
ones [22, 55, 243–246]. Apart from these methods, there are        with holes). Dilated convolution is originally proposed in
more complex approaches proposed recently, e.g., weaving           semantic segmentation tasks [252, 253]. Its main idea is to
features across different layers [247].                            expand the convolution filter and use sparse parameters.
    As the feature maps of different layers may have differ-       For example, a 3x3 filter with a dilation rate of 2 will have
ent sizes both in terms of their spatial and channel dimen-        the same receptive field as a 5x5 kernel but only have
sions, one may need to accommodate the feature maps, such          9 parameters. Dilated convolution has now been widely
as by adjusting the number of channels, up-sampling low-           used in object detection [21, 56, 254, 255], and proves to
resolution maps, or down-sampling high-resolution maps to          be effective for improved accuracy without any additional
a proper size. The easiest ways to do this is to use nearest-      parameters and computational cost [56].
or bilinear-interpolation [22, 244]. Besides, fractional strided
convolution (a.k.a. transpose convolution) [45, 248], is an-
                                                                   4.3       Beyond Sliding Window
other recent popular way to resize the feature maps and
adjust the number of channels. The advantage of using              Although object detection has evolved from using hand-
fractional strided convolution is that it can learn an appro-      crafted features to deep neural networks, the detection still
priate way to perform up-sampling by itself [55, 212, 241–         follows a paradigm of “sliding window on feature maps”
243, 245, 246, 249].                                               [137]. Recently, there are some detectors built beyond sliding
                                                                   windows.
   •    Element-wise operation
                                                                         •    Detection as sub-region search
    From a local point of view, feature fusion can be consid-
ered as the element-wise operation between different feature           Sub-region search [184, 256–258] provides a new way
maps. There are three groups of methods: 1) element-wise           of performing detection. One recent method is to think
sum, 2) element-wise product, and 3) concatenation, as             of detection as a path planning process that starts from
shown in Fig. 18 (c)-(e).                                          initial grids and finally converges to the desired ground
    The element-wise sum is the easiest way to perform             truth boxes [256]. Another method is to think of detection
feature fusion. It has been frequently used in many recent         as an iterative updating process to refine the corners of a
object detectors [22, 55, 241, 243, 246]. The element-wise         predicted bounding box [257].
product [245, 249–251] is very similar to the element-wise
sum, while the only difference is the use of multiplication              •    Detection as key points localization
instead of summation. An advantage of element-wise prod-
uct is that it can be used to suppress or highlight the features       Key points localization is an important computer vision
within a certain area, which may further benefit small object      task that has extensively broad applications, such as facial
detection [245, 250, 251]. Feature concatenation is another        expression recognition [259], human poses identification
way of feature fusion [212, 237, 240, 244]. Its advantage          [260], etc. As any object in an image can be uniquely
is that it can be used to integrate context information of         determined by its upper left corner and lower right corner
different regions [105, 144, 149, 161], while its disadvantage     of the ground truth box, the detection task, therefore, can be
is the increase of the memory [235].                               equivalently framed as a pair-wise key points localization
                                                                   problem. One recent implementation of this idea is to pre-
4.2.3 Learning High Resolution Features with Large Re-             dict a heat-map for the corners [261]. The advantage of this
ceptive Fields                                                     approach is that it can be implemented under a semantic
The receptive field and feature resolution are two important       segmentation framework, and there is no need to design
characteristics of a CNN based detector, where the former          multi-scale anchor boxes.
one refers to the spatial range of input pixels that contribute
to the calculation of a single pixel of the output, and the
                                                                   4.4       Improvements of Localization
latter one corresponds to the down-sampling rate between
the input and the feature map. A network with a larger             To improve localization accuracy, there are two groups of
receptive field is able to capture a larger scale of context       methods in recent detectors: 1) bounding box refinement,
information, while that with a smaller one may concentrate         and 2) designing new loss functions for accurate localiza-
more on the local details.                                         tion.
                                                                                                                                 21
4.4.1      Bounding Box Refinement                                 the boundary of an object, segmentation may be helpful for
The most intuitive way to improve localization accuracy            category recognition.
is bounding box refinement, which can be considered as                   •    Segmentation helps accurate localization
a post-processing of the detection results. Although the
bounding box regression has been integrated into most of               The ground-truth bounding box of an object is deter-
the modern object detectors, there are still some objects          mined by its well-defined boundary. For some objects with
with unexpected scales that cannot be well captured by any         a special shape (e.g., imagine a cat with a very long tail),
of the predefined anchors. This will inevitably lead to an         it will be difficult to predict high IoU locations. As object
inaccurate prediction of their locations. For this reason, the     boundaries can be well encoded in semantic segmentation
“iterative bounding box refinement” [262–264] has been in-         features, learning with segmentation would be helpful for
troduced recently by iteratively feeding the detection results     accurate object localization.
into a BB regressor until the prediction converges to a correct          •    Segmentation can be embedded as context
location and size. However, some researchers also claimed
that this method does not guarantee the monotonicity of               Objects in daily life are surrounded by different back-
localization accuracy [262], in other words, the BB regression     grounds, such as the sky, water, grass, etc, and all these
may degenerate the localization if it is applied for multiple      elements constitute the context of an object. Integrating the
times.                                                             context of semantic segmentation will be helpful for object
                                                                   detection, say, an aircraft is more likely to appear in the sky
4.4.2      Improving Loss Functions for Accurate Localization      than on the water.
In most modern detectors, object localization is considered        4.5.2 How Segmentation Improves Detection?
as a coordinate regression problem. However, there are
two drawbacks of this paradigm. First, the regression loss         There are two main approaches to improve object detection
function does not correspond to the final evaluation of            by segmentation: 1) learning with enriched features and 2)
localization. For example, we can not guarantee that a lower       learning with multi-task loss functions.
regression error will always produce a higher IoU predic-                •    Learning with enriched features
tion, especially when the object has a very large aspect ratio.
Second, the traditional bounding box regression method                 The simplest way is to think of the segmentation net-
does not provide the confidence of localization. When there        work as a fixed feature extractor and to integrate it into a de-
are multiple BB’s overlapping with each other, this may lead       tection framework as additional features [144, 269, 270]. The
to failure in non-maximum suppression (see more details in         advantage of this approach is that it is easy to implement,
subsection 2.3.5).                                                 while the disadvantage is that the segmentation network
    The above problems can be alleviated by designing              may bring additional calculation.
new loss functions. The most intuitive design is to directly             •    Learning with multi-task loss functions
use IoU as the localization loss function [265]. Some other
researchers have further proposed an IoU-guided NMS to                 Another way is to introduce an additional segmentation
improve localization in both training and detection stages         branch on top of the original detection framework and to
[163]. Besides, some researchers have also tried to improve        train this model with multi-task loss functions (segmenta-
localization under a probabilistic inference framework [266].      tion loss + detection loss) [4, 269]. In most cases, the segmen-
Different from the previous methods that directly predict          tation brunch will be removed at the inference stage. The
the box coordinates, this method predicts the probability          advantage is the detection speed will not be affected, but the
distribution of a bounding box location.                           disadvantage is that the training requires pixel-level image
                                                                   annotations. To this end, some researchers have followed the
                                                                   idea of “weakly supervised learning”: instead of training
4.5       Learning with Segmentation                               based on pixel-wise annotation masks, they simply train
Object detection and semantic segmentation are all impor-          the segmentation brunch based on the bounding-box level
tant tasks in computer vision. Recent researches suggest           annotations [250, 271].
object detection can be improved by learning with semantic
segmentation.                                                      4.6       Robust Detection of Rotation and Scale Changes
                                                                   Object rotation and scale changes are important challenges
4.5.1      Why Segmentation Improves Detection?                    in object detection. As the features learned by CNN are
There are three reasons why the semantic segmentation              not invariant to rotation and large degree of scale changes,
improves object detection.                                         in recent years, many people have made efforts in this
                                                                   problem.
      •    Segmentation helps category recognition
    Edges and boundaries are the basic elements that consti-       4.6.1 Rotation Robust Detection
tute human visual cognition [267, 268]. In computer vision,        Object rotation is very common in detection tasks such as
the difference between an object (e.g., a car, a person) and a     face detection, text detection, etc. The most straight forward
stuff (e.g., sky, water, grass) is that the former usually has a   solution to this problem is data augmentation so that an
closed and well defined boundary while the latter does not.        object in any orientation can be well covered by the aug-
As the feature of semantic segmentation tasks well captures        mented data [88]. Another solution is to train independent
                                                                                                                               22
detectors for every orientation [272, 273]. Apart from these      “larger ones” [184, 258]. Another recent improvement is
traditional approaches, recently, there are some new im-          learning to predict the scale distribution of objects in an
provement methods.                                                image, and then adaptively re-scaling the image according
                                                                  to the distribution [282, 283].
   •   Rotation invariant loss functions
   The idea of learning with rotation invariant loss function
can be traced back to the 1990s [274]. Some recent works          4.7   Training from Scratch
have introduced a constraint on the original detection loss       Most deep learning based detectors are first pre-trained on
function so that to make the features of rotated objects          large scale datasets, say ImageNet, and then fine-tuned on
unchanged [275, 276].                                             specific detection tasks. People have always believed that
                                                                  pre-training helps to improve generalization ability and
   •   Rotation calibration
                                                                  training speed and the question is, do we really need to
    Another way of improving rotation invariant detection is      pre-training a detector on ImageNet? In fact, there are some
to make geometric transformations of the objects candidates       limitations when adopting the pre-trained networks in ob-
[277–279]. This will be especially helpful for multi-stage        ject detection. The first limitation is the divergence between
detectors, where the correlation at early stages will benefit     ImageNet classification and object detection, including their
the subsequent detections. The representative of this idea        loss functions and scale/category distributions. The second
is Spatial Transformer Networks (STN) [278]. STN has now          limitation is the domain mismatch. As images in ImageNet
been used in rotated text detection [278] and rotated face        are RGB images while detection sometimes will be applied
detection [279].                                                  to depth image (RGB-D) or 3D medical images, the pre-
                                                                  trained knowledge can not be well transfer to these detec-
   •   Rotation RoI Pooling
                                                                  tion tasks.
    In a two-stage detector, feature pooling aims to extract          In recent years, some researchers have tried to train an
a fixed length feature representation for an object proposal      object detector from scratch. To speed up training and im-
with any location and size by first dividing the proposal         prove stability, some researchers introduce dense connection
evenly into a set of grids, and then concatenating the grid       and batch normalization to accelerate the back-propagation
features. As the grid meshing is performed in Cartesian           in shallow layers [238, 284]. The recent work by K. He
coordinates, the features are not invariance to rotation trans-   et al. [285] has further questioned the paradigm of pre-
form. A recent improvement is to mesh the grids in polar          training even further by exploring the opposite regime:
coordinates so that the features could be robust to the           they reported competitive results on object detection on
rotation changes [272].                                           the COCO dataset using standard models trained from
                                                                  random initialization, with the sole exception of increasing
4.6.2 Scale Robust Detection                                      the number of training iterations so the randomly initialized
Recent improvements have been made at both training and           models may converge. Training from random initialization
detection stages for scale robust detection.                      is also surprisingly robust even using only 10% of the train-
                                                                  ing data, which indicates that ImageNet pre-training may
   •   Scale adaptive training
                                                                  speed up convergence, but does not necessarily provide
    Most of the modern detectors re-scale the input image         regularization or improve final detection accuracy.
to a fixed size and back propagate the loss of the objects
in all scales, as shown in Fig. 19 (a). However, a drawback
                                                                  4.8   Adversarial Training
of doing this is there will be a “scale imbalance” problem.
Building an image pyramid during detection could alleviate        The Generative Adversarial Networks (GAN) [286], intro-
this problem but not fundamentally [46, 234]. A recent            duced by A. Goodfellow et al. in 2014, has received great
improvement is Scale Normalization for Image Pyramids             attention in recent years. A typical GAN consists of two
(SNIP) [280], which builds image pyramids at both of              neural networks: a generator networks and a discriminator
training and detection stages and only backpropagates the         networks, contesting with each other in a minimax opti-
loss of some selected scales, as shown in Fig. 19 (b). Some       mization framework. Typically, the generator learns to map
researchers have further proposed a more efficient training       from a latent space to a particular data distribution of inter-
strategy: SNIP with Efficient Resampling (SNIPER) [281], i.e.     est, while the discriminator aims to discriminate between in-
to crop and re-scale an image to a set of sub-regions so that     stances from the true data distribution and those produced
to benefit from large batch training.                             by the generator. GAN has been widely used for many
                                                                  computer vision tasks such as image generation[286, 287],
   •   Scale adaptive detection
                                                                  image style transfer [288], and image super-resolution [289].
    Most of the modern detectors use the fixed configura-         In recent two years, GAN has also been applied to object
tions for detecting objects of different sizes. For example,      detection, especially for improving the detection of small
in a typical CNN based detector, we need to carefully             and occluded object.
define the size of anchors. A drawback of doing this is               GAN has been used to enhance the detection on small
the configurations cannot be adaptive to unexpected scale         objects by narrowing the representations between small and
changes. To improve the detection of small objects, some          large ones [290, 291]. To improve the detection of occluded
“adaptive zoom-in” techniques are proposed in some recent         objects, one recent idea is to generate occlusion masks
detectors to adaptively enlarge the small objects into the        by using adversarial training [292]. Instead of generating
                                                                                                                                                      23
Fig. 19. Different training strategies for multi-scale object detection: (a): Training on a single resolution image, back propagate objects of all scales
[17–19, 21]. (b) Training on multi-resolution images (image pyramid), back propagate objects of selected scale. If an object is too large or too small,
its gradient will be discarded [56, 280, 281].
examples in pixel space, the adversarial network directly                     prove WSOD. More recently, generative adversarial training
modifies the features to mimic occlusion.                                     has been used for WSOD [302].
    In addition to these works, “adversarial attack” [293],
which aims to study how to attack a detector with adver-
sarial examples, has drawn increasing attention recently.                     5     A PPLICATIONS
The research on this topic is especially important for au-
                                                                              In this section, we will review some important detection
tonomous driving, as it cannot be fully trusted before guar-
                                                                              applications in the past 20 years, including pedestrian
anteeing the robustness to adversarial attacks.
                                                                              detection, face detection, text detection, traffic sign/light
                                                                              detection, and remote sensing target detection.
4.9   Weakly Supervised Object Detection
The training of a modern object detector usually requires                     5.1     Pedestrian Detection
a large amount of manually labeled data, while the label-                     Pedestrian detection, as an important object detection ap-
ing process is time-consuming, expensive, and inefficient.                    plication, has received extensive attention in many areas
Weakly Supervised Object Detection (WSOD) aims to solve                       such as autonomous driving, video surveillance, criminal
this problem by training a detector with only image level                     investigation, etc. Some early time’s pedestrian detection
annotations instead of bounding boxes.                                        methods, such as HOG detector [12], ICF detector [171], laid
    Recently, multi-instance learning has been used for                       a solid foundation for general object detection in terms of
WSOD [294, 295]. Multi-instance learning is a group of su-                    the feature representation [12, 171], the design of classifier
pervised learning method [39, 296]. Instead of learning with                  [174], and the detection acceleration [177]. In recent years,
a set of instances which are individually labeled, a multi-                   some general object detection algorithms, e.g., Faster RCNN
instance learning model receives a set of labeled bags, each                  [19], have been introduced to pedestrian detection [165], and
containing many instances. If we consider object candidates                   has greatly promoted the progress of this area.
in one image as a bag, and image-level annotation as the
label, then the WSOD can be formulated as a multi-instance
                                                                              5.1.1    Difficulties and Challenges
learning process.
    Class activation mapping is another recently group of                     The challenges and difficulties in pedestrian detection can
methods for WSOD [297, 298]. The research on CNN visu-                        be summarized as follows.
alization has shown that the convolution layer of a CNN                          Small pedestrian: Fig. 20 (a) shows some examples of
behaves as object detectors despite there is no supervision                   the small pedestrians that are captured far from the camera.
on the location of the object. Class activation mapping shed                  In Caltech Dataset [59, 60], 15% of the pedestrians are less
light on how to enable a CNN to have localization ability                     than 30 pixels in height.
despite being trained on image level labels [299].                               Hard negatives: Some backgrounds in street view im-
    In addition to the above approaches, some other re-                       ages are very similar to pedestrians in their visual appear-
searchers considered the WSOD as a proposal ranking pro-                      ance, as shown in Fig. 20 (b).
cess by selecting the most informative regions and then                          Dense and occluded pedestrian: Fig 20 (c) shows some
training these regions with image-level annotation [300].                     examples of dense and occluded pedestrians. In the Caltech
Another simple method for WSOD is to mask out different                       Dataset [59, 60], pedestrians that haven’t been occluded only
parts of the image. If the detection score drops sharply,                     account for 29% of the total pedestrian instances.
then an object would be covered with high probability                            Real-time detection: The real-time pedestrian detection
[301]. Besides, interactive annotation [295] takes human                      from HD video is crucial for some applications like au-
feedback into consideration during training so that to im-                    tonomous driving and video surveillance.
                                                                                                                                         24
       recognition,” in European conference on computer vision.                detection with structured models. Citeseer, 2012.
       Springer, 2014, pp. 346–361.                                     [39]   S. Andrews, I. Tsochantaridis, and T. Hofmann, “Support
[18]   R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE inter-            vector machines for multiple-instance learning,” in Ad-
       national conference on computer vision, 2015, pp. 1440–1448.            vances in neural information processing systems, 2003, pp.
[19]   S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn:                  577–584.
       Towards real-time object detection with region proposal          [40]   A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
       networks,” in Advances in neural information processing                 classification with deep convolutional neural networks,”
       systems, 2015, pp. 91–99.                                               in Advances in neural information processing systems, 2012,
[20]   J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You                pp. 1097–1105.
       only look once: Unified, real-time object detection,” in         [41]   R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-
       Proceedings of the IEEE conference on computer vision and               based convolutional networks for accurate object de-
       pattern recognition, 2016, pp. 779–788.                                 tection and segmentation,” IEEE transactions on pattern
[21]   W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.               analysis and machine intelligence, vol. 38, no. 1, pp. 142–
       Fu, and A. C. Berg, “Ssd: Single shot multibox detector,”               158, 2016.
       in European conference on computer vision. Springer, 2016,       [42]   K. E. Van de Sande, J. R. Uijlings, T. Gevers, and A. W.
       pp. 21–37.                                                              Smeulders, “Segmentation as selective search for object
[22]   T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan,             recognition,” in Computer Vision (ICCV), 2011 IEEE Inter-
       and S. J. Belongie, “Feature pyramid networks for object                national Conference on. IEEE, 2011, pp. 1879–1886.
       detection.” in CVPR, vol. 1, no. 2, 2017, p. 4.                  [43]   R. B. Girshick, P. F. Felzenszwalb, and D. McAllester,
[23]   T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár,                “Discriminatively trained deformable part models, re-
       “Focal loss for dense object detection,” IEEE transactions              lease 5,” http://people.cs.uchicago.edu/ rbg/latent-
       on pattern analysis and machine intelligence, 2018.                     release5/.
[24]   L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu,         [44]   S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn:
       and M. Pietikäinen, “Deep learning for generic object de-              towards real-time object detection with region proposal
       tection: A survey,” arXiv preprint arXiv:1809.02165, 2018.              networks,” IEEE Transactions on Pattern Analysis & Ma-
[25]   S. Agarwal, J. O. D. Terrail, and F. Jurie, “Recent advances            chine Intelligence, no. 6, pp. 1137–1149, 2017.
       in object detection in the age of deep convolutional neural      [45]   M. D. Zeiler and R. Fergus, “Visualizing and understand-
       networks,” arXiv preprint arXiv:1809.03193, 2018.                       ing convolutional networks,” in European conference on
[26]   A. Andreopoulos and J. K. Tsotsos, “50 years of object                  computer vision. Springer, 2014, pp. 818–833.
       recognition: Directions forward,” Computer vision and im-        [46]   J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via
       age understanding, vol. 117, no. 8, pp. 827–891, 2013.                  region-based fully convolutional networks,” in Advances
[27]   J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara,                    in neural information processing systems, 2016, pp. 379–387.
       A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama           [47]   Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun,
       et al., “Speed/accuracy trade-offs for modern convolu-                  “Light-head r-cnn: In defense of two-stage object detec-
       tional object detectors,” in IEEE CVPR, vol. 4, 2017.                   tor,” arXiv preprint arXiv:1711.07264, 2017.
[28]   K. Grauman and B. Leibe, “Visual object recognition              [48]   J. Redmon and A. Farhadi, “Yolo9000: better, faster,
       (synthesis lectures on artificial intelligence and machine              stronger,” arXiv preprint, 2017.
       learning),” Morgan & Claypool, 2011.                             [49]   ——, “Yolov3: An incremental improvement,” arXiv
[29]   C. P. Papageorgiou, M. Oren, and T. Poggio, “A general                  preprint arXiv:1804.02767, 2018.
       framework for object detection,” in Computer vision, 1998.       [50]   M. Everingham, L. Van Gool, C. K. Williams, J. Winn,
       sixth international conference on. IEEE, 1998, pp. 555–562.             and A. Zisserman, “The pascal visual object classes (voc)
[30]   C. Papageorgiou and T. Poggio, “A trainable system for                  challenge,” International journal of computer vision, vol. 88,
       object detection,” International journal of computer vision,            no. 2, pp. 303–338, 2010.
       vol. 38, no. 1, pp. 15–33, 2000.                                 [51]   M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams,
[31]   A. Mohan, C. Papageorgiou, and T. Poggio, “Example-                     J. Winn, and A. Zisserman, “The pascal visual object
       based object detection in images by components,” IEEE                   classes challenge: A retrospective,” International journal of
       Transactions on Pattern Analysis & Machine Intelligence,                computer vision, vol. 111, no. 1, pp. 98–136, 2015.
       no. 4, pp. 349–361, 2001.                                        [52]   O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
[32]   Y. Freund, R. Schapire, and N. Abe, “A short introduction               S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein
       to boosting,” Journal-Japanese Society For Artificial Intelli-          et al., “Imagenet large scale visual recognition challenge,”
       gence, vol. 14, no. 771-780, p. 1612, 1999.                             International Journal of Computer Vision, vol. 115, no. 3, pp.
[33]   D. G. Lowe, “Object recognition from local scale-invariant              211–252, 2015.
       features,” in Computer vision, 1999. The proceedings of the      [53]   T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona,
       seventh IEEE international conference on, vol. 2. Ieee, 1999,           D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco:
       pp. 1150–1157.                                                          Common objects in context,” in European conference on
[34]   ——, “Distinctive image features from scale-invariant                    computer vision. Springer, 2014, pp. 740–755.
       keypoints,” International journal of computer vision, vol. 60,   [54]   M. A. Sadeghi and D. Forsyth, “30hz object detection
       no. 2, pp. 91–110, 2004.                                                with dpm v5,” in European Conference on Computer Vision.
[35]   S. Belongie, J. Malik, and J. Puzicha, “Shape matching and              Springer, 2014, pp. 65–79.
       object recognition using shape contexts,” CALIFORNIA             [55]   S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Single-
       UNIV SAN DIEGO LA JOLLA DEPT OF COMPUTER                                shot refinement neural network for object detection,” in
       SCIENCE AND ENGINEERING, Tech. Rep., 2002.                              IEEE CVPR, 2018.
[36]   T. Malisiewicz, A. Gupta, and A. A. Efros, “Ensemble             [56]   Y. Li, Y. Chen, N. Wang, and Z. Zhang, “Scale-aware
       of exemplar-svms for object detection and beyond,” in                   trident networks for object detection,” arXiv preprint
       Computer Vision (ICCV), 2011 IEEE International Conference              arXiv:1901.01892, 2019.
       on. IEEE, 2011, pp. 89–96.                                       [57]   J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
[37]   R. B. Girshick, P. F. Felzenszwalb, and D. A. Mcallester,               “Imagenet: A large-scale hierarchical image database,” in
       “Object detection with grammar models,” in Advances in                  Computer Vision and Pattern Recognition, 2009. CVPR 2009.
       Neural Information Processing Systems, 2011, pp. 442–450.               IEEE Conference on. Ieee, 2009, pp. 248–255.
[38]   R. B. Girshick, From rigid templates to grammars: Object         [58]   I. Krasin and T. e. a. Duerig, “Openimages: A
                                                                                                                                         30
       public dataset for large-scale multi-label and multi-                  European Conference on Computer Vision. Springer, 2010,
       class image classification.” Dataset available from                    pp. 591–604.
       https://storage.googleapis.com/openimages/web/index.html,       [75]   C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, “Detecting texts
       2017.                                                                  of arbitrary orientations in natural images,” in 2012 IEEE
[59]   P. Dollár, C. Wojek, B. Schiele, and P. Perona, “Pedestrian           Conference on Computer Vision and Pattern Recognition.
       detection: A benchmark,” in Computer Vision and Pattern                IEEE, 2012, pp. 1083–1090.
       Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE,         [76]   A. Mishra, K. Alahari, and C. Jawahar, “Scene text recog-
       2009, pp. 304–311.                                                     nition using higher order language priors,” in BMVC-
[60]   P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian            British Machine Vision Conference. BMVA, 2012.
       detection: An evaluation of the state of the art,” IEEE         [77]   M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zis-
       transactions on pattern analysis and machine intelligence,             serman, “Synthetic data and artificial neural net-
       vol. 34, no. 4, pp. 743–761, 2012.                                     works for natural scene text recognition,” arXiv preprint
[61]   A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for                  arXiv:1406.2227, 2014.
       autonomous driving? the kitti vision benchmark suite,”          [78]   A. Veit, T. Matera, L. Neumann, J. Matas, and S. Be-
       in Computer Vision and Pattern Recognition (CVPR), 2012                longie, “Coco-text: Dataset and benchmark for text de-
       IEEE Conference on. IEEE, 2012, pp. 3354–3361.                         tection and recognition in natural images,” arXiv preprint
[62]   S. Zhang, R. Benenson, and B. Schiele, “Citypersons: A                 arXiv:1601.07140, 2016.
       diverse dataset for pedestrian detection,” in The IEEE          [79]   R. De Charette and F. Nashashibi, “Real time visual
       Conference on Computer Vision and Pattern Recognition                  traffic lights recognition based on spot light detection and
       (CVPR), vol. 1, no. 2, 2017, p. 3.                                     adaptive traffic lights templates,” in Intelligent Vehicles
[63]   M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,               Symposium, 2009 IEEE. IEEE, 2009, pp. 358–363.
       R. Benenson, U. Franke, S. Roth, and B. Schiele, “The           [80]   A. Møgelmose, M. M. Trivedi, and T. B. Moeslund,
       cityscapes dataset for semantic urban scene understand-                “Vision-based traffic sign detection and analysis for in-
       ing,” in Proceedings of the IEEE conference on computer                telligent driver assistance systems: Perspectives and sur-
       vision and pattern recognition, 2016, pp. 3213–3223.                   vey.” IEEE Trans. Intelligent Transportation Systems, vol. 13,
[64]   M. Braun, S. Krebs, F. Flohr, and D. M. Gavrila, “The                  no. 4, pp. 1484–1497, 2012.
       eurocity persons dataset: A novel benchmark for object          [81]   S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and
       detection,” arXiv preprint arXiv:1805.07193, 2018.                     C. Igel, “Detection of traffic signs in real-world images:
[65]   V. Jain and E. Learned-Miller, “Fddb: A benchmark                      The german traffic sign detection benchmark,” in Neural
       for face detection in unconstrained settings,” Technical               Networks (IJCNN), The 2013 International Joint Conference
       Report UM-CS-2010-009, University of Massachusetts,                    on. IEEE, 2013, pp. 1–8.
       Amherst, Tech. Rep., 2010.                                      [82]   R. Timofte, K. Zimmermann, and L. Van Gool, “Multi-
[66]   M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof,                view traffic sign detection, recognition, and 3d localisa-
       “Annotated facial landmarks in the wild: A large-scale,                tion,” Machine vision and applications, vol. 25, no. 3, pp.
       real-world database for facial landmark localization,” in              633–647, 2014.
       Computer Vision Workshops (ICCV Workshops), 2011 IEEE           [83]   Z. Zhu, D. Liang, S. Zhang, X. Huang, B. Li, and S. Hu,
       International Conference on. IEEE, 2011, pp. 2144–2151.                “Traffic-sign detection and classification in the wild,” in
[67]   B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney,             Proceedings of the IEEE Conference on Computer Vision and
       K. Allen, P. Grother, A. Mah, and A. K. Jain, “Pushing the             Pattern Recognition, 2016, pp. 2110–2118.
       frontiers of unconstrained face detection and recognition:      [84]   K. Behrendt, L. Novak, and R. Botros, “A deep learning
       Iarpa janus benchmark a,” in Proceedings of the IEEE                   approach to traffic lights: Detection, tracking, and clas-
       conference on computer vision and pattern recognition, 2015,           sification,” in Robotics and Automation (ICRA), 2017 IEEE
       pp. 1931–1939.                                                         International Conference on. IEEE, 2017, pp. 1370–1377.
[68]   S. Yang, P. Luo, C.-C. Loy, and X. Tang, “Wider face:           [85]   G. Heitz and D. Koller, “Learning spatial context: Using
       A face detection benchmark,” in Proceedings of the IEEE                stuff to find things,” in European conference on computer
       conference on computer vision and pattern recognition, 2016,           vision. Springer, 2008, pp. 30–43.
       pp. 5525–5533.                                                  [86]   F. Tanner, B. Colder, C. Pullen, D. Heagy, M. Eppolito,
[69]   H. Nada, V. A. Sindagi, H. Zhang, and V. M. Patel,                     V. Carlan, C. Oertel, and P. Sallee, “Overhead imagery
       “Pushing the limits of unconstrained face detection: a                 research data setan annotated data library & tools to aid
       challenge dataset and baseline results,” arXiv preprint                in the development of computer vision algorithms,” in
       arXiv:1804.10275, 2018.                                                2009 IEEE Applied Imagery Pattern Recognition Workshop
[70]   M. K. Yucel, Y. C. Bilge, O. Oguz, N. Ikizler-Cinbis,                  (AIPR 2009). IEEE, 2009, pp. 1–8.
       P. Duygulu, and R. G. Cinbis, “Wildest faces: Face de-          [87]   K. Liu and G. Mattyus, “Fast multiclass vehicle detec-
       tection and recognition in violent settings,” arXiv preprint           tion on aerial images.” IEEE Geosci. Remote Sensing Lett.,
       arXiv:1805.07566, 2018.                                                vol. 12, no. 9, pp. 1938–1942, 2015.
[71]   S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and       [88]   H. Zhu, X. Chen, W. Dai, K. Fu, Q. Ye, and J. Jiao, “Ori-
       R. Young, “Icdar 2003 robust reading competitions,” in                 entation robust object detection in aerial images using
       null. IEEE, 2003, p. 682.                                              deep convolutional neural network,” in Image Processing
[72]   D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh,                  (ICIP), 2015 IEEE International Conference on. IEEE, 2015,
       A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R.                   pp. 3735–3739.
       Chandrasekhar, S. Lu et al., “Icdar 2015 competition on         [89]   S. Razakarivony and F. Jurie, “Vehicle detection in aerial
       robust reading,” in Document Analysis and Recognition                  imagery: A small target detection benchmark,” Journal of
       (ICDAR), 2015 13th International Conference on.         IEEE,          Visual Communication and Image Representation, vol. 34, pp.
       2015, pp. 1156–1160.                                                   187–203, 2016.
[73]   B. Shi, C. Yao, M. Liao, M. Yang, P. Xu, L. Cui, S. Belongie,   [90]   G. Cheng and J. Han, “A survey on object detection in
       S. Lu, and X. Bai, “Icdar2017 competition on reading                   optical remote sensing images,” ISPRS Journal of Pho-
       chinese text in the wild (rctw-17),” in Document Analysis              togrammetry and Remote Sensing, vol. 117, pp. 11–28, 2016.
       and Recognition (ICDAR), 2017 14th IAPR International           [91]   Z. Zou and Z. Shi, “Random access memories: A new
       Conference on, vol. 1. IEEE, 2017, pp. 1429–1434.                      paradigm for target detection in high resolution aerial
[74]   K. Wang and S. Belongie, “Word spotting in the wild,” in               remote sensing images,” IEEE Transactions on Image Pro-
                                                                                                                                          31
        cessing, vol. 27, no. 3, pp. 1100–1111, 2018.                           and A. L. Yuille, “Semantic image segmentation with
[92]    G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo,                deep convolutional nets and fully connected crfs,” arXiv
        M. Datcu, M. Pelillo, and L. Zhang, “Dota: A large-scale                preprint arXiv:1412.7062, 2014.
        dataset for object detection in aerial images,” in Proc.        [112]   C. Garcia and M. Delakis, “A neural architecture for fast
        CVPR, 2018.                                                             and robust face detection,” in Pattern Recognition, 2002.
[93]    D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli,                      Proceedings. 16th International Conference on, vol. 2. IEEE,
        M. Klaric, Y. Bulatov, and B. McCord, “xview: Ob-                       2002, pp. 44–47.
        jects in context in overhead imagery,” arXiv preprint           [113]   M. Osadchy, M. L. Miller, and Y. L. Cun, “Synergistic face
        arXiv:1802.07856, 2018.                                                 detection and pose estimation with energy-based mod-
[94]    K. Oksuz, B. C. Cam, E. Akbas, and S. Kalkan, “Local-                   els,” in Advances in Neural Information Processing Systems,
        ization recall precision (lrp): A new performance metric                2005, pp. 1017–1024.
        for object detection,” in European Conference on Computer       [114]   S. J. Nowlan and J. C. Platt, “A convolutional neural
        Vision (ECCV), vol. 6, 2018.                                            network hand tracker,” Advances in neural information
[95]    M. Turk and A. Pentland, “Eigenfaces for recognition,”                  processing systems, pp. 901–908, 1995.
        Journal of cognitive neuroscience, vol. 3, no. 1, pp. 71–86,    [115]   T. Malisiewicz, Exemplar-based representations for object
        1991.                                                                   detection, association and beyond. Carnegie Mellon Uni-
[96]    R. Vaillant, C. Monrocq, and Y. Le Cun, “Original ap-                   versity, 2011.
        proach for the localisation of objects in images,” IEE          [116]   B. Alexe, T. Deselaers, and V. Ferrari, “What is an object?”
        Proceedings-Vision, Image and Signal Processing, vol. 141,              in Computer Vision and Pattern Recognition (CVPR), 2010
        no. 4, pp. 245–250, 1994.                                               IEEE Conference on. IEEE, 2010, pp. 73–80.
[97]    Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-      [117]   J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W.
        based learning applied to document recognition,” Pro-                   Smeulders, “Selective search for object recognition,” In-
        ceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.             ternational journal of computer vision, vol. 104, no. 2, pp.
[98]    I. Biederman, “Recognition-by-components: a theory                      154–171, 2013.
        of human image understanding.” Psychological review,            [118]   J. Carreira and C. Sminchisescu, “Constrained parametric
        vol. 94, no. 2, p. 115, 1987.                                           min-cuts for automatic object segmentation,” in Computer
[99]    M. A. Fischler and R. A. Elschlager, “The representation                Vision and Pattern Recognition (CVPR), 2010 IEEE Confer-
        and matching of pictorial structures,” IEEE Transactions                ence on. IEEE, 2010, pp. 3241–3248.
        on computers, vol. 100, no. 1, pp. 67–92, 1973.                 [119]   P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and
[100]   B. Leibe, A. Leonardis, and B. Schiele, “Robust object                  J. Malik, “Multiscale combinatorial grouping,” in Proceed-
        detection with interleaved categorization and segmenta-                 ings of the IEEE conference on computer vision and pattern
        tion,” International journal of computer vision, vol. 77, no.           recognition, 2014, pp. 328–335.
        1-3, pp. 259–289, 2008.                                         [120]   B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the ob-
[101]   D. M. Gavrila and V. Philomin, “Real-time object detec-                 jectness of image windows,” IEEE transactions on pattern
        tion for” smart” vehicles,” in Computer Vision, 1999. The               analysis and machine intelligence, vol. 34, no. 11, pp. 2189–
        Proceedings of the Seventh IEEE International Conference on,            2202, 2012.
        vol. 1. IEEE, 1999, pp. 87–93.                                  [121]   M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. Torr, “Bing:
[102]   B. Wu and R. Nevatia, “Detection of multiple, partially                 Binarized normed gradients for objectness estimation at
        occluded humans in a single image by bayesian combi-                    300fps,” in Proceedings of the IEEE conference on computer
        nation of edgelet part detectors,” in null. IEEE, 2005, pp.             vision and pattern recognition, 2014, pp. 3286–3293.
        90–97.                                                          [122]   C. L. Zitnick and P. Dollár, “Edge boxes: Locating object
[103]   P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,                 proposals from edges,” in European conference on computer
        and Y. LeCun, “Overfeat: Integrated recognition, localiza-              vision. Springer, 2014, pp. 391–405.
        tion and detection using convolutional networks,” arXiv         [123]   C. Szegedy, S. Reed, D. Erhan, D. Anguelov, and S. Ioffe,
        preprint arXiv:1312.6229, 2013.                                         “Scalable, high-quality object detection,” arXiv preprint
[104]   C. Szegedy, A. Toshev, and D. Erhan, “Deep neural net-                  arXiv:1412.1441, 2014.
        works for object detection,” in Advances in neural informa-     [124]   D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scal-
        tion processing systems, 2013, pp. 2553–2561.                           able object detection using deep neural networks,” in
[105]   Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos, “A uni-                Proceedings of the IEEE Conference on Computer Vision and
        fied multi-scale deep convolutional neural network for                  Pattern Recognition, 2014, pp. 2147–2154.
        fast object detection,” in European Conference on Computer      [125]   A. Ghodrati, A. Diba, M. Pedersoli, T. Tuytelaars, and
        Vision. Springer, 2016, pp. 354–370.                                    L. Van Gool, “Deepproposal: Hunting objects by cas-
[106]   A. Pentland, B. Moghaddam, T. Starner et al., “View-                    cading deep convolutional layers,” in Proceedings of the
        based and modular eigenspaces for face recognition,”                    IEEE International Conference on Computer Vision, 2015, pp.
        1994.                                                                   2578–2586.
[107]   G. Yang and T. S. Huang, “Human face detection in a             [126]   W. Kuo, B. Hariharan, and J. Malik, “Deepbox: Learning
        complex background,” Pattern recognition, vol. 27, no. 1,               objectness with convolutional networks,” in Proceedings of
        pp. 53–63, 1994.                                                        the IEEE International Conference on Computer Vision, 2015,
[108]   I. Craw, D. Tock, and A. Bennett, “Finding face features,”              pp. 2479–2487.
        in European Conference on Computer Vision. Springer, 1992,      [127]   S. Gidaris and N. Komodakis, “Attend refine repeat:
        pp. 92–96.                                                              Active box proposal generation via in-out localization,”
[109]   R. Xiao, L. Zhu, and H.-J. Zhang, “Boosting chain learn-                arXiv preprint arXiv:1606.04446, 2016.
        ing for object detection,” in Computer Vision, 2003. Pro-       [128]   H. Li, Y. Liu, W. Ouyang, and X. Wang, “Zoom out-and-
        ceedings. Ninth IEEE International Conference on. IEEE,                 in network with recursive training for object proposal,”
        2003, pp. 709–715.                                                      arXiv preprint arXiv:1702.05711, 2017.
[110]   J. Long, E. Shelhamer, and T. Darrell, “Fully convolu-          [129]   J. Hosang, R. Benenson, P. Dollár, and B. Schiele, “What
        tional networks for semantic segmentation,” in Proceed-                 makes for effective detection proposals?” IEEE transac-
        ings of the IEEE conference on computer vision and pattern              tions on pattern analysis and machine intelligence, vol. 38,
        recognition, 2015, pp. 3431–3440.                                       no. 4, pp. 814–830, 2016.
[111]   L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy,              [130]   J. Hosang, R. Benenson, and B. Schiele, “How
                                                                                                                                         32
        good are detection proposals, really?” arXiv preprint                  S. Yan, “Attentive contexts for object detection,” IEEE
        arXiv:1406.6962, 2014.                                                 Transactions on Multimedia, vol. 19, no. 5, pp. 944–954,
[131]   J. Carreira and C. Sminchisescu, “Cpmc: Automatic object               2017.
        segmentation using constrained parametric min-cuts,”           [150]   Q. Chen, Z. Song, J. Dong, Z. Huang, Y. Hua, and
        IEEE Transactions on Pattern Analysis & Machine Intelli-               S. Yan, “Contextualizing object detection and classifica-
        gence, no. 7, pp. 1312–1328, 2011.                                     tion,” IEEE transactions on pattern analysis and machine
[132]   N. Chavali, H. Agrawal, A. Mahendru, and D. Batra,                     intelligence, vol. 37, no. 1, pp. 13–27, 2015.
        “Object-proposal evaluation protocol is’ gameable’,” in        [151]   S. Gupta, B. Hariharan, and J. Malik, “Exploring person
        Proceedings of the IEEE conference on computer vision and              context and local scene context for object detection,”
        pattern recognition, 2016, pp. 835–844.                                arXiv preprint arXiv:1511.08177, 2015.
[133]   K. Lenc and A. Vedaldi, “R-cnn minus r,” arXiv preprint        [152]   X. Chen and A. Gupta, “Spatial memory for con-
        arXiv:1506.06981, 2015.                                                text reasoning in object detection,” arXiv preprint
[134]   P.-A. Savalle, S. Tsogkas, G. Papandreou, and I. Kokkinos,             arXiv:1704.04224, 2017.
        “Deformable part models with cnn features,” in European        [153]   Y. Liu, R. Wang, S. Shan, and X. Chen, “Structure infer-
        Conference on Computer Vision, Parts and Attributes Work-              ence net: Object detection using scene-level context and
        shop, 2014.                                                            instance-level relationships,” in Proceedings of the IEEE
[135]   N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-              Conference on Computer Vision and Pattern Recognition,
        based r-cnns for fine-grained category detection,” in Eu-              2018, pp. 6985–6994.
        ropean conference on computer vision. Springer, 2014, pp.      [154]   J. H. Hosang, R. Benenson, and B. Schiele, “Learning non-
        834–849.                                                               maximum suppression.” in CVPR, 2017, pp. 6469–6477.
[136]   L. Wan, D. Eigen, and R. Fergus, “End-to-end integration       [155]   P. Henderson and V. Ferrari, “End-to-end training of
        of a convolution network, deformable parts model and                   object class detectors for mean average precision,” in
        non-maximum suppression,” in Proceedings of the IEEE                   Asian Conference on Computer Vision. Springer, 2016, pp.
        Conference on Computer Vision and Pattern Recognition,                 198–213.
        2015, pp. 851–859.                                             [156]   R. Rothe, M. Guillaumin, and L. Van Gool, “Non-
[137]   R. Girshick, F. Iandola, T. Darrell, and J. Malik, “De-                maximum suppression for object detection by passing
        formable part models are convolutional neural net-                     messages between windows,” in Asian Conference on Com-
        works,” in Proceedings of the IEEE conference on Computer              puter Vision. Springer, 2014, pp. 290–306.
        Vision and Pattern Recognition, 2015, pp. 437–446.             [157]   D. Mrowca, M. Rohrbach, J. Hoffman, R. Hu, K. Saenko,
[138]   B. Li, T. Wu, S. Shao, L. Zhang, and R. Chu, “Object                   and T. Darrell, “Spatial semantic regularisation for large
        detection via end-to-end integration of aspect ratio and               scale object detection,” in Proceedings of the IEEE interna-
        context aware part-based models and fully convolutional                tional conference on computer vision, 2015, pp. 2003–2011.
        networks,” arXiv preprint arXiv:1612.00534, 2016.              [158]   N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, “Soft-
[139]   A. Torralba and P. Sinha, “Detecting faces in impov-                   nmsimproving object detection with one line of code,” in
        erished images,” MASSACHUSETTS INST OF TECH                            Computer Vision (ICCV), 2017 IEEE International Conference
        CAMBRIDGE ARTIFICIAL INTELLIGENCE LAB, Tech.                           on. IEEE, 2017, pp. 5562–5570.
        Rep., 2001.                                                    [159]   L. Tychsen-Smith and L. Petersson, “Improving object
[140]   S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross,           localization with fitness nms and bounded iou loss,”
        S. Chintala, and P. Dollár, “A multipath network for                  arXiv preprint arXiv:1711.00164, 2017.
        object detection,” arXiv preprint arXiv:1604.02135, 2016.      [160]   S. K. Divvala, D. Hoiem, J. H. Hays, A. A. Efros, and
[141]   X. Zeng, W. Ouyang, B. Yang, J. Yan, and X. Wang, “Gated               M. Hebert, “An empirical study of context in object de-
        bi-directional cnn for object detection,” in European Con-             tection,” in Computer Vision and Pattern Recognition, 2009.
        ference on Computer Vision. Springer, 2016, pp. 354–369.               CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 1271–
[142]   X. Zeng, W. Ouyang, J. Yan, H. Li, T. Xiao, K. Wang, Y. Liu,           1278.
        Y. Zhou, B. Yang, Z. Wang et al., “Crafting gbd-net for        [161]   C. Chen, M.-Y. Liu, O. Tuzel, and J. Xiao, “R-cnn for small
        object detection,” IEEE transactions on pattern analysis and           object detection,” in Asian conference on computer vision.
        machine intelligence, vol. 40, no. 9, pp. 2109–2123, 2018.             Springer, 2016, pp. 214–230.
[143]   W. Ouyang, K. Wang, X. Zhu, and X. Wang, “Learning             [162]   H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, “Relation
        chained deep features and classifiers for cascade in object            networks for object detection,” in Computer Vision and
        detection,” arXiv preprint arXiv:1702.07054, 2017.                     Pattern Recognition (CVPR), vol. 2, no. 3, 2018.
[144]   S. Gidaris and N. Komodakis, “Object detection via             [163]   B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang, “Acquisition
        a multi-region and semantic segmentation-aware cnn                     of localization confidence for accurate object detection,”
        model,” in Proceedings of the IEEE International Conference            in Proceedings of the European Conference on Computer Vi-
        on Computer Vision, 2015, pp. 1134–1142.                               sion, Munich, Germany, 2018, pp. 8–14.
[145]   Y. Zhu, C. Zhao, J. Wang, X. Zhao, Y. Wu, H. Lu et al.,        [164]   H. A. Rowley, S. Baluja, and T. Kanade, “Human face de-
        “Couplenet: Coupling global structure with local parts                 tection in visual scenes,” in Advances in Neural Information
        for object detection,” in Proc. of Intl Conf. on Computer              Processing Systems, 1996, pp. 875–881.
        Vision (ICCV), vol. 2, 2017.                                   [165]   L. Zhang, L. Lin, X. Liang, and K. He, “Is faster r-
[146]   C. Desai, D. Ramanan, and C. C. Fowlkes, “Discrimina-                  cnn doing well for pedestrian detection?” in European
        tive models for multi-class object layout,” International              Conference on Computer Vision. Springer, 2016, pp. 443–
        journal of computer vision, vol. 95, no. 1, pp. 1–12, 2011.            457.
[147]   Z. Li, Y. Chen, G. Yu, and Y. Deng, “R-fcn++: Towards          [166]   A. Shrivastava, A. Gupta, and R. Girshick, “Training
        accurate region-based fully convolutional networks for                 region-based object detectors with online hard example
        object detection.” in AAAI, 2018.                                      mining,” in Proceedings of the IEEE Conference on Computer
[148]   S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick,                Vision and Pattern Recognition, 2016, pp. 761–769.
        “Inside-outside net: Detecting objects in context with skip    [167]   T. Tang, S. Zhou, Z. Deng, H. Zou, and L. Lei, “Vehicle
        pooling and recurrent neural networks,” in Proceedings of              detection in aerial images based on region convolutional
        the IEEE conference on computer vision and pattern recogni-            neural networks and hard negative example mining,”
        tion, 2016, pp. 2874–2883.                                             Sensors, vol. 17, no. 2, p. 336, 2017.
[149]   J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, and          [168]   X. Sun, P. Wu, and S. C. Hoi, “Face detection using deep
                                                                                                                                         33
        learning: An improved faster rcnn approach,” Neurocom-                  tems, 1990, pp. 598–605.
        puting, vol. 299, pp. 42–50, 2018.                                [187] S. Han, H. Mao, and W. J. Dally, “Deep compres-
[169]   J. Jin, K. Fu, and C. Zhang, “Traffic sign recognition with             sion: Compressing deep neural networks with pruning,
        hinge loss trained convolutional neural networks,” IEEE                 trained quantization and huffman coding,” arXiv preprint
        Transactions on Intelligent Transportation Systems, vol. 15,            arXiv:1510.00149, 2015.
        no. 5, pp. 1991–2000, 2014.                                       [188] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf,
[170]   M. Zhou, M. Jing, D. Liu, Z. Xia, Z. Zou, and Z. Shi,                   “Pruning filters for efficient convnets,” arXiv preprint
        “Multi-resolution networks for ship detection in infrared               arXiv:1608.08710, 2016.
        remote sensing images,” Infrared Physics & Technology,            [189] G. Huang, S. Liu, L. van der Maaten, and K. Q.
        2018.                                                                   Weinberger, “Condensenet: An efficient densenet using
[171]   P. Dollár, Z. Tu, P. Perona, and S. Belongie, “Integral                learned group convolutions,” group, vol. 3, no. 12, p. 11,
        channel features,” 2009.                                                2017.
[172]   P. Dollár, R. Appel, S. Belongie, and P. Perona, “Fast           [190] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi,
        feature pyramids for object detection,” IEEE Transactions               “Xnor-net: Imagenet classification using binary convolu-
        on Pattern Analysis and Machine Intelligence, vol. 36, no. 8,           tional neural networks,” in European Conference on Com-
        pp. 1532–1545, 2014.                                                    puter Vision. Springer, 2016, pp. 525–542.
[173]   R. Benenson, M. Mathias, R. Timofte, and L. Van Gool,             [191] X. Lin, C. Zhao, and W. Pan, “Towards accurate binary
        “Pedestrian detection at 100 frames per second,” in Com-                convolutional neural network,” in Advances in Neural
        puter Vision and Pattern Recognition (CVPR), 2012 IEEE                  Information Processing Systems, 2017, pp. 345–353.
        Conference on. IEEE, 2012, pp. 2903–2910.                         [192] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and
[174]   S. Maji, A. C. Berg, and J. Malik, “Classification using                Y. Bengio, “Binarized neural networks,” in Advances in
        intersection kernel support vector machines is efficient,”              neural information processing systems, 2016, pp. 4107–4115.
        in Computer Vision and Pattern Recognition, 2008. CVPR            [193] G. Hinton, O. Vinyals, and J. Dean, “Distilling
        2008. IEEE Conference on. IEEE, 2008, pp. 1–8.                          the knowledge in a neural network,” arXiv preprint
[175]   A. Vedaldi and A. Zisserman, “Sparse kernel approxima-                  arXiv:1503.02531, 2015.
        tions for efficient classification and detection,” in Com-        [194] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta,
        puter Vision and Pattern Recognition (CVPR), 2012 IEEE                  and Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv
        Conference on. IEEE, 2012, pp. 2320–2327.                               preprint arXiv:1412.6550, 2014.
[176]   F. Fleuret and D. Geman, “Coarse-to-fine face detection,”         [195] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker,
        International Journal of computer vision, vol. 41, no. 1-2, pp.         “Learning efficient object detection models with knowl-
        85–107, 2001.                                                           edge distillation,” in Advances in Neural Information Pro-
[177]   Q. Zhu, M.-C. Yeh, K.-T. Cheng, and S. Avidan, “Fast hu-                cessing Systems, 2017, pp. 742–751.
        man detection using a cascade of histograms of oriented           [196] Q. Li, S. Jin, and J. Yan, “Mimicking very efficient network
        gradients,” in Computer Vision and Pattern Recognition,                 for object detection,” in 2017 IEEE Conference on Computer
        2006 IEEE Computer Society Conference on, vol. 2. IEEE,                 Vision and Pattern Recognition (CVPR). IEEE, 2017, pp.
        2006, pp. 1491–1498.                                                    7341–7349.
[178]   A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman,               [197] K. He and J. Sun, “Convolutional neural networks at con-
        “Multiple kernels for object detection,” in Computer Vi-                strained time cost,” in Proceedings of the IEEE conference
        sion, 2009 IEEE 12th International Conference on. IEEE,                 on computer vision and pattern recognition, 2015, pp. 5353–
        2009, pp. 606–613.                                                      5360.
[179]   H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A con-            [198] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wo-
        volutional neural network cascade for face detection,” in               jna, “Rethinking the inception architecture for computer
        Proceedings of the IEEE Conference on Computer Vision and               vision,” in Proceedings of the IEEE conference on computer
        Pattern Recognition, 2015, pp. 5325–5334.                               vision and pattern recognition, 2016, pp. 2818–2826.
[180]   K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face               [199] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large
        detection and alignment using multitask cascaded convo-                 kernel mattersimprove semantic segmentation by global
        lutional networks,” IEEE Signal Processing Letters, vol. 23,            convolutional network,” in Computer Vision and Pattern
        no. 10, pp. 1499–1503, 2016.                                            Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017,
[181]   Z. Cai, M. Saberian, and N. Vasconcelos, “Learning                      pp. 1743–1751.
        complexity-aware cascades for deep pedestrian detec-              [200] K.-H. Kim, S. Hong, B. Roh, Y. Cheon, and M. Park,
        tion,” in Proceedings of the IEEE International Conference              “Pvanet: deep but lightweight neural networks for real-
        on Computer Vision, 2015, pp. 3361–3369.                                time object detection,” arXiv preprint arXiv:1608.08021,
[182]   B. Yang, J. Yan, Z. Lei, and S. Z. Li, “Craft objects from              2016.
        images,” in Proceedings of the IEEE Conference on Computer        [201] X. Zhang, J. Zou, X. Ming, K. He, and J. Sun, “Efficient
        Vision and Pattern Recognition, 2016, pp. 6043–6051.                    and accurate approximations of nonlinear convolutional
[183]   F. Yang, W. Choi, and Y. Lin, “Exploit all the layers: Fast             networks,” in Proceedings of the IEEE Conference on Com-
        and accurate cnn object detector with scale dependent                   puter Vision and Pattern Recognition, 2015, pp. 1984–1992.
        pooling and cascaded rejection classifiers,” in Proceedings       [202] X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very
        of the IEEE conference on computer vision and pattern recog-            deep convolutional networks for classification and de-
        nition, 2016, pp. 2129–2137.                                            tection,” IEEE transactions on pattern analysis and machine
[184]   M. Gao, R. Yu, A. Li, V. I. Morariu, and L. S. Davis,                   intelligence, vol. 38, no. 10, pp. 1943–1955, 2016.
        “Dynamic zoom-in network for fast object detection in             [203] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet:
        large images,” in IEEE Conference on Computer Vision and                An extremely efficient convolutional neural network for
        Pattern Recognition (CVPR), 2018.                                       mobile devices,” 2017.
[185]   W. Ouyang, K. Wang, X. Zhu, and X. Wang, “Chained                 [204] F. Chollet, “Xception: Deep learning with depthwise
        cascade network for object detection,” in Computer Vision               separable convolutions,” arXiv preprint, pp. 1610–02 357,
        (ICCV), 2017 IEEE International Conference on. IEEE, 2017,              2017.
        pp. 1956–1964.                                                    [205] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko,
[186]   Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain                 W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mo-
        damage,” in Advances in neural information processing sys-              bilenets: Efficient convolutional neural networks for mo-
                                                                                                                                         34
        bile vision applications,” arXiv preprint arXiv:1704.04861,             fbfft: A gpu performance evaluation,” arXiv preprint
        2017.                                                                   arXiv:1412.7580, 2014.
[206]   M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and                [225]   O. Rippel, J. Snoek, and R. P. Adams, “Spectral represen-
        L.-C. Chen, “Mobilenetv2: Inverted residuals and linear                 tations for convolutional neural networks,” in Advances in
        bottlenecks,” in 2018 IEEE/CVF Conference on Computer                   neural information processing systems, 2015, pp. 2449–2457.
        Vision and Pattern Recognition. IEEE, 2018, pp. 4510–4520.      [226]   C. Dubout and F. Fleuret, “Exact acceleration of linear ob-
[207]   Y. Li, J. Li, W. Lin, and J. Li, “Tiny-dsod: Lightweight                ject detectors,” in European Conference on Computer Vision.
        object detection for resource-restricted usages,” arXiv                 Springer, 2012, pp. 301–311.
        preprint arXiv:1807.11013, 2018.                                [227]   M. A. Sadeghi and D. Forsyth, “Fast template evaluation
[208]   G. E. Hinton and R. R. Salakhutdinov, “Reducing the                     with vector quantization,” in Advances in neural informa-
        dimensionality of data with neural networks,” science,                  tion processing systems, 2013, pp. 2949–2957.
        vol. 313, no. 5786, pp. 504–507, 2006.                          [228]   I. Kokkinos, “Bounding part scores for rapid detection
[209]   R. J. Wang, X. Li, S. Ao, and C. X. Ling, “Pelee: A real-time           with deformable part models,” in European Conference on
        object detection system on mobile devices,” arXiv preprint              Computer Vision. Springer, 2012, pp. 41–50.
        arXiv:1804.06882, 2018.                                         [229]   J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai,
[210]   F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf,                      T. Liu, X. Wang, L. Wang, G. Wang et al., “Recent ad-
        W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level                 vances in convolutional neural networks,” arXiv preprint
        accuracy with 50x fewer parameters and¡ 0.5 mb model                    arXiv:1512.07108, 2015.
        size,” arXiv preprint arXiv:1602.07360, 2016.                   [230]   K. Simonyan and A. Zisserman, “Very deep convolu-
[211]   B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer,                        tional networks for large-scale image recognition,” arXiv
        “Squeezedet: Unified, small, low power fully convolu-                   preprint arXiv:1409.1556, 2014.
        tional neural networks for real-time object detection for       [231]   C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
        autonomous driving.” in CVPR Workshops, 2017, pp. 446–                  D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich,
        454.                                                                    “Going deeper with convolutions,” in Proceedings of the
[212]   T. Kong, A. Yao, Y. Chen, and F. Sun, “Hypernet: Towards                IEEE conference on computer vision and pattern recognition,
        accurate region proposal generation and joint object de-                2015, pp. 1–9.
        tection,” in Proceedings of the IEEE conference on computer     [232]   S. Ioffe and C. Szegedy, “Batch normalization: Accelerat-
        vision and pattern recognition, 2016, pp. 845–853.                      ing deep network training by reducing internal covariate
[213]   B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning               shift,” arXiv preprint arXiv:1502.03167, 2015.
        transferable architectures for scalable image recognition,”     [233]   C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi,
        in Proceedings of the IEEE conference on computer vision and            “Inception-v4, inception-resnet and the impact of resid-
        pattern recognition, 2018, pp. 8697–8710.                               ual connections on learning.” in AAAI, vol. 4, 2017, p. 12.
[214]   B. Zoph and Q. V. Le, “Neural architecture search with          [234]   K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual
        reinforcement learning,” arXiv preprint arXiv:1611.01578,               learning for image recognition,” in Proceedings of the IEEE
        2016.                                                                   conference on computer vision and pattern recognition, 2016,
[215]   Y. Chen, T. Yang, X. Zhang, G. Meng, C. Pan, and J. Sun,                pp. 770–778.
        “Detnas: Neural architecture search on object detection,”       [235]   G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Wein-
        arXiv preprint arXiv:1903.10979, 2019.                                  berger, “Densely connected convolutional networks.” in
[216]   C. Liu, L.-C. Chen, F. Schroff, H. Adam, W. Hua, A. Yuille,             CVPR, vol. 1, no. 2, 2017, p. 3.
        and L. Fei-Fei, “Auto-deeplab: Hierarchical neural archi-       [236]   J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation net-
        tecture search for semantic image segmentation,” arXiv                  works,” arXiv preprint arXiv:1709.01507, vol. 7, 2017.
        preprint arXiv:1901.02985, 2019.                                [237]   P. Zhou, B. Ni, C. Geng, J. Hu, and Y. Xu, “Scale-
[217]   X. Chu, B. Zhang, R. Xu, and H. Ma, “Multi-objective re-                transferrable object detection,” in Proceedings of the IEEE
        inforced evolution in mobile neural architecture search,”               Conference on Computer Vision and Pattern Recognition,
        arXiv preprint arXiv:1901.01074, 2019.                                  2018, pp. 528–537.
[218]   C.-H. Hsu, S.-H. Chang, D.-C. Juan, J.-Y. Pan, Y.-T. Chen,      [238]   Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, and X. Xue,
        W. Wei, and S.-C. Chang, “Monas: Multi-objective neural                 “Dsod: Learning deeply supervised object detectors from
        architecture search using reinforcement learning,” arXiv                scratch,” in The IEEE International Conference on Computer
        preprint arXiv:1806.10332, 2018.                                        Vision (ICCV), vol. 3, no. 6, 2017, p. 7.
[219]   P. Simard, L. Bottou, P. Haffner, and Y. LeCun, “Boxlets: a     [239]   S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He,
        fast convolution algorithm for signal processing and neu-               “Aggregated residual transformations for deep neural
        ral networks,” in Advances in Neural Information Processing             networks,” in Computer Vision and Pattern Recognition
        Systems, 1999, pp. 571–577.                                             (CVPR), 2017 IEEE Conference on. IEEE, 2017, pp. 5987–
[220]   X. Wang, T. X. Han, and S. Yan, “An hog-lbp human                       5995.
        detector with partial occlusion handling,” in Computer          [240]   J. Jeong, H. Park, and N. Kwak, “Enhancement of ssd by
        Vision, 2009 IEEE 12th International Conference on. IEEE,               concatenating feature maps for object detection,” arXiv
        2009, pp. 32–39.                                                        preprint arXiv:1705.09587, 2017.
[221]   F. Porikli, “Integral histogram: A fast way to extract          [241]   K. Lee, J. Choi, J. Jeong, and N. Kwak, “Residual features
        histograms in cartesian spaces,” in Computer Vision and                 and unified prediction network for single stage detec-
        Pattern Recognition, 2005. CVPR 2005. IEEE Computer So-                 tion,” arXiv preprint arXiv:1707.05031, 2017.
        ciety Conference on, vol. 1. IEEE, 2005, pp. 829–836.           [242]   G. Cao, X. Xie, W. Yang, Q. Liao, G. Shi, and J. Wu,
[222]   M. Mathieu, M. Henaff, and Y. LeCun, “Fast training                     “Feature-fused ssd: fast detection for small objects,” in
        of convolutional networks through ffts,” arXiv preprint                 Ninth International Conference on Graphic and Image Pro-
        arXiv:1312.5851, 2013.                                                  cessing (ICGIP 2017), vol. 10615. International Society
[223]   H. Pratt, B. Williams, F. Coenen, and Y. Zheng, “Fcnn:                  for Optics and Photonics, 2018, p. 106151E.
        Fourier convolutional neural networks,” in Joint European       [243]   L. Zheng, C. Fu, and Y. Zhao, “Extend the shallow part
        Conference on Machine Learning and Knowledge Discovery in               of single shot multibox detector via convolutional neural
        Databases. Springer, 2017, pp. 786–798.                                 network,” arXiv preprint arXiv:1801.05918, 2018.
[224]   N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Pi-       [244]   A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta,
        antino, and Y. LeCun, “Fast convolutional nets with                     “Beyond skip connections: Top-down modulation for ob-
                                                                                                                                        35
      ject detection,” arXiv preprint arXiv:1612.06851, 2016.                 2017 Fifteenth IAPR International Conference on.      IEEE,
[245] T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, and Y. Chen, “Ron:              2017, pp. 514–517.
      Reverse connection with objectness prior networks for           [265]   J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang, “Unitbox:
      object detection,” in IEEE Conference on Computer Vision                An advanced object detection network,” in Proceedings of
      and Pattern Recognition, vol. 1, 2017, p. 2.                            the 2016 ACM on Multimedia Conference. ACM, 2016, pp.
[246] S. Woo, S. Hwang, and I. S. Kweon, “Stairnet: Top-down                  516–520.
      semantic aggregation for accurate one shot detection,” in       [266]   S. Gidaris and N. Komodakis, “Locnet: Improving local-
      2018 IEEE Winter Conference on Applications of Computer                 ization accuracy for object detection,” in Proceedings of the
      Vision (WACV). IEEE, 2018, pp. 1093–1102.                               IEEE conference on computer vision and pattern recognition,
[247] Y. Chen, J. Li, B. Zhou, J. Feng, and S. Yan, “Weaving                  2016, pp. 789–798.
      multi-scale context for single shot detector,” arXiv preprint   [267]   B. A. Olshausen and D. J. Field, “Emergence of simple-
      arXiv:1712.03149, 2017.                                                 cell receptive field properties by learning a sparse code
[248] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive                    for natural images,” Nature, vol. 381, no. 6583, p. 607,
      deconvolutional networks for mid and high level feature                 1996.
      learning,” in Computer Vision (ICCV), 2011 IEEE Interna-        [268]   A. J. Bell and T. J. Sejnowski, “The independent compo-
      tional Conference on. IEEE, 2011, pp. 2018–2025.                        nents of natural scenes are edge filters,” Vision research,
[249] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C.                         vol. 37, no. 23, pp. 3327–3338, 1997.
      Berg, “Dssd: Deconvolutional single shot detector,” arXiv       [269]   S. Brahmbhatt, H. I. Christensen, and J. Hays, “Stuffnet:
      preprint arXiv:1701.06659, 2017.                                        Using stuffto improve object detection,” in Applications of
[250] J. Wang, Y. Yuan, and G. Yu, “Face attention network:                   Computer Vision (WACV), 2017 IEEE Winter Conference on.
      An effective face detector for the occluded faces,” arXiv               IEEE, 2017, pp. 934–943.
      preprint arXiv:1711.07246, 2017.                                [270]   A. Shrivastava and A. Gupta, “Contextual priming and
[251] P. He, W. Huang, T. He, Q. Zhu, Y. Qiao, and X. Li, “Single             feedback for faster r-cnn,” in European Conference on Com-
      shot text detector with regional attention,” in The IEEE                puter Vision. Springer, 2016, pp. 330–348.
      International Conference on Computer Vision (ICCV), vol. 6,     [271]   Z. Zhang, S. Qiao, C. Xie, W. Shen, B. Wang, and A. L.
      no. 7, 2017.                                                            Yuille, “Single-shot object detection with enriched seman-
[252] F. Yu and V. Koltun, “Multi-scale context aggregation                   tics,” Center for Brains, Minds and Machines (CBMM),
      by dilated convolutions,” arXiv preprint arXiv:1511.07122,              Tech. Rep., 2018.
      2015.                                                           [272]   B. Cai, Z. Jiang, H. Zhang, Y. Yao, and S. Nie, “Online
[253] F. Yu, V. Koltun, and T. A. Funkhouser, “Dilated residual               exemplar-based fully convolutional network for aircraft
      networks.” in CVPR, vol. 2, 2017, p. 3.                                 detection in remote sensing images,” IEEE Geoscience and
[254] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun,                   Remote Sensing Letters, no. 99, pp. 1–5, 2018.
      “Detnet: A backbone network for object detection,” arXiv        [273]   G. Cheng, J. Han, P. Zhou, and L. Guo, “Multi-class
      preprint arXiv:1804.06215, 2018.                                        geospatial object detection and geographic image clas-
[255] S. Liu, D. Huang, and Y. Wang, “Receptive field block                   sification based on collection of part detectors,” ISPRS
      net for accurate and fast object detection,” arXiv preprint             Journal of Photogrammetry and Remote Sensing, vol. 98, pp.
      arXiv:1711.07767, 2017.                                                 119–132, 2014.
[256] M. Najibi, M. Rastegari, and L. S. Davis, “G-cnn: an            [274]   P. Y. Simard, Y. A. LeCun, J. S. Denker, and B. Victorri,
      iterative grid based object detector,” in Proceedings of the            “Transformation invariance in pattern recognitiontangent
      IEEE Conference on Computer Vision and Pattern Recogni-                 distance and tangent propagation,” in Neural networks:
      tion, 2016, pp. 2369–2377.                                              tricks of the trade. Springer, 1998, pp. 239–274.
[257] D. Yoo, S. Park, J.-Y. Lee, A. S. Paek, and I. So Kweon,        [275]   G. Cheng, P. Zhou, and J. Han, “Rifd-cnn: Rotation-
      “Attentionnet: Aggregating weak directions for accurate                 invariant and fisher discriminative convolutional neural
      object detection,” in Proceedings of the IEEE International             networks for object detection,” in Proceedings of the IEEE
      Conference on Computer Vision, 2015, pp. 2659–2667.                     Conference on Computer Vision and Pattern Recognition,
[258] Y. Lu, T. Javidi, and S. Lazebnik, “Adaptive object detec-              2016, pp. 2884–2893.
      tion using adjacency and zoom prediction,” in Proceedings       [276]   ——, “Learning rotation-invariant convolutional neural
      of the IEEE Conference on Computer Vision and Pattern                   networks for object detection in vhr optical remote sens-
      Recognition, 2016, pp. 2351–2359.                                       ing images,” IEEE Transactions on Geoscience and Remote
[259] R. Ranjan, V. M. Patel, and R. Chellappa, “Hyperface:                   Sensing, vol. 54, no. 12, pp. 7405–7415, 2016.
      A deep multi-task learning framework for face detec-            [277]   X. Shi, S. Shan, M. Kan, S. Wu, and X. Chen, “Real-time
      tion, landmark localization, pose estimation, and gender                rotation-invariant face detection with progressive calibra-
      recognition,” IEEE Transactions on Pattern Analysis and                 tion networks,” in Proceedings of the IEEE Conference on
      Machine Intelligence, vol. 41, no. 1, pp. 121–135, 2019.                Computer Vision and Pattern Recognition, 2018, pp. 2295–
[260] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Real-                      2303.
      time multi-person 2d pose estimation using part affinity        [278]   M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial
      fields,” arXiv preprint arXiv:1611.08050, 2016.                         transformer networks,” in Advances in neural information
[261] H. Law and J. Deng, “Cornernet: Detecting objects as                    processing systems, 2015, pp. 2017–2025.
      paired keypoints,” in Proceedings of the European Confer-       [279]   D. Chen, G. Hua, F. Wen, and J. Sun, “Supervised trans-
      ence on Computer Vision (ECCV), vol. 6, 2018.                           former network for efficient face detection,” in European
[262] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into                 Conference on Computer Vision. Springer, 2016, pp. 122–
      high quality object detection,” in IEEE Conference on Com-              138.
      puter Vision and Pattern Recognition (CVPR), vol. 1, no. 2,     [280]   B. Singh and L. S. Davis, “An analysis of scale invariance
      2018, p. 10.                                                            in object detection–snip,” in Proceedings of the IEEE Con-
[263] R. N. Rajaram, E. Ohn-Bar, and M. M. Trivedi, “Refinenet:               ference on Computer Vision and Pattern Recognition, 2018,
      Iterative refinement for accurate object localization,” in              pp. 3578–3587.
      Intelligent Transportation Systems (ITSC), 2016 IEEE 19th       [281]   B. Singh, M. Najibi, and L. S. Davis, “Sniper: Efficient
      International Conference on. IEEE, 2016, pp. 1528–1533.                 multi-scale training,” arXiv preprint arXiv:1805.09300,
[264] M.-C. Roh and J.-y. Lee, “Refining faster-rcnn for accurate             2018.
      object detection,” in Machine Vision Applications (MVA),        [282]   S. Qiao, W. Shen, W. Qiu, C. Liu, and A. L. Yuille,
                                                                                                                                             36
        “Scalenet: Guiding object proposal generation in super-                   Applications of Computer Vision (WACV), 2016 IEEE Winter
        markets and beyond.” in ICCV, 2017, pp. 1809–1818.                        Conference on. IEEE, 2016, pp. 1–9.
[283]   Z. Hao, Y. Liu, H. Qin, J. Yan, X. Li, and X. Hu, “Scale-         [302]   Y. Shen, R. Ji, S. Zhang, W. Zuo, and Y. Wang, “Generative
        aware face detection,” in The IEEE Conference on Computer                 adversarial learning towards fast weakly supervised de-
        Vision and Pattern Recognition (CVPR), vol. 3, 2017.                      tection,” in Proceedings of the IEEE Conference on Computer
[284]   R. Zhu, S. Zhang, X. Wang, L. Wen, H. Shi, L. Bo, and                     Vision and Pattern Recognition, 2018, pp. 5764–5773.
        T. Mei, “Scratchdet: Exploring to train single-shot object        [303]   M. Enzweiler and D. M. Gavrila, “Monocular pedestrian
        detectors from scratch,” arXiv preprint arXiv:1810.08425,                 detection: Survey and experiments,” IEEE Transactions on
        2018.                                                                     Pattern Analysis & Machine Intelligence, no. 12, pp. 2179–
[285]   K. He, R. Girshick, and P. Dollár, “Rethinking imagenet                  2195, 2008.
        pre-training,” arXiv preprint arXiv:1811.08883, 2018.             [304]   D. Geronimo, A. M. Lopez, A. D. Sappa, and T. Graf,
[286]   I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,                         “Survey of pedestrian detection for advanced driver as-
        D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,                   sistance systems,” IEEE transactions on pattern analysis and
        “Generative adversarial nets,” in Advances in neural infor-               machine intelligence, vol. 32, no. 7, pp. 1239–1258, 2010.
        mation processing systems, 2014, pp. 2672–2680.                   [305]   R. Benenson, M. Omran, J. Hosang, and B. Schiele, “Ten
[287]   A. Radford, L. Metz, and S. Chintala, “Unsupervised rep-                  years of pedestrian detection, what have we learned?” in
        resentation learning with deep convolutional generative                   European Conference on Computer Vision. Springer, 2014,
        adversarial networks,” arXiv preprint arXiv:1511.06434,                   pp. 613–627.
        2015.                                                             [306]   S. Zhang, R. Benenson, M. Omran, J. Hosang, and
[288]   J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired                  B. Schiele, “How far are we from solving pedestrian de-
        image-to-image translation using cycle-consistent adver-                  tection?” in Proceedings of the IEEE Conference on Computer
        sarial networks,” arXiv preprint, 2017.                                   Vision and Pattern Recognition, 2016, pp. 1259–1267.
[289]   C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunning-         [307]   ——, “Towards reaching human performance in pedes-
        ham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang                 trian detection,” IEEE transactions on pattern analysis and
        et al., “Photo-realistic single image super-resolution using              machine intelligence, vol. 40, no. 4, pp. 973–986, 2018.
        a generative adversarial network.” in CVPR, vol. 2, no. 3,        [308]   P. Viola, M. J. Jones, and D. Snow, “Detecting pedestrians
        2017, p. 4.                                                               using patterns of motion and appearance,” International
[290]   J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, and S. Yan, “Per-                Journal of Computer Vision, vol. 63, no. 2, pp. 153–161, 2005.
        ceptual generative adversarial networks for small object          [309]   P. Sabzmeydani and G. Mori, “Detecting pedestrians by
        detection,” in IEEE CVPR, 2017.                                           learning shapelet features,” in Computer Vision and Pattern
[291]   Y. Bai, Y. Zhang, M. Ding, and B. Ghanem, “Sod-mtgan:                     Recognition, 2007. CVPR’07. IEEE Conference on. IEEE,
        Small object detection via multi-task generative adversar-                2007, pp. 1–8.
        ial network,” Computer Vision-ECCV, pp. 8–14, 2018.               [310]   J. Cao, Y. Pang, and X. Li, “Pedestrian detection inspired
[292]   X. Wang, A. Shrivastava, and A. Gupta, “A-fast-rcnn:                      by appearance constancy and shape symmetry,” in Pro-
        Hard positive generation via adversary for object detec-                  ceedings of the IEEE Conference on Computer Vision and
        tion,” in IEEE Conference on Computer Vision and Pattern                  Pattern Recognition, 2016, pp. 1316–1324.
        Recognition, 2017.                                                [311]   R. Benenson, R. Timofte, and L. Van Gool, “Stixels estima-
[293]   S.-T. Chen, C. Cornelius, J. Martin, and D. H. Chau,                      tion without depth map computation,” in Computer Vision
        “Robust physical adversarial attack on faster r-cnn object                Workshops (ICCV Workshops), 2011 IEEE International Con-
        detector,” arXiv preprint arXiv:1804.05810, 2018.                         ference on. IEEE, 2011, pp. 2010–2017.
[294]   R. G. Cinbis, J. Verbeek, and C. Schmid, “Weakly super-           [312]   J. Hosang, M. Omran, R. Benenson, and B. Schiele, “Tak-
        vised object localization with multi-fold multiple instance               ing a deeper look at pedestrians,” in Proceedings of the
        learning,” IEEE transactions on pattern analysis and machine              IEEE Conference on Computer Vision and Pattern Recogni-
        intelligence, vol. 39, no. 1, pp. 189–203, 2017.                          tion, 2015, pp. 4073–4082.
[295]   D. P. Papadopoulos, J. R. Uijlings, F. Keller, and V. Ferrari,    [313]   J. Cao, Y. Pang, and X. Li, “Learning multilayer channel
        “We don’t need no bounding-boxes: Training object class                   features for pedestrian detection,” IEEE transactions on
        detectors using only human verification,” in Proceedings                  image processing, vol. 26, no. 7, pp. 3210–3220, 2017.
        of the IEEE Conference on Computer Vision and Pattern             [314]   J. Mao, T. Xiao, Y. Jiang, and Z. Cao, “What can help
        Recognition, 2016, pp. 854–863.                                           pedestrian detection?” in 2017 IEEE Conference on Com-
[296]   T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez,                    puter Vision and Pattern Recognition (CVPR). IEEE, 2017,
        “Solving the multiple instance problem with axis-parallel                 pp. 6034–6043.
        rectangles,” Artificial intelligence, vol. 89, no. 1-2, pp. 31–   [315]   Q. Hu, P. Wang, C. Shen, A. van den Hengel, and
        71, 1997.                                                                 F. Porikli, “Pushing the limits of deep cnns for pedestrian
[297]   Y. Zhu, Y. Zhou, Q. Ye, Q. Qiu, and J. Jiao, “Soft proposal               detection,” IEEE Transactions on Circuits and Systems for
        networks for weakly supervised object localization,” in                   Video Technology, vol. 28, no. 6, pp. 1358–1368, 2018.
        Proc. IEEE Int. Conf. Comput. Vis.(ICCV), 2017, pp. 1841–         [316]   Y. Tian, P. Luo, X. Wang, and X. Tang, “Pedestrian detec-
        1850.                                                                     tion aided by deep learning semantic tasks,” in Proceed-
[298]   A. Diba, V. Sharma, A. M. Pazandeh, H. Pirsiavash, and                    ings of the IEEE Conference on Computer Vision and Pattern
        L. Van Gool, “Weakly supervised cascaded convolutional                    Recognition, 2015, pp. 5079–5087.
        networks.” in CVPR, vol. 1, no. 2, 2017, p. 8.                    [317]   D. Xu, W. Ouyang, E. Ricci, X. Wang, and N. Sebe,
[299]   B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Tor-                   “Learning cross-modal deep representations for robust
        ralba, “Learning deep features for discriminative local-                  pedestrian detection,” in Proc. of the IEEE Conf. on Com-
        ization,” in Proceedings of the IEEE Conference on Computer               puter Vision and Pattern Recognition (CVPR), 2017.
        Vision and Pattern Recognition, 2016, pp. 2921–2929.              [318]   X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, and C. Shen,
[300]   H. Bilen and A. Vedaldi, “Weakly supervised deep de-                      “Repulsion loss: Detecting pedestrians in a crowd,” arXiv
        tection networks,” in Proceedings of the IEEE Conference on               preprint arXiv:1711.07752, 2017.
        Computer Vision and Pattern Recognition, 2016, pp. 2846–          [319]   Y. Tian, P. Luo, X. Wang, and X. Tang, “Deep learning
        2854.                                                                     strong parts for pedestrian detection,” in Proceedings of
[301]   L. Bazzani, A. Bergamo, D. Anguelov, and L. Torresani,                    the IEEE international conference on computer vision, 2015,
        “Self-taught object localization with deep networks,” in                  pp. 1904–1912.
                                                                                                                                         37
[320] W. Ouyang, H. Zhou, H. Li, Q. Li, J. Yan, and X. Wang,                  recognition: Recent advances and future trends,” Frontiers
      “Jointly learning deep features, deformable parts, occlu-               of Computer Science, vol. 10, no. 1, pp. 19–36, 2016.
      sion and classification for pedestrian detection,” IEEE         [339]   Q. Ye and D. Doermann, “Text detection and recognition
      transactions on pattern analysis and machine intelligence,              in imagery: A survey,” IEEE transactions on pattern analysis
      vol. 40, no. 8, pp. 1874–1887, 2018.                                    and machine intelligence, vol. 37, no. 7, pp. 1480–1500, 2015.
[321] S. Zhang, J. Yang, and B. Schiele, “Occluded pedestrian         [340]   L. Neumann and J. Matas, “Scene text localization and
      detection through guided attention in cnns,” in Proceed-                recognition with oriented stroke detection,” in Proceedings
      ings of the IEEE Conference on Computer Vision and Pattern              of the IEEE International Conference on Computer Vision,
      Recognition, 2018, pp. 6995–7003.                                       2013, pp. 97–104.
[322] P. Hu and D. Ramanan, “Finding tiny faces,” in Computer         [341]   X.-C. Yin, X. Yin, K. Huang, and H.-W. Hao, “Robust text
      Vision and Pattern Recognition (CVPR), 2017 IEEE Confer-                detection in natural scene images,” IEEE transactions on
      ence on. IEEE, 2017, pp. 1522–1530.                                     pattern analysis and machine intelligence, vol. 36, no. 5, pp.
[323] M.-H. Yang, D. J. Kriegman, and N. Ahuja, “Detecting                    970–983, 2014.
      faces in images: A survey,” IEEE Transactions on pattern        [342]   K. Wang, B. Babenko, and S. Belongie, “End-to-end scene
      analysis and machine intelligence, vol. 24, no. 1, pp. 34–58,           text recognition,” in Computer Vision (ICCV), 2011 IEEE
      2002.                                                                   International Conference on. IEEE, 2011, pp. 1457–1464.
[324] S. Zafeiriou, C. Zhang, and Z. Zhang, “A survey on face         [343]   T. Wang, D. J. Wu, A. Coates, and A. Y. Ng, “End-to-end
      detection in the wild: past, present and future,” Computer              text recognition with convolutional neural networks,” in
      Vision and Image Understanding, vol. 138, pp. 1–24, 2015.               Pattern Recognition (ICPR), 2012 21st International Confer-
[325] H. A. Rowley, S. Baluja, and T. Kanade, “Neural network-                ence on. IEEE, 2012, pp. 3304–3308.
      based face detection,” IEEE Transactions on pattern analysis    [344]   S. Tian, Y. Pan, C. Huang, S. Lu, K. Yu, and C. Lim Tan,
      and machine intelligence, vol. 20, no. 1, pp. 23–38, 1998.              “Text flow: A unified text detection system in natural
[326] E. Osuna, R. Freund, and F. Girosit, “Training support                  scene images,” in Proceedings of the IEEE international
      vector machines: an application to face detection,” in                  conference on computer vision, 2015, pp. 4651–4659.
      Computer vision and pattern recognition, 1997. Proceedings.,    [345]   M. Jaderberg, A. Vedaldi, and A. Zisserman, “Deep fea-
      1997 IEEE computer society conference on. IEEE, 1997, pp.               tures for text spotting,” in European conference on computer
      130–136.                                                                vision. Springer, 2014, pp. 512–528.
[327] M. Osadchy, Y. L. Cun, and M. L. Miller, “Synergistic           [346]   X.-C. Yin, W.-Y. Pei, J. Zhang, and H.-W. Hao, “Multi-
      face detection and pose estimation with energy-based                    orientation scene text detection with adaptive cluster-
      models,” Journal of Machine Learning Research, vol. 8, no.              ing,” IEEE Transactions on Pattern Analysis & Machine
      May, pp. 1197–1215, 2007.                                               Intelligence, no. 9, pp. 1930–1937, 2015.
[328] S. Yang, P. Luo, C. C. Loy, and X. Tang, “Faceness-net:         [347]   Z. Zhang, W. Shen, C. Yao, and X. Bai, “Symmetry-based
      Face detection through deep facial part responses,” IEEE                text line detection in natural scenes,” in Proceedings of the
      transactions on pattern analysis and machine intelligence,              IEEE Conference on Computer Vision and Pattern Recogni-
      vol. 40, no. 8, pp. 1845–1859, 2018.                                    tion, 2015, pp. 2558–2567.
[329] S. Yang, Y. Xiong, C. C. Loy, and X. Tang, “Face detection      [348]   M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisser-
      through scale-friendly deep convolutional networks,”                    man, “Reading text in the wild with convolutional neural
      arXiv preprint arXiv:1706.02863, 2017.                                  networks,” International Journal of Computer Vision, vol.
[330] M. Najibi, P. Samangouei, R. Chellappa, and L. S. Davis,                116, no. 1, pp. 1–20, 2016.
      “Ssh: Single stage headless face detector.” in ICCV, 2017,      [349]   W. Huang, Y. Qiao, and X. Tang, “Robust scene text
      pp. 4885–4894.                                                          detection with convolution neural network induced
[331] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z.                    mser trees,” in European Conference on Computer Vision.
      Li, “Sˆ 3fd: Single shot scale-invariant face detector,” in             Springer, 2014, pp. 497–511.
      Computer Vision (ICCV), 2017 IEEE International Conference      [350]   T. He, W. Huang, Y. Qiao, and J. Yao, “Text-attentional
      on. IEEE, 2017, pp. 192–201.                                            convolutional neural network for scene text detection,”
[332] X. Liu, “A camera phone based currency reader for the                   IEEE transactions on image processing, vol. 25, no. 6, pp.
      visually impaired,” in Proceedings of the 10th international            2529–2541, 2016.
      ACM SIGACCESS conference on Computers and accessibility.        [351]   J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and
      ACM, 2008, pp. 305–306.                                                 X. Xue, “Arbitrary-oriented scene text detection via rota-
[333] N. Ezaki, K. Kiyota, B. T. Minh, M. Bulacu, and                         tion proposals,” IEEE Transactions on Multimedia, 2018.
      L. Schomaker, “Improved text-detection methods for a            [352]   ——, “Arbitrary-oriented scene text detection via rotation
      camera-based text reading system for blind persons,”                    proposals,” IEEE Transactions on Multimedia, 2018.
      in Document Analysis and Recognition, 2005. Proceedings.        [353]   Y. Jiang, X. Zhu, X. Wang, S. Yang, W. Li, H. Wang,
      Eighth International Conference on. IEEE, 2005, pp. 257–                P. Fu, and Z. Luo, “R2cnn: rotational region cnn for
      261.                                                                    orientation robust scene text detection,” arXiv preprint
[334] P. Sermanet, S. Chintala, and Y. LeCun, “Convolutional                  arXiv:1706.09579, 2017.
      neural networks applied to house numbers digit classi-          [354]   M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu, “Textboxes:
      fication,” in Pattern Recognition (ICPR), 2012 21st Interna-            A fast text detector with a single deep neural network.”
      tional Conference on. IEEE, 2012, pp. 3288–3291.                        in AAAI, 2017, pp. 4161–4167.
[335] Z. Wojna, A. Gorban, D.-S. Lee, K. Murphy, Q. Yu, Y. Li,        [355]   W. He, X.-Y. Zhang, F. Yin, and C.-L. Liu, “Deep direct
      and J. Ibarz, “Attention-based extraction of structured                 regression for multi-oriented scene text detection,” arXiv
      information from street view imagery,” arXiv preprint                   preprint arXiv:1703.08289, 2017.
      arXiv:1704.03549, 2017.                                         [356]   Y. Liu and L. Jin, “Deep matching prior network: Toward
[336] Y. Liu and L. Jin, “Deep matching prior network: Toward                 tighter multi-oriented text detection,” in Proc. CVPR,
      tighter multi-oriented text detection,” in Proc. CVPR,                  2017, pp. 3454–3461.
      2017, pp. 3454–3461.                                            [357]   X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He,
[337] Y. Wu and P. Natarajan, “Self-organized text detection                  and J. Liang, “East: an efficient and accurate scene text
      with minimal post-processing via border learning,” in                   detector,” in Proc. CVPR, 2017, pp. 2642–2651.
      Proc. ICCV, 2017.                                               [358]   C. Yao, X. Bai, N. Sang, X. Zhou, S. Zhou, and Z. Cao,
[338] Y. Zhu, C. Yao, and X. Bai, “Scene text detection and                   “Scene text detection via holistic, multi-channel predic-
                                                                                                                                        38
      tion,” arXiv preprint arXiv:1606.09002, 2016.                           Networks (IJCNN), The 2013 International Joint Conference
[359] C. Xue, S. Lu, and F. Zhan, “Accurate scene text detection              on. IEEE, 2013, pp. 1–5.
      through border semantics awareness and bootstrapping,”          [377]   Z. Shi, Z. Zou, and C. Zhang, “Real-time traffic light
      in European Conference on Computer Vision. Springer, 2018,              detection with adaptive background suppression filter,”
      pp. 370–387.                                                            IEEE Transactions on Intelligent Transportation Systems,
[360] P. Lyu, C. Yao, W. Wu, S. Yan, and X. Bai, “Multi-oriented              vol. 17, no. 3, pp. 690–700, 2016.
      scene text detection via corner localization and region         [378]   Y. Lu, J. Lu, S. Zhang, and P. Hall, “Traffic signal detec-
      segmentation,” in Proceedings of the IEEE Conference on                 tion and classification in street views using an attention
      Computer Vision and Pattern Recognition, 2018, pp. 7553–                model,” Computational Visual Media, vol. 4, no. 3, pp. 253–
      7563.                                                                   266, 2018.
[361] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao, “Detect-          [379]   M. Bach, D. Stumper, and K. Dietmayer, “Deep convolu-
      ing text in natural image with connectionist text pro-                  tional traffic light recognition for automated driving,” in
      posal network,” in European conference on computer vision.              2018 21st International Conference on Intelligent Transporta-
      Springer, 2016, pp. 56–72.                                              tion Systems (ITSC). IEEE, 2018, pp. 851–858.
[362] A. d. l. Escalera, L. Moreno, M. A. Salichs, and J. M.          [380]   S. Qiu, G. Wen, and Y. Fan, “Occluded object detection
      Armingol, “Road traffic sign detection and classification,”             in high-resolution remote sensing images using partial
      1997.                                                                   configuration object model,” IEEE Journal of Selected Topics
[363] D. M. Gavrila, U. Franke, C. Wohler, and S. Gorzig, “Real               in Applied Earth Observations and Remote Sensing, vol. 10,
      time vision for intelligent vehicles,” IEEE Instrumentation             no. 5, pp. 1909–1925, 2017.
      & Measurement Magazine, vol. 4, no. 2, pp. 22–27, 2001.         [381]   Z. Zou and Z. Shi, “Ship detection in spaceborne optical
[364] C. F. Paulo and P. L. Correia, “Automatic detection                     image with svd networks,” IEEE Transactions on Geo-
      and classification of traffic signs,” in Image Analysis for             science and Remote Sensing, vol. 54, no. 10, pp. 5832–5845,
      Multimedia Interactive Services, 2007. WIAMIS’07. Eighth                2016.
      International Workshop on. IEEE, 2007, pp. 11–11.               [382]   L. Zhang, L. Zhang, and B. Du, “Deep learning for remote
[365] A. De la Escalera, J. M. Armingol, and M. Mata, “Traffic                sensing data: A technical tutorial on the state of the art,”
      sign recognition and analysis for intelligent vehicles,”                IEEE Geoscience and Remote Sensing Magazine, vol. 4, no. 2,
      Image and vision computing, vol. 21, no. 3, pp. 247–258,                pp. 22–40, 2016.
      2003.                                                           [383]   N. Proia and V. Pagé, “Characterization of a bayesian
[366] W. Shadeed, D. I. Abu-Al-Nadi, and M. J. Mismar, “Road                  ship detection method in optical satellite images,” IEEE
      traffic sign detection in color images,” in Electronics, Cir-           Geoscience and Remote Sensing Letters, vol. 7, no. 2, pp.
      cuits and Systems, 2003. ICECS 2003. Proceedings of the 2003            226–230, 2010.
      10th IEEE International Conference on, vol. 2. IEEE, 2003,      [384]   C. Zhu, H. Zhou, R. Wang, and J. Guo, “A novel hierar-
      pp. 890–893.                                                            chical method of ship detection from spaceborne optical
[367] S. Maldonado-Bascón, S. Lafuente-Arroyo, P. Gil-                       image based on shape and texture features,” IEEE Trans-
      Jimenez, H. Gómez-Moreno, and F. López-Ferreras,                      actions on geoscience and remote sensing, vol. 48, no. 9, pp.
      “Road-sign detection and recognition based on support                   3446–3456, 2010.
      vector machines,” IEEE transactions on intelligent trans-       [385]   S. Qi, J. Ma, J. Lin, Y. Li, and J. Tian, “Unsupervised
      portation systems, vol. 8, no. 2, pp. 264–278, 2007.                    ship detection based on saliency and s-hog descriptor
[368] M. Omachi and S. Omachi, “Traffic light detection with                  from optical satellite images,” IEEE Geoscience and Remote
      color and edge information,” 2009.                                      Sensing Letters, vol. 12, no. 7, pp. 1451–1455, 2015.
[369] Y. Xie, L.-f. Liu, C.-h. Li, and Y.-y. Qu, “Unifying visual     [386]   F. Bi, B. Zhu, L. Gao, and M. Bian, “A visual search
      saliency with hog feature learning for traffic sign detec-              inspired computational model for ship detection in op-
      tion,” in Intelligent Vehicles Symposium, 2009 IEEE. IEEE,              tical satellite images,” IEEE Geoscience and Remote Sensing
      2009, pp. 24–29.                                                        Letters, vol. 9, no. 4, pp. 749–753, 2012.
[370] S. Houben, “A single target voting scheme for traffic           [387]   J. Han, P. Zhou, D. Zhang, G. Cheng, L. Guo, Z. Liu, S. Bu,
      sign detection,” in Intelligent Vehicles Symposium (IV), 2011           and J. Wu, “Efficient, simultaneous detection of multi-
      IEEE. IEEE, 2011, pp. 124–129.                                          class geospatial targets based on visual saliency modeling
[371] A. Soetedjo and K. Yamada, “Fast and robust traffic sign                and discriminative learning of sparse coding,” ISPRS
      detection,” in Systems, Man and Cybernetics, 2005 IEEE                  Journal of Photogrammetry and Remote Sensing, vol. 89, pp.
      International Conference on, vol. 2. IEEE, 2005, pp. 1341–              37–48, 2014.
      1346.                                                           [388]   J. Han, D. Zhang, G. Cheng, L. Guo, and J. Ren, “Object
[372] N. Fairfield and C. Urmson, “Traffic light mapping and                  detection in optical remote sensing images based on
      detection,” in Robotics and Automation (ICRA), 2011 IEEE                weakly supervised learning and high-level feature learn-
      International Conference on. IEEE, 2011, pp. 5421–5426.                 ing,” IEEE Transactions on Geoscience and Remote Sensing,
[373] J. Levinson, J. Askeland, J. Dolson, and S. Thrun, “Traffic             vol. 53, no. 6, pp. 3325–3337, 2015.
      light mapping, localization, and state detection for au-        [389]   J. Tang, C. Deng, G.-B. Huang, and B. Zhao,
      tonomous vehicles,” in Robotics and Automation (ICRA),                  “Compressed-domain ship detection on spaceborne op-
      2011 IEEE International Conference on. IEEE, 2011, pp.                  tical image using deep neural network and extreme learn-
      5784–5791.                                                              ing machine,” IEEE Transactions on Geoscience and Remote
[374] C. Bahlmann, Y. Zhu, V. Ramesh, M. Pellkofer, and                       Sensing, vol. 53, no. 3, pp. 1174–1185, 2015.
      T. Koehler, “A system for traffic sign detection, tracking,     [390]   Z. Shi, X. Yu, Z. Jiang, and B. Li, “Ship detection in high-
      and recognition using color, shape, and motion informa-                 resolution optical imagery based on anomaly detector
      tion,” in Intelligent Vehicles Symposium, 2005. Proceedings.            and local shape feature,” IEEE Transactions on Geoscience
      IEEE. IEEE, 2005, pp. 255–260.                                          and Remote Sensing, vol. 52, no. 8, pp. 4511–4523, 2014.
[375] I. M. Creusen, R. G. Wijnhoven, E. Herbschleb, and              [391]   A. Kembhavi, D. Harwood, and L. S. Davis, “Vehicle
      P. de With, “Color exploitation in hog-based traffic sign               detection using partial least squares,” IEEE Transactions
      detection,” in 2010 IEEE International Conference on Image              on Pattern Analysis and Machine Intelligence, vol. 33, no. 6,
      Processing. IEEE, 2010, pp. 2669–2672.                                  pp. 1250–1265, 2011.
[376] G. Wang, G. Ren, Z. Wu, Y. Zhao, and L. Jiang, “A robust,       [392]   L. Wan, L. Zheng, H. Huo, and T. Fang, “Affine invariant
      coarse-to-fine traffic sign detection method,” in Neural                description and large-margin dimensionality reduction
                                                                                                                                     39
        for target detection in optical remote sensing images,”              IEEE, 2017, pp. 311–319.
        IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 7,    [408] L. Sommer, T. Schuchert, and J. Beyerer, “Comprehensive
        pp. 1116–1120, 2017.                                                 analysis of deep learning based vehicle detection in aerial
[393]   H. Zhou, L. Wei, C. P. Lim, D. Creighton, and S. Na-                 images,” IEEE Transactions on Circuits and Systems for
        havandi, “Robust vehicle detection in aerial images us-              Video Technology, 2018.
        ing bag-of-words and orientation aware scanning,” IEEE         [409] Z. Liu, J. Hu, L. Weng, and Y. Yang, “Rotated region based
        Transactions on Geoscience and Remote Sensing, no. 99, pp.           cnn for ship detection,” in Image Processing (ICIP), 2017
        1–12, 2018.                                                          IEEE International Conference on. IEEE, 2017, pp. 900–904.
[394]   M. ElMikaty and T. Stathaki, “Detection of cars in             [410] H. Lin, Z. Shi, and Z. Zou, “Fully convolutional network
        high-resolution aerial images of complex urban envi-                 with task partitioning for inshore ship detection in op-
        ronments,” IEEE Transactions on Geoscience and Remote                tical remote sensing images,” IEEE Geoscience and Remote
        Sensing, vol. 55, no. 10, pp. 5913–5924, 2017.                       Sensing Letters, vol. 14, no. 10, pp. 1665–1669, 2017.
[395]   L. Zhang, Z. Shi, and J. Wu, “A hierarchical oil tank de-      [411] ——, “Maritime semantic labeling of optical remote
        tector with deep surrounding features for high-resolution            sensing images with multi-scale fully convolutional net-
        optical satellite imagery,” IEEE Journal of Selected Topics          work,” Remote Sensing, vol. 9, no. 5, p. 480, 2017.
        in Applied Earth Observations and Remote Sensing, vol. 8,
        no. 10, pp. 4895–4909, 2015.
[396]   C. Zhu, B. Liu, Y. Zhou, Q. Yu, X. Liu, and W. Yu, “Frame-
        work design and implementation for oil tank detection in
        optical satellite imagery,” in Geoscience and Remote Sensing
        Symposium (IGARSS), 2012 IEEE International.           IEEE,
        2012, pp. 6016–6019.
[397]   G. Liu, Y. Zhang, X. Zheng, X. Sun, K. Fu, and H. Wang,
        “A new method on inshore ship detection in high-
        resolution satellite images using shape and context in-
        formation,” IEEE Geoscience and Remote Sensing Letters,
        vol. 11, no. 3, pp. 617–621, 2014.
[398]   J. Xu, X. Sun, D. Zhang, and K. Fu, “Automatic detection
        of inshore ships in high-resolution remote sensing images
        using robust invariant generalized hough transform,”
        IEEE Geoscience and Remote Sensing Letters, vol. 11, no. 12,
        pp. 2070–2074, 2014.
[399]   J. Zhang, C. Tao, and Z. Zou, “An on-road vehicle detec-
        tion method for high-resolution aerial images based on
        local and global structure learning,” IEEE Geoscience and
        Remote Sensing Letters, vol. 14, no. 8, pp. 1198–1202, 2017.
[400]   W. Diao, X. Sun, X. Zheng, F. Dou, H. Wang, and K. Fu,
        “Efficient saliency-based object detection in remote sens-
        ing images using deep belief networks,” IEEE Geoscience
        and Remote Sensing Letters, vol. 13, no. 2, pp. 137–141,
        2016.
[401]   P. Zhang, X. Niu, Y. Dou, and F. Xia, “Airport detection on
        optical satellite images using deep convolutional neural
        networks,” IEEE Geoscience and Remote Sensing Letters,
        vol. 14, no. 8, pp. 1183–1187, 2017.
[402]   Z. Shi and Z. Zou, “Can a machine generate humanlike
        language descriptions for a remote sensing image?” IEEE
        Transactions on Geoscience and Remote Sensing, vol. 55,
        no. 6, pp. 3623–3634, 2017.
[403]   X. Han, Y. Zhong, and L. Zhang, “An efficient and ro-
        bust integrated geospatial object detection framework for
        high spatial resolution remote sensing imagery,” Remote
        Sensing, vol. 9, no. 7, p. 666, 2017.
[404]   Z. Xu, X. Xu, L. Wang, R. Yang, and F. Pu, “Deformable
        convnet with aspect ratio constrained nms for object
        detection in remote sensing imagery,” Remote Sensing,
        vol. 9, no. 12, p. 1312, 2017.
[405]   W. Li, K. Fu, H. Sun, X. Sun, Z. Guo, M. Yan, and
        X. Zheng, “Integrated localization and recognition for
        inshore ships in large scene remote sensing images,” IEEE
        Geoscience and Remote Sensing Letters, vol. 14, no. 6, pp.
        936–940, 2017.
[406]   O. A. Penatti, K. Nogueira, and J. A. dos Santos, “Do
        deep features generalize from everyday objects to remote
        sensing and aerial scenes domains?” in Proceedings of the
        IEEE conference on computer vision and pattern recognition
        workshops, 2015, pp. 44–51.
[407]   L. W. Sommer, T. Schuchert, and J. Beyerer, “Fast deep
        vehicle detection in aerial images,” in Applications of
        Computer Vision (WACV), 2017 IEEE Winter Conference on.