0% found this document useful (0 votes)
67 views39 pages

Object Detection Survey

Survey on Object Detection 2019

Uploaded by

Pranav Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views39 pages

Object Detection Survey

Survey on Object Detection 2019

Uploaded by

Pranav Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

1

Object Detection in 20 Years: A Survey


Zhengxia Zou, Zhenwei Shi, Member, IEEE, Yuhong Guo, and Jieping Ye, Senior Member, IEEE

Abstract—Object detection, as of one the most fundamental and challenging problems in computer vision, has received great
attention in recent years. Its development in the past two decades can be regarded as an epitome of computer vision history. If we think
of today’s object detection as a technical aesthetics under the power of deep learning, then turning back the clock 20 years we would
witness the wisdom of cold weapon era. This paper extensively reviews 400+ papers of object detection in the light of its technical
evolution, spanning over a quarter-century’s time (from the 1990s to 2019). A number of topics have been covered in this paper,
including the milestone detectors in history, detection datasets, metrics, fundamental building blocks of the detection system, speed up
techniques, and the recent state of the art detection methods. This paper also reviews some important detection applications, such as
pedestrian detection, face detection, text detection, etc, and makes an in-deep analysis of their challenges as well as technical
arXiv:1905.05055v2 [cs.CV] 16 May 2019

improvements in recent years.

Index Terms—Object detection, Computer vision, Deep learning, Convolutional neural networks, Technical evolution.

1 I NTRODUCTION

O BJECT detection is an important computer vision task


that deals with detecting instances of visual objects
of a certain class (such as humans, animals, or cars) in
digital images. The objective of object detection is to develop
computational models and techniques that provide one of
the most basic pieces of information needed by computer
vision applications: What objects are where?
As one of the fundamental problems of computer vision,
object detection forms the basis of many other computer
vision tasks, such as instance segmentation [1–4], image
captioning [5–7], object tracking [8], etc. From the appli-
cation point of view, object detection can be grouped into
two research topics “general object detection” and “detec-
tion applications”, where the former one aims to explore Fig. 1. The increasing number of publications in object detection from
1998 to 2018. (Data from Google scholar advanced search: allintitle:
the methods of detecting different types of objects under “object detection” AND “detecting objects”.)
a unified framework to simulate the human vision and
cognition, and the later one refers to the detection under
specific application scenarios, such as pedestrian detection, decades.
face detection, text detection, etc. In recent years, the rapid
• Difference from other related reviews
development of deep learning techniques [9] has brought
new blood into object detection, leading to remarkable A number of reviews of general object detection have
breakthroughs and pushing it forward to a research hot-spot been published in recent years [24–28]. The main difference
with unprecedented attention. Object detection has now between this paper and the above reviews are summarized
been widely used in many real-world applications, such as follows:
as autonomous driving, robot vision, video surveillance, 1. A comprehensive review in the light of technical
etc. Fig. 1 shows the growing number of publications that evolutions: This paper extensively reviews 400+ papers in
are associated with “object detection” over the past two the development history of object detection, spanning over a
quarter-century’s time (from the 1990s to 2019). Most of the
Corresponding Author: Zhengxia Zou (zzhengxi@umich.edu) and Jieping Ye previous reviews merely focus on a short historical period
(jpye@umich.edu).
or on some specific detection tasks without considering the
Zhengxia Zou is with the Department of Computational Medicine and
Bioinformatics, University of Michigan, Ann Arbor, MI 48109, U.S.A. technical evolutions over their entire lifetime. Standing on
Zhenwei Shi is with the Image Processing Center, School of Astronautics, the highway of the history not only helps readers build a
Beihang University, Beijing 100191, China, and with the State Key Laboratory complete knowledge hierarchy but also helps to find future
of Virtual Reality Technology and Systems, School of Astronautics, Beihang
University, Beijing 100191, China. directions of this fast developing field.
Yuhong Guo is with the School of Computer Science, Carleton University, 2. An in-depth exploration of the key technologies and
Ottawa, HP5167, Canada, and with the DiDi Labs, Toronto, Canada. the recent state of the arts: After years of development,
Jieping Ye is with the Department of Computational Medicine and Bioinfor- the state of the art object detection systems have been
matics, and the Department of Electrical Engineering and Computer Science,
University of Michigan, Ann Arbor, MI 48109, U.S.A., and with the DiDi AI integrated with a large number of techniques such as “multi-
Labs, DiDi Chuxing, Beijing, 100085, China. scale detection”, “hard negative mining”, “bounding box
2

Fig. 2. A road map of object detection. Milestone detectors in this figure: VJ Det. [10, 11], HOG Det. [12], DPM [13–15], RCNN [16], SPPNet [17],
Fast RCNN [18], Faster RCNN [19], YOLO [20], SSD [21], Pyramid Networks [22], Retina-Net [23].

regression”, etc. However, previous reviews lack fundamen- methods in the recent three years are summarized in Section
tal analysis to help readers understand the nature of these 4. Some important detection applications will be reviewed
sophisticated techniques, e.g., “Where did they come from in Section 5. In Section 6, we conclude this paper and make
and how did they evolve?” “What are the pros and cons an analysis of the further research directions.
of each group of methods?” This paper makes an in-depth
analysis for readers of the above concerns.
3. A comprehensive analysis of detection speed up
2 O BJECT D ETECTION IN 20 Y EARS
techniques: The acceleration of object detection has long In this section, we will review the history of object detection
been a crucial but challenging task. This paper makes an in multiple aspects, including milestone detectors, object
extensive review of the speed up techniques in 20 years detection datasets, metrics, and the evolution of key tech-
of object detection history at multiple levels, including niques.
“detection pipeline” (e.g., cascaded detection, feature map
shared computation), “detection backbone” (e.g., network 2.1 A Road Map of Object Detection
compression, lightweight network design), and “numerical
In the past two decades, it is widely accepted that the
computation” (e.g., integral image, vector quantization).
progress of object detection has generally gone through
This topic is rarely covered by previous reviews.
two historical periods: “traditional object detection period
• Difficulties and Challenges in Object Detection (before 2014)” and “deep learning based detection period
Despite people always asking “what are the difficulties (after 2014)”, as shown in Fig. 2.
and challenges in object detection?”, actually, this question
is not easy to answer and may even be over-generalized. 2.1.1 Milestones: Traditional Detectors
As different detection tasks have totally different objectives If we think of today’s object detection as a technical aes-
and constraints, their difficulties may vary from each other. thetics under the power of deep learning, then turning back
In addition to some common challenges in other computer the clock 20 years we would witness “the wisdom of cold
vision tasks such as objects under different viewpoints, weapon era”. Most of the early object detection algorithms
illuminations, and intraclass variations, the challenges in were built based on handcrafted features. Due to the lack of
object detection include but not limited to the following effective image representation at that time, people have no
aspects: object rotation and scale changes (e.g., small ob- choice but to design sophisticated feature representations,
jects), accurate object localization, dense and occluded object and a variety of speed up skills to exhaust the usage of
detection, speed up of detection, etc. In Sections 4 and 5, we limited computing resources.
will give a more detailed analysis of these topics.
• Viola Jones Detectors
The rest of this paper is organized as follows. In Section
2, we review the 20 years’ evolutionary history of object 18 years ago, P. Viola and M. Jones achieved real-time
detection. Some speed up techniques in object detection will detection of human faces for the first time without any
be introduced in Section 3. Some state of the art detection constraints (e.g., skin color segmentation) [10, 11]. Running
3

on a 700MHz Pentium III CPU, the detector was tens or The DPM follows the detection philosophy of “divide
even hundreds of times faster than any other algorithms in and conquer”, where the training can be simply considered
its time under comparable detection accuracy. The detection as the learning of a proper way of decomposing an object,
algorithm, which was later referred to the “Viola-Jones and the inference can be considered as an ensemble of de-
(VJ) detector”, was herein given by the authors’ names in tections on different object parts. For example, the problem
memory of their significant contributions. of detecting a “car” can be considered as the detection of
The VJ detector follows a most straight forward way of its window, body, and wheels. This part of the work, a.k.a.
detection, i.e., sliding windows: to go through all possible “star-model”, was completed by P. Felzenszwalb et al. [13].
locations and scales in an image to see if any window Later on, R. Girshick has further extended the star-model to
contains a human face. Although it seems to be a very the “mixture models” [14, 15, 37, 38] to deal with the objects
simple process, the calculation behind it was far beyond the in the real world under more significant variations.
computer’s power of its time. The VJ detector has dramat- A typical DPM detector consists of a root-filter and a
ically improved its detection speed by incorporating three number of part-filters. Instead of manually specifying the
important techniques: “integral image”, “feature selection”, configurations of the part filters (e.g., size and location), a
and “detection cascades”. weakly supervised learning method is developed in DPM
1) Integral image: The integral image is a computational where all configurations of part filters can be learned auto-
method to speed up box filtering or convolution process. matically as latent variables. R. Girshick has further formu-
Like other object detection algorithms in its time [29–31], lated this process as a special case of Multi-Instance learning
the Haar wavelet is used in VJ detector as the feature [39], and some other important techniques such as “hard
representation of an image. The integral image makes the negative mining”, “bounding box regression”, and “context
computational complexity of each window in VJ detector priming” are also applied for improving detection accuracy
independent of its window size. (to be introduced in Section 2.3). To speed up the detection,
2) Feature selection: Instead of using a set of manually Girshick developed a technique for “compiling” detection
selected Haar basis filters, the authors used Adaboost al- models into a much faster one that implements a cascade
gorithm [32] to select a small set of features that are mostly architecture, which has achieved over 10 times acceleration
helpful for face detection from a huge set of random features without sacrificing any accuracy [14, 38].
pools (about 180k-dimensional). Although today’s object detectors have far surpassed
3) Detection cascades: A multi-stage detection paradigm DPM in terms of the detection accuracy, many of them are
(a.k.a. the “detection cascades”) was introduced in VJ detec- still deeply influenced by its valuable insights, e.g., mixture
tor to reduce its computational overhead by spending less models, hard negative mining, bounding box regression, etc.
computations on background windows but more on face In 2010, P. Felzenszwalb and R. Girshick were awarded the
targets. “lifetime achievement” by PASCAL VOC.

• HOG Detector 2.1.2 Milestones: CNN based Two-stage Detectors


As the performance of hand-crafted features became satu-
Histogram of Oriented Gradients (HOG) feature descrip-
rated, object detection has reached a plateau after 2010. R.
tor was originally proposed in 2005 by N. Dalal and B.
Girshick says: “... progress has been slow during 2010-2012,
Triggs [12]. HOG can be considered as an important im-
with small gains obtained by building ensemble systems
provement of the scale-invariant feature transform [33, 34]
and employing minor variants of successful methods”[38].
and shape contexts [35] of its time. To balance the feature
In 2012, the world saw the rebirth of convolutional neural
invariance (including translation, scale, illumination, etc)
networks [40]. As a deep convolutional network is able to
and the nonlinearity (on discriminating different objects cat-
learn robust and high-level feature representations of an
egories), the HOG descriptor is designed to be computed on
image, a natural question is whether we can bring it to
a dense grid of uniformly spaced cells and use overlapping
object detection? R. Girshick et al. took the lead to break the
local contrast normalization (on “blocks”) for improving
deadlocks in 2014 by proposing the Regions with CNN fea-
accuracy. Although HOG can be used to detect a variety of
tures (RCNN) for object detection [16, 41]. Since then, object
object classes, it was motivated primarily by the problem of
detection started to evolve at an unprecedented speed.
pedestrian detection. To detect objects of different sizes, the
In deep learning era, object detection can be grouped
HOG detector rescales the input image for multiple times
into two genres: “two-stage detection” and “one-stage de-
while keeping the size of a detection window unchanged.
tection”, where the former frames the detection as a “coarse-
The HOG detector has long been an important foundation
to-fine” process while the later frames it as to “complete in
of many object detectors [13, 14, 36] and a large variety of
one step”.
computer vision applications for many years.
• RCNN
• Deformable Part-based Model (DPM)
The idea behind RCNN is simple: It starts with the ex-
DPM, as the winners of VOC-07, -08, and -09 detection traction of a set of object proposals (object candidate boxes)
challenges, was the peak of the traditional object detection by selective search [42]. Then each proposal is rescaled to
methods. DPM was originally proposed by P. Felzenszwalb a fixed size image and fed into a CNN model trained on
[13] in 2008 as an extension of the HOG detector, and then ImageNet (say, AlexNet [40]) to extract features. Finally,
a variety of improvements have been made by R. Girshick linear SVM classifiers are used to predict the presence of an
[14, 15, 37, 38]. object within each region and to recognize object categories.
4

RCNN yields a signicant performance boost on VOC07, Although Faster RCNN breaks through the speed bottle-
with a large improvement of mean Average Precision (mAP) neck of Fast RCNN, there is still computation redundancy
from 33.7% (DPM-v5 [43]) to 58.5%. at subsequent detection stage. Later, a variety of improve-
Although RCNN has made great progress, its drawbacks ments have been proposed, including RFCN [46] and Light
are obvious: the redundant feature computations on a large head RCNN [47]. (See more details in Section 3.)
number of overlapped proposals (over 2000 boxes from one
• Feature Pyramid Networks
image) leads to an extremely slow detection speed (14s per
image with GPU). Later in the same year, SPPNet [17] was In 2017, T.-Y. Lin et al. proposed Feature Pyramid Net-
proposed and has overcome this problem. works (FPN) [22] on basis of Faster RCNN. Before FPN,
most of the deep learning based detectors run detection only
• SPPNet
on a network’s top layer. Although the features in deeper
In 2014, K. He et al. proposed Spatial Pyramid Pooling layers of a CNN are beneficial for category recognition, it
Networks (SPPNet) [17]. Previous CNN models require a is not conducive to localizing objects. To this end, a top-
fixed-size input, e.g., a 224x224 image for AlexNet [40]. down architecture with lateral connections is developed in
The main contribution of SPPNet is the introduction of a FPN for building high-level semantics at all scales. Since a
Spatial Pyramid Pooling (SPP) layer, which enables a CNN CNN naturally forms a feature pyramid through its forward
to generate a fixed-length representation regardless of the propagation, the FPN shows great advances for detecting
size of image/region of interest without rescaling it. When objects with a wide variety of scales. Using FPN in a basic
using SPPNet for object detection, the feature maps can be Faster R-CNN system, it achieves state-of-the-art single
computed from the entire image only once, and then fixed- model detection results on the MSCOCO dataset without
length representations of arbitrary regions can be generated bells and whistles (COCO mAP@.5=59.1%, COCO mAP@[.5,
for training the detectors, which avoids repeatedly com- .95]=36.2%). FPN has now become a basic building block of
puting the convolutional features. SPPNet is more than 20 many latest detectors.
times faster than R-CNN without sacrificing any detection
accuracy (VOC07 mAP=59.2%). 2.1.3 Milestones: CNN based One-stage Detectors
Although SPPNet has effectively improved the detection • You Only Look Once (YOLO)
speed, there are still some drawbacks: first, the training is
YOLO was proposed by R. Joseph et al. in 2015. It was
still multi-stage, second, SPPNet only fine-tunes its fully
the first one-stage detector in deep learning era [20]. YOLO
connected layers while simply ignores all previous layers.
is extremely fast: a fast version of YOLO runs at 155fps
Later in the next year, Fast RCNN [18] was proposed and
with VOC07 mAP=52.7%, while its enhanced version runs
solved these problems.
at 45fps with VOC07 mAP=63.4% and VOC12 mAP=57.9%.
• Fast RCNN YOLO is the abbreviation of “You Only Look Once”. It can
be seen from its name that the authors have completely
In 2015, R. Girshick proposed Fast RCNN detector [18], abandoned the previous detection paradigm of “proposal
which is a further improvement of R-CNN and SPPNet detection + verification”. Instead, it follows a totally dif-
[16, 17]. Fast RCNN enables us to simultaneously train a ferent philosophy: to apply a single neural network to the
detector and a bounding box regressor under the same full image. This network divides the image into regions
network configurations. On VOC07 dataset, Fast RCNN and predicts bounding boxes and probabilities for each
increased the mAP from 58.5% (RCNN) to 70.0% while with region simultaneously. Later, R. Joseph has made a series
a detection speed over 200 times faster than R-CNN. of improvements on basis of YOLO and has proposed its v2
Although Fast-RCNN successfully integrates the advan- and v3 editions [48, 49], which further improve the detection
tages of R-CNN and SPPNet, its detection speed is still accuracy while keeps a very high detection speed.
limited by the proposal detection (see Section 2.3.2 for more In spite of its great improvement of detection speed,
details). Then, a question naturally arises: “can we generate YOLO suffers from a drop of the localization accuracy com-
object proposals with a CNN model?” Later, Faster R-CNN pared with two-stage detectors, especially for some small
[19] has answered this question. objects. YOLO’s subsequent versions [48, 49] and the latter
proposed SSD [21] has paid more attention to this problem.
• Faster RCNN
• Single Shot MultiBox Detector (SSD)
In 2015, S. Ren et al. proposed Faster RCNN detector
[19, 44] shortly after the Fast RCNN. Faster RCNN is the first SSD [21] was proposed by W. Liu et al. in 2015. It was the
end-to-end, and the first near-realtime deep learning de- second one-stage detector in deep learning era. The main
tector (COCO mAP@.5=42.7%, COCO mAP@[.5,.95]=21.9%, contribution of SSD is the introduction of the multi-reference
VOC07 mAP=73.2%, VOC12 mAP=70.4%, 17fps with ZF- and multi-resolution detection techniques (to be introduce
Net [45]). The main contribution of Faster-RCNN is the in- in Section 2.3.2), which significantly improves the detection
troduction of Region Proposal Network (RPN) that enables accuracy of a one-stage detector, especially for some small
nearly cost-free region proposals. From R-CNN to Faster objects. SSD has advantages in terms of both detection speed
RCNN, most individual blocks of an object detection sys- and accuracy (VOC07 mAP=76.8%, VOC12 mAP=74.9%,
tem, e.g., proposal detection, feature extraction, bounding COCO mAP@.5=46.5%, mAP@[.5,.95]=26.8%, a fast version
box regression, etc, have been gradually integrated into a runs at 59fps). The main difference between SSD and any
unified, end-to-end learning framework. previous detectors is that the former one detects objects of
5

different scales on different layers of the network, while the


latter ones only run detection on their top layers.
• RetinaNet
In despite of its high speed and simplicity, the one-stage
detectors have trailed the accuracy of two-stage detectors for
years. T.-Y. Lin et al. have discovered the reasons behind and
proposed RetinaNet in 2017 [23]. They claimed that the ex-
treme foreground-background class imbalance encountered
during training of dense detectors is the central cause. To
this end, a new loss function named “focal loss” has been
introduced in RetinaNet by reshaping the standard cross
entropy loss so that detector will put more focus on hard,
misclassified examples during training. Focal Loss enables
the one-stage detectors to achieve comparable accuracy of
two-stage detectors while maintaining very high detection
speed. (COCO mAP@.5=59.1%, mAP@[.5, .95]=39.1%).

2.2 Object Detection Datasets and Metrics


Building larger datasets with less bias is critical for de-
veloping advanced computer vision algorithms. In object
detection, a number of well-known datasets and bench-
marks have been released in the past 10 years, including
the datasets of PASCAL VOC Challenges [50, 51] (e.g.,
VOC2007, VOC2012), ImageNet Large Scale Visual Recog-
nition Challenge (e.g., ILSVRC2014) [52], MS-COCO De-
tection Challenge [53], etc. The statistics of these datasets Fig. 3. The accuracy improvements of object detection on VOC07,
are given in Table 1. Fig. 4 shows some image examples of VOC12 and MS-COCO datasets. Detectors in this figure: DPM-v1 [13],
these datasets. Fig. 3 shows the improvements of detection DPM-v5 [54], RCNN [16], SPPNet [17], Fast RCNN [18], Faster RCNN
accuracy on VOC07, VOC12 and MS-COCO datasets from [19], SSD [21], FPN [22], Retina-Net [23], RefineDet [55], TridentNet[56].
2008 to 2018.
• Pascal VOC images/object instances is two orders of magnitude larger
1 than VOC. For example, ILSVRC-14 contains 517k images
The PASCAL Visual Object Classes (VOC) Challenges
and 534k annotated objects.
(from 2005 to 2012) [50, 51] was one of the most important
competition in early computer vision community. There are • MS-COCO
multiple tasks in PASCAL VOC, including image classifi-
cation, object detection, semantic segmentation and action MS-COCO3 [53] is the most challenging object detection
detection. Two versions of Pascal-VOC are mostly used dataset available today. The annual competition based on
in object detection: VOC07 and VOC12, where the former MS-COCO dataset has been held since 2015. It has less
consists of 5k tr. images + 12k annotated objects, and the number of object categories than ILSVRC, but more object
latter consists of 11k tr. images + 27k annotated objects. 20 instances. For example, MS-COCO-17 contains 164k images
classes of objects that are common in life are annotated in and 897k annotated objects from 80 categories. Compared
these two datasets (Person: person; Animal: bird, cat, cow, with VOC and ILSVRC, the biggest progress of MS-COCO
dog, horse, sheep; Vehicle: aeroplane, bicycle, boat, bus, car, is that apart from the bounding box annotations, each object
motor-bike, train; Indoor: bottle, chair, dining table, potted is further labeled using per-instance segmentation to aid in
plant, sofa, tv/monitor). In recent years, as some larger precise localization. In addition, MS-COCO contains more
datasets like ILSVRC and MS-COCO (to be introduced) has small objects (whose area is smaller than 1% of the image)
been released, the VOC has gradually fallen out of fashion and more densely located objects than VOC and ILSVRC.
and has now become a test-bed for most new detectors. All these features make the objects distribution in MS-
COCO closer to those of the real world. Just like ImageNet
• ILSVRC in its time, MS-COCO has become the de facto standard for
The ImageNet Large Scale Visual Recognition Challenge the object detection community.
(ILSVRC)2 [52] has pushed forward the state of the art in • Open Images
generic object detection. ILSVRC is organized each year
from 2010 to 2017. It contains a detection challenge us- The year of 2018 sees the introduction of the Open Im-
ing ImageNet images [57]. The ILSVRC detection dataset ages Detection (OID) challenge4 [58], following MS-COCO
contains 200 classes of visual objects. The number of its but at an unprecedented scale. There are two tasks in

1. http://host.robots.ox.ac.uk/pascal/VOC/ 3. http://cocodataset.org/
2. http://image-net.org/challenges/LSVRC/ 4. https://storage.googleapis.com/openimages/web/index.html
6

Fig. 4. Some example images and annotations in (a) PASCAL-VOC07, (b) ILSVRC, (c) MS-COCO, and (d) Open Images.

train validation trainval test


Dataset
images objects images objects images objects images objects
VOC-2007 2,501 6,301 2,510 6,307 5,011 12,608 4,952 14,976
VOC-2012 5,717 13,609 5,823 13,841 11,540 27,450 10,991 -
ILSVRC-2014 456,567 478,807 20,121 55,502 476,688 534,309 40,152 -
ILSVRC-2017 456,567 478,807 20,121 55,502 476,688 534,309 65,500 -
MS-COCO-2015 82,783 604,907 40,504 291,875 123,287 896,782 81,434 -
MS-COCO-2018 118,287 860,001 5,000 36,781 123,287 896,782 40,670 -
OID-2018 1,743,042 14,610,229 41,620 204,621 1,784,662 14,814,850 125,436 625,282
TABLE 1
Some well-known object detection datasets and their statistics.

Open Images: 1) the standard object detection, and 2) the to predict full image performance in certain cases [59].
visual relationship detection which detects paired objects in In 2009, the Caltech pedestrian detection benchmark was
particular relations. For the object detection task, the dataset created [59, 60] and since then, the evaluation metric has
consists of 1,910k images with 15,440k annotated bounding changed from per-window (FPPW) to false positives per-
boxes on 600 object categories. image (FPPI).
• Datasets of Other Detection Tasks In recent years, the most frequently used evaluation for
object detection is “Average Precision (AP)”, which was
In addition to general object detection, the past 20 years originally introduced in VOC2007. AP is defined as the
also witness the prosperity of detection applications in spe- average detection precision under different recalls, and is
cific areas, such as pedestrian detection, face detection, text usually evaluated in a category specific manner. To compare
detection, traffic sign/light detection, and remote sensing performance over all object categories, the mean AP (mAP)
target detection. Tables 2-6 list some of the popular datasets averaged over all object categories is usually used as the
of these detection tasks5 . A detailed introduction of the final metric of performance. To measure the object local-
detection methods of these tasks can be found in Section ization accuracy, the Intersection over Union (IoU) is used
5. to check whether the IoU between the predicted box and
the ground truth box is greater than a predefined threshold,
2.2.1 Metrics say, 0.5. If yes, the object will be identified as “successfully
How can we evaluate the effectiveness of an object detector? detected”, otherwise will be identified as “missed”. The 0.5-
This question may even have different answers at different IoU based mAP has then become the de facto metric for
time. object detection problems for years.
In the early time’s detection community, there is no After 2014, due to the popularity of MS-COCO datasets,
widely accepted evaluation criteria on detection perfor- researchers started to pay more attention to the accuracy
mance. For example, in the early research of pedestrian of the bounding box location. Instead of using a fixed IoU
detection [12], the “miss rate vs. false positives per-window threshold, MS-COCO AP is averaged over multiple IoU
(FPPW)” was usually used as a metric. However, the per- thresholds between 0.5 (coarse localization) and 0.95 (perfect
window measurement (FPPW) can be flawed and fails localization). This change of the metric has encouraged more
accurate object localization and may be of great importance
5. The #Cites shows statistics as of Feb. 2019. for some real-world applications (e.g., imagine there is a
7

Dataset Year Description #Cites


MIT Ped.[30] 2000 One of the first pedestrian detection datasets. Consists of ∼500 training and ∼200 1515
testing images (built based on the LabelMe database). url: http://cbcl.mit.edu/
software-datasets/PedestrianData.html
INRIA [12] 2005 One of the most famous and important pedestrian detection datasets at early time. 24705
Introduced by the HOG paper [12]. url: http://pascal.inrialpes.fr/data/human/
Caltech 2009 One of the most famous pedestrian detection datasets and benchmarks. Consists 2026
[59, 60] of ∼190,000 pedestrians in training set and ∼160,000 in testing set. The metric
is Pascal-VOC @ 0.5 IoU. url: http://www.vision.caltech.edu/Image Datasets/
CaltechPedestrians/
KITTI [61] 2012 One of the most famous datasets for traffic scene analysis. Captured in Karl- 2620
sruhe, Germany. Consists of ∼100,000 pedestrians (∼6,000 individuals). url:
http://www.cvlibs.net/datasets/kitti/index.php
CityPersons 2017 Built based on CityScapes dataset [63]. Consists of ∼19,000 pedestrians in training 50
[62] set and ∼11,000 in testing set. Same metric with CalTech. url: https://bitbucket.
org/shanshanzhang/citypersons
EuroCity [64] 2018 The largest pedestrian detection dataset so far. Captured from 31 cities in 12 1
European countries. Consists of ∼238,000 instances in ∼47,000 images. Same
metric with CalTech.
TABLE 2
An overview of some popular pedestrian detection datasets.

Dataset Year Description #Cites


FDDB [65] 2010 Consists of ∼2,800 images and ∼5,000 faces from Yahoo! With occlusions, pose 531
changes, out-of-focus, etc. url: http://vis-www.cs.umass.edu/fddb/index.html
AFLW [66] 2011 Consists of ∼26,000 faces and 22,000 images from Flickr with rich facial landmark 414
annotations. url: https://www.tugraz.at/institute/icg/research/team-bischof/
lrs/downloads/aflw/
IJB [67] 2015 IJB-A/B/C consists of over 50,000 images and videos frames, for both 279
recognition and detection tasks. url: https://www.nist.gov/programs-projects/
face-challenges
WiderFace 2016 One of the largest face detection dataset. Consists of ∼32,000 images and 394,000 193
[68] faces with rich annotations i.e., scale, occlusion, pose, etc. url: http://mmlab.ie.
cuhk.edu.hk/projects/WIDERFace/
UFDD [69] 2018 Consists of ∼6,000 images and ∼11,000 faces. Variations include weather-based 1
degradation, motion blur, focus blur, etc. url: http://www.ufdd.info/
WildestFaces 2018 With ∼68,000 video frames and ∼2,200 shots of 64 fighting celebrities in uncon- 2
[70] strained scenarios. The dataset hasn’t been released yet.
TABLE 3
An overview of some popular face detection datasets.

robot arm trying to grasp a spanner). 2.3.1 Early Time’s Dark Knowledge
Recently, there are some further developments of the The early time’s object detection (before 2000) did not follow
evaluation in the Open Images dataset, e.g., by considering a unified detection philosophy like sliding window detec-
the group-of boxes and the non-exhaustive image-level cate- tion. Detectors at that time were usually designed based on
gory hierarchies. Some researchers also have proposed some low-level and mid-level vision as follows.
alternative metrics, e.g., “localization recall precision” [94].
Despite the recent changes, the VOC/COCO-based mAP is • Components, shapes and edges
still the most frequently used evaluation metric for object
detection. “Recognition-by-components”, as an important cogni-
tive theory [98], has long been the core idea of image
recognition and object detection [13, 99, 100]. Some early
researchers framed the object detection as a measurement of
2.3 Technical Evolution in Object Detection
similarity between the object components, shapes and con-
In this section, we will introduce some important building tours, including Distance Transforms [101], Shape Contexts
blocks of a detection system and their technical evolutions [35], and Edgelet [102], etc. Despite promising initial results,
in the past 20 years. things did not work out well on more complicated detec-
8

Dataset Year Description #Cites


ICDAR [71] 2003 ICDAR2003 is one of the first public datasets for text detection. ICDAR 2015 530
and 2017 are other popular iterations of the ICDAR challenge [72, 73]. url: http:
//rrc.cvc.uab.es/
STV [74] 2010 Consists of ∼350 images and ∼720 text instances taken from Google StreetView. 339
url: http://tc11.cvc.uab.es/datasets/SVT 1
MSRA-TD500 2012 Consists of ∼500 indoor/outdoor images with Chinese and English texts. url: 413
[75] http://www.iapr-tc11.org/mediawiki/index.php/MSRA Text Detection 500
Database (MSRA-TD500)
IIIT5k [76] 2012 Consists of ∼1,100 images and ∼5,000 words from both streets and born-digital 165
images. url: http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html
Syn90k [77] 2014 A synthetic dataset with 9 million images generated from a 90,000 vocabulary of 246
multiple fonts. url: http://www.robots.ox.ac.uk/∼vgg/data/text/
COCOText 2016 The largest text detection dataset so far. Built based on MS-COCO, Consists 69
[78] of ∼63,000 images and ∼173,000 text annotations. https://bgshih.github.io/
cocotext/.
TABLE 4
An overview of some popular scene text detection datasets.

Dataset Year Description #Cites


TLR [79] 2009 Captured by a moving vehicle in Paris. Consists of ∼11,000 video frames 164
and ∼9,200 traffic light instances. url: http://www.lara.prd.fr/benchmarks/
trafficlightsrecognition
LISA [80] 2012 One of the first traffic sign detection dataset. Consists of ∼6,600 video 325
frames, ∼7,800 instances of 47 US signs. url: http://cvrr.ucsd.edu/LISA/
lisa-traffic-sign-dataset.html
GTSDB [81] 2013 One of the most popular traffic signs detection dataset. Consists of ∼900 images 259
with ∼1,200 traffic signs capture with various weather conditions during differ-
ent time of a day. url: http://benchmark.ini.rub.de/?section=gtsdb&subsection=
news
BelgianTSD 2012 Consists of ∼7,300 static images, ∼120,000 video frames, and ∼11,000 traffic sign 224
[82] annotations of 269 types. The 3D location of each sign has been annotated. url:
https://btsd.ethz.ch/shareddata/
TT100K [83] 2016 The largest traffic sign detection dataset so far, with ∼100,000 images (2048 x 111
2048) and ∼30,000 traffic sign instances of 128 classes. Each instance is annotated
with class label, bounding box and pixel mask. url: http://cg.cs.tsinghua.edu.cn/
traffic%2Dsign/
BSTL [84] 2017 The largest traffic light detection dataset. Consists of ∼5000 static images, ∼8300 21
video frames, and ∼24000 traffic light instances. https://hci.iwr.uni-heidelberg.
de/node/6132
TABLE 5
An overview of some popular traffic light detection and traffic sign detection datasets.

tion problems. Therefore, machine learning based detection specific knowledge from data.
methods were beginning to prosper. Wavelet feature transform started to dominate visual
Machine learning based detection has gone through mul- recognition and object detection since 2000. The essence of
tiple periods, including the statistical models of appearance this group of methods is learning by transforming an image
(before 1998), wavelet feature representations (1998-2005), from pixels to a set of wavelet coefficients. Among these
and gradient-based representations (2005-2012). methods, the Haar wavelet, owing to its high computational
Building statistical models of an object, like Eigenfaces efficiency, has been mostly used in many object detection
[95, 106] as shown in Fig 5 (a), was the first wave of learning tasks, such as general object detection [29], face detection
based approaches in object detection history. In 1991, M. [10, 11, 109], pedestrian detection [30, 31], etc. Fig 5 (d)
Turk et al. achieved real-time face detection in a lab envi- shows a set of Haar wavelets basis learned by a VJ detector
ronment by using Eigenface decomposition [95]. Compared [10, 11] for human faces.
with the rule-based or template based approaches of its
• Early time’s CNN for object detection
time [107, 108], a statistical model better provides holistic
descriptions of an object’s appearance by learning task- The history of using CNN to detecting objects can be
9

Dataset Year Description #Cites


TAS [85] 2008 Consists of 30 images of 729x636 pixels from Google Earth and ∼1,300 vehicles. 419
url: http://ai.stanford.edu/∼gaheitz/Research/TAS/
OIRDS [86] 2009 Consists for 900 images (0.08-0.3m/pixel) captured by aircraft-mounted camera 32
and 1,800 annotated vehicle targets. url: https://sourceforge.net/projects/oirds/
DLR3K [87] 2013 The most frequently used datasets for small vehicle detection. Consists of 68
9,300 cars and 160 trucks. url: https://www.dlr.de/eoc/en/desktopdefault.aspx/
tabid-5431/9230 read-42467/
UCAS-AOD 2015 Consists of ∼900 Google Earth images, ∼2,800 vehicles and ∼3,200 airplanes. url: 19
[88] http://www.ucassdl.cn/resource.asp
VeDAI [89] 2016 Consists of ∼1,200 images (0.1-0.25m/pixel), ∼3,600 targets of 9 classes. Designed 65
for detecting small target in remote sensing images. url: https://downloads.
greyc.fr/vedai/
NWPU- 2016 The most frequently used remote sensing detection dataset in recent years. 204
VHR10 [90] Consists of ∼800 images (0.08-2.0m/pixel) and ∼3,800 remote sensing targets
of ten classes (e.g., airplanes, ships, baseball diamonds, tennis courts, etc). url:
http://jiong.tea.ac.cn/people/JunweiHan/NWPUVHR10dataset.html
LEVIR [91] 2018 Consists of ∼22,000 Google Earth images and ∼10,000 independently labeled 15
targets (airplane, ship, oil-pot). url: https://pan.baidu.com/s/1geTwAVD
DOTA [92] 2018 The first remote sensing detection dataset to incorporate rotated bounding boxes. 32
Consists of ∼2,800 Google Earth images and ∼200,000 instances of 15 classes. url:
https://captain-whu.github.io/DOTA/dataset.html
xView [93] 2018 The largest remote sensing detection dataset so far. Consists of ∼1,000,000 remote 10
sensing targets of 60 classes (0.3m/pixel), covering1,415km2 of land area. url:
http://xviewdataset.org
TABLE 6
An overview of some remote sensing target detection datasets.

traced back to the 1990s [96], where Y. LeCun et al. have With the increase of computing power after the VJ
made great contributions at that time. Due to limitations in detector, researchers started to pay more attention to an
computing resources, CNN models at the time were much intuitive way of detection by building “feature pyramid +
smaller and shallower than those of today. Despite this, the sliding windows”. From 2004 to 2014, a number of mile-
computational efficiency was still considered as one of the stone detectors were built based on this detection paradigm,
tough nuts to crack in early times’s CNN based detection including the HOG detector, DPM, and even the Overfeat
models. Y. LeCun et al. have made a series of improvements detector [103] of the deep learning era (winner of ILSVRC-
like “shared-weight replicated neural network” [96] and 13 localization task).
“space displacement network” [97] to reduce the computa-
Early detection models like VJ detector and HOG de-
tions by extending each layer of the convolutional network
tector were specifically designed to detect objects with a
so as to cover the entire input image, as shown in Fig. 5
“fixed aspect ratio” (e.g., faces and upright pedestrians) by
(b)-(c). In this way, the feature of any location of the entire
simply building the feature pyramid and sliding fixed size
image can be extracted by taking only one time of forward
detection window on it. The detection of “various aspect
propagation of the network. This can be considered as the
ratios” was not considered at that time. To detect objects
prototype of today’s fully convolutional networks (FCN)
with a more complex appearance like those in PASCAL
[110, 111], which was proposed almost 20 years later. CNN
VOC, R. Girshick et al. began to seek better solutions outside
also has been applied to other tasks such as face detection
the feature pyramid. The “mixture model” [15] was one of
[112, 113] and hand tracking [114] of its time.
the best solutions at that time, by training multiple models
to detect objects with different aspect ratios. Apart from
2.3.2 Technical Evolution of Multi-Scale Detection this, exemplar-based detection [36, 115] provided another
Multi-scale detection of objects with “different sizes” and solution by training individual models for every object
“different aspect ratios” is one of the main technical chal- instance (exemplar) of the training set.
lenges in object detection. In the past 20 years, multi-
scale detection has gone through multiple historical periods: As objects in the modern datasets (e.g., MS-COCO)
“feature pyramids and sliding windows (before 2014)”, “de- become more diversified, the mixture model or exemplar-
tection with object proposals (2010-2015)”, “deep regression based methods inevitably lead to more miscellaneous de-
(2013-2016)”, “multi-reference detection (after 2015)”, and tection models. A question then naturally arises: is there a
“multi-resolution detection (after 2016)”, as shown in Fig. 6. unified multi-scale approach to detect objects of different
aspect ratios? The introduction of “object proposals” (to be
• Feature pyramids + sliding windows (before 2014) introduced) has answered this question.
10

the distinction between detectors and proposal generators is


becoming blurred [132].
As “object proposal” has revolutionized the sliding win-
dow detection and has quickly dominated the deep learning
based detectors, in 2014-2015, many researchers began to ask
the following questions: what is the main role of the object
proposals in detection? Is it for improving accuracy, or sim-
ply for detection speed up? To answer this question, some
researchers have tried to weaken the role of the proposals
[133] or simply perform sliding window detection on CNN
features [134–138], but none of them obtained satisfactory
results. The proposal detection has soon slipped out of sight
after the rise of one-stage detectors and “deep regression”
techniques (to be introduced).

• Deep regression (2013-2016)

In recent years, as the increase of GPU’s computing


power, the way people deal with multi-scale detection has
become more and more straight forward and brute-force.
The idea of using the deep regression to solve multi-scale
problems is very simple, i.e., to directly predict the co-
ordinates of a bounding box based on the deep learning
features [20, 104]. The advantage of this approach is that it is
simple and easy to implement while the disadvantage is the
localization may not be accurate enough especially for some
small objects. “Multi-reference detection” (to be introduced)
has latter solved this problem.

• Multi-reference/-resolution detection (after 2015)


Fig. 5. Some well-known detection models of the early time: (a) Eigen-
faces [95], (b) Shared weight networks [96], (c) Space displacement
networks (Lenet-5) [97], (d) Haar wavelets of VJ detector [10]. Multi-reference detection is the most popular framework
for multi-scale object detection [19, 21, 44, 48]. Its main idea
is to pre-define a set of reference boxes (a.k.a. anchor boxes)
• Detection with object proposals (2010-2015) with different sizes and aspect-ratios at different locations of
an image, and then predict the detection box based on these
Object proposals refer to a group of class-agnostic can-
references.
didate boxes that likely to contain any objects. It was first
A typical loss of each predefined anchor box consists of
time applied in object detection in 2010 [116]. Detection
two parts: 1) a cross-entropy loss for category recognition
with object proposals helps to avoid the exhaustive sliding
and 2) an L1/L2 regression loss for object localization. A
window search across an image.
general form of the loss function can be written as follows:
An object proposal detection algorithm should meet
the following three requirements: 1) high recall rate, 2) L(p, p∗ , t, t∗ ) = Lcls. (p, p∗ ) + βI(t)Lloc. (t, t∗ )
high localization accuracy, and 3) on basis of the first two (
requirements, to improve precision and reduce processing 1 IOU{a, a∗ } > η (1)
I(t) =
time. Modern proposal detection methods can be divided 0 else
into three categories: 1) segmentation grouping approaches
[42, 117–119], 2) window scoring approaches [116, 120–122], where t and t∗ are the locations of predicted and ground-
and 3) neural network based approaches [123–128]. We refer truth bounding box, p and p∗ are their category probabil-
readers to the following papers for a comprehensive review ities. IOU{a, a∗ } is the IOU between the anchor a and its
of these methods [129, 130]. ground-truth a∗ . η is an IOU threshold, say, 0.5. If an anchor
Early time’s proposal detection methods followed a that does not cover any objects, its localization loss does not
bottom-up detection philosophy [116, 120] and were deeply count in the final loss.
affected by visual saliency detection. Later, researchers Another popular technique in the last two years is multi-
started to move to low-level vision (e.g., edge detection) and resolution detection [21, 22, 55, 105], i.e. by detecting objects
more careful handcrafted skills to improve the localization of different scales at different layers of the network. Since a
of candidate boxes [42, 117–119, 122, 131]. After 2014, with CNN naturally forms a feature pyramid during its forward
the popularity of deep CNN in visual recognition, the top- propagation, it is easier to detect larger objects in deeper
down, learning-based approaches began to show more ad- layers and smaller ones in shallower layers. Multi-reference
vantages in this problem [19, 121, 123, 124]. Since then, the and multi-resolution detection have now become two basic
object proposal detection has evolved from the bottom-up building blocks in the state of the art object detection sys-
vision to “overfitting to a specific set of object classes”, and tems.
11

Fig. 6. Evolution of multi-scale detection techniques in object detection from 2001 to 2019: 1) feature pyramids and sliding windows, 2) detection
with object proposals, 3) deep regression, 4) multi-reference detection, and 5) multi-resolution detection. Detectors in this figure: VJ Det. [10], HOG
Det. [12], DPM [13, 15], Exemplar SVM [36], Overfeat [103], RCNN [16], SPPNet [17], Fast RCNN [18], Faster RCNN [19], DNN Det. [104], YOLO
[20], YOLO-v2 [48], SSD [21], Unified Det. [105], FPN [22], RetinaNet [23], RefineDet [55], TridentNet [56].

2.3.3 Technical Evolution of Bounding Box Regression • From features to BB (after 2013)
The Bounding Box (BB) regression is an important tech- After the introduction of Faster RCNN in 2015, BB re-
nique in object detection. It aims to refine the location of gression no longer serves as an individual post-processing
a predicted bounding box based on the initial proposal block but has been integrated with the detector and trained
or the anchor box. In the past 20 years, the evolution of in an end-to-end fashion. At the same time, BB regression
BB regression has gone through three historical periods: has evolved to predicting BB directly based on CNN fea-
“without BB regression (before 2008)”, “from BB to BB (2008- tures. In order to get more robust prediction, the smooth-L1
2013)”, and “from feature to BB (after 2013)”. Fig. 7 shows function [19] is commonly used,
the evolutions of bounding box regression. (
• Without BB regression (before 2008)
5t2 |t| 6 0.1
L(t) = (2)
|t| − 0.05 else,
Most of the early detection methods such as VJ detector
and HOG detector do not use BB regression, and usually or the root-square function [20],
directly consider the sliding window as the detection result. √ √
L(x, x∗ ) = ( x − x∗ )2 , (3)
To obtain accurate locations of an object, researchers have
no choice but to build very dense pyramid and slide the as their regression loss, which are more robust to the outliers
detector densely on each location. than the least square loss used in DPM. Some researchers
• From BB to BB (2008-2013) also choose to normalize the coordinates to get more robust
results [18, 19, 21, 23].
The first time that BB regression was introduced to an
object detection system was in DPM [15]. The BB regression 2.3.4 Technical Evolution of Context Priming
at that time usually acted as a post-processing block, thus
Visual objects are usually embedded in a typical context
it is optional. As the goal in the PASCAL VOC is to predict
with the surrounding environments. Our brain takes advan-
single bounding box for each object, the simplest way for
tage of the associations among objects and environments
a DPM to generate final detection should be directly using
to facilitate visual perception and cognition [160]. Context
its root filter locations. Later, R. Girshick et al. introduced
priming has long been used to improve detection. There
a more complex way to predict a bounding box based
are three common approaches in its evolutionary history: 1)
on the complete configuration of an object hypothesis and
detection with local context, 2) detection with global context,
formulate this process as a linear least-squares regression
and 3) context interactives, as shown in Fig. 8.
problem [15]. This method yields noticeable improvements
of the detection under PASCAL criteria. • Detection with local context
12

Fig. 7. Evolution of bounding box regression techniques in object detection from 2001 to 2019. Detectors in this figure: VJ Det. [10], HOG Det. [12],
Exemplar SVM [36], DPM [13, 15], Overfeat [103], RCNN [16], SPPNet [17], Fast RCNN [18], Faster RCNN [19], YOLO [20], SSD [21], YOLO-v2
[48], Unified Det. [105], FPN [22], RetinaNet [23], RefineDet [55], TridentNet [56].

Fig. 8. Evolution of context priming in object detection from 2001 to 2019: 1) detection with local context, 2) detection with global context, 3)
detection with context interactives. Detectors in this figure: Face Det. [139], MultiPath [140], GBDNet [141, 142], CC-Net [143], MultiRegion-CNN
[144], CoupleNet [145], DPM [14, 15], StructDet [146], YOLO [20], RFCN++ [147], ION [148], AttenContext [149], CtxSVM [150], PersonContext
[151], SMN [152], RetinaNet [23], SIN [153].

Local context refers to the visual information in the area time’s object detectors, a common way of integrating global
that surrounds the object to detect. It has long been acknowl- context is to integrate a statistical summary of the elements
edged that local context helps improve object detection. At that comprise the scene, like Gist [160]. For modern deep
early 2000s, Sinha and Torralba [139] found that inclusion of learning based detectors, there are two methods to integrate
local contextual regions such as the facial bounding contour global context. The first way is to take advantage of large
substantially improves face detection performance. Dalal receptive field (even larger than the input image) [20] or
and Triggs also found that incorporating a small amount global pooling operation of a CNN feature [147]. The second
of background information improves the accuracy of pedes- way is to think of the global context as a kind of sequential
trian detection [12]. Recent deep learning based detectors information and to learn it with the recurrent neural net-
can also be improved with local context by simply enlarging works [148, 149].
the networks’ receptive field or the size of object proposals • Context interactive
[140–145, 161].
Context interactive refers to the piece of information that
• Detection with global context conveys by the interactions of visual elements, such as the
Global context exploits scene configuration as an addi- constraints and dependencies. For most object detectors,
tional source of information for object detection. For early object instances are detected and recognized individually
13

Fig. 9. Evolution of non-max suppression (NMS) techniques in object detection from 1994 to 2019: 1) Greedy selection, 2) Bounding box
aggregation, and 3) Learn to NMS. Detectors in this figure: VJ Det. [10], Face Det. [96], HOG Det. [12], DPM [13, 15], RCNN [16], SPPNet
[17], Fast RCNN [18], Faster RCNN [19], YOLO [20], SSD [21], FPN [22], RetinaNet [23], LearnNMS [154], MAP-Det [155], End2End-DPM [136],
StrucDet [146], Overfeat [103], APC-NMS [156], MAPC [157], SoftNMS [158], FitnessNMS [159].

without exploiting their relations. Some recent researches shown in Fig 11. First of all, the top-scoring box may not be
have suggested that modern object detectors can be im- the best fit. Second, it may suppress nearby objects. Finally, it
proved by considering context interactives. Some recent im- does not suppress false positives. In recent years, in spite of
provements can be grouped into two categories, where the the fact that some manual modifications have been recently
first one is to explore the relationship between individual made to improve its performance [158, 159, 163] (see Section
objects [15, 146, 150, 152, 162], and the second one is to 4.4 for more details), to our best knowledge, the greedy
explore modeling the dependencies between objects and selection still performs as the strongest baseline for today’s
scenes [151, 151, 153]. object detection.

2.3.5 Technical Evolution of Non-Maximum Suppression • BB aggregation


Non-maximum suppression (NMS) is an important group of BB aggregation is another group of techniques for NMS
techniques in object detection. As the neighboring windows [10, 103, 156, 157] with the idea of combining or clustering
usually have similar detection scores, the non-maximum multiple overlapped bounding boxes into one final detec-
suppression is herein used as a post-processing step to tion. The advantage of this type of method is that it takes full
remove the replicated bounding boxes and obtain the final consideration of object relationships and their spatial layout.
detection result. At early times of object detection, NMS There are some well-known detectors using this method,
was not always integrated [30]. This is because the desired such as the VJ detector [10] and the Overfeat [103].
output of an object detection system was not entirely clear at
that time. During the past 20 years, NMS has been gradually • Learning to NMS
developed into the following three groups of methods: A recent group of NMS improvements that have recently
1) greedy selection, 2) bounding box aggregation, and 3) received much attention is learning to NMS [136, 146, 154,
learning to NMS, as shown in Fig. 9. 155]. The main idea of such group of methods is to think
• Greedy selection of NMS as a filter to re-score all raw detections and to train
Greedy selection is an old fashioned but the most popu- the NMS as part of a network in an end-to-end fashion.
lar way to perform NMS in object detection. The idea behind These methods have shown promising results on improving
this process is simple and intuitive: for a set of overlapped occlusion and dense object detection over traditional hand-
detections, the bounding box with the maximum detection crafted NMS methods.
score is selected while its neighboring boxes are removed
according to a predefined overlap threshold (say, 0.5). The 2.3.6 Technical Evolution of Hard Negative Mining
above processing is iteratively performed in a greedy man- The training of an object detector is essentially an imbal-
ner. anced data learning problem. In the case of sliding win-
Although greedy selection has now become the de facto dow based detectors, the imbalance between backgrounds
method for NMS, it still has some space for improvement, as and objects could be as extreme as 104 ∼105 background
14

Fig. 10. Evolution of hard negative mining techniques in object detection from 1994 to 2019. Detectors in this figure: Face Det. [164], Haar Det. [29],
VJ Det. [10], HOG Det. [12], DPM [13, 15], RCNN [16], SPPNet [17], Fast RCNN [18], Faster RCNN [19], YOLO [20], SSD [21], FasterPed [165],
OHEM [166], RetinaNet [23], RefineDet [55].

Fig. 11. Examples of possible failures when using a standard greedy


selection based non-max suppression: (a) the top-scoring box may not
be the best fit, (b) it may suppress nearby objects, and (c) it does not Fig. 12. An overview of the speed up techniques in object detection.
suppress false positives. Images from R. Rothe et al. ACCV2014 [156].

windows to every object. Modern detection datasets require RCNN and YOLO simply balance the weights between the
the prediction of object aspect ratio, further increasing the positive and negative windows. However, researchers later
imbalanced ratio to 106 ∼107 [129]. In this case, using all noticed that the weight-balancing cannot completely solve
background data will be harmful to training as the vast the imbalanced data problem [23]. To this end, after 2016,
number of easy negatives will overwhelm the learning the bootstrap was re-introduced to deep learning based
process. Hard negative mining (HNM) aims to deal with the detectors [21, 165–168]. For example, in SSD [21] and OHEM
problem of imbalanced data during training. The technical [166], only the gradients of a very small part of samples
evolution of HNM in object detection is shown in Fig. 10. (those with the largest loss values) will be back-propagated.
In RefineDet [55], an “anchor refinement module” is de-
• Bootstrap signed to filter easy negatives. An alternative improvement
Bootstrap in object detection refers to a group of training is to design new loss functions [23, 169, 170], by reshaping
techniques in which the training starts with a small part the standard cross entropy loss so that it will put more focus
of background samples and then iteratively add new miss- on hard, misclassified examples [23].
classified backgrounds during the training process. In early
times object detectors, bootstrap was initially introduced
with the purpose of reducing the training computations
over millions of background samples [10, 29, 164]. Later it 3 S PEED -U P OF D ETECTION
became a standard training technique in DPM and HOG
detectors [12, 13] for solving the data imbalance problem. The acceleration of object detection has long been an im-
portant but challenging problem. In the past 20 years, the
• HNM in deep learning based detectors object detection community has developed sophisticated
Later in the deep learning era, due to the improvement acceleration techniques. These techniques can be roughly
of computing power, bootstrap was shortly discarded in divided into three levels of groups: “speed up of detection
object detection during 2014-2016 [16–20]. To ease the data- pipeline”, “speed up of detection engine”, and “speed up of
imbalance problem during training, detectors like Faster numerical computation”, as shown in Fig 12.
15

sliding multiple detectors on one feature map rather than


re-scaling the image or features [173].

3.2 Speed up of Classifiers

Traditional sliding window based detectors, e.g., HOG de-


tector and DPM, prefer using linear classifiers than nonlin-
ear ones due to their low computational complexity. Detec-
tion with nonlinear classifiers such as kernel SVM suggests
higher accuracy, but at the same time brings high compu-
Fig. 13. An illustration of how to compute the HOG map of an image.
tational overhead. As a standard non-parametric method,
the traditional kernel method has no fixed computational
3.1 Feature Map Shared Computation complexity. When we have a very large training set, the
detection speed will become extremely slow.
Among the different computational stages of an object de-
In object detection, there are many ways to speed up
tector, the feature extraction usually dominates the amount
kernelized classifiers, where the “model approximation” is
of computation. For a sliding window based detector, the
most commonly used [30, 174]. Since the decision bound-
computational redundancy starts from both positions and
ary of a classical kernel SVM can only be determined by
scales, where the former one is caused by the overlap
a small set of its training samples (support vectors), the
between adjacent windows, while the later one is by the
computational complexity at the inference stage would be
feature correlation between adjacent scales.
proportional to the number of support vectors: O(Nsv ).
Reduced Set Vectors [30] is an approximation method for
3.1.1 Spatial Computational Redundancy and Speed Up
kernel SVM, which aims to obtain an equivalent decision
The most commonly used idea to reduce the spatial com- boundary in terms of a small number of synthetic vectors.
putational redundancy is feature map shared computation, Another way to speed up kernel SVM in object detection
i.e., to compute the feature map of the whole image only is to approximate its decision boundary to a piece-wise
once before sliding window on it. The “image pyramid” of linear form so as to achieve a constant inference time [174].
a traditional detector herein can be considered as a “feature The kernel method can also be accelerated with the sparse
pyramid”. For example, to speed up HOG pedestrian de- encoding methods [175].
tector, researchers usually accumulate the “HOG map” of
the whole input image, as shown in Fig. 13. However, the
drawback of this method is also obvious, i.e., the feature
3.3 Cascaded Detection
map resolution (the minimum step size of the sliding win-
dow on this feature map) will be limited by the cell size. If a Cascaded detection is a commonly used technique in ob-
small object is located between two cells, it could be ignored ject detection [10, 176]. It takes a coarse to fine detection
by all detection windows. One solution to this problem is to philosophy: to filter out most of the simple background
build an integral feature pyramid, which will be introduced windows using simple calculations, then to process those
in Section 3.6. more difficult windows with complex ones. The VJ detector
The idea of feature map shared computation has also is a representative of cascaded detection. After that, many
been extensively used in convolutional based detectors. subsequent classical object detectors such as the HOG detec-
Some related works can be traced back to the 1990s [96, 97]. tor and DPM have been accelerated by using this technique
Most of the CNN based detectors in recent years, e.g., [14, 38, 54, 177, 178].
SPPNet [17], Fast-RCNN [18], and Faster-RCNN [19], have In recent years, cascaded detection has also been applied
applied similar ideas, which have achieved tens or even to deep learning based detectors, especially for those de-
hundreds of times of acceleration. tection tasks of “small objects in large scenes” , e.g., face
detection [179, 180], pedestrian detection [165, 177, 181],
3.1.2 Scale Computational Redundancy and Speed Up etc. In addition to the algorithm acceleration, cascaded
To reduce the scale computational redundancy, the most detection has been applied to solve other problems, e.g.,
successful way is to directly scale the features rather than to improve the detection of hard examples [182–184], to
the images, which has been first applied in the VJ detector integrate context information [143, 185], and to improve
[10]. However, such an approach cannot be applied directly localization accuracy [104, 125].
to HOG-like features because of blurring effects. For this
problem, P. Dollár et al. discovered the strong (log-linear)
correlation between the neighbor scales of the HOG and 3.4 Network Pruning and Quantification
integral channel features [171] through extensive statistical
analysis. This correlation can be used to accelerate the com- “Network pruning” and “network quantification” are two
putation of a feature pyramid [172] by approximating the commonly used techniques to speed up a CNN model,
feature maps of adjacent scales. Besides, building “detector where the former one refers to pruning the network struc-
pyramid” is another way to avoid scale computational re- ture or weight to reduce its size and the latter one refers to
dundancy, i.e., to detect objects of different scales by simply reducing the code-length of activations or weights.
16

Fig. 14. An overview of speed up methods of a CNN’s convolutional layer and the comparison of their computational complexity: (a) Standard
0 0 0
convolution: O(dk2 c). (b) Factoring convolutional filters (k × k → (k × k )2 or 1 × k, k × 1): O(dk 2 c) or O(dkc). (c) Factoring convolutional
0 2 2 0
channels: O(d k c) + O(dk d ). (d) Group convolution (#groups=m): O(dk c/m). (e) Depth-wise separable convolution: O(ck2 ) + O(dc).
2

3.4.1 Network Pruning one (“student net”) [193, 194]. Recently, this idea has been
The research of “network pruning” can be traced back to used in the acceleration of object detection [195, 196]. One
as early as the 1980s. At that time, Y. LeCun et al. proposed straight forward approach of this idea is to use a teacher net
a method called “optimal brain damage” to compress the to instruct the training of a (light-weight) student net so that
parameters of a multi-layer perceptron network [186]. In this the latter can be used for speed up detection [195]. Another
method, the loss function of a network is approximated by approach is to make transform of the candidate regions so
taking the second-order derivatives so that to remove some as to minimize the features distance between the student net
unimportant weights. Following this idea, the network and teacher net. This method makes the detection model 2
pruning methods in recent years usually take an iterative times faster while achieving a comparable accuracy [196].
training and pruning process, i.e., to remove only a small
group of unimportant weights after each stage of training, 3.5 Lightweight Network Design
and to repeat those operations [187]. As traditional network
pruning simply removes unimportant weights, which may The last group of methods to speed up a CNN based de-
result in some sparse connectivity patterns in a convolu- tector is to directly design a lightweight network instead of
tional filter, it can not be directly applied to compress a CNN using off-the-shelf detection engines. Researchers have long
model. A simple solution to this problem is to remove the been exploring the right configurations of a network so that
whole filters instead of the independent weights [188, 189]. to gain accuracy under a constrained time cost. In addition
to some general designing principles like “fewer channels
3.4.2 Network Quantification and more layers” [197], some other approaches have been
The recent works on network quantification mainly focus on proposed in recent years: 1) factorizing convolutions, 2)
network binarization, which aims to accelerate a network group convolution, 3) depth-wise separable convolution, 4)
by quantifying its activations or weights to binary variables bottle-neck design, and 5) neural architecture search.
(say, 0/1) so that the floating-point operation is converted
to AND, OR, NOT logical operations. Network binariza- 3.5.1 Factorizing Convolutions
tion can significantly speed up computations and reduce Factorizing convolutions is the simplest and most straight
the network’s storage so that it can be much easier to be forward way to build a lightweight CNN model. There are
deployed on mobile devices. One possible implementation two groups of factorizing methods.
of the above ideas is to approximate the convolution by The first group of methods is to factorize a large convo-
binary variables with the least squares method [190]. A lution filter into a set of small ones in their spatial dimension
more accurate approximation can be obtained by using [47, 147, 198], as shown in Fig. 14 (b). For example, one can
linear combinations of multiple binary convolutions [191]. factorize a 7x7 filter into three 3x3 filters, where they share
In addition, some researchers have further developed GPU the same receptive field but the later one is more efficient.
acceleration libraries for binarized computation, which ob- Another example is to factorize a k ×k filter into a k ×1 filter
tained more significant acceleration results [192]. and a 1×k filter [198, 199], which could be more efficient for
very large filters, say 15x15 [199]. This idea has been recently
3.4.3 Network Distillation used in object detection [200].
Network distillation is a general framework to compress the The second group of methods is to factorize a large
knowledge of a large network (“teacher net”) into a small group of convolutions into two small groups in their chan-
17

nel dimension [201, 202], as shown in Fig. 14 (c). For exam- 3.6 Numerical Acceleration
ple, one can approximate a convolution layer with d filters In this section, we mainly introduce four important numer-
and a feature map of c channels by d0 filters + a nonlinear ical acceleration methods that are frequently used in object
activation + another d filters ( d0 < d ). In this case, the detection: 1) speed up with the integral image, 2) speed
complexity O(dk 2 c) of the original layer can be reduced to up in the frequency domain, 3) vector quantization, and 4)
O(d0 k 2 c) + O(dd0 ). reduced rank approximation.
3.5.2 Group Convolution
3.6.1 Speed Up with Integral Image
Group convolution aims to reduce the number of parame-
ters in a convolution layer by dividing the feature channels The integral image is an important method in image pro-
into many different groups, and then convolve on each cessing. It helps to rapidly calculate summations over image
group independently [189, 203], as shown in Fig. 14 (d). sub-regions. The essence of integral image is the integral-
If we evenly divide the feature channels into m groups, differential separability of convolution in signal processing:
without changing other configurations, the computational dg(x)
Z
complexity of the convolution will theoretically be reduced f (x) ∗ g(x) = ( f (x)dx) ∗ ( ), (4)
dx
to 1/m of that before.
where if dg(x)/dx is a sparse signal, then the convolution
3.5.3 Depth-wise Separable Convolution can be accelerated by the right part of this equation. Al-
Depth-wise separable convolution, as shown in Fig. 14 (e), though the VJ detector [10] is well known for the integral
is a recent popular way of building lightweight convolution image acceleration, before it was born, the integral image
networks [204]. It can be viewed as a special case of the has already been used to speed up a CNN model [219] and
group convolution when the number of groups is set equal achieved more than 10 times acceleration.
to the number of channels. In addition to the above examples, integral image can
Suppose we have a convolutional layer with d filters and also be used to speed up more general features in ob-
a feature map of c channels. The size of each filter is k × k . ject detection, e.g., color histogram, gradient histogram
For a depth-wise separable convolution, every k ×k ×c filter [171, 177, 220, 221], etc. A typical example is to speed up
is first to split into c slices each with the size of k ×k ×1, and HOG by computing integral HOG maps [177, 220]. Instead
then the convolutions are performed individually in each of accumulating pixel values in a traditional integral image,
channel with each slice of the filter. Finally, a number of the integral HOG map accumulates gradient orientations in
1x1 filters are used to make a dimension transform so that an image, as shown in Fig. 15. As the histogram of a cell
the final output should have d channels. By using depth- can be viewed as the summation of the gradient vector in
wise separable convolution, the computational complexity a certain region, by using the integral image, it is possible
can be reduced from O(dk 2 c) to O(ck 2 ) + O(dc). This idea to compute a histogram in a rectangle region of an arbitrary
has been recently applied to object detection and fine-grain position and size with a constant computational overhead.
classification [205–207]. The integral HOG map has been used in pedestrian detec-
tion and has achieved dozens of times’ acceleration without
3.5.4 Bottle-neck Design losing any accuracy [177].
A bottleneck layer in a neural network contains few nodes Later in 2009, P. Dollár et al. proposed a new type of
compared to the previous layers. It can be used to learning image feature called Integral Channel Features (ICF), which
efficient data encodings of the input with reduced dimen- can be considered as a more general case of the integral
sionality, which has been commonly used in deep autoen- image features, and has been successfully used in pedes-
coders [208]. In recent years, the bottle-neck design has been trian detection [171]. ICF achieves state-of-the-art detection
widely used for designing lightweight networks [47, 209– accuracy under the near realtime detection speed in its time.
212]. Among these methods, one common approach is to
compress the input layer of a detector to reduce the amount 3.6.2 Speed Up in Frequency Domain
of computation from the very beginning of the detection Convolution is an important type of numerical operation
pipeline [209–211]. Another approach is to compress the in object detection. As the detection of a linear detector
output of the detection engine to make the feature map can be viewed as the window-wise inner product between
thinner, so as to make it more efficient for subsequent the feature map and detector’s weights, this process can be
detection stages [47, 212]. implemented by convolutions.
3.5.5 Neural Architecture Search The convolution can be accelerated in many ways, where
the Fourier transform is a very practical choice especially
More recently, there has been significant interest in de-
for speeding up those large filters. The theoretical basis
signing network architectures automatically by neural ar-
for accelerating convolution in the frequency domain is the
chitecture search (NAS) instead of relying heavily on ex-
convolution theorem in signal processing, that is, under
pert experience and knowledge. NAS has been applied to
suitable conditions, the Fourier transform of a convolution
large-scale image classification [213, 214], object detection
of two signals is the point-wise product in their Fourier
[215] and image segmentation [216] tasks. NAS also shows
space:
promising results in designing lightweight networks very
I ∗ W = F −1 (F (I) F (W )) (5)
recently, where the constraints on the prediction accuracy
and computational complexity are both considered during where F is Fourier transform, F −1 is Inverse Fourier trans-
the searching process [217, 218]. form, I and W are the input image and filter, ∗ is the
18

Fig. 15. An illustration of how to compute the “Integral HOG Map” [177]. With integral image techniques, we can efficiently compute the histogram
feature of any location and any size with constant computational complexity.

implemented by a table-look-up operation. As there is no


floating point multiplication and division in this process,
the speed of a DPM and exemplar SVM detector can be
accelerated over an order of magnitude [227].

3.6.4 Reduced Rank Approximation


In deep networks, the computation in a fully-connected
layer is essentially a multiplication of two matrices. When
the parameter matrix W ∈ Ru×v is large, the computing
burden of a detector will be heavy. For example, in Fast
RCNN detector [18] nearly half of the forward pass time is
spent in computing the fully connected layers. The reduced
rank approximation is a method to accelerate matrix multi-
plications. It aims to make a low-rank decomposition of the
matrix W:
W ≈ UΣt V, (6)
where U is a u × t matrix comprising of the first t left-
singular vectors of W, Σ t is a t × t diagonal matrix contain-
ing the top t singular values of W, and V is v × t matrix
Fig. 16. An illustration of how to speed up a linear detector (e.g., HOG comprising of the first t right-singular vectors of W. The
detector, DPM, etc) in frequency domain with fast Fourier transform and above process, also known as the Truncated SVD, reduces
inverse fast Fourier transform [226].
the parameter count from uv to t(u + v), which can be
significant if t is much smaller than min(u, v ). Truncated
convolution operation, and is the point-wise product. SVD has been used to accelerate the Fast RCNN detector
The above calculation can be accelerated by using the Fast [18] and achieves x2 speed up.
Fourier Transform (FFT) and the Inverse Fast Fourier Trans-
form (IFFT). FFT and IFFT have now been frequently used to 4 R ECENT A DVANCES IN O BJECT D ETECTION
speed up CNN models [222–225] and some classical linear
In this section, we will review the state of the art object
object detectors [226], which has improved the detection
detection methods in recent three years.
speed over an order of magnitude. Fig. 16 shows a standard
pipeline to speed up a linear object detector (e.g., HOG and
DPM) in the frequency domain. 4.1 Detection with Better Engines
In recent years, deep CNN has played a central role in
3.6.3 Vector Quantization many computer vision tasks. As the accuracy of a detector
The Vector Quantization (VQ) is a classical quantization depends heavily on its feature extraction networks, in this
method in signal processing that aims to approximate the paper, we refer to the backbone networks, e.g. the ResNet
distribution of a large group of data by a small set of and VGG, as the “engine” of a detector. Fig. 17 shows the
prototype vectors. It can be used for data compression and detection accuracy of three well-known detection systems:
accelerating the inner product operation in object detection Faster RCNN [19], R-FCN [46] and SSD [21] with different
[227, 228]. For example, with VQ, the HOG histograms can choices of the engines [27].
be grouped and quantified into a set of prototype histogram In this section, we will introduce some of the important
vectors. Then in the detection stage, the inner production detection engines in deep learning era. We refer readers to
between the feature vector and detection weights can be the following survey for more details on this topic [229].
19

Fig. 17. A comparison of detection accuracy of three detectors: Faster Fig. 18. An illustration of different feature fusion methods: (a) bottom-
RCNN [19], R-FCN [46] and SSD [21] on MS-COCO dataset with up fusion, (b) top-down fusion, (c) element-wise sum, (d) element-wise
different detection engines. Image from J. Huang et al. CVPR2017 [27]. product, and (e) concatenation.

AlexNet: AlexNet [40], an eight-layer deep network, object detection models such as STDN [237], DSOD [238],
was the first CNN model that started the deep learning TinyDSOD [207], and Pelee [209] choose DenseNet [235] as
revolution in computer vision. AlexNet famously won the their detection engine. The Mask RCNN [4], as the state of
2012 ImageNet LSVRC-2012 competition by a large margin the art model for instance segmentation, applied the next
[15.3% VS 26.2% (second place) error rates]. As of Feb. 2019, generation of ResNet: ResNeXt [239] as its detection engine.
the Alexnet paper has been cited over 30,000 times. Besides, to speed up detection, the depth-wise separable
VGG: VGG was proposed by Oxford’s Visual Geometry convolution operation, which was introduced by Xception
Group (VGG) in 2014 [230]. VGG increased the model’s [204], an improved version of Incepion, has also been used
depth to 16-19 layers and used very small (3x3) convolution in detectors such as MobileNet [205] and LightHead RCNN
filters instead of 5x5 and 7x7 those were previously used in [47].
AlexNet. VGG has achieved the state of the art performance
on the ImageNet dataset of its time.
GoogLeNet: GoogLeNet, a.k.a Inception [198, 231–233], 4.2 Detection with Better Features
is a big family of CNN models proposed by Google Inc.
since 2014. GoogLeNet increased both of a CNN’s width The quality of feature representations is critical for object
and depth (up to 22 layers). The main contribution of the detection. In recent years, many researchers have made
Inception family is the introduction of factorizing convolu- efforts to further improve the quality of image features on
tion and batch normalization. basis of some latest engines, where the most important two
ResNet: The Deep Residual Networks (ResNet) [234], groups of methods are: 1) feature fusion and 2) learning
proposed by K. He et al. in 2015, is a new type of convolu- high-resolution features with large receptive fields.
tional network architecture that is substantially deeper (up
to 152 layers) than those used previously. ResNet aims to 4.2.1 Why Feature Fusion is Important?
ease the training of networks by reformulating its layers as
learning residual functions with reference to the layer in- Invariance and equivariance are two important properties
puts. ResNet won multiple computer vision competitions in in image feature representations. Classification desires in-
2015, including ImageNet detection, ImageNet localization, variant feature representations since it aims at learning
COCO detection, and COCO segmentation. high-level semantic information. Object localization desires
DenseNet: DenseNet [235] was proposed by G. Huang equivariant representations since it aims at discriminating
and Z. Liu et al. in 2017. The success of ResNet suggested position and scale changes. As object detection consists of
that the short cut connection in CNN enables us to train two sub-tasks of object recognition and localization, it is cru-
deeper and more accurate models. The authors embraced cial for a detector to learn both invariance and equivariance
this observation and introduced a densely connected block, at the same time.
which connects each layer to every other layer in a feed- Feature fusion has been widely used in object detection
forward fashion. in the last three years. As a CNN model consists of a series
SENet: Squeeze and Excitation Networks (SENet) was of convolutional and pooling layers, features in deeper
proposed by J. Hu and L. Shen et al. in 2018 [236]. Its layers will have stronger invariance but less equivariance.
main contribution is the integration of global pooling and Although this could be beneficial to category recognition, it
shuffling to learn channel-wise importance of the feature suffers from low localization accuracy in object detection.
map. SENet won the 1st place in ILSVRC 2017 classification On the contrary, features in shallower layers is not con-
competition. ducive to learning semantics, but it helps object localization
as it contains more information about edges and contours.
• Object detectors with new engines Therefore, the integration of deep and shallow features in
In recent three years, many of the latest engines have a CNN model helps improve both invariance and equivari-
been applied to object detection. For example, some latest ance.
20

4.2.2 Feature Fusion in Different Ways As we mentioned before, the lower the feature resolution
There are many ways to perform feature fusion in object is, the harder will be to detect small objects. The most
detection. Here we introduce some recent methods in two straight forward way to increase the feature resolution is to
aspects: 1) processing flow and 2) element-wise operation. remove pooling layer or to reduce the convolution down-
sampling rate. But this will cause a new problem, the
• Processing flow receptive field will become too small due to the decreasing
of output stride. In other words, this will narrow a detector’s
Recent feature fusion methods in object detection can be
”sight” and may result in the miss detection of some large
divided into two categories: 1) bottom-up fusion, 2) top-
objects.
down fusion, as shown in Fig. 18 (a)-(b). Bottom-up fusion
feeds forward shallow features to deeper layers via skip A piratical method to increase both of the receptive field
connections [237, 240–242]. In comparison, top-down fusion and feature resolution at the same time is to introduce di-
feeds back the features of deeper layers into the shallower lated convolution (a.k.a. atrous convolution, or convolution
ones [22, 55, 243–246]. Apart from these methods, there are with holes). Dilated convolution is originally proposed in
more complex approaches proposed recently, e.g., weaving semantic segmentation tasks [252, 253]. Its main idea is to
features across different layers [247]. expand the convolution filter and use sparse parameters.
As the feature maps of different layers may have differ- For example, a 3x3 filter with a dilation rate of 2 will have
ent sizes both in terms of their spatial and channel dimen- the same receptive field as a 5x5 kernel but only have
sions, one may need to accommodate the feature maps, such 9 parameters. Dilated convolution has now been widely
as by adjusting the number of channels, up-sampling low- used in object detection [21, 56, 254, 255], and proves to
resolution maps, or down-sampling high-resolution maps to be effective for improved accuracy without any additional
a proper size. The easiest ways to do this is to use nearest- parameters and computational cost [56].
or bilinear-interpolation [22, 244]. Besides, fractional strided
convolution (a.k.a. transpose convolution) [45, 248], is an-
4.3 Beyond Sliding Window
other recent popular way to resize the feature maps and
adjust the number of channels. The advantage of using Although object detection has evolved from using hand-
fractional strided convolution is that it can learn an appro- crafted features to deep neural networks, the detection still
priate way to perform up-sampling by itself [55, 212, 241– follows a paradigm of “sliding window on feature maps”
243, 245, 246, 249]. [137]. Recently, there are some detectors built beyond sliding
windows.
• Element-wise operation
• Detection as sub-region search
From a local point of view, feature fusion can be consid-
ered as the element-wise operation between different feature Sub-region search [184, 256–258] provides a new way
maps. There are three groups of methods: 1) element-wise of performing detection. One recent method is to think
sum, 2) element-wise product, and 3) concatenation, as of detection as a path planning process that starts from
shown in Fig. 18 (c)-(e). initial grids and finally converges to the desired ground
The element-wise sum is the easiest way to perform truth boxes [256]. Another method is to think of detection
feature fusion. It has been frequently used in many recent as an iterative updating process to refine the corners of a
object detectors [22, 55, 241, 243, 246]. The element-wise predicted bounding box [257].
product [245, 249–251] is very similar to the element-wise
sum, while the only difference is the use of multiplication • Detection as key points localization
instead of summation. An advantage of element-wise prod-
uct is that it can be used to suppress or highlight the features Key points localization is an important computer vision
within a certain area, which may further benefit small object task that has extensively broad applications, such as facial
detection [245, 250, 251]. Feature concatenation is another expression recognition [259], human poses identification
way of feature fusion [212, 237, 240, 244]. Its advantage [260], etc. As any object in an image can be uniquely
is that it can be used to integrate context information of determined by its upper left corner and lower right corner
different regions [105, 144, 149, 161], while its disadvantage of the ground truth box, the detection task, therefore, can be
is the increase of the memory [235]. equivalently framed as a pair-wise key points localization
problem. One recent implementation of this idea is to pre-
4.2.3 Learning High Resolution Features with Large Re- dict a heat-map for the corners [261]. The advantage of this
ceptive Fields approach is that it can be implemented under a semantic
The receptive field and feature resolution are two important segmentation framework, and there is no need to design
characteristics of a CNN based detector, where the former multi-scale anchor boxes.
one refers to the spatial range of input pixels that contribute
to the calculation of a single pixel of the output, and the
4.4 Improvements of Localization
latter one corresponds to the down-sampling rate between
the input and the feature map. A network with a larger To improve localization accuracy, there are two groups of
receptive field is able to capture a larger scale of context methods in recent detectors: 1) bounding box refinement,
information, while that with a smaller one may concentrate and 2) designing new loss functions for accurate localiza-
more on the local details. tion.
21

4.4.1 Bounding Box Refinement the boundary of an object, segmentation may be helpful for
The most intuitive way to improve localization accuracy category recognition.
is bounding box refinement, which can be considered as • Segmentation helps accurate localization
a post-processing of the detection results. Although the
bounding box regression has been integrated into most of The ground-truth bounding box of an object is deter-
the modern object detectors, there are still some objects mined by its well-defined boundary. For some objects with
with unexpected scales that cannot be well captured by any a special shape (e.g., imagine a cat with a very long tail),
of the predefined anchors. This will inevitably lead to an it will be difficult to predict high IoU locations. As object
inaccurate prediction of their locations. For this reason, the boundaries can be well encoded in semantic segmentation
“iterative bounding box refinement” [262–264] has been in- features, learning with segmentation would be helpful for
troduced recently by iteratively feeding the detection results accurate object localization.
into a BB regressor until the prediction converges to a correct • Segmentation can be embedded as context
location and size. However, some researchers also claimed
that this method does not guarantee the monotonicity of Objects in daily life are surrounded by different back-
localization accuracy [262], in other words, the BB regression grounds, such as the sky, water, grass, etc, and all these
may degenerate the localization if it is applied for multiple elements constitute the context of an object. Integrating the
times. context of semantic segmentation will be helpful for object
detection, say, an aircraft is more likely to appear in the sky
4.4.2 Improving Loss Functions for Accurate Localization than on the water.
In most modern detectors, object localization is considered 4.5.2 How Segmentation Improves Detection?
as a coordinate regression problem. However, there are
two drawbacks of this paradigm. First, the regression loss There are two main approaches to improve object detection
function does not correspond to the final evaluation of by segmentation: 1) learning with enriched features and 2)
localization. For example, we can not guarantee that a lower learning with multi-task loss functions.
regression error will always produce a higher IoU predic- • Learning with enriched features
tion, especially when the object has a very large aspect ratio.
Second, the traditional bounding box regression method The simplest way is to think of the segmentation net-
does not provide the confidence of localization. When there work as a fixed feature extractor and to integrate it into a de-
are multiple BB’s overlapping with each other, this may lead tection framework as additional features [144, 269, 270]. The
to failure in non-maximum suppression (see more details in advantage of this approach is that it is easy to implement,
subsection 2.3.5). while the disadvantage is that the segmentation network
The above problems can be alleviated by designing may bring additional calculation.
new loss functions. The most intuitive design is to directly • Learning with multi-task loss functions
use IoU as the localization loss function [265]. Some other
researchers have further proposed an IoU-guided NMS to Another way is to introduce an additional segmentation
improve localization in both training and detection stages branch on top of the original detection framework and to
[163]. Besides, some researchers have also tried to improve train this model with multi-task loss functions (segmenta-
localization under a probabilistic inference framework [266]. tion loss + detection loss) [4, 269]. In most cases, the segmen-
Different from the previous methods that directly predict tation brunch will be removed at the inference stage. The
the box coordinates, this method predicts the probability advantage is the detection speed will not be affected, but the
distribution of a bounding box location. disadvantage is that the training requires pixel-level image
annotations. To this end, some researchers have followed the
idea of “weakly supervised learning”: instead of training
4.5 Learning with Segmentation based on pixel-wise annotation masks, they simply train
Object detection and semantic segmentation are all impor- the segmentation brunch based on the bounding-box level
tant tasks in computer vision. Recent researches suggest annotations [250, 271].
object detection can be improved by learning with semantic
segmentation. 4.6 Robust Detection of Rotation and Scale Changes
Object rotation and scale changes are important challenges
4.5.1 Why Segmentation Improves Detection? in object detection. As the features learned by CNN are
There are three reasons why the semantic segmentation not invariant to rotation and large degree of scale changes,
improves object detection. in recent years, many people have made efforts in this
problem.
• Segmentation helps category recognition
Edges and boundaries are the basic elements that consti- 4.6.1 Rotation Robust Detection
tute human visual cognition [267, 268]. In computer vision, Object rotation is very common in detection tasks such as
the difference between an object (e.g., a car, a person) and a face detection, text detection, etc. The most straight forward
stuff (e.g., sky, water, grass) is that the former usually has a solution to this problem is data augmentation so that an
closed and well defined boundary while the latter does not. object in any orientation can be well covered by the aug-
As the feature of semantic segmentation tasks well captures mented data [88]. Another solution is to train independent
22

detectors for every orientation [272, 273]. Apart from these “larger ones” [184, 258]. Another recent improvement is
traditional approaches, recently, there are some new im- learning to predict the scale distribution of objects in an
provement methods. image, and then adaptively re-scaling the image according
to the distribution [282, 283].
• Rotation invariant loss functions
The idea of learning with rotation invariant loss function
can be traced back to the 1990s [274]. Some recent works 4.7 Training from Scratch
have introduced a constraint on the original detection loss Most deep learning based detectors are first pre-trained on
function so that to make the features of rotated objects large scale datasets, say ImageNet, and then fine-tuned on
unchanged [275, 276]. specific detection tasks. People have always believed that
pre-training helps to improve generalization ability and
• Rotation calibration
training speed and the question is, do we really need to
Another way of improving rotation invariant detection is pre-training a detector on ImageNet? In fact, there are some
to make geometric transformations of the objects candidates limitations when adopting the pre-trained networks in ob-
[277–279]. This will be especially helpful for multi-stage ject detection. The first limitation is the divergence between
detectors, where the correlation at early stages will benefit ImageNet classification and object detection, including their
the subsequent detections. The representative of this idea loss functions and scale/category distributions. The second
is Spatial Transformer Networks (STN) [278]. STN has now limitation is the domain mismatch. As images in ImageNet
been used in rotated text detection [278] and rotated face are RGB images while detection sometimes will be applied
detection [279]. to depth image (RGB-D) or 3D medical images, the pre-
trained knowledge can not be well transfer to these detec-
• Rotation RoI Pooling
tion tasks.
In a two-stage detector, feature pooling aims to extract In recent years, some researchers have tried to train an
a fixed length feature representation for an object proposal object detector from scratch. To speed up training and im-
with any location and size by first dividing the proposal prove stability, some researchers introduce dense connection
evenly into a set of grids, and then concatenating the grid and batch normalization to accelerate the back-propagation
features. As the grid meshing is performed in Cartesian in shallow layers [238, 284]. The recent work by K. He
coordinates, the features are not invariance to rotation trans- et al. [285] has further questioned the paradigm of pre-
form. A recent improvement is to mesh the grids in polar training even further by exploring the opposite regime:
coordinates so that the features could be robust to the they reported competitive results on object detection on
rotation changes [272]. the COCO dataset using standard models trained from
random initialization, with the sole exception of increasing
4.6.2 Scale Robust Detection the number of training iterations so the randomly initialized
Recent improvements have been made at both training and models may converge. Training from random initialization
detection stages for scale robust detection. is also surprisingly robust even using only 10% of the train-
ing data, which indicates that ImageNet pre-training may
• Scale adaptive training
speed up convergence, but does not necessarily provide
Most of the modern detectors re-scale the input image regularization or improve final detection accuracy.
to a fixed size and back propagate the loss of the objects
in all scales, as shown in Fig. 19 (a). However, a drawback
4.8 Adversarial Training
of doing this is there will be a “scale imbalance” problem.
Building an image pyramid during detection could alleviate The Generative Adversarial Networks (GAN) [286], intro-
this problem but not fundamentally [46, 234]. A recent duced by A. Goodfellow et al. in 2014, has received great
improvement is Scale Normalization for Image Pyramids attention in recent years. A typical GAN consists of two
(SNIP) [280], which builds image pyramids at both of neural networks: a generator networks and a discriminator
training and detection stages and only backpropagates the networks, contesting with each other in a minimax opti-
loss of some selected scales, as shown in Fig. 19 (b). Some mization framework. Typically, the generator learns to map
researchers have further proposed a more efficient training from a latent space to a particular data distribution of inter-
strategy: SNIP with Efficient Resampling (SNIPER) [281], i.e. est, while the discriminator aims to discriminate between in-
to crop and re-scale an image to a set of sub-regions so that stances from the true data distribution and those produced
to benefit from large batch training. by the generator. GAN has been widely used for many
computer vision tasks such as image generation[286, 287],
• Scale adaptive detection
image style transfer [288], and image super-resolution [289].
Most of the modern detectors use the fixed configura- In recent two years, GAN has also been applied to object
tions for detecting objects of different sizes. For example, detection, especially for improving the detection of small
in a typical CNN based detector, we need to carefully and occluded object.
define the size of anchors. A drawback of doing this is GAN has been used to enhance the detection on small
the configurations cannot be adaptive to unexpected scale objects by narrowing the representations between small and
changes. To improve the detection of small objects, some large ones [290, 291]. To improve the detection of occluded
“adaptive zoom-in” techniques are proposed in some recent objects, one recent idea is to generate occlusion masks
detectors to adaptively enlarge the small objects into the by using adversarial training [292]. Instead of generating
23

Fig. 19. Different training strategies for multi-scale object detection: (a): Training on a single resolution image, back propagate objects of all scales
[17–19, 21]. (b) Training on multi-resolution images (image pyramid), back propagate objects of selected scale. If an object is too large or too small,
its gradient will be discarded [56, 280, 281].

examples in pixel space, the adversarial network directly prove WSOD. More recently, generative adversarial training
modifies the features to mimic occlusion. has been used for WSOD [302].
In addition to these works, “adversarial attack” [293],
which aims to study how to attack a detector with adver-
sarial examples, has drawn increasing attention recently. 5 A PPLICATIONS
The research on this topic is especially important for au-
In this section, we will review some important detection
tonomous driving, as it cannot be fully trusted before guar-
applications in the past 20 years, including pedestrian
anteeing the robustness to adversarial attacks.
detection, face detection, text detection, traffic sign/light
detection, and remote sensing target detection.
4.9 Weakly Supervised Object Detection
The training of a modern object detector usually requires 5.1 Pedestrian Detection
a large amount of manually labeled data, while the label- Pedestrian detection, as an important object detection ap-
ing process is time-consuming, expensive, and inefficient. plication, has received extensive attention in many areas
Weakly Supervised Object Detection (WSOD) aims to solve such as autonomous driving, video surveillance, criminal
this problem by training a detector with only image level investigation, etc. Some early time’s pedestrian detection
annotations instead of bounding boxes. methods, such as HOG detector [12], ICF detector [171], laid
Recently, multi-instance learning has been used for a solid foundation for general object detection in terms of
WSOD [294, 295]. Multi-instance learning is a group of su- the feature representation [12, 171], the design of classifier
pervised learning method [39, 296]. Instead of learning with [174], and the detection acceleration [177]. In recent years,
a set of instances which are individually labeled, a multi- some general object detection algorithms, e.g., Faster RCNN
instance learning model receives a set of labeled bags, each [19], have been introduced to pedestrian detection [165], and
containing many instances. If we consider object candidates has greatly promoted the progress of this area.
in one image as a bag, and image-level annotation as the
label, then the WSOD can be formulated as a multi-instance
5.1.1 Difficulties and Challenges
learning process.
Class activation mapping is another recently group of The challenges and difficulties in pedestrian detection can
methods for WSOD [297, 298]. The research on CNN visu- be summarized as follows.
alization has shown that the convolution layer of a CNN Small pedestrian: Fig. 20 (a) shows some examples of
behaves as object detectors despite there is no supervision the small pedestrians that are captured far from the camera.
on the location of the object. Class activation mapping shed In Caltech Dataset [59, 60], 15% of the pedestrians are less
light on how to enable a CNN to have localization ability than 30 pixels in height.
despite being trained on image level labels [299]. Hard negatives: Some backgrounds in street view im-
In addition to the above approaches, some other re- ages are very similar to pedestrians in their visual appear-
searchers considered the WSOD as a proposal ranking pro- ance, as shown in Fig. 20 (b).
cess by selecting the most informative regions and then Dense and occluded pedestrian: Fig 20 (c) shows some
training these regions with image-level annotation [300]. examples of dense and occluded pedestrians. In the Caltech
Another simple method for WSOD is to mask out different Dataset [59, 60], pedestrians that haven’t been occluded only
parts of the image. If the detection score drops sharply, account for 29% of the total pedestrian instances.
then an object would be covered with high probability Real-time detection: The real-time pedestrian detection
[301]. Besides, interactive annotation [295] takes human from HD video is crucial for some applications like au-
feedback into consideration during training so that to im- tonomous driving and video surveillance.
24

of hard negatives by using both RGB and infrared images


[317].
To improve dense and occluded pedestrian detection:
As we have mentioned in Section 2.3.2, the features in
deeper layers of CNN have richer semantics but are not
effective for detecting dense objects. To this end, some
researchers have designed new loss function by considering
the attraction of target and the repulsion of other surround-
ing objects [318]. Target occlusion is another problem that
usually comes up with dense pedestrians. The ensemble of
Fig. 20. Some hard examples of pedestrian detection from Caltech part detectors [319, 320] and the attention mechanism [321]
dataset [59, 60]: (a) small pedestrians, (b) hard negatives, and (c) dense are the most common ways to improve occluded pedestrian
and occluded pedestrians. detection.

5.1.2 Literature Review 5.2 Face Detection


Pedestrian detection has a very long research history [30, Face detection is one of the oldest computer vision appli-
31, 101]. Its development can be divided into two technical cations [96, 164]. Early time’s face detection, such as the
periods: 1) traditional pedestrian detection and 2) deep VJ detector [10], has greatly promoted the object detection
learning based pedestrian detection. We refer readers to the where many of its remarkable ideas are still playing impor-
following surveys for more details on this topic [60, 303– tant roles even in today’s object detection. Face detection
307]. has now been applied in all walks of life, such as the “smile”
detection in digital cameras, “face swiping” in e-commerce,
• Traditional pedestrian detection methods facial makeup in mobile apps, etc.
Due to the limitations of computing resources, the Haar
5.2.1 Difficulties and Challenges
wavelet feature has been broadly used in early time’s pedes-
trian detection [30, 31, 308]. To improve the detection of The difficulties and challenges in face detection can be
occluded pedestrians, one popular idea of that time was summarized as follows:
“detection by components” [31, 102, 220], i.e., to think of Intra-class variation: Human faces may present a variety
the detection as an ensemble of multiple part detectors that of expressions, skin colors, poses, and movements, as shown
trained individually on different human parts, e.g. head, in Fig. 21 (a).
legs, and arms. As the increase of computing power, people Occlusion: Faces may be partially occluded by other
started to design more complex detection models, and since objects, as shown in Fig. 21 (b).
2005, gradient-based representation [12, 37, 177, 220, 309] Multi-scale detection: Detecting faces in a large variety
and DPM [15, 37, 54] have become the mainstream of pedes- of scales, especially for some tiny faces, as shown in Fig. 21
trian detection. In 2009, by using the integral image accelera- (c).
tion, an effective and lightweight feature representation: the Real-time detection: Face detection on mobile devices
Integral Channel Features (ICF), was proposed [171]. ICF usually requires a CPU real-time detection speed.
then became the new benchmark of pedestrian detection
5.2.2 Literature review
at that time [60]. In addition to the feature representation,
some domain knowledge also has been considered, such as The research of face detection can be traced back to the
appearance constancy and shape symmetry [310] and stereo early 1990s [95, 106, 108]. It then has gone through multiple
information [173, 311]. historical periods: early time’s face detection (before 2001),
traditional face detection (2001-2015), and deep learning
• Deep learning based pedestrian detection methods based face detection (2015-now). We refer readers to the
Pedestrian detection is one of the first computer vision following surveys for more details [323, 324].
task that applies deep learning [312]. • Early time’s face detection (before 2001)
To improve small pedestrian detection: Although deep
The early time’s face detection algorithms can be divided
learning object detectors such as Fast/Faster R-CNN have
into three groups: 1) Rule-based methods. This group of
shown state of the art performance for general object detec-
methods encode human knowledge of what constitutes a
tion, they have limited success for detecting small pedestri-
typical face and capture the relationships between facial
ans due to the low resolution of their convolutional features
elements [107, 108]. 2) Subspace analysis-based methods.
[165]. Some recent solutions to this problem include feature
This group of methods analyze the face distribution in
fusion [165], introducing extra high-resolution handcrafted
underlying linear subspace [95, 106]. Eigenfaces is the rep-
features [313, 314], and ensembling detection results on
resentative of this group of methods [95]. 3) Learning based
multiple resolutions [315].
methods: To frame the face detection as a sliding window
To improve hard negative detection: Some recent im-
+ binary classification (target vs background) process. Some
provements include the integration of boosted decision tree
commonly used models of this group include neural net-
[165], and semantics segmentation (as the context of the
work [96, 164, 325] and SVM [29, 326].
pedestrians) [316]. In addition, the idea of “cross-modal
learning” has also been introduced to enrich the feature • Traditional face detection (2000-2015)
25

Fig. 22. Challenges in text detection and recognition: (a) Variation of


fonts, colors and languages. Image from maxpixel (free of copyrights).
(b) Text rotation and perspective distortion. Image from Y. Liu et al.
CVPR2017 [336]. (c) Densely arranged text localization. Image from Y.
Wu et al. ICCV2017 [337].
Fig. 21. Challenges in face detection: (a) Intra-class variation, image
from WildestFaces Dataset [70]. (b) Face occlusion, image from UFDD
Dataset [69]. (c) Multi-scale face detection. Image from P. Hu et al. detection is to determine whether or not there is text in a
CVPR2017 [322].
given image, and if there is, to localize, and recognize it. Text
detection has very broad applications. It helps people who
There are two groups of face detectors in this period. The are visually impaired to “read” street signs and currency
first group of methods are built based on boosted decision [332, 333]. In geographic information systems, the detection
trees [10, 11, 109]. These methods are easy to compute, but and recognition of house numbers and street signs make it
usually suffer from low detection accuracy under complex easier to build digital maps [334, 335].
scenes. The second group is based on early time’s convo-
lutional neural networks, where the shared computation of 5.3.1 Difficulties and Challenges
features are used to speed up detection [112, 113, 327]. The difficulties and challenges of text detection can be
• Deep learning based face detection (after 2015) summarized as follows:
Different fonts and languages: Texts may have different
In deep learning era, most of the face detection al- fonts, colors, and languages, as shown in Fig. 22 (a).
gorithms follow the detection idea of the general object Text rotation and perspective distortion: Texts may
detectors such as Faster RCNN and SSD. have different orientations and even may have perspective
To speed up face detection: Cascaded detection (see distortion, as shown in Fig. 22 (b).
more details in Section 3.3) is the most common way to Densely arranged text localization: Text lines with large
speed up a face detector in deep learning era [179, 180]. aspect ratios and dense layout are difficult to localize accu-
Another speed up method is to predict the scale distribution rately, as shown in Fig. 22 (c).
of the faces in an image [283] and then run detection on Broken and blurred characters: Broken and blurred
some selected scales. characters are common in street view images.
To improve multi-pose and occluded face detection:
The idea of “face calibration” has been used to improve 5.3.2 Literature Review
multi-pose face detection by estimating the calibration pa-
Text detection consists of two related but relatively inde-
rameters [279] or using progressive calibration through
pendent tasks: 1) text localization, and 2) text recognition.
multiple detection stages [277]. To improve occluded face
The existing text detection methods can be divided into two
detection, two methods have been proposed recently. The
groups: “step-wise detection” and “integrated detection”.
first one is to incorporate “attention mechanism” so that
We refer readers to the following survey for more details
to highlight the features of underlying face targets [250].
[338, 339].
The second one is “detection based on parts” [328], which
inherits ideas from DPM. • Step-wise detection vs integrated detection
To improve multi-scale face detection: Recent works on
Step-wise detection methods [340, 341] consist of a se-
multi-scale face detection [322, 329–331] use similar detec-
ries of processing steps including character segmentation,
tion strategies as those in general object detection, including
candidate region verification, character grouping, and word
multi-scale feature fusion and multi-resolution detection
recognition. The advantage of this group of methods is most
(see Section 2.3.2 and 4.2.2 for more details).
of the background can be filtered in the coarse segmentation
step, which greatly reduces the computational cost of the
5.3 Text Detection following process. The disadvantage is the parameters of
Text has long been the major information carrier of the all steps need to be set carefully, and the errors will occur
human for thousands of years. The fundamental goal of text and accumulate throughout each of these steps. By contrast,
26

integrated methods [342–345] frame the text detection as a


joint probability inference problem, where the steps of char-
acter localization, grouping, and recognition are processed
under a unified framework. The advantage of these methods
is it avoids the cumulative error and is easy to integrate
language models. The disadvantage is the inference will
be computationally expensive when considering a large
number of character classes and candidate windows [339].
• Traditional methods vs deep learning methods
Most of the traditional text detection methods generate
text candidates in an unsupervised way, where the com-
monly used techniques include Maximally Stable Extremal
Regions (MSER) segmentation [341] and morphological fil-
tering [346]. Some domain knowledge, such as the symme-
try of texts and the structures of strokes, also have been
considered in these methods [340, 341, 347].
In recent years, researchers have paid more attention to Fig. 23. Challenges in traffic sign detection and traffic light detection: (a)
Illumination changes. Image from pxhere (free of copyrights). (b) Motion
the problem of text localization rather than recognition. Two blur. Image from GTSRB Dataset [81]. (c) Detection under bad weather.
groups of methods are proposed recently. The first group of Image from Flickr and Max Pixel (free of copyrights).
methods frame the text detection as a special case of general
object detection [251, 348–357]. These methods have a uni-
fied detection framework, but it is less effective for detecting towards the detection of general objects rather than fixed
texts with orientation or with large aspect ratio. The second patterns like traffic lights and traffic signs, it would still be a
group of methods frame the text detection as an image mistake to believe that their recognition is not challenging.
segmentation problem [336, 337, 358–360]. The advantage
of these methods is there are no special restrictions for the 5.4.1 Difficulties and Challenges
shape and orientation of text, but the disadvantage is that it The challenges and difficulties of traffic sign/light detection
is not easy to distinguish densely arranged text lines from can be summarized as follows:
each other based on the segmentation result. The recent Illumination changes: The detection will be particularly
deep learning based text detection methods have proposed difficult when driving into the sun glare or at night, as
some solutions to the above problems. shown in Fig. 23 (a).
For text rotation and perspective changes: The most Motion blur: The image captured by an on-board cam-
common solution to this problem is to introduce additional era will become blurred due to the motion of the car, as
parameters in anchor boxes and RoI pooling layer that are shown in Fig. 23 (b).
associated with rotation and perspective changes [351–353, Bad weather: In bad weathers, e.g., rainy and snowy
355–357]. days, the image quality will be affected, as shown in Fig. 23
To improve densely arranged text detection: The (c).
segmentation-based approach shows more advantages in Real-time detection: This is particularly important for
detecting densely arranged texts. To distinguish the adjacent autonomous driving.
text lines, two groups of solutions have been proposed
recently. The first one is “segment and linking”, where 5.4.2 Literature Review
“segment” refers to the character heatmap, and “linking”
Existing traffic sign/light detection methods can be divided
refers to the connection between two adjacent segments
into two groups: 1) traditional detection methods and 2)
indicating that they belong to the same word or line of text
deep learning based detection methods. We refer readers
[336, 358]. The second group is to introduce an additional
to the following survey [80] for more details on this topic.
corner/border detection task to help separate densely ar-
range texts, where a group of corners or a closed boundary • Traditional detection methods
corresponds to an individual line of text [337, 359, 360].
To improve broken and blurred text detection: A recent The research of vision based traffic sign/light detection
idea to deal with broken and blurred texts is to use word can date back to as far as 20 years ago [362, 363]. As traffic
level [77, 361] recognition and sentence level recognition sign/light has particular shape and color, the traditional
[335]. To deal with texts with different fonts, the most detection methods are usually based on color thresholding
effective way is training with synthetic samples [77, 348]. [364–368], visual saliency detection[369], morphological fil-
tering [79], and edge/contour analysis [370, 371]. As the
above methods are merely designed based on low-level
5.4 Traffic Sign and Traffic Light Detection vision, they usually fail under complex environments (as
With the development of self-driving technology, the auto- is shown in Fig. 23), therefore, some researchers began to
matic detection of traffic sign and traffic light has attracted find other solutions beyond vision-based approaches, e.g.,
great attention in recent years. Over the past decades, al- to combine GPS and digital maps in traffic light detection
though the computer vision community has largely pushed [372, 373]. Although “feature pyramid + sliding window”
27

has become a standard framework for general object de-


tection and pedestrian detection at that time, apart from a
very small number of works [374], the mainstream of traffic
sign/light detection methods did not follow this paradigm
until 2010 [375–377].
• Deep learning based detection methods
In deep learning era, some well-known detectors such
as Faster RCNN and SSD were applied in traffic sign/light
detection tasks [83, 84, 378, 379]. On basis on these detectors,
some new techniques, such as the attention mechanism and
adversarial training have been used to improve detection
under complex traffic environments [290, 378].

5.5 Remote Sensing Target Detection


Remote sensing imaging technique has opened a door for
people to better understand the earth. In recent years, as the
resolution of remote sensing images has increased, remote
Fig. 24. Challenges in remote sensing target detection: (a) Detection
sensing target detection (e.g., the detection of airplane, ship, in “big data”: data volume comparison between a single-view remote
oil-pot, etc), has become a research hot-spot. Remote sensing sensing imagery and an average image size of VOC, ImageNet, and
target detection has broad applications, such as military in- MS-COCO. (b) Targets occluded by cloud. Images from S. Qiu et al.
vestigation, disaster rescue, and urban traffic management. JSTARS2017 [380] and Z. Zou et al. TGRS2016 [381].

5.5.1 Difficulties and Challenges


detection can be considered as the detection of the foredeck
The challenges and difficulties in remote sensing target and the stern [397, 398]. To improve the occluded target
detection are summarized as follows: detection, one commonly used idea is “detection by parts”
Detection in “big data”: Due to the huge data volume [380, 399]. To detect targets with different orientations, the
of remote sensing images, how to quickly and accurately “mixture model” is used by training different detectors for
detect remote sensing targets remains a problem. Fig. 24 targets of different orientations [273].
(a) shows a comparison on data volume between remote
sensing images and natural images. • Deep learning based detection methods
Occluded targets: Over 50% of the earth’s surface is
covered by cloud every day. Some examples of occluded After the great success of RCNN in 2014, deep CNN has
targets are shown in Fig. 24 (b). been soon applied to remote sensing target detection [275,
Domain adaptation: Remote sensing images captured 276, 400, 401]. The general object detection framework like
by different sensors (e.g., with different modulates and Faster RCNN and SSD have attracted increasing attention in
resolutions) present a high degree of differences. remote sensing community [91, 167, 381, 402–405].
Due to the huge different between a remote sensing
5.5.2 Literature Review image and an everyday image, some investigations have
been made on the effectiveness of deep CNN features for
We refer readers to the following surveys for more details
remote sensing images [406–408]. People discovered that in
on this topic [90, 382].
spite of its great success, the deep CNN is no better than
• Traditional detection methods traditional methods for spectral data [406]. To detect tar-
gets with different orientations, some researchers have im-
Most of the traditional remote sensing target detection
proved the ROI Pooling layer for better rotation invariance
methods follow a two-stage detection paradigm: 1) candi-
[272, 409]. To improve domain adaptation, some researchers
date extraction and 2) target verification. In candidate ex-
formulated the detection from a Bayesian view that at the
traction stage, some frequently used methods include gray
detection stage, the model is adaptively updated based on
value filtering based methods [383, 384], visual saliency-
the distribution of test images [91]. In addition, the attention
based methods [385–388], wavelet transform based methods
mechanisms and feature fusion strategy also have been used
[389], anomaly detection based methods [390], etc. One
to improve small target detection [410, 411].
similarity of the above methods is they are all unsuper-
vised methods, thus usually fail in complex environments.
In target verification stage, some frequently used features
6 C ONCLUSION AND FUTURE DIRECTIONS
include HOG [390, 391], LBP [384], SIFT [386, 388, 392], etc.
Besides, there are also some other methods following the Remarkable achievements have been made in object detec-
sliding window detection paradigm [391–394]. tion over the past 20 years. This paper not only extensively
To detect targets with particular structure and shape reviews some milestone detectors (e.g. VJ detector, HOG de-
such as oil-pots and inshore ships, some domain knowledge tector, DPM, Faster-RCNN, YOLO, SSD, etc), key technolo-
is used. For example, the oil-pot detection can be considered gies, speed up methods, detection applications, datasets,
as circle/arc detection problem [395, 396]. The inshore ship and metrics in its 20 years of history, but also discusses the
28

challenges currently met by the community, and how these R EFERENCES


detectors can be further extended and improved.
The future research of object detection may focus but is [1] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik,
not limited to the following aspects: “Simultaneous detection and segmentation,” in European
Conference on Computer Vision. Springer, 2014, pp. 297–
Lightweight object detection: To speed up the detection 312.
algorithm so that it can run smoothly on mobile devices. [2] ——, “Hypercolumns for object segmentation and fine-
Some important applications include mobile augmented re- grained localization,” in Proceedings of the IEEE conference
ality, smart cameras, face verification, etc. Although a great on computer vision and pattern recognition, 2015, pp. 447–
effort has been made in recent years, the speed gap between 456.
a machine and human eyes still remains large, especially for [3] J. Dai, K. He, and J. Sun, “Instance-aware semantic seg-
mentation via multi-task network cascades,” in Proceed-
detecting some small objects. ings of the IEEE Conference on Computer Vision and Pattern
Detection meets AutoML: Recent deep learning based Recognition, 2016, pp. 3150–3158.
detectors are becoming more and more sophisticated and [4] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-
heavily relies on experiences. A future direction is to reduce cnn,” in Computer Vision (ICCV), 2017 IEEE International
human intervention when designing the detection model Conference on. IEEE, 2017, pp. 2980–2988.
[5] A. Karpathy and L. Fei-Fei, “Deep visual-semantic align-
(e.g., how to design the engine and how to set anchor boxes)
ments for generating image descriptions,” in Proceedings
by using neural architecture search. AutoML could be the of the IEEE conference on computer vision and pattern recog-
future of object detection. nition, 2015, pp. 3128–3137.
Detection meets domain adaptation: The training pro- [6] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudi-
cess of any target detector can be essentially considered as nov, R. Zemel, and Y. Bengio, “Show, attend and tell:
a likelihood estimation process under the assumption of Neural image caption generation with visual attention,”
in International conference on machine learning, 2015, pp.
independent and identically distributed (i.i.d.) data. Object
2048–2057.
detection with non-i.i.d. data, especially for some real-world [7] Q. Wu, C. Shen, P. Wang, A. Dick, and A. van den Hengel,
applications, still remains a challenge. GAN has shown “Image captioning and visual question answering based
promising results in domain adaptation and may be of great on attributes and external knowledge,” IEEE transactions
help to object detection in the future. on pattern analysis and machine intelligence, vol. 40, no. 6,
Weakly supervised detection: The training of a deep pp. 1367–1381, 2018.
[8] K. Kang, H. Li, J. Yan, X. Zeng, B. Yang, T. Xiao, C. Zhang,
learning based detector usually relies on a large amount Z. Wang, R. Wang, X. Wang et al., “T-cnn: Tubelets with
of well-annotated images. The annotation process is time- convolutional neural networks for object detection from
consuming, expensive, and inefficient. Developing weakly videos,” IEEE Transactions on Circuits and Systems for Video
supervised detection techniques where the detectors are Technology, vol. 28, no. 10, pp. 2896–2907, 2018.
only trained with image-level annotations, or partially with [9] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”
bounding box annotations is of great importance for reduc- nature, vol. 521, no. 7553, p. 436, 2015.
[10] P. Viola and M. Jones, “Rapid object detection using a
ing labor costs and improving detection flexibility.
boosted cascade of simple features,” in Computer Vision
Small object detection: Detecting small objects in large and Pattern Recognition, 2001. CVPR 2001. Proceedings of
scenes has long been a challenge. Some potential application the 2001 IEEE Computer Society Conference on, vol. 1. IEEE,
of this research direction includes counting the population 2001, pp. I–I.
of wild animals with remote sensing images and detecting [11] P. Viola and M. J. Jones, “Robust real-time face detection,”
the state of some important military targets. Some further International journal of computer vision, vol. 57, no. 2, pp.
137–154, 2004.
directions may include the integration of the visual attention [12] N. Dalal and B. Triggs, “Histograms of oriented gradients
mechanisms and the design of high resolution lightweight for human detection,” in Computer Vision and Pattern
networks. Recognition, 2005. CVPR 2005. IEEE Computer Society Con-
Detection in videos: Real-time object detection/tracking ference on, vol. 1. IEEE, 2005, pp. 886–893.
in HD videos is of great importance for video surveillance [13] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A
and autonomous driving. Traditional object detectors are discriminatively trained, multiscale, deformable part
model,” in Computer Vision and Pattern Recognition, 2008.
usually designed under for image-wise detection, while CVPR 2008. IEEE Conference on. IEEE, 2008, pp. 1–8.
simply ignores the correlations between videos frames. Im- [14] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester,
proving detection by exploring the spatial and temporal “Cascade object detection with deformable part models,”
correlation is an important research direction. in Computer vision and pattern recognition (CVPR), 2010
Detection with information fusion: Object detection IEEE conference on. IEEE, 2010, pp. 2241–2248.
[15] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and
with multiple sources/modalities of data, e.g., RGB-D im-
D. Ramanan, “Object detection with discriminatively
age, 3d point cloud, LIDAR, etc, is of great importance for trained part-based models,” IEEE transactions on pattern
autonomous driving and drone applications. Some open analysis and machine intelligence, vol. 32, no. 9, pp. 1627–
questions include: how to immigrate well-trained detectors 1645, 2010.
to different modalities of data, how to make information [16] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich
fusion to improve detection, etc. feature hierarchies for accurate object detection and se-
mantic segmentation,” in Proceedings of the IEEE conference
Standing on the highway of technical evolutions, we on computer vision and pattern recognition, 2014, pp. 580–
believe this paper will help readers to build a big picture 587.
of object detection and to find future directions of this fast- [17] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyra-
moving research field. mid pooling in deep convolutional networks for visual
29

recognition,” in European conference on computer vision. detection with structured models. Citeseer, 2012.
Springer, 2014, pp. 346–361. [39] S. Andrews, I. Tsochantaridis, and T. Hofmann, “Support
[18] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE inter- vector machines for multiple-instance learning,” in Ad-
national conference on computer vision, 2015, pp. 1440–1448. vances in neural information processing systems, 2003, pp.
[19] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: 577–584.
Towards real-time object detection with region proposal [40] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
networks,” in Advances in neural information processing classification with deep convolutional neural networks,”
systems, 2015, pp. 91–99. in Advances in neural information processing systems, 2012,
[20] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You pp. 1097–1105.
only look once: Unified, real-time object detection,” in [41] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-
Proceedings of the IEEE conference on computer vision and based convolutional networks for accurate object de-
pattern recognition, 2016, pp. 779–788. tection and segmentation,” IEEE transactions on pattern
[21] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. analysis and machine intelligence, vol. 38, no. 1, pp. 142–
Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” 158, 2016.
in European conference on computer vision. Springer, 2016, [42] K. E. Van de Sande, J. R. Uijlings, T. Gevers, and A. W.
pp. 21–37. Smeulders, “Segmentation as selective search for object
[22] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, recognition,” in Computer Vision (ICCV), 2011 IEEE Inter-
and S. J. Belongie, “Feature pyramid networks for object national Conference on. IEEE, 2011, pp. 1879–1886.
detection.” in CVPR, vol. 1, no. 2, 2017, p. 4. [43] R. B. Girshick, P. F. Felzenszwalb, and D. McAllester,
[23] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Discriminatively trained deformable part models, re-
“Focal loss for dense object detection,” IEEE transactions lease 5,” http://people.cs.uchicago.edu/ rbg/latent-
on pattern analysis and machine intelligence, 2018. release5/.
[24] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, [44] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn:
and M. Pietikäinen, “Deep learning for generic object de- towards real-time object detection with region proposal
tection: A survey,” arXiv preprint arXiv:1809.02165, 2018. networks,” IEEE Transactions on Pattern Analysis & Ma-
[25] S. Agarwal, J. O. D. Terrail, and F. Jurie, “Recent advances chine Intelligence, no. 6, pp. 1137–1149, 2017.
in object detection in the age of deep convolutional neural [45] M. D. Zeiler and R. Fergus, “Visualizing and understand-
networks,” arXiv preprint arXiv:1809.03193, 2018. ing convolutional networks,” in European conference on
[26] A. Andreopoulos and J. K. Tsotsos, “50 years of object computer vision. Springer, 2014, pp. 818–833.
recognition: Directions forward,” Computer vision and im- [46] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via
age understanding, vol. 117, no. 8, pp. 827–891, 2013. region-based fully convolutional networks,” in Advances
[27] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, in neural information processing systems, 2016, pp. 379–387.
A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama [47] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun,
et al., “Speed/accuracy trade-offs for modern convolu- “Light-head r-cnn: In defense of two-stage object detec-
tional object detectors,” in IEEE CVPR, vol. 4, 2017. tor,” arXiv preprint arXiv:1711.07264, 2017.
[28] K. Grauman and B. Leibe, “Visual object recognition [48] J. Redmon and A. Farhadi, “Yolo9000: better, faster,
(synthesis lectures on artificial intelligence and machine stronger,” arXiv preprint, 2017.
learning),” Morgan & Claypool, 2011. [49] ——, “Yolov3: An incremental improvement,” arXiv
[29] C. P. Papageorgiou, M. Oren, and T. Poggio, “A general preprint arXiv:1804.02767, 2018.
framework for object detection,” in Computer vision, 1998. [50] M. Everingham, L. Van Gool, C. K. Williams, J. Winn,
sixth international conference on. IEEE, 1998, pp. 555–562. and A. Zisserman, “The pascal visual object classes (voc)
[30] C. Papageorgiou and T. Poggio, “A trainable system for challenge,” International journal of computer vision, vol. 88,
object detection,” International journal of computer vision, no. 2, pp. 303–338, 2010.
vol. 38, no. 1, pp. 15–33, 2000. [51] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams,
[31] A. Mohan, C. Papageorgiou, and T. Poggio, “Example- J. Winn, and A. Zisserman, “The pascal visual object
based object detection in images by components,” IEEE classes challenge: A retrospective,” International journal of
Transactions on Pattern Analysis & Machine Intelligence, computer vision, vol. 111, no. 1, pp. 98–136, 2015.
no. 4, pp. 349–361, 2001. [52] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
[32] Y. Freund, R. Schapire, and N. Abe, “A short introduction S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein
to boosting,” Journal-Japanese Society For Artificial Intelli- et al., “Imagenet large scale visual recognition challenge,”
gence, vol. 14, no. 771-780, p. 1612, 1999. International Journal of Computer Vision, vol. 115, no. 3, pp.
[33] D. G. Lowe, “Object recognition from local scale-invariant 211–252, 2015.
features,” in Computer vision, 1999. The proceedings of the [53] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona,
seventh IEEE international conference on, vol. 2. Ieee, 1999, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco:
pp. 1150–1157. Common objects in context,” in European conference on
[34] ——, “Distinctive image features from scale-invariant computer vision. Springer, 2014, pp. 740–755.
keypoints,” International journal of computer vision, vol. 60, [54] M. A. Sadeghi and D. Forsyth, “30hz object detection
no. 2, pp. 91–110, 2004. with dpm v5,” in European Conference on Computer Vision.
[35] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and Springer, 2014, pp. 65–79.
object recognition using shape contexts,” CALIFORNIA [55] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Single-
UNIV SAN DIEGO LA JOLLA DEPT OF COMPUTER shot refinement neural network for object detection,” in
SCIENCE AND ENGINEERING, Tech. Rep., 2002. IEEE CVPR, 2018.
[36] T. Malisiewicz, A. Gupta, and A. A. Efros, “Ensemble [56] Y. Li, Y. Chen, N. Wang, and Z. Zhang, “Scale-aware
of exemplar-svms for object detection and beyond,” in trident networks for object detection,” arXiv preprint
Computer Vision (ICCV), 2011 IEEE International Conference arXiv:1901.01892, 2019.
on. IEEE, 2011, pp. 89–96. [57] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
[37] R. B. Girshick, P. F. Felzenszwalb, and D. A. Mcallester, “Imagenet: A large-scale hierarchical image database,” in
“Object detection with grammar models,” in Advances in Computer Vision and Pattern Recognition, 2009. CVPR 2009.
Neural Information Processing Systems, 2011, pp. 442–450. IEEE Conference on. Ieee, 2009, pp. 248–255.
[38] R. B. Girshick, From rigid templates to grammars: Object [58] I. Krasin and T. e. a. Duerig, “Openimages: A
30

public dataset for large-scale multi-label and multi- European Conference on Computer Vision. Springer, 2010,
class image classification.” Dataset available from pp. 591–604.
https://storage.googleapis.com/openimages/web/index.html, [75] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, “Detecting texts
2017. of arbitrary orientations in natural images,” in 2012 IEEE
[59] P. Dollár, C. Wojek, B. Schiele, and P. Perona, “Pedestrian Conference on Computer Vision and Pattern Recognition.
detection: A benchmark,” in Computer Vision and Pattern IEEE, 2012, pp. 1083–1090.
Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, [76] A. Mishra, K. Alahari, and C. Jawahar, “Scene text recog-
2009, pp. 304–311. nition using higher order language priors,” in BMVC-
[60] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian British Machine Vision Conference. BMVA, 2012.
detection: An evaluation of the state of the art,” IEEE [77] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zis-
transactions on pattern analysis and machine intelligence, serman, “Synthetic data and artificial neural net-
vol. 34, no. 4, pp. 743–761, 2012. works for natural scene text recognition,” arXiv preprint
[61] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for arXiv:1406.2227, 2014.
autonomous driving? the kitti vision benchmark suite,” [78] A. Veit, T. Matera, L. Neumann, J. Matas, and S. Be-
in Computer Vision and Pattern Recognition (CVPR), 2012 longie, “Coco-text: Dataset and benchmark for text de-
IEEE Conference on. IEEE, 2012, pp. 3354–3361. tection and recognition in natural images,” arXiv preprint
[62] S. Zhang, R. Benenson, and B. Schiele, “Citypersons: A arXiv:1601.07140, 2016.
diverse dataset for pedestrian detection,” in The IEEE [79] R. De Charette and F. Nashashibi, “Real time visual
Conference on Computer Vision and Pattern Recognition traffic lights recognition based on spot light detection and
(CVPR), vol. 1, no. 2, 2017, p. 3. adaptive traffic lights templates,” in Intelligent Vehicles
[63] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, Symposium, 2009 IEEE. IEEE, 2009, pp. 358–363.
R. Benenson, U. Franke, S. Roth, and B. Schiele, “The [80] A. Møgelmose, M. M. Trivedi, and T. B. Moeslund,
cityscapes dataset for semantic urban scene understand- “Vision-based traffic sign detection and analysis for in-
ing,” in Proceedings of the IEEE conference on computer telligent driver assistance systems: Perspectives and sur-
vision and pattern recognition, 2016, pp. 3213–3223. vey.” IEEE Trans. Intelligent Transportation Systems, vol. 13,
[64] M. Braun, S. Krebs, F. Flohr, and D. M. Gavrila, “The no. 4, pp. 1484–1497, 2012.
eurocity persons dataset: A novel benchmark for object [81] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and
detection,” arXiv preprint arXiv:1805.07193, 2018. C. Igel, “Detection of traffic signs in real-world images:
[65] V. Jain and E. Learned-Miller, “Fddb: A benchmark The german traffic sign detection benchmark,” in Neural
for face detection in unconstrained settings,” Technical Networks (IJCNN), The 2013 International Joint Conference
Report UM-CS-2010-009, University of Massachusetts, on. IEEE, 2013, pp. 1–8.
Amherst, Tech. Rep., 2010. [82] R. Timofte, K. Zimmermann, and L. Van Gool, “Multi-
[66] M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof, view traffic sign detection, recognition, and 3d localisa-
“Annotated facial landmarks in the wild: A large-scale, tion,” Machine vision and applications, vol. 25, no. 3, pp.
real-world database for facial landmark localization,” in 633–647, 2014.
Computer Vision Workshops (ICCV Workshops), 2011 IEEE [83] Z. Zhu, D. Liang, S. Zhang, X. Huang, B. Li, and S. Hu,
International Conference on. IEEE, 2011, pp. 2144–2151. “Traffic-sign detection and classification in the wild,” in
[67] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, Proceedings of the IEEE Conference on Computer Vision and
K. Allen, P. Grother, A. Mah, and A. K. Jain, “Pushing the Pattern Recognition, 2016, pp. 2110–2118.
frontiers of unconstrained face detection and recognition: [84] K. Behrendt, L. Novak, and R. Botros, “A deep learning
Iarpa janus benchmark a,” in Proceedings of the IEEE approach to traffic lights: Detection, tracking, and clas-
conference on computer vision and pattern recognition, 2015, sification,” in Robotics and Automation (ICRA), 2017 IEEE
pp. 1931–1939. International Conference on. IEEE, 2017, pp. 1370–1377.
[68] S. Yang, P. Luo, C.-C. Loy, and X. Tang, “Wider face: [85] G. Heitz and D. Koller, “Learning spatial context: Using
A face detection benchmark,” in Proceedings of the IEEE stuff to find things,” in European conference on computer
conference on computer vision and pattern recognition, 2016, vision. Springer, 2008, pp. 30–43.
pp. 5525–5533. [86] F. Tanner, B. Colder, C. Pullen, D. Heagy, M. Eppolito,
[69] H. Nada, V. A. Sindagi, H. Zhang, and V. M. Patel, V. Carlan, C. Oertel, and P. Sallee, “Overhead imagery
“Pushing the limits of unconstrained face detection: a research data setan annotated data library & tools to aid
challenge dataset and baseline results,” arXiv preprint in the development of computer vision algorithms,” in
arXiv:1804.10275, 2018. 2009 IEEE Applied Imagery Pattern Recognition Workshop
[70] M. K. Yucel, Y. C. Bilge, O. Oguz, N. Ikizler-Cinbis, (AIPR 2009). IEEE, 2009, pp. 1–8.
P. Duygulu, and R. G. Cinbis, “Wildest faces: Face de- [87] K. Liu and G. Mattyus, “Fast multiclass vehicle detec-
tection and recognition in violent settings,” arXiv preprint tion on aerial images.” IEEE Geosci. Remote Sensing Lett.,
arXiv:1805.07566, 2018. vol. 12, no. 9, pp. 1938–1942, 2015.
[71] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and [88] H. Zhu, X. Chen, W. Dai, K. Fu, Q. Ye, and J. Jiao, “Ori-
R. Young, “Icdar 2003 robust reading competitions,” in entation robust object detection in aerial images using
null. IEEE, 2003, p. 682. deep convolutional neural network,” in Image Processing
[72] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, (ICIP), 2015 IEEE International Conference on. IEEE, 2015,
A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. pp. 3735–3739.
Chandrasekhar, S. Lu et al., “Icdar 2015 competition on [89] S. Razakarivony and F. Jurie, “Vehicle detection in aerial
robust reading,” in Document Analysis and Recognition imagery: A small target detection benchmark,” Journal of
(ICDAR), 2015 13th International Conference on. IEEE, Visual Communication and Image Representation, vol. 34, pp.
2015, pp. 1156–1160. 187–203, 2016.
[73] B. Shi, C. Yao, M. Liao, M. Yang, P. Xu, L. Cui, S. Belongie, [90] G. Cheng and J. Han, “A survey on object detection in
S. Lu, and X. Bai, “Icdar2017 competition on reading optical remote sensing images,” ISPRS Journal of Pho-
chinese text in the wild (rctw-17),” in Document Analysis togrammetry and Remote Sensing, vol. 117, pp. 11–28, 2016.
and Recognition (ICDAR), 2017 14th IAPR International [91] Z. Zou and Z. Shi, “Random access memories: A new
Conference on, vol. 1. IEEE, 2017, pp. 1429–1434. paradigm for target detection in high resolution aerial
[74] K. Wang and S. Belongie, “Word spotting in the wild,” in remote sensing images,” IEEE Transactions on Image Pro-
31

cessing, vol. 27, no. 3, pp. 1100–1111, 2018. and A. L. Yuille, “Semantic image segmentation with
[92] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, deep convolutional nets and fully connected crfs,” arXiv
M. Datcu, M. Pelillo, and L. Zhang, “Dota: A large-scale preprint arXiv:1412.7062, 2014.
dataset for object detection in aerial images,” in Proc. [112] C. Garcia and M. Delakis, “A neural architecture for fast
CVPR, 2018. and robust face detection,” in Pattern Recognition, 2002.
[93] D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli, Proceedings. 16th International Conference on, vol. 2. IEEE,
M. Klaric, Y. Bulatov, and B. McCord, “xview: Ob- 2002, pp. 44–47.
jects in context in overhead imagery,” arXiv preprint [113] M. Osadchy, M. L. Miller, and Y. L. Cun, “Synergistic face
arXiv:1802.07856, 2018. detection and pose estimation with energy-based mod-
[94] K. Oksuz, B. C. Cam, E. Akbas, and S. Kalkan, “Local- els,” in Advances in Neural Information Processing Systems,
ization recall precision (lrp): A new performance metric 2005, pp. 1017–1024.
for object detection,” in European Conference on Computer [114] S. J. Nowlan and J. C. Platt, “A convolutional neural
Vision (ECCV), vol. 6, 2018. network hand tracker,” Advances in neural information
[95] M. Turk and A. Pentland, “Eigenfaces for recognition,” processing systems, pp. 901–908, 1995.
Journal of cognitive neuroscience, vol. 3, no. 1, pp. 71–86, [115] T. Malisiewicz, Exemplar-based representations for object
1991. detection, association and beyond. Carnegie Mellon Uni-
[96] R. Vaillant, C. Monrocq, and Y. Le Cun, “Original ap- versity, 2011.
proach for the localisation of objects in images,” IEE [116] B. Alexe, T. Deselaers, and V. Ferrari, “What is an object?”
Proceedings-Vision, Image and Signal Processing, vol. 141, in Computer Vision and Pattern Recognition (CVPR), 2010
no. 4, pp. 245–250, 1994. IEEE Conference on. IEEE, 2010, pp. 73–80.
[97] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient- [117] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W.
based learning applied to document recognition,” Pro- Smeulders, “Selective search for object recognition,” In-
ceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. ternational journal of computer vision, vol. 104, no. 2, pp.
[98] I. Biederman, “Recognition-by-components: a theory 154–171, 2013.
of human image understanding.” Psychological review, [118] J. Carreira and C. Sminchisescu, “Constrained parametric
vol. 94, no. 2, p. 115, 1987. min-cuts for automatic object segmentation,” in Computer
[99] M. A. Fischler and R. A. Elschlager, “The representation Vision and Pattern Recognition (CVPR), 2010 IEEE Confer-
and matching of pictorial structures,” IEEE Transactions ence on. IEEE, 2010, pp. 3241–3248.
on computers, vol. 100, no. 1, pp. 67–92, 1973. [119] P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and
[100] B. Leibe, A. Leonardis, and B. Schiele, “Robust object J. Malik, “Multiscale combinatorial grouping,” in Proceed-
detection with interleaved categorization and segmenta- ings of the IEEE conference on computer vision and pattern
tion,” International journal of computer vision, vol. 77, no. recognition, 2014, pp. 328–335.
1-3, pp. 259–289, 2008. [120] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the ob-
[101] D. M. Gavrila and V. Philomin, “Real-time object detec- jectness of image windows,” IEEE transactions on pattern
tion for” smart” vehicles,” in Computer Vision, 1999. The analysis and machine intelligence, vol. 34, no. 11, pp. 2189–
Proceedings of the Seventh IEEE International Conference on, 2202, 2012.
vol. 1. IEEE, 1999, pp. 87–93. [121] M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. Torr, “Bing:
[102] B. Wu and R. Nevatia, “Detection of multiple, partially Binarized normed gradients for objectness estimation at
occluded humans in a single image by bayesian combi- 300fps,” in Proceedings of the IEEE conference on computer
nation of edgelet part detectors,” in null. IEEE, 2005, pp. vision and pattern recognition, 2014, pp. 3286–3293.
90–97. [122] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object
[103] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, proposals from edges,” in European conference on computer
and Y. LeCun, “Overfeat: Integrated recognition, localiza- vision. Springer, 2014, pp. 391–405.
tion and detection using convolutional networks,” arXiv [123] C. Szegedy, S. Reed, D. Erhan, D. Anguelov, and S. Ioffe,
preprint arXiv:1312.6229, 2013. “Scalable, high-quality object detection,” arXiv preprint
[104] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural net- arXiv:1412.1441, 2014.
works for object detection,” in Advances in neural informa- [124] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scal-
tion processing systems, 2013, pp. 2553–2561. able object detection using deep neural networks,” in
[105] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos, “A uni- Proceedings of the IEEE Conference on Computer Vision and
fied multi-scale deep convolutional neural network for Pattern Recognition, 2014, pp. 2147–2154.
fast object detection,” in European Conference on Computer [125] A. Ghodrati, A. Diba, M. Pedersoli, T. Tuytelaars, and
Vision. Springer, 2016, pp. 354–370. L. Van Gool, “Deepproposal: Hunting objects by cas-
[106] A. Pentland, B. Moghaddam, T. Starner et al., “View- cading deep convolutional layers,” in Proceedings of the
based and modular eigenspaces for face recognition,” IEEE International Conference on Computer Vision, 2015, pp.
1994. 2578–2586.
[107] G. Yang and T. S. Huang, “Human face detection in a [126] W. Kuo, B. Hariharan, and J. Malik, “Deepbox: Learning
complex background,” Pattern recognition, vol. 27, no. 1, objectness with convolutional networks,” in Proceedings of
pp. 53–63, 1994. the IEEE International Conference on Computer Vision, 2015,
[108] I. Craw, D. Tock, and A. Bennett, “Finding face features,” pp. 2479–2487.
in European Conference on Computer Vision. Springer, 1992, [127] S. Gidaris and N. Komodakis, “Attend refine repeat:
pp. 92–96. Active box proposal generation via in-out localization,”
[109] R. Xiao, L. Zhu, and H.-J. Zhang, “Boosting chain learn- arXiv preprint arXiv:1606.04446, 2016.
ing for object detection,” in Computer Vision, 2003. Pro- [128] H. Li, Y. Liu, W. Ouyang, and X. Wang, “Zoom out-and-
ceedings. Ninth IEEE International Conference on. IEEE, in network with recursive training for object proposal,”
2003, pp. 709–715. arXiv preprint arXiv:1702.05711, 2017.
[110] J. Long, E. Shelhamer, and T. Darrell, “Fully convolu- [129] J. Hosang, R. Benenson, P. Dollár, and B. Schiele, “What
tional networks for semantic segmentation,” in Proceed- makes for effective detection proposals?” IEEE transac-
ings of the IEEE conference on computer vision and pattern tions on pattern analysis and machine intelligence, vol. 38,
recognition, 2015, pp. 3431–3440. no. 4, pp. 814–830, 2016.
[111] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, [130] J. Hosang, R. Benenson, and B. Schiele, “How
32

good are detection proposals, really?” arXiv preprint S. Yan, “Attentive contexts for object detection,” IEEE
arXiv:1406.6962, 2014. Transactions on Multimedia, vol. 19, no. 5, pp. 944–954,
[131] J. Carreira and C. Sminchisescu, “Cpmc: Automatic object 2017.
segmentation using constrained parametric min-cuts,” [150] Q. Chen, Z. Song, J. Dong, Z. Huang, Y. Hua, and
IEEE Transactions on Pattern Analysis & Machine Intelli- S. Yan, “Contextualizing object detection and classifica-
gence, no. 7, pp. 1312–1328, 2011. tion,” IEEE transactions on pattern analysis and machine
[132] N. Chavali, H. Agrawal, A. Mahendru, and D. Batra, intelligence, vol. 37, no. 1, pp. 13–27, 2015.
“Object-proposal evaluation protocol is’ gameable’,” in [151] S. Gupta, B. Hariharan, and J. Malik, “Exploring person
Proceedings of the IEEE conference on computer vision and context and local scene context for object detection,”
pattern recognition, 2016, pp. 835–844. arXiv preprint arXiv:1511.08177, 2015.
[133] K. Lenc and A. Vedaldi, “R-cnn minus r,” arXiv preprint [152] X. Chen and A. Gupta, “Spatial memory for con-
arXiv:1506.06981, 2015. text reasoning in object detection,” arXiv preprint
[134] P.-A. Savalle, S. Tsogkas, G. Papandreou, and I. Kokkinos, arXiv:1704.04224, 2017.
“Deformable part models with cnn features,” in European [153] Y. Liu, R. Wang, S. Shan, and X. Chen, “Structure infer-
Conference on Computer Vision, Parts and Attributes Work- ence net: Object detection using scene-level context and
shop, 2014. instance-level relationships,” in Proceedings of the IEEE
[135] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part- Conference on Computer Vision and Pattern Recognition,
based r-cnns for fine-grained category detection,” in Eu- 2018, pp. 6985–6994.
ropean conference on computer vision. Springer, 2014, pp. [154] J. H. Hosang, R. Benenson, and B. Schiele, “Learning non-
834–849. maximum suppression.” in CVPR, 2017, pp. 6469–6477.
[136] L. Wan, D. Eigen, and R. Fergus, “End-to-end integration [155] P. Henderson and V. Ferrari, “End-to-end training of
of a convolution network, deformable parts model and object class detectors for mean average precision,” in
non-maximum suppression,” in Proceedings of the IEEE Asian Conference on Computer Vision. Springer, 2016, pp.
Conference on Computer Vision and Pattern Recognition, 198–213.
2015, pp. 851–859. [156] R. Rothe, M. Guillaumin, and L. Van Gool, “Non-
[137] R. Girshick, F. Iandola, T. Darrell, and J. Malik, “De- maximum suppression for object detection by passing
formable part models are convolutional neural net- messages between windows,” in Asian Conference on Com-
works,” in Proceedings of the IEEE conference on Computer puter Vision. Springer, 2014, pp. 290–306.
Vision and Pattern Recognition, 2015, pp. 437–446. [157] D. Mrowca, M. Rohrbach, J. Hoffman, R. Hu, K. Saenko,
[138] B. Li, T. Wu, S. Shao, L. Zhang, and R. Chu, “Object and T. Darrell, “Spatial semantic regularisation for large
detection via end-to-end integration of aspect ratio and scale object detection,” in Proceedings of the IEEE interna-
context aware part-based models and fully convolutional tional conference on computer vision, 2015, pp. 2003–2011.
networks,” arXiv preprint arXiv:1612.00534, 2016. [158] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, “Soft-
[139] A. Torralba and P. Sinha, “Detecting faces in impov- nmsimproving object detection with one line of code,” in
erished images,” MASSACHUSETTS INST OF TECH Computer Vision (ICCV), 2017 IEEE International Conference
CAMBRIDGE ARTIFICIAL INTELLIGENCE LAB, Tech. on. IEEE, 2017, pp. 5562–5570.
Rep., 2001. [159] L. Tychsen-Smith and L. Petersson, “Improving object
[140] S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross, localization with fitness nms and bounded iou loss,”
S. Chintala, and P. Dollár, “A multipath network for arXiv preprint arXiv:1711.00164, 2017.
object detection,” arXiv preprint arXiv:1604.02135, 2016. [160] S. K. Divvala, D. Hoiem, J. H. Hays, A. A. Efros, and
[141] X. Zeng, W. Ouyang, B. Yang, J. Yan, and X. Wang, “Gated M. Hebert, “An empirical study of context in object de-
bi-directional cnn for object detection,” in European Con- tection,” in Computer Vision and Pattern Recognition, 2009.
ference on Computer Vision. Springer, 2016, pp. 354–369. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 1271–
[142] X. Zeng, W. Ouyang, J. Yan, H. Li, T. Xiao, K. Wang, Y. Liu, 1278.
Y. Zhou, B. Yang, Z. Wang et al., “Crafting gbd-net for [161] C. Chen, M.-Y. Liu, O. Tuzel, and J. Xiao, “R-cnn for small
object detection,” IEEE transactions on pattern analysis and object detection,” in Asian conference on computer vision.
machine intelligence, vol. 40, no. 9, pp. 2109–2123, 2018. Springer, 2016, pp. 214–230.
[143] W. Ouyang, K. Wang, X. Zhu, and X. Wang, “Learning [162] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, “Relation
chained deep features and classifiers for cascade in object networks for object detection,” in Computer Vision and
detection,” arXiv preprint arXiv:1702.07054, 2017. Pattern Recognition (CVPR), vol. 2, no. 3, 2018.
[144] S. Gidaris and N. Komodakis, “Object detection via [163] B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang, “Acquisition
a multi-region and semantic segmentation-aware cnn of localization confidence for accurate object detection,”
model,” in Proceedings of the IEEE International Conference in Proceedings of the European Conference on Computer Vi-
on Computer Vision, 2015, pp. 1134–1142. sion, Munich, Germany, 2018, pp. 8–14.
[145] Y. Zhu, C. Zhao, J. Wang, X. Zhao, Y. Wu, H. Lu et al., [164] H. A. Rowley, S. Baluja, and T. Kanade, “Human face de-
“Couplenet: Coupling global structure with local parts tection in visual scenes,” in Advances in Neural Information
for object detection,” in Proc. of Intl Conf. on Computer Processing Systems, 1996, pp. 875–881.
Vision (ICCV), vol. 2, 2017. [165] L. Zhang, L. Lin, X. Liang, and K. He, “Is faster r-
[146] C. Desai, D. Ramanan, and C. C. Fowlkes, “Discrimina- cnn doing well for pedestrian detection?” in European
tive models for multi-class object layout,” International Conference on Computer Vision. Springer, 2016, pp. 443–
journal of computer vision, vol. 95, no. 1, pp. 1–12, 2011. 457.
[147] Z. Li, Y. Chen, G. Yu, and Y. Deng, “R-fcn++: Towards [166] A. Shrivastava, A. Gupta, and R. Girshick, “Training
accurate region-based fully convolutional networks for region-based object detectors with online hard example
object detection.” in AAAI, 2018. mining,” in Proceedings of the IEEE Conference on Computer
[148] S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick, Vision and Pattern Recognition, 2016, pp. 761–769.
“Inside-outside net: Detecting objects in context with skip [167] T. Tang, S. Zhou, Z. Deng, H. Zou, and L. Lei, “Vehicle
pooling and recurrent neural networks,” in Proceedings of detection in aerial images based on region convolutional
the IEEE conference on computer vision and pattern recogni- neural networks and hard negative example mining,”
tion, 2016, pp. 2874–2883. Sensors, vol. 17, no. 2, p. 336, 2017.
[149] J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, and [168] X. Sun, P. Wu, and S. C. Hoi, “Face detection using deep
33

learning: An improved faster rcnn approach,” Neurocom- tems, 1990, pp. 598–605.
puting, vol. 299, pp. 42–50, 2018. [187] S. Han, H. Mao, and W. J. Dally, “Deep compres-
[169] J. Jin, K. Fu, and C. Zhang, “Traffic sign recognition with sion: Compressing deep neural networks with pruning,
hinge loss trained convolutional neural networks,” IEEE trained quantization and huffman coding,” arXiv preprint
Transactions on Intelligent Transportation Systems, vol. 15, arXiv:1510.00149, 2015.
no. 5, pp. 1991–2000, 2014. [188] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf,
[170] M. Zhou, M. Jing, D. Liu, Z. Xia, Z. Zou, and Z. Shi, “Pruning filters for efficient convnets,” arXiv preprint
“Multi-resolution networks for ship detection in infrared arXiv:1608.08710, 2016.
remote sensing images,” Infrared Physics & Technology, [189] G. Huang, S. Liu, L. van der Maaten, and K. Q.
2018. Weinberger, “Condensenet: An efficient densenet using
[171] P. Dollár, Z. Tu, P. Perona, and S. Belongie, “Integral learned group convolutions,” group, vol. 3, no. 12, p. 11,
channel features,” 2009. 2017.
[172] P. Dollár, R. Appel, S. Belongie, and P. Perona, “Fast [190] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi,
feature pyramids for object detection,” IEEE Transactions “Xnor-net: Imagenet classification using binary convolu-
on Pattern Analysis and Machine Intelligence, vol. 36, no. 8, tional neural networks,” in European Conference on Com-
pp. 1532–1545, 2014. puter Vision. Springer, 2016, pp. 525–542.
[173] R. Benenson, M. Mathias, R. Timofte, and L. Van Gool, [191] X. Lin, C. Zhao, and W. Pan, “Towards accurate binary
“Pedestrian detection at 100 frames per second,” in Com- convolutional neural network,” in Advances in Neural
puter Vision and Pattern Recognition (CVPR), 2012 IEEE Information Processing Systems, 2017, pp. 345–353.
Conference on. IEEE, 2012, pp. 2903–2910. [192] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and
[174] S. Maji, A. C. Berg, and J. Malik, “Classification using Y. Bengio, “Binarized neural networks,” in Advances in
intersection kernel support vector machines is efficient,” neural information processing systems, 2016, pp. 4107–4115.
in Computer Vision and Pattern Recognition, 2008. CVPR [193] G. Hinton, O. Vinyals, and J. Dean, “Distilling
2008. IEEE Conference on. IEEE, 2008, pp. 1–8. the knowledge in a neural network,” arXiv preprint
[175] A. Vedaldi and A. Zisserman, “Sparse kernel approxima- arXiv:1503.02531, 2015.
tions for efficient classification and detection,” in Com- [194] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta,
puter Vision and Pattern Recognition (CVPR), 2012 IEEE and Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv
Conference on. IEEE, 2012, pp. 2320–2327. preprint arXiv:1412.6550, 2014.
[176] F. Fleuret and D. Geman, “Coarse-to-fine face detection,” [195] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker,
International Journal of computer vision, vol. 41, no. 1-2, pp. “Learning efficient object detection models with knowl-
85–107, 2001. edge distillation,” in Advances in Neural Information Pro-
[177] Q. Zhu, M.-C. Yeh, K.-T. Cheng, and S. Avidan, “Fast hu- cessing Systems, 2017, pp. 742–751.
man detection using a cascade of histograms of oriented [196] Q. Li, S. Jin, and J. Yan, “Mimicking very efficient network
gradients,” in Computer Vision and Pattern Recognition, for object detection,” in 2017 IEEE Conference on Computer
2006 IEEE Computer Society Conference on, vol. 2. IEEE, Vision and Pattern Recognition (CVPR). IEEE, 2017, pp.
2006, pp. 1491–1498. 7341–7349.
[178] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman, [197] K. He and J. Sun, “Convolutional neural networks at con-
“Multiple kernels for object detection,” in Computer Vi- strained time cost,” in Proceedings of the IEEE conference
sion, 2009 IEEE 12th International Conference on. IEEE, on computer vision and pattern recognition, 2015, pp. 5353–
2009, pp. 606–613. 5360.
[179] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A con- [198] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wo-
volutional neural network cascade for face detection,” in jna, “Rethinking the inception architecture for computer
Proceedings of the IEEE Conference on Computer Vision and vision,” in Proceedings of the IEEE conference on computer
Pattern Recognition, 2015, pp. 5325–5334. vision and pattern recognition, 2016, pp. 2818–2826.
[180] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face [199] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large
detection and alignment using multitask cascaded convo- kernel mattersimprove semantic segmentation by global
lutional networks,” IEEE Signal Processing Letters, vol. 23, convolutional network,” in Computer Vision and Pattern
no. 10, pp. 1499–1503, 2016. Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017,
[181] Z. Cai, M. Saberian, and N. Vasconcelos, “Learning pp. 1743–1751.
complexity-aware cascades for deep pedestrian detec- [200] K.-H. Kim, S. Hong, B. Roh, Y. Cheon, and M. Park,
tion,” in Proceedings of the IEEE International Conference “Pvanet: deep but lightweight neural networks for real-
on Computer Vision, 2015, pp. 3361–3369. time object detection,” arXiv preprint arXiv:1608.08021,
[182] B. Yang, J. Yan, Z. Lei, and S. Z. Li, “Craft objects from 2016.
images,” in Proceedings of the IEEE Conference on Computer [201] X. Zhang, J. Zou, X. Ming, K. He, and J. Sun, “Efficient
Vision and Pattern Recognition, 2016, pp. 6043–6051. and accurate approximations of nonlinear convolutional
[183] F. Yang, W. Choi, and Y. Lin, “Exploit all the layers: Fast networks,” in Proceedings of the IEEE Conference on Com-
and accurate cnn object detector with scale dependent puter Vision and Pattern Recognition, 2015, pp. 1984–1992.
pooling and cascaded rejection classifiers,” in Proceedings [202] X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very
of the IEEE conference on computer vision and pattern recog- deep convolutional networks for classification and de-
nition, 2016, pp. 2129–2137. tection,” IEEE transactions on pattern analysis and machine
[184] M. Gao, R. Yu, A. Li, V. I. Morariu, and L. S. Davis, intelligence, vol. 38, no. 10, pp. 1943–1955, 2016.
“Dynamic zoom-in network for fast object detection in [203] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet:
large images,” in IEEE Conference on Computer Vision and An extremely efficient convolutional neural network for
Pattern Recognition (CVPR), 2018. mobile devices,” 2017.
[185] W. Ouyang, K. Wang, X. Zhu, and X. Wang, “Chained [204] F. Chollet, “Xception: Deep learning with depthwise
cascade network for object detection,” in Computer Vision separable convolutions,” arXiv preprint, pp. 1610–02 357,
(ICCV), 2017 IEEE International Conference on. IEEE, 2017, 2017.
pp. 1956–1964. [205] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko,
[186] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mo-
damage,” in Advances in neural information processing sys- bilenets: Efficient convolutional neural networks for mo-
34

bile vision applications,” arXiv preprint arXiv:1704.04861, fbfft: A gpu performance evaluation,” arXiv preprint
2017. arXiv:1412.7580, 2014.
[206] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and [225] O. Rippel, J. Snoek, and R. P. Adams, “Spectral represen-
L.-C. Chen, “Mobilenetv2: Inverted residuals and linear tations for convolutional neural networks,” in Advances in
bottlenecks,” in 2018 IEEE/CVF Conference on Computer neural information processing systems, 2015, pp. 2449–2457.
Vision and Pattern Recognition. IEEE, 2018, pp. 4510–4520. [226] C. Dubout and F. Fleuret, “Exact acceleration of linear ob-
[207] Y. Li, J. Li, W. Lin, and J. Li, “Tiny-dsod: Lightweight ject detectors,” in European Conference on Computer Vision.
object detection for resource-restricted usages,” arXiv Springer, 2012, pp. 301–311.
preprint arXiv:1807.11013, 2018. [227] M. A. Sadeghi and D. Forsyth, “Fast template evaluation
[208] G. E. Hinton and R. R. Salakhutdinov, “Reducing the with vector quantization,” in Advances in neural informa-
dimensionality of data with neural networks,” science, tion processing systems, 2013, pp. 2949–2957.
vol. 313, no. 5786, pp. 504–507, 2006. [228] I. Kokkinos, “Bounding part scores for rapid detection
[209] R. J. Wang, X. Li, S. Ao, and C. X. Ling, “Pelee: A real-time with deformable part models,” in European Conference on
object detection system on mobile devices,” arXiv preprint Computer Vision. Springer, 2012, pp. 41–50.
arXiv:1804.06882, 2018. [229] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai,
[210] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, T. Liu, X. Wang, L. Wang, G. Wang et al., “Recent ad-
W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level vances in convolutional neural networks,” arXiv preprint
accuracy with 50x fewer parameters and¡ 0.5 mb model arXiv:1512.07108, 2015.
size,” arXiv preprint arXiv:1602.07360, 2016. [230] K. Simonyan and A. Zisserman, “Very deep convolu-
[211] B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer, tional networks for large-scale image recognition,” arXiv
“Squeezedet: Unified, small, low power fully convolu- preprint arXiv:1409.1556, 2014.
tional neural networks for real-time object detection for [231] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
autonomous driving.” in CVPR Workshops, 2017, pp. 446– D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich,
454. “Going deeper with convolutions,” in Proceedings of the
[212] T. Kong, A. Yao, Y. Chen, and F. Sun, “Hypernet: Towards IEEE conference on computer vision and pattern recognition,
accurate region proposal generation and joint object de- 2015, pp. 1–9.
tection,” in Proceedings of the IEEE conference on computer [232] S. Ioffe and C. Szegedy, “Batch normalization: Accelerat-
vision and pattern recognition, 2016, pp. 845–853. ing deep network training by reducing internal covariate
[213] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning shift,” arXiv preprint arXiv:1502.03167, 2015.
transferable architectures for scalable image recognition,” [233] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi,
in Proceedings of the IEEE conference on computer vision and “Inception-v4, inception-resnet and the impact of resid-
pattern recognition, 2018, pp. 8697–8710. ual connections on learning.” in AAAI, vol. 4, 2017, p. 12.
[214] B. Zoph and Q. V. Le, “Neural architecture search with [234] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual
reinforcement learning,” arXiv preprint arXiv:1611.01578, learning for image recognition,” in Proceedings of the IEEE
2016. conference on computer vision and pattern recognition, 2016,
[215] Y. Chen, T. Yang, X. Zhang, G. Meng, C. Pan, and J. Sun, pp. 770–778.
“Detnas: Neural architecture search on object detection,” [235] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Wein-
arXiv preprint arXiv:1903.10979, 2019. berger, “Densely connected convolutional networks.” in
[216] C. Liu, L.-C. Chen, F. Schroff, H. Adam, W. Hua, A. Yuille, CVPR, vol. 1, no. 2, 2017, p. 3.
and L. Fei-Fei, “Auto-deeplab: Hierarchical neural archi- [236] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation net-
tecture search for semantic image segmentation,” arXiv works,” arXiv preprint arXiv:1709.01507, vol. 7, 2017.
preprint arXiv:1901.02985, 2019. [237] P. Zhou, B. Ni, C. Geng, J. Hu, and Y. Xu, “Scale-
[217] X. Chu, B. Zhang, R. Xu, and H. Ma, “Multi-objective re- transferrable object detection,” in Proceedings of the IEEE
inforced evolution in mobile neural architecture search,” Conference on Computer Vision and Pattern Recognition,
arXiv preprint arXiv:1901.01074, 2019. 2018, pp. 528–537.
[218] C.-H. Hsu, S.-H. Chang, D.-C. Juan, J.-Y. Pan, Y.-T. Chen, [238] Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, and X. Xue,
W. Wei, and S.-C. Chang, “Monas: Multi-objective neural “Dsod: Learning deeply supervised object detectors from
architecture search using reinforcement learning,” arXiv scratch,” in The IEEE International Conference on Computer
preprint arXiv:1806.10332, 2018. Vision (ICCV), vol. 3, no. 6, 2017, p. 7.
[219] P. Simard, L. Bottou, P. Haffner, and Y. LeCun, “Boxlets: a [239] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He,
fast convolution algorithm for signal processing and neu- “Aggregated residual transformations for deep neural
ral networks,” in Advances in Neural Information Processing networks,” in Computer Vision and Pattern Recognition
Systems, 1999, pp. 571–577. (CVPR), 2017 IEEE Conference on. IEEE, 2017, pp. 5987–
[220] X. Wang, T. X. Han, and S. Yan, “An hog-lbp human 5995.
detector with partial occlusion handling,” in Computer [240] J. Jeong, H. Park, and N. Kwak, “Enhancement of ssd by
Vision, 2009 IEEE 12th International Conference on. IEEE, concatenating feature maps for object detection,” arXiv
2009, pp. 32–39. preprint arXiv:1705.09587, 2017.
[221] F. Porikli, “Integral histogram: A fast way to extract [241] K. Lee, J. Choi, J. Jeong, and N. Kwak, “Residual features
histograms in cartesian spaces,” in Computer Vision and and unified prediction network for single stage detec-
Pattern Recognition, 2005. CVPR 2005. IEEE Computer So- tion,” arXiv preprint arXiv:1707.05031, 2017.
ciety Conference on, vol. 1. IEEE, 2005, pp. 829–836. [242] G. Cao, X. Xie, W. Yang, Q. Liao, G. Shi, and J. Wu,
[222] M. Mathieu, M. Henaff, and Y. LeCun, “Fast training “Feature-fused ssd: fast detection for small objects,” in
of convolutional networks through ffts,” arXiv preprint Ninth International Conference on Graphic and Image Pro-
arXiv:1312.5851, 2013. cessing (ICGIP 2017), vol. 10615. International Society
[223] H. Pratt, B. Williams, F. Coenen, and Y. Zheng, “Fcnn: for Optics and Photonics, 2018, p. 106151E.
Fourier convolutional neural networks,” in Joint European [243] L. Zheng, C. Fu, and Y. Zhao, “Extend the shallow part
Conference on Machine Learning and Knowledge Discovery in of single shot multibox detector via convolutional neural
Databases. Springer, 2017, pp. 786–798. network,” arXiv preprint arXiv:1801.05918, 2018.
[224] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Pi- [244] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta,
antino, and Y. LeCun, “Fast convolutional nets with “Beyond skip connections: Top-down modulation for ob-
35

ject detection,” arXiv preprint arXiv:1612.06851, 2016. 2017 Fifteenth IAPR International Conference on. IEEE,
[245] T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, and Y. Chen, “Ron: 2017, pp. 514–517.
Reverse connection with objectness prior networks for [265] J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang, “Unitbox:
object detection,” in IEEE Conference on Computer Vision An advanced object detection network,” in Proceedings of
and Pattern Recognition, vol. 1, 2017, p. 2. the 2016 ACM on Multimedia Conference. ACM, 2016, pp.
[246] S. Woo, S. Hwang, and I. S. Kweon, “Stairnet: Top-down 516–520.
semantic aggregation for accurate one shot detection,” in [266] S. Gidaris and N. Komodakis, “Locnet: Improving local-
2018 IEEE Winter Conference on Applications of Computer ization accuracy for object detection,” in Proceedings of the
Vision (WACV). IEEE, 2018, pp. 1093–1102. IEEE conference on computer vision and pattern recognition,
[247] Y. Chen, J. Li, B. Zhou, J. Feng, and S. Yan, “Weaving 2016, pp. 789–798.
multi-scale context for single shot detector,” arXiv preprint [267] B. A. Olshausen and D. J. Field, “Emergence of simple-
arXiv:1712.03149, 2017. cell receptive field properties by learning a sparse code
[248] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive for natural images,” Nature, vol. 381, no. 6583, p. 607,
deconvolutional networks for mid and high level feature 1996.
learning,” in Computer Vision (ICCV), 2011 IEEE Interna- [268] A. J. Bell and T. J. Sejnowski, “The independent compo-
tional Conference on. IEEE, 2011, pp. 2018–2025. nents of natural scenes are edge filters,” Vision research,
[249] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. vol. 37, no. 23, pp. 3327–3338, 1997.
Berg, “Dssd: Deconvolutional single shot detector,” arXiv [269] S. Brahmbhatt, H. I. Christensen, and J. Hays, “Stuffnet:
preprint arXiv:1701.06659, 2017. Using stuffto improve object detection,” in Applications of
[250] J. Wang, Y. Yuan, and G. Yu, “Face attention network: Computer Vision (WACV), 2017 IEEE Winter Conference on.
An effective face detector for the occluded faces,” arXiv IEEE, 2017, pp. 934–943.
preprint arXiv:1711.07246, 2017. [270] A. Shrivastava and A. Gupta, “Contextual priming and
[251] P. He, W. Huang, T. He, Q. Zhu, Y. Qiao, and X. Li, “Single feedback for faster r-cnn,” in European Conference on Com-
shot text detector with regional attention,” in The IEEE puter Vision. Springer, 2016, pp. 330–348.
International Conference on Computer Vision (ICCV), vol. 6, [271] Z. Zhang, S. Qiao, C. Xie, W. Shen, B. Wang, and A. L.
no. 7, 2017. Yuille, “Single-shot object detection with enriched seman-
[252] F. Yu and V. Koltun, “Multi-scale context aggregation tics,” Center for Brains, Minds and Machines (CBMM),
by dilated convolutions,” arXiv preprint arXiv:1511.07122, Tech. Rep., 2018.
2015. [272] B. Cai, Z. Jiang, H. Zhang, Y. Yao, and S. Nie, “Online
[253] F. Yu, V. Koltun, and T. A. Funkhouser, “Dilated residual exemplar-based fully convolutional network for aircraft
networks.” in CVPR, vol. 2, 2017, p. 3. detection in remote sensing images,” IEEE Geoscience and
[254] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun, Remote Sensing Letters, no. 99, pp. 1–5, 2018.
“Detnet: A backbone network for object detection,” arXiv [273] G. Cheng, J. Han, P. Zhou, and L. Guo, “Multi-class
preprint arXiv:1804.06215, 2018. geospatial object detection and geographic image clas-
[255] S. Liu, D. Huang, and Y. Wang, “Receptive field block sification based on collection of part detectors,” ISPRS
net for accurate and fast object detection,” arXiv preprint Journal of Photogrammetry and Remote Sensing, vol. 98, pp.
arXiv:1711.07767, 2017. 119–132, 2014.
[256] M. Najibi, M. Rastegari, and L. S. Davis, “G-cnn: an [274] P. Y. Simard, Y. A. LeCun, J. S. Denker, and B. Victorri,
iterative grid based object detector,” in Proceedings of the “Transformation invariance in pattern recognitiontangent
IEEE Conference on Computer Vision and Pattern Recogni- distance and tangent propagation,” in Neural networks:
tion, 2016, pp. 2369–2377. tricks of the trade. Springer, 1998, pp. 239–274.
[257] D. Yoo, S. Park, J.-Y. Lee, A. S. Paek, and I. So Kweon, [275] G. Cheng, P. Zhou, and J. Han, “Rifd-cnn: Rotation-
“Attentionnet: Aggregating weak directions for accurate invariant and fisher discriminative convolutional neural
object detection,” in Proceedings of the IEEE International networks for object detection,” in Proceedings of the IEEE
Conference on Computer Vision, 2015, pp. 2659–2667. Conference on Computer Vision and Pattern Recognition,
[258] Y. Lu, T. Javidi, and S. Lazebnik, “Adaptive object detec- 2016, pp. 2884–2893.
tion using adjacency and zoom prediction,” in Proceedings [276] ——, “Learning rotation-invariant convolutional neural
of the IEEE Conference on Computer Vision and Pattern networks for object detection in vhr optical remote sens-
Recognition, 2016, pp. 2351–2359. ing images,” IEEE Transactions on Geoscience and Remote
[259] R. Ranjan, V. M. Patel, and R. Chellappa, “Hyperface: Sensing, vol. 54, no. 12, pp. 7405–7415, 2016.
A deep multi-task learning framework for face detec- [277] X. Shi, S. Shan, M. Kan, S. Wu, and X. Chen, “Real-time
tion, landmark localization, pose estimation, and gender rotation-invariant face detection with progressive calibra-
recognition,” IEEE Transactions on Pattern Analysis and tion networks,” in Proceedings of the IEEE Conference on
Machine Intelligence, vol. 41, no. 1, pp. 121–135, 2019. Computer Vision and Pattern Recognition, 2018, pp. 2295–
[260] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Real- 2303.
time multi-person 2d pose estimation using part affinity [278] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial
fields,” arXiv preprint arXiv:1611.08050, 2016. transformer networks,” in Advances in neural information
[261] H. Law and J. Deng, “Cornernet: Detecting objects as processing systems, 2015, pp. 2017–2025.
paired keypoints,” in Proceedings of the European Confer- [279] D. Chen, G. Hua, F. Wen, and J. Sun, “Supervised trans-
ence on Computer Vision (ECCV), vol. 6, 2018. former network for efficient face detection,” in European
[262] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into Conference on Computer Vision. Springer, 2016, pp. 122–
high quality object detection,” in IEEE Conference on Com- 138.
puter Vision and Pattern Recognition (CVPR), vol. 1, no. 2, [280] B. Singh and L. S. Davis, “An analysis of scale invariance
2018, p. 10. in object detection–snip,” in Proceedings of the IEEE Con-
[263] R. N. Rajaram, E. Ohn-Bar, and M. M. Trivedi, “Refinenet: ference on Computer Vision and Pattern Recognition, 2018,
Iterative refinement for accurate object localization,” in pp. 3578–3587.
Intelligent Transportation Systems (ITSC), 2016 IEEE 19th [281] B. Singh, M. Najibi, and L. S. Davis, “Sniper: Efficient
International Conference on. IEEE, 2016, pp. 1528–1533. multi-scale training,” arXiv preprint arXiv:1805.09300,
[264] M.-C. Roh and J.-y. Lee, “Refining faster-rcnn for accurate 2018.
object detection,” in Machine Vision Applications (MVA), [282] S. Qiao, W. Shen, W. Qiu, C. Liu, and A. L. Yuille,
36

“Scalenet: Guiding object proposal generation in super- Applications of Computer Vision (WACV), 2016 IEEE Winter
markets and beyond.” in ICCV, 2017, pp. 1809–1818. Conference on. IEEE, 2016, pp. 1–9.
[283] Z. Hao, Y. Liu, H. Qin, J. Yan, X. Li, and X. Hu, “Scale- [302] Y. Shen, R. Ji, S. Zhang, W. Zuo, and Y. Wang, “Generative
aware face detection,” in The IEEE Conference on Computer adversarial learning towards fast weakly supervised de-
Vision and Pattern Recognition (CVPR), vol. 3, 2017. tection,” in Proceedings of the IEEE Conference on Computer
[284] R. Zhu, S. Zhang, X. Wang, L. Wen, H. Shi, L. Bo, and Vision and Pattern Recognition, 2018, pp. 5764–5773.
T. Mei, “Scratchdet: Exploring to train single-shot object [303] M. Enzweiler and D. M. Gavrila, “Monocular pedestrian
detectors from scratch,” arXiv preprint arXiv:1810.08425, detection: Survey and experiments,” IEEE Transactions on
2018. Pattern Analysis & Machine Intelligence, no. 12, pp. 2179–
[285] K. He, R. Girshick, and P. Dollár, “Rethinking imagenet 2195, 2008.
pre-training,” arXiv preprint arXiv:1811.08883, 2018. [304] D. Geronimo, A. M. Lopez, A. D. Sappa, and T. Graf,
[286] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, “Survey of pedestrian detection for advanced driver as-
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, sistance systems,” IEEE transactions on pattern analysis and
“Generative adversarial nets,” in Advances in neural infor- machine intelligence, vol. 32, no. 7, pp. 1239–1258, 2010.
mation processing systems, 2014, pp. 2672–2680. [305] R. Benenson, M. Omran, J. Hosang, and B. Schiele, “Ten
[287] A. Radford, L. Metz, and S. Chintala, “Unsupervised rep- years of pedestrian detection, what have we learned?” in
resentation learning with deep convolutional generative European Conference on Computer Vision. Springer, 2014,
adversarial networks,” arXiv preprint arXiv:1511.06434, pp. 613–627.
2015. [306] S. Zhang, R. Benenson, M. Omran, J. Hosang, and
[288] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired B. Schiele, “How far are we from solving pedestrian de-
image-to-image translation using cycle-consistent adver- tection?” in Proceedings of the IEEE Conference on Computer
sarial networks,” arXiv preprint, 2017. Vision and Pattern Recognition, 2016, pp. 1259–1267.
[289] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunning- [307] ——, “Towards reaching human performance in pedes-
ham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang trian detection,” IEEE transactions on pattern analysis and
et al., “Photo-realistic single image super-resolution using machine intelligence, vol. 40, no. 4, pp. 973–986, 2018.
a generative adversarial network.” in CVPR, vol. 2, no. 3, [308] P. Viola, M. J. Jones, and D. Snow, “Detecting pedestrians
2017, p. 4. using patterns of motion and appearance,” International
[290] J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, and S. Yan, “Per- Journal of Computer Vision, vol. 63, no. 2, pp. 153–161, 2005.
ceptual generative adversarial networks for small object [309] P. Sabzmeydani and G. Mori, “Detecting pedestrians by
detection,” in IEEE CVPR, 2017. learning shapelet features,” in Computer Vision and Pattern
[291] Y. Bai, Y. Zhang, M. Ding, and B. Ghanem, “Sod-mtgan: Recognition, 2007. CVPR’07. IEEE Conference on. IEEE,
Small object detection via multi-task generative adversar- 2007, pp. 1–8.
ial network,” Computer Vision-ECCV, pp. 8–14, 2018. [310] J. Cao, Y. Pang, and X. Li, “Pedestrian detection inspired
[292] X. Wang, A. Shrivastava, and A. Gupta, “A-fast-rcnn: by appearance constancy and shape symmetry,” in Pro-
Hard positive generation via adversary for object detec- ceedings of the IEEE Conference on Computer Vision and
tion,” in IEEE Conference on Computer Vision and Pattern Pattern Recognition, 2016, pp. 1316–1324.
Recognition, 2017. [311] R. Benenson, R. Timofte, and L. Van Gool, “Stixels estima-
[293] S.-T. Chen, C. Cornelius, J. Martin, and D. H. Chau, tion without depth map computation,” in Computer Vision
“Robust physical adversarial attack on faster r-cnn object Workshops (ICCV Workshops), 2011 IEEE International Con-
detector,” arXiv preprint arXiv:1804.05810, 2018. ference on. IEEE, 2011, pp. 2010–2017.
[294] R. G. Cinbis, J. Verbeek, and C. Schmid, “Weakly super- [312] J. Hosang, M. Omran, R. Benenson, and B. Schiele, “Tak-
vised object localization with multi-fold multiple instance ing a deeper look at pedestrians,” in Proceedings of the
learning,” IEEE transactions on pattern analysis and machine IEEE Conference on Computer Vision and Pattern Recogni-
intelligence, vol. 39, no. 1, pp. 189–203, 2017. tion, 2015, pp. 4073–4082.
[295] D. P. Papadopoulos, J. R. Uijlings, F. Keller, and V. Ferrari, [313] J. Cao, Y. Pang, and X. Li, “Learning multilayer channel
“We don’t need no bounding-boxes: Training object class features for pedestrian detection,” IEEE transactions on
detectors using only human verification,” in Proceedings image processing, vol. 26, no. 7, pp. 3210–3220, 2017.
of the IEEE Conference on Computer Vision and Pattern [314] J. Mao, T. Xiao, Y. Jiang, and Z. Cao, “What can help
Recognition, 2016, pp. 854–863. pedestrian detection?” in 2017 IEEE Conference on Com-
[296] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez, puter Vision and Pattern Recognition (CVPR). IEEE, 2017,
“Solving the multiple instance problem with axis-parallel pp. 6034–6043.
rectangles,” Artificial intelligence, vol. 89, no. 1-2, pp. 31– [315] Q. Hu, P. Wang, C. Shen, A. van den Hengel, and
71, 1997. F. Porikli, “Pushing the limits of deep cnns for pedestrian
[297] Y. Zhu, Y. Zhou, Q. Ye, Q. Qiu, and J. Jiao, “Soft proposal detection,” IEEE Transactions on Circuits and Systems for
networks for weakly supervised object localization,” in Video Technology, vol. 28, no. 6, pp. 1358–1368, 2018.
Proc. IEEE Int. Conf. Comput. Vis.(ICCV), 2017, pp. 1841– [316] Y. Tian, P. Luo, X. Wang, and X. Tang, “Pedestrian detec-
1850. tion aided by deep learning semantic tasks,” in Proceed-
[298] A. Diba, V. Sharma, A. M. Pazandeh, H. Pirsiavash, and ings of the IEEE Conference on Computer Vision and Pattern
L. Van Gool, “Weakly supervised cascaded convolutional Recognition, 2015, pp. 5079–5087.
networks.” in CVPR, vol. 1, no. 2, 2017, p. 8. [317] D. Xu, W. Ouyang, E. Ricci, X. Wang, and N. Sebe,
[299] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Tor- “Learning cross-modal deep representations for robust
ralba, “Learning deep features for discriminative local- pedestrian detection,” in Proc. of the IEEE Conf. on Com-
ization,” in Proceedings of the IEEE Conference on Computer puter Vision and Pattern Recognition (CVPR), 2017.
Vision and Pattern Recognition, 2016, pp. 2921–2929. [318] X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, and C. Shen,
[300] H. Bilen and A. Vedaldi, “Weakly supervised deep de- “Repulsion loss: Detecting pedestrians in a crowd,” arXiv
tection networks,” in Proceedings of the IEEE Conference on preprint arXiv:1711.07752, 2017.
Computer Vision and Pattern Recognition, 2016, pp. 2846– [319] Y. Tian, P. Luo, X. Wang, and X. Tang, “Deep learning
2854. strong parts for pedestrian detection,” in Proceedings of
[301] L. Bazzani, A. Bergamo, D. Anguelov, and L. Torresani, the IEEE international conference on computer vision, 2015,
“Self-taught object localization with deep networks,” in pp. 1904–1912.
37

[320] W. Ouyang, H. Zhou, H. Li, Q. Li, J. Yan, and X. Wang, recognition: Recent advances and future trends,” Frontiers
“Jointly learning deep features, deformable parts, occlu- of Computer Science, vol. 10, no. 1, pp. 19–36, 2016.
sion and classification for pedestrian detection,” IEEE [339] Q. Ye and D. Doermann, “Text detection and recognition
transactions on pattern analysis and machine intelligence, in imagery: A survey,” IEEE transactions on pattern analysis
vol. 40, no. 8, pp. 1874–1887, 2018. and machine intelligence, vol. 37, no. 7, pp. 1480–1500, 2015.
[321] S. Zhang, J. Yang, and B. Schiele, “Occluded pedestrian [340] L. Neumann and J. Matas, “Scene text localization and
detection through guided attention in cnns,” in Proceed- recognition with oriented stroke detection,” in Proceedings
ings of the IEEE Conference on Computer Vision and Pattern of the IEEE International Conference on Computer Vision,
Recognition, 2018, pp. 6995–7003. 2013, pp. 97–104.
[322] P. Hu and D. Ramanan, “Finding tiny faces,” in Computer [341] X.-C. Yin, X. Yin, K. Huang, and H.-W. Hao, “Robust text
Vision and Pattern Recognition (CVPR), 2017 IEEE Confer- detection in natural scene images,” IEEE transactions on
ence on. IEEE, 2017, pp. 1522–1530. pattern analysis and machine intelligence, vol. 36, no. 5, pp.
[323] M.-H. Yang, D. J. Kriegman, and N. Ahuja, “Detecting 970–983, 2014.
faces in images: A survey,” IEEE Transactions on pattern [342] K. Wang, B. Babenko, and S. Belongie, “End-to-end scene
analysis and machine intelligence, vol. 24, no. 1, pp. 34–58, text recognition,” in Computer Vision (ICCV), 2011 IEEE
2002. International Conference on. IEEE, 2011, pp. 1457–1464.
[324] S. Zafeiriou, C. Zhang, and Z. Zhang, “A survey on face [343] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng, “End-to-end
detection in the wild: past, present and future,” Computer text recognition with convolutional neural networks,” in
Vision and Image Understanding, vol. 138, pp. 1–24, 2015. Pattern Recognition (ICPR), 2012 21st International Confer-
[325] H. A. Rowley, S. Baluja, and T. Kanade, “Neural network- ence on. IEEE, 2012, pp. 3304–3308.
based face detection,” IEEE Transactions on pattern analysis [344] S. Tian, Y. Pan, C. Huang, S. Lu, K. Yu, and C. Lim Tan,
and machine intelligence, vol. 20, no. 1, pp. 23–38, 1998. “Text flow: A unified text detection system in natural
[326] E. Osuna, R. Freund, and F. Girosit, “Training support scene images,” in Proceedings of the IEEE international
vector machines: an application to face detection,” in conference on computer vision, 2015, pp. 4651–4659.
Computer vision and pattern recognition, 1997. Proceedings., [345] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Deep fea-
1997 IEEE computer society conference on. IEEE, 1997, pp. tures for text spotting,” in European conference on computer
130–136. vision. Springer, 2014, pp. 512–528.
[327] M. Osadchy, Y. L. Cun, and M. L. Miller, “Synergistic [346] X.-C. Yin, W.-Y. Pei, J. Zhang, and H.-W. Hao, “Multi-
face detection and pose estimation with energy-based orientation scene text detection with adaptive cluster-
models,” Journal of Machine Learning Research, vol. 8, no. ing,” IEEE Transactions on Pattern Analysis & Machine
May, pp. 1197–1215, 2007. Intelligence, no. 9, pp. 1930–1937, 2015.
[328] S. Yang, P. Luo, C. C. Loy, and X. Tang, “Faceness-net: [347] Z. Zhang, W. Shen, C. Yao, and X. Bai, “Symmetry-based
Face detection through deep facial part responses,” IEEE text line detection in natural scenes,” in Proceedings of the
transactions on pattern analysis and machine intelligence, IEEE Conference on Computer Vision and Pattern Recogni-
vol. 40, no. 8, pp. 1845–1859, 2018. tion, 2015, pp. 2558–2567.
[329] S. Yang, Y. Xiong, C. C. Loy, and X. Tang, “Face detection [348] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisser-
through scale-friendly deep convolutional networks,” man, “Reading text in the wild with convolutional neural
arXiv preprint arXiv:1706.02863, 2017. networks,” International Journal of Computer Vision, vol.
[330] M. Najibi, P. Samangouei, R. Chellappa, and L. S. Davis, 116, no. 1, pp. 1–20, 2016.
“Ssh: Single stage headless face detector.” in ICCV, 2017, [349] W. Huang, Y. Qiao, and X. Tang, “Robust scene text
pp. 4885–4894. detection with convolution neural network induced
[331] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. mser trees,” in European Conference on Computer Vision.
Li, “Sˆ 3fd: Single shot scale-invariant face detector,” in Springer, 2014, pp. 497–511.
Computer Vision (ICCV), 2017 IEEE International Conference [350] T. He, W. Huang, Y. Qiao, and J. Yao, “Text-attentional
on. IEEE, 2017, pp. 192–201. convolutional neural network for scene text detection,”
[332] X. Liu, “A camera phone based currency reader for the IEEE transactions on image processing, vol. 25, no. 6, pp.
visually impaired,” in Proceedings of the 10th international 2529–2541, 2016.
ACM SIGACCESS conference on Computers and accessibility. [351] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and
ACM, 2008, pp. 305–306. X. Xue, “Arbitrary-oriented scene text detection via rota-
[333] N. Ezaki, K. Kiyota, B. T. Minh, M. Bulacu, and tion proposals,” IEEE Transactions on Multimedia, 2018.
L. Schomaker, “Improved text-detection methods for a [352] ——, “Arbitrary-oriented scene text detection via rotation
camera-based text reading system for blind persons,” proposals,” IEEE Transactions on Multimedia, 2018.
in Document Analysis and Recognition, 2005. Proceedings. [353] Y. Jiang, X. Zhu, X. Wang, S. Yang, W. Li, H. Wang,
Eighth International Conference on. IEEE, 2005, pp. 257– P. Fu, and Z. Luo, “R2cnn: rotational region cnn for
261. orientation robust scene text detection,” arXiv preprint
[334] P. Sermanet, S. Chintala, and Y. LeCun, “Convolutional arXiv:1706.09579, 2017.
neural networks applied to house numbers digit classi- [354] M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu, “Textboxes:
fication,” in Pattern Recognition (ICPR), 2012 21st Interna- A fast text detector with a single deep neural network.”
tional Conference on. IEEE, 2012, pp. 3288–3291. in AAAI, 2017, pp. 4161–4167.
[335] Z. Wojna, A. Gorban, D.-S. Lee, K. Murphy, Q. Yu, Y. Li, [355] W. He, X.-Y. Zhang, F. Yin, and C.-L. Liu, “Deep direct
and J. Ibarz, “Attention-based extraction of structured regression for multi-oriented scene text detection,” arXiv
information from street view imagery,” arXiv preprint preprint arXiv:1703.08289, 2017.
arXiv:1704.03549, 2017. [356] Y. Liu and L. Jin, “Deep matching prior network: Toward
[336] Y. Liu and L. Jin, “Deep matching prior network: Toward tighter multi-oriented text detection,” in Proc. CVPR,
tighter multi-oriented text detection,” in Proc. CVPR, 2017, pp. 3454–3461.
2017, pp. 3454–3461. [357] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He,
[337] Y. Wu and P. Natarajan, “Self-organized text detection and J. Liang, “East: an efficient and accurate scene text
with minimal post-processing via border learning,” in detector,” in Proc. CVPR, 2017, pp. 2642–2651.
Proc. ICCV, 2017. [358] C. Yao, X. Bai, N. Sang, X. Zhou, S. Zhou, and Z. Cao,
[338] Y. Zhu, C. Yao, and X. Bai, “Scene text detection and “Scene text detection via holistic, multi-channel predic-
38

tion,” arXiv preprint arXiv:1606.09002, 2016. Networks (IJCNN), The 2013 International Joint Conference
[359] C. Xue, S. Lu, and F. Zhan, “Accurate scene text detection on. IEEE, 2013, pp. 1–5.
through border semantics awareness and bootstrapping,” [377] Z. Shi, Z. Zou, and C. Zhang, “Real-time traffic light
in European Conference on Computer Vision. Springer, 2018, detection with adaptive background suppression filter,”
pp. 370–387. IEEE Transactions on Intelligent Transportation Systems,
[360] P. Lyu, C. Yao, W. Wu, S. Yan, and X. Bai, “Multi-oriented vol. 17, no. 3, pp. 690–700, 2016.
scene text detection via corner localization and region [378] Y. Lu, J. Lu, S. Zhang, and P. Hall, “Traffic signal detec-
segmentation,” in Proceedings of the IEEE Conference on tion and classification in street views using an attention
Computer Vision and Pattern Recognition, 2018, pp. 7553– model,” Computational Visual Media, vol. 4, no. 3, pp. 253–
7563. 266, 2018.
[361] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao, “Detect- [379] M. Bach, D. Stumper, and K. Dietmayer, “Deep convolu-
ing text in natural image with connectionist text pro- tional traffic light recognition for automated driving,” in
posal network,” in European conference on computer vision. 2018 21st International Conference on Intelligent Transporta-
Springer, 2016, pp. 56–72. tion Systems (ITSC). IEEE, 2018, pp. 851–858.
[362] A. d. l. Escalera, L. Moreno, M. A. Salichs, and J. M. [380] S. Qiu, G. Wen, and Y. Fan, “Occluded object detection
Armingol, “Road traffic sign detection and classification,” in high-resolution remote sensing images using partial
1997. configuration object model,” IEEE Journal of Selected Topics
[363] D. M. Gavrila, U. Franke, C. Wohler, and S. Gorzig, “Real in Applied Earth Observations and Remote Sensing, vol. 10,
time vision for intelligent vehicles,” IEEE Instrumentation no. 5, pp. 1909–1925, 2017.
& Measurement Magazine, vol. 4, no. 2, pp. 22–27, 2001. [381] Z. Zou and Z. Shi, “Ship detection in spaceborne optical
[364] C. F. Paulo and P. L. Correia, “Automatic detection image with svd networks,” IEEE Transactions on Geo-
and classification of traffic signs,” in Image Analysis for science and Remote Sensing, vol. 54, no. 10, pp. 5832–5845,
Multimedia Interactive Services, 2007. WIAMIS’07. Eighth 2016.
International Workshop on. IEEE, 2007, pp. 11–11. [382] L. Zhang, L. Zhang, and B. Du, “Deep learning for remote
[365] A. De la Escalera, J. M. Armingol, and M. Mata, “Traffic sensing data: A technical tutorial on the state of the art,”
sign recognition and analysis for intelligent vehicles,” IEEE Geoscience and Remote Sensing Magazine, vol. 4, no. 2,
Image and vision computing, vol. 21, no. 3, pp. 247–258, pp. 22–40, 2016.
2003. [383] N. Proia and V. Pagé, “Characterization of a bayesian
[366] W. Shadeed, D. I. Abu-Al-Nadi, and M. J. Mismar, “Road ship detection method in optical satellite images,” IEEE
traffic sign detection in color images,” in Electronics, Cir- Geoscience and Remote Sensing Letters, vol. 7, no. 2, pp.
cuits and Systems, 2003. ICECS 2003. Proceedings of the 2003 226–230, 2010.
10th IEEE International Conference on, vol. 2. IEEE, 2003, [384] C. Zhu, H. Zhou, R. Wang, and J. Guo, “A novel hierar-
pp. 890–893. chical method of ship detection from spaceborne optical
[367] S. Maldonado-Bascón, S. Lafuente-Arroyo, P. Gil- image based on shape and texture features,” IEEE Trans-
Jimenez, H. Gómez-Moreno, and F. López-Ferreras, actions on geoscience and remote sensing, vol. 48, no. 9, pp.
“Road-sign detection and recognition based on support 3446–3456, 2010.
vector machines,” IEEE transactions on intelligent trans- [385] S. Qi, J. Ma, J. Lin, Y. Li, and J. Tian, “Unsupervised
portation systems, vol. 8, no. 2, pp. 264–278, 2007. ship detection based on saliency and s-hog descriptor
[368] M. Omachi and S. Omachi, “Traffic light detection with from optical satellite images,” IEEE Geoscience and Remote
color and edge information,” 2009. Sensing Letters, vol. 12, no. 7, pp. 1451–1455, 2015.
[369] Y. Xie, L.-f. Liu, C.-h. Li, and Y.-y. Qu, “Unifying visual [386] F. Bi, B. Zhu, L. Gao, and M. Bian, “A visual search
saliency with hog feature learning for traffic sign detec- inspired computational model for ship detection in op-
tion,” in Intelligent Vehicles Symposium, 2009 IEEE. IEEE, tical satellite images,” IEEE Geoscience and Remote Sensing
2009, pp. 24–29. Letters, vol. 9, no. 4, pp. 749–753, 2012.
[370] S. Houben, “A single target voting scheme for traffic [387] J. Han, P. Zhou, D. Zhang, G. Cheng, L. Guo, Z. Liu, S. Bu,
sign detection,” in Intelligent Vehicles Symposium (IV), 2011 and J. Wu, “Efficient, simultaneous detection of multi-
IEEE. IEEE, 2011, pp. 124–129. class geospatial targets based on visual saliency modeling
[371] A. Soetedjo and K. Yamada, “Fast and robust traffic sign and discriminative learning of sparse coding,” ISPRS
detection,” in Systems, Man and Cybernetics, 2005 IEEE Journal of Photogrammetry and Remote Sensing, vol. 89, pp.
International Conference on, vol. 2. IEEE, 2005, pp. 1341– 37–48, 2014.
1346. [388] J. Han, D. Zhang, G. Cheng, L. Guo, and J. Ren, “Object
[372] N. Fairfield and C. Urmson, “Traffic light mapping and detection in optical remote sensing images based on
detection,” in Robotics and Automation (ICRA), 2011 IEEE weakly supervised learning and high-level feature learn-
International Conference on. IEEE, 2011, pp. 5421–5426. ing,” IEEE Transactions on Geoscience and Remote Sensing,
[373] J. Levinson, J. Askeland, J. Dolson, and S. Thrun, “Traffic vol. 53, no. 6, pp. 3325–3337, 2015.
light mapping, localization, and state detection for au- [389] J. Tang, C. Deng, G.-B. Huang, and B. Zhao,
tonomous vehicles,” in Robotics and Automation (ICRA), “Compressed-domain ship detection on spaceborne op-
2011 IEEE International Conference on. IEEE, 2011, pp. tical image using deep neural network and extreme learn-
5784–5791. ing machine,” IEEE Transactions on Geoscience and Remote
[374] C. Bahlmann, Y. Zhu, V. Ramesh, M. Pellkofer, and Sensing, vol. 53, no. 3, pp. 1174–1185, 2015.
T. Koehler, “A system for traffic sign detection, tracking, [390] Z. Shi, X. Yu, Z. Jiang, and B. Li, “Ship detection in high-
and recognition using color, shape, and motion informa- resolution optical imagery based on anomaly detector
tion,” in Intelligent Vehicles Symposium, 2005. Proceedings. and local shape feature,” IEEE Transactions on Geoscience
IEEE. IEEE, 2005, pp. 255–260. and Remote Sensing, vol. 52, no. 8, pp. 4511–4523, 2014.
[375] I. M. Creusen, R. G. Wijnhoven, E. Herbschleb, and [391] A. Kembhavi, D. Harwood, and L. S. Davis, “Vehicle
P. de With, “Color exploitation in hog-based traffic sign detection using partial least squares,” IEEE Transactions
detection,” in 2010 IEEE International Conference on Image on Pattern Analysis and Machine Intelligence, vol. 33, no. 6,
Processing. IEEE, 2010, pp. 2669–2672. pp. 1250–1265, 2011.
[376] G. Wang, G. Ren, Z. Wu, Y. Zhao, and L. Jiang, “A robust, [392] L. Wan, L. Zheng, H. Huo, and T. Fang, “Affine invariant
coarse-to-fine traffic sign detection method,” in Neural description and large-margin dimensionality reduction
39

for target detection in optical remote sensing images,” IEEE, 2017, pp. 311–319.
IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 7, [408] L. Sommer, T. Schuchert, and J. Beyerer, “Comprehensive
pp. 1116–1120, 2017. analysis of deep learning based vehicle detection in aerial
[393] H. Zhou, L. Wei, C. P. Lim, D. Creighton, and S. Na- images,” IEEE Transactions on Circuits and Systems for
havandi, “Robust vehicle detection in aerial images us- Video Technology, 2018.
ing bag-of-words and orientation aware scanning,” IEEE [409] Z. Liu, J. Hu, L. Weng, and Y. Yang, “Rotated region based
Transactions on Geoscience and Remote Sensing, no. 99, pp. cnn for ship detection,” in Image Processing (ICIP), 2017
1–12, 2018. IEEE International Conference on. IEEE, 2017, pp. 900–904.
[394] M. ElMikaty and T. Stathaki, “Detection of cars in [410] H. Lin, Z. Shi, and Z. Zou, “Fully convolutional network
high-resolution aerial images of complex urban envi- with task partitioning for inshore ship detection in op-
ronments,” IEEE Transactions on Geoscience and Remote tical remote sensing images,” IEEE Geoscience and Remote
Sensing, vol. 55, no. 10, pp. 5913–5924, 2017. Sensing Letters, vol. 14, no. 10, pp. 1665–1669, 2017.
[395] L. Zhang, Z. Shi, and J. Wu, “A hierarchical oil tank de- [411] ——, “Maritime semantic labeling of optical remote
tector with deep surrounding features for high-resolution sensing images with multi-scale fully convolutional net-
optical satellite imagery,” IEEE Journal of Selected Topics work,” Remote Sensing, vol. 9, no. 5, p. 480, 2017.
in Applied Earth Observations and Remote Sensing, vol. 8,
no. 10, pp. 4895–4909, 2015.
[396] C. Zhu, B. Liu, Y. Zhou, Q. Yu, X. Liu, and W. Yu, “Frame-
work design and implementation for oil tank detection in
optical satellite imagery,” in Geoscience and Remote Sensing
Symposium (IGARSS), 2012 IEEE International. IEEE,
2012, pp. 6016–6019.
[397] G. Liu, Y. Zhang, X. Zheng, X. Sun, K. Fu, and H. Wang,
“A new method on inshore ship detection in high-
resolution satellite images using shape and context in-
formation,” IEEE Geoscience and Remote Sensing Letters,
vol. 11, no. 3, pp. 617–621, 2014.
[398] J. Xu, X. Sun, D. Zhang, and K. Fu, “Automatic detection
of inshore ships in high-resolution remote sensing images
using robust invariant generalized hough transform,”
IEEE Geoscience and Remote Sensing Letters, vol. 11, no. 12,
pp. 2070–2074, 2014.
[399] J. Zhang, C. Tao, and Z. Zou, “An on-road vehicle detec-
tion method for high-resolution aerial images based on
local and global structure learning,” IEEE Geoscience and
Remote Sensing Letters, vol. 14, no. 8, pp. 1198–1202, 2017.
[400] W. Diao, X. Sun, X. Zheng, F. Dou, H. Wang, and K. Fu,
“Efficient saliency-based object detection in remote sens-
ing images using deep belief networks,” IEEE Geoscience
and Remote Sensing Letters, vol. 13, no. 2, pp. 137–141,
2016.
[401] P. Zhang, X. Niu, Y. Dou, and F. Xia, “Airport detection on
optical satellite images using deep convolutional neural
networks,” IEEE Geoscience and Remote Sensing Letters,
vol. 14, no. 8, pp. 1183–1187, 2017.
[402] Z. Shi and Z. Zou, “Can a machine generate humanlike
language descriptions for a remote sensing image?” IEEE
Transactions on Geoscience and Remote Sensing, vol. 55,
no. 6, pp. 3623–3634, 2017.
[403] X. Han, Y. Zhong, and L. Zhang, “An efficient and ro-
bust integrated geospatial object detection framework for
high spatial resolution remote sensing imagery,” Remote
Sensing, vol. 9, no. 7, p. 666, 2017.
[404] Z. Xu, X. Xu, L. Wang, R. Yang, and F. Pu, “Deformable
convnet with aspect ratio constrained nms for object
detection in remote sensing imagery,” Remote Sensing,
vol. 9, no. 12, p. 1312, 2017.
[405] W. Li, K. Fu, H. Sun, X. Sun, Z. Guo, M. Yan, and
X. Zheng, “Integrated localization and recognition for
inshore ships in large scene remote sensing images,” IEEE
Geoscience and Remote Sensing Letters, vol. 14, no. 6, pp.
936–940, 2017.
[406] O. A. Penatti, K. Nogueira, and J. A. dos Santos, “Do
deep features generalize from everyday objects to remote
sensing and aerial scenes domains?” in Proceedings of the
IEEE conference on computer vision and pattern recognition
workshops, 2015, pp. 44–51.
[407] L. W. Sommer, T. Schuchert, and J. Beyerer, “Fast deep
vehicle detection in aerial images,” in Applications of
Computer Vision (WACV), 2017 IEEE Winter Conference on.

You might also like