Object Detection With DL
Object Detection With DL
a r t i c l e i n f o a b s t r a c t
Article history:                                         In the realm of computer vision, Deep Convolutional Neural Networks (DCNNs) have demonstrated
Available online 8 November 2022                         excellent performance. Video Processing, Object Detection, Image Segmentation, Image Classification,
                                                         Speech Recognition and Natural Language Processing are some of the application areas of CNN. Object
Keywords:
                                                         detection is the most crucial and challenging task of computer vision. It has numerous applications in
Computer vision
Deep convolutional neural network
                                                         the field of security, military, transportation and medical sciences. In this review, object detection and its
Object detection                                         different aspects have been covered in detail. With the gradual increase in the evolution of deep learning
Deep learning                                            algorithms for detecting objects, a significant improvement in the performance of object detection models
Conventional methods                                     has been observed. However, this does not imply that the conventional object detection methods, which
                                                         had been evolving for decades prior to the emergence of deep learning, had become outdated. There are
                                                         some cases where conventional methods with global features are superior choice. This review paper
                                                         starts with a quick overview of object detection followed by object detection frameworks, backbone
                                                         convolutional neural network, and an overview of common datasets along with the evaluation metrics.
                                                         Object detection problems and applications are also studied in detail. Some future research challenges
                                                         in designing deep neural networks are discussed. Lastly, the performance of object detection models on
                                                         PASCAL VOC and MS COCO datasets is compared and conclusions are drawn.
                                                                                                                            © 2022 Elsevier Inc. All rights reserved.
https://doi.org/10.1016/j.dsp.2022.103812
1051-2004/© 2022 Elsevier Inc. All rights reserved.
R. Kaur and S. Singh                                                                                                            Digital Signal Processing 132 (2023) 103812
Fig. 1. Classification of generic object detection models (a) Two-stage detectors from period 2014 to 2017 [20,21,25–29]. (b) One-stage detectors from period 2013 to 2020
[22,23,30–37].
         jects. However this approach has high computational cost                            The rest of this paper is organized as follows.
         and also causes a large number of non-essential choices.
     ii. Extraction of features – After locating an object, process of                   • Section 2 provides extensive details about object detection
         feature extraction is carried out to provide robust repre-                           frameworks; two-stage detectors and one-stage detectors,
         sentation. The methods such as HOG [9], Haar-like [10],                              along with its characteristics in tabular form.
         SIFT [11] are used to extract features for object recognition                   •    Backbone architectures are described in Section 3 and their
         to provide a meaningful representation. However because                              performance is compared and analyzed.
         of contrasting backgrounds, lighting environment and per-                       •    Section 4 discusses the popular datasets and criteria for as-
         spective variances, it is extremely hard to manually build                           sessing the performance of object detection algorithms.
         a comprehensive feature descriptor that correctly identifies                     •    Section 5 and Section 6 elaborates various object detection
         all kinds of object.                                                                 problems and its applications.
    iii. Classification – At this stage, a classifier such as Adaboost                     •    Section 7 covers the future research areas.
         [12] is used to identify the target objects and to build the                    •    Comparative results are presented in Section 8.
         models more organized and meaningful for visual percep-                         •    Finally, Section 9 draws the conclusion.
         tion.
    It is clear from the above points that in traditional methods,                     2. Object detection frameworks
    handcrafted features are not always adequate to correctly rep-
    resent the objects. Along with this, sliding window approach                          A considerable advancement has been made in the domain of
    used for generating bounding boxes is computationally expen-                       generic object detection with the evolution of deep learning net-
    sive and ineffective, in. The traditional techniques include HOG                   works [24].
    [9], SIFT [11], Haar [10], VJ detector [13,14] and other algo-                        Object detection is a fusion of object location and object classi-
    rithms such as [15,16]. In HOG [9], it takes a long time to rec-                   fication task. Because deep CNNs have high feature representation
    ognize an object since it employs a sliding window approach
                                                                                       power, hence they are used in object detection architectures. The
    to extract features [17]. SIFT [11] algorithm is extremely slow,
                                                                                       classification of object detection models is depicted in Fig. 1. There
    has high computational cost and also not good at illumina-
                                                                                       are two types of detectors: two-stage and one-stage detectors [1].
    tion changes [18]. In VJ detector [13], training duration is very
    large and is limited to binary classification only [19]. There-
                                                                                       2.1. Two-stage object detectors: region based
    fore, deep learning techniques are being utilized to overcome
    the problems of traditional methods.
 b. Deep learning based methods                                                            The two-stage object detection framework divides the task of
    The advent of deep learning has potential to address few lim-                      object localization and object classification. In simpler terms, firstly
    itations of conventional techniques. Lately, the deep learning                     the region proposals are generated where the object is localized
    methods have become prominent for learning feature repre-                          and then that region is classified according to its particular cate-
    sentation from data automatically. These approaches have sig-                      gory. This is the reason why it is called two-stage. Fig. 1(a) shows
    nificantly improved object detection. The deep learning based                       various two-stage object detectors. These architectures are also
    approaches are Faster RCNN [20,21], SSD [22], YOLO [23] and                        called Region-based frameworks [2]. The main advantage of two-
    many more (Refer to Section 2).                                                    stage object detectors is that they have high detection accuracy
                                                                                       and disadvantage is that they have slow detection speed. The fea-
    The major strengths of the paper are as follows:                                   tures and characteristics of these detectors are explained below:
 1. The study examines the state-of-the-art object detection mod-                      2.1.1. RCNN
    els providing an in-depth analysis of major object detectors                           Region-based convolutional neural network (RCNN) proposed
    along with their characteristics.                                                  by [25] was an advanced research in using deep learning meth-
 2. The work provides detailed explanation of backbone architec-                       ods for detection of objects [38]. Its architecture is shown in Fig. 2.
    tures. Furthermore, benchmark datasets and evaluation criteria                     The process of RCNN is explained below in four stages [2,6,39]:
    are discussed and challenges are explored.                                             1st stage – Region proposals are extracted using the selective
 3. A comprehensive performance comparison of different object                         search method. The selective search identifies these regions based
    detectors is provided on two popular datasets namely PASCAL                        on varying scales, enclosures, textures, and color patterns. It ex-
    VOC dataset and COCO dataset.                                                      tracts around 2000 regions from each image [39].
                                                                                   2
R. Kaur and S. Singh                                                                                                 Digital Signal Processing 132 (2023) 103812
   2nd stage – All these region proposals are rescaled into the same
image size to match the CNN input size since the fully connected
layer requires fixed-length input vectors. The features from each
candidate region are extracted using CNN.
   3rd stage – After the extraction of features, SVM classifier is
used to detect whether the object is present within each region.
   4th stage – Finally, for each identified object in an image, tighter
bounding boxes are generated around it using a linear regression
model.
   Although RCNN has shown great improvement in object de-
tection, it still has some limitations like slow object detection,
multi-stage pipeline training, and rigidness of the selective search
method.
2.1.2. SPP-Net
    As RCNN generates 2000 region proposals per image, CNN fea-
ture extraction from these regions was the main barrier. The con-
                                                                                                  Fig. 3. Spatial Pyramid Pooling [26].
straint of fixed input size is only because of fully connected layers
[40]. So to overcome this difficulty, [26] brings in a new technique
called the Spatial Pyramid Pooling Network layer (SPP-Net). The                   2nd – Region of Interest (RoI) is generated using selective search
SPP layer is added on top of the final convolutional layer to pro-             method.
duce fixed length features for fully connected layers, irrespective of             3rd – RoI pooling layer is applied on the extracted RoI to gen-
the size of RoI (Region of interest), and without rescaling it, which         erate feature vector of fixed length. It assures that all the regions
can lead to information loss [4,40].                                          have same magnitude.
    By using the SPPNet layer, a great improvement in the speed                   4th – Extracted features are then sent to fully connected layer
of RCNN was seen without any loss in detection quality. This is               for categorization and localization using softmax layer and linear
because the convolutional layers need to be run for one time only             regression layer at the same time.
on the complete test image to create features of fixed-length for                  Fast RCNN consumes less computational time and has improved
region proposals of random size.                                              the detection accuracy. However, it was based on traditional region
    The network structure of SPP-Net is demonstrated in Fig. 3.               proposal method, which uses selective search method that makes
Here the output of the SPP layer is 256×M-d vectors. 256 is the               it time consuming.
number of Convolutional filters and M is the no. of bins. The
fully connected layer receives the fixed-length dimensional vector             2.1.4. Faster RCNN
[2,26].                                                                           Despite Fast RCNN has shown considerable advancement in
                                                                              speed and accuracy, it uses the selective search method to gen-
2.1.3. Fast RCNN                                                              erate 2000 region proposals which was a very slow process. Ren,
    Although SPPNet outperforms RCNN in terms of efficiency and                 S. et al. [20,21] worked on this issue and developed a new detector
accuracy, it still has some problems like it roughly follows the              named Faster RCNN as the first end-to-end deep learning detector
same procedure of RCNN, which includes fine-tuning of the net-                 [41]. It also improves the detection speed of Fast RCNN by replac-
work, extraction of features, and bounding box regression [6]. Gir-           ing the traditional region proposals algorithms such as selective
shick, R. has shown further improvement in RCNN and SPPNet, and               search [42], multiscale combinatorial grouping [43] or edge boxes
put forward a new detector named Fast RCNN [27]. It allows end to             [44] with a CNN called Region Proposal Network (RPN). The proce-
end training of the detector that learn softmax classifier and class           dure for Faster RCNN is as follows:
specific bounding box regression concurrently, with a multi task                   a) CNN takes an image as an input and provides the feature
loss, rather than separately training them as in RCNN and SPPNet.             maps of an image as an output.
In Fast RCNN, rather than executing CNN 2000 times per image,                     b) RPN is applied to the generated feature maps returning the
it is run only once and get all the regions of interest. Then RoI             object proposals (RoI) as well as their objectness score.
pooling layer was added between the final convolutional layer and                  c) Once the RoIs are extracted, RoI pooling layer is applied to it
initial fully connected layer so that a feature of fixed length vec-           to bring all the proposals to a fixed dimension.
tor gets extracted for all region proposal [2,4,39]. Working of Fast              d) The derived feature vectors are supplied into a succession
RCNN detector is as follows:                                                  of fully connected layers with a layer of softmax and regression
    1st – Fast RCNN takes a complete input image and pass it to               at the top, to classify and output the bounding boxes for objects
CNN to produce feature map.                                                   [39].
                                                                          3
R. Kaur and S. Singh                                                                                                   Digital Signal Processing 132 (2023) 103812
                       Fig. 4. Region Proposal Network [20].                  Fig. 5. Feature Pyramid Network [28]. (For interpretation of the colors in the fig-
                                                                              ure(s), the reader is referred to the web version of this article.)
                                                                          4
R. Kaur and S. Singh                                                                                                                          Digital Signal Processing 132 (2023) 103812
Table 1
Characteristics of Two-Stage Object Detection Models.
  Object           Year   Image Input   Backbone        Region      Learning Method /       Loss / Cost                  Strengths                          Shortcomings
  Detector                size          DCNN            Proposal    (Optimization           function
                                                        Method      method)
  RCNN [25]        2014   Fixed         AlexNet         Selective   SGD,BP                  Hinge loss,                  1. First Neural Network            1. Training is costly as huge
                                                        Search                              Bounding box                 based on region proposal           amount of space and time
                                                                                            regressor loss               for higher detection quality.      is required.
                                                                                                                         2. Increase in the performance     2. Multi-stage pipeline
                                                                                                                         is seen over traditional           training is used.
                                                                                                                         state-of-the-art methods           3. CNN is frequently applied
                                                                                                                                                            to 2000 image regions, so
                                                                                                                                                            the feature extraction is the
                                                                                                                                                            main time constraint in testing.
                                                                                                                                                            4. Extracting 2000 image regions
                                                                                                                                                            is a difficult task as features are
                                                                                                                                                            extracted for every image region.
  SPP-Net [26]     2014   Arbitrary     ZFNet           Selective   SGD                     Hinge loss,                  1. Extracts the features           1. Architecture is identical
                                                        Search                              Bounding box                 of entire image at once.           to RCNN, so it has same
                                                                                            regressor loss               2. Outputs the fixed length         drawbacks as RCNN.
                                                                                                                         features regardless of             2. No end-to-end training.
                                                                                                                         image size.
                                                                                                                         3. Faster than RCNN.
  Fast RCNN        2015   Arbitrary     Alexnet         Selective   SGD                     Classification loss,          1. First end-to-end                1. Sluggish for real time
       [27]                             VGGM VGG16      Search                              Bounding box                 detector training.                 applications because of
                                                                                            regression loss              2. Faster and accurate than        selective search.
                                                                                                                         RCNN and SPP-Net.                  2. Region proposal
                                                                                                                         3. Single-stage training           computation is a bottleneck.
                                                                                                                         network.                           3. No end-to-end training.
                                                                                                                         4. RoI pooling layer used.
  Faster RCNN      2015   Arbitrary     ZFNet           Region      SGD                     Classification loss           1. Introduces RPN that             1. Training is complex;
       [20,21]                          VGG16           Proposal                            (class log loss),            generates cost free                inefficient for real time
                                                        Network                             Bounding box                 region proposals.                  applications.
                                                                                            regression loss              2. Established translation-        2. Lack of performance
                                                                                                                         invariant and multi-scale          for small and multi-scale
                                                                                                                         anchors.                           objects.
                                                                                                                         3. An integrated network           3. Speed is slow.
                                                                                                                         comprising RPN and
                                                                                                                         Fast RCNN is designed
                                                                                                                         with common Convolutional
                                                                                                                         layers.
                                                                                                                         4. Provides end-to-end
                                                                                                                         training.
  Feature          2017   Arbitrary     ResNet50        Region      Synchronized            Classification loss           1. Multi-level feature fusion      1. It is still required to
       Pyramid                          ResNet101       Proposal    SGD                     (Class log loss),            FPN is designed.                   use pyramid representation
       Network                                          Network                             Bounding box                 2. Accurate solution to multi-     to tackle multi-scale
       [28]                                                                                 regression loss              scale object detection.            challenges.
                                                                                                                         3. Follows top-down structure      2. Speed is yet the bottleneck
                                                                                                                         with lateral connections.          for detection purpose;
                                                                                                                                                            cannot fulfill real time needs.
  Mask             2017   Arbitrary     ResNet101       Region      SGD                     Classification loss,          1. RoIAlign pooling layer is       1. Detection speed is low
      RCNN                              ResNext101      Proposal                            Bounding box                 used rather than RoI pooling       to satisfy real time
      [29]                                              Network                             regression loss,             layer; thus increase in the        requirements.
                                                                                            Mask loss (average           detection accuracy.
                                                                                            binary cross entropy loss)   2. Simple and flexible
                                                                                                                         architecture for object instance
                                                                                                                         segmentation.
                                                                                                                         3. Pixel to pixel alignment
                                                                                                                         is carried out.
2.2. One-stage object detectors: regression/classification based                              detection by using multi-scale sliding window approach. It is one
                                                                                             of the most powerful object detection frameworks, applied to Ima-
   One-stage object detection frameworks locate and categorize
                                                                                             geNet Large Scale Visual Recognition Challenge 2013 (ILSVRC), and
simultaneously using DCNNs without partitioning them into two
                                                                                             ranks first in detection and localization [31]. It is the first fully
portions. These are also called region proposal free frameworks.
                                                                                             convolutional deep network based one-stage detector that detects
Several one-stage detectors are shown in Fig. 1(b).
                                                                                             the object using a single forward pass via fully convolutional lay-
   In this, only one pass is needed through a neural network. It
                                                                                             ers. OverFeat acts as a base model for later emerged algorithms
has feed-forward neural network and predicts all the bounding
                                                                                             namely YOLO and its versions, SSD etc. The primary difference is
boxes at one time [7]. They map image pixels directly to bound-
                                                                                             that the training of classifiers and regressors is done in succession
ing box coordinates and class probabilities [1,6]. One-stage object
                                                                                             in OverFeat [2].
detectors are described as below.
2.2.1. DetectorNet                                                                           2.2.3. YOLO
    Szegedy, C. et al. [30] has implemented the DetectorNet frame-                               You Only Look Once (YOLO), is a single-stage object detector
work as a regression problem. It is capable of learning features                             designed by Redmon, J. et al. [23] where object detection is car-
for classification and acquiring some geometric information. It uses                          ried out as a regression problem. It predicts the coordinates of the
AlexNet as a backbone network and the softmax layer is replaced                              bounding boxes for the objects and determines the likelihood of
with the regression layer. To predict the foreground pixels, Detec-                          the category to which it associates. Due to the use of only single
torNet splits the input image into a coarse grid. It has a very slow                         network, an end-to-end optimization can be achieved [53]. It pre-
training process as the network is to be trained for each object                             dicts the detections directly using a limited selection of candidate
type and mask type. Also, the DetectorNet cannot handle multi-                               regions. Unlike region based approaches, which employs features
ple objects of similar class. When it is used in conjunction with a                          from a specific region, YOLO uses features broadly from the whole
multi-scale coarse-to-fine method, DNN-based object mask regres-                              image [2].
sion produces excellent results [2,30,45].                                                       In YOLO object detection, an image is divided into an S × S
                                                                                             grid; each grid comprises of five tuples (x, y, w, h and confidence
2.2.2. OverFeat                                                                              score). The confidence score of an individual object is based on
    Sermanet, P. et al. [31] has presented a unified structure for                            the probability. This score is given to every class and whichever
using Convolutional Networks for localization, classification and                             class has a high probability, that class is given precedence. The
                                                                                        5
R. Kaur and S. Singh                                                                                           Digital Signal Processing 132 (2023) 103812
parameters width (w) and height (h) of the bounding box are pre-            giving effective results along with generating the reliability of the
dicted with respect to the size of an object. From the overlapping          model. It operates at an inference speed of 140 fps. YOLOv5 uses
bounding boxes, the box having highest IOU is selected and the              PyTorch which makes the deployment of the model faster, easier
remaining boxes are removed [45].                                           and accurate [60]. Although the YOLOv4 and YOLOv5 frameworks
                                                                            are similar, thus comparing the difference between them is hard,
2.2.4. SSD                                                                  but later on, YOLOv5 has gained higher performance than YOLOv4
    SSD, a fast single-shot multi-box detector for several classes          under certain situations. There are five types of YOLOv5 model -
was implemented by Liu, W., et al. [22]. It builds a unified detector        nano, small, medium, large, and extralarge. The type of model is
framework which is faster as YOLO, and accurate as Faster-RCNN.             chosen according to the dataset. Further, the lightweight model of
The design of SSD combines the idea of regression from YOLO’s               YOLOv5 model is released with version 6.0; with an improved in-
model and anchors procedure from Faster R-CNN’s algorithm. By               ference speed of 1666 fps [35,60].
using YOLO’s regression, SSD reduces the computing complexity                   The characteristics of one-stage object detection models are de-
of neural networks to assure real-time performance. With the an-            scribed in Table 2. It provides concise details for each object de-
chor’s procedure, SSD is capable of extracting features of various          tector. It gives information for the same parameters as mentioned
sizes and aspect ratios to ensure detection accuracy [54]. SSD uses         in Table 1 except Region Proposal method.
VGG-16 as a backbone detector.                                                  Finally, it can be concluded that YOLOv5 model acts as a good
    The process of SSD is based on a feed-forward CNN that gener-           object detector to detect small objects. It is the fastest model com-
ates bounding boxes of fixed size and objectness scores for the              pared to other object detectors. For the detection of objects which
existence of object class instances in those boxes, then applies            are large in size, any object detector can be used. If results are re-
NMS (Non-maximum suppression) to give rise to the final detec-               quired in real-time, then any one-stage object detector can be used
tions [22]. It also uses the concept of RPN to attain fast detection        but if accuracy is main concern, then Faster RCNN (a two-stage ob-
speed while maintaining high detection quality [2]. With some               ject detector) is a good choice.
auxiliary data augmentation and hard negative mining approaches,
SSD accomplished state-of-the-art performance on various bench-             3. Backbone networks
mark datasets [47].
                                                                                The DCNNs serve as backbone network for the object detec-
2.2.5. YOLOv2                                                               tion models. To ameliorate the feature representation behavior,
    YOLOv2, an enhanced version of YOLOv1 [23], is given by Red-            the structure of the network gets more complex which means
mon, J. et al. [32]. In this version, different ideas such as Batch         the network layer gets deeper and its parameters are increased.
Normalization, Convolutional with Anchor Boxes, High-Resolution             A backbone CNN is used to extract features in DCNN-based object
Classifier, Fine-Grained Features, and Multi-scale training are ap-          detection systems [1,38].
plied to improve YOLO’s performance. It uses a Darknet-19 as a                  The backbone network acts as a primary feature extractor for
backbone classification containing 19 convolutional layers and 5             object detection method, taking images as input and generating
max-pooling layers which require fewer processes to analyze an              feature maps as output for each input image. According to the
image while achieving best accuracy [24].                                   need of accuracy and efficiency; the densely connected backbones,
                                                                            such as ResNet [61], ResNext [62] etc. can be used. Complex back-
2.2.6. YOLOv3                                                               bones are required when there is a need for high precision and to
    YOLOv3 [33] is a gradual form of YOLOv2 [32], that uses logis-          build accurate applications [24].
tic regression to estimate an objectness score for each bounding                Before the paradigm of deep learning, constructing feature de-
box. There are multiple classes contained in the bounding box and           scriptors requires extensive effort and expertise. In contrast, CNN
to predict those classes, multi-label classification is used. It also        incorporates the capability of learning the features using CNNs ab-
uses binary cross-entropy loss, data augmentation techniques, and           stract hierarchical layers. In this section, some common backbone
batch normalization. YOLOv3 uses a robust feature extractor called          CNN architectures are discussed [45].
Darknet-53 [24,33,47].
                                                                            3.1. AlexNet
2.2.7. YOLOv4
    YOLOv4 [36] is a state-of-the-art object detector that is more              AlexNet [63] is an important CNN architecture consisting of five
accurate and faster than all the previous versions of YOLO [23,32,          convolutional layers and three fully connected layers. After giv-
33]. It includes a method called “Bag of freebies” which increases          ing an input of fixed size (224 × 224) to an image, the network
the training time without influencing the inference time. This               convolves over and over again and pools the activations, then the
method exploits data augmentation techniques, Self-Adversarial              result is transmitted to fully connected layers. The network was
training, Cross mini-Batch Normalization (CmBN), CIoU-loss [55],            trained on ImageNet and combines several methods of regulariza-
DropBlock regularization [56], Cosine annealing scheduler [57] to           tion, such as data augmentation, dropout etc. In order to accelerate
improve training. YOLOv4 also incorporates those methods which              the data processing and increase the convergence speed, the ReLu
solely impact the inference time known as “Bag of specials”; it in-         activation function and GPU were used for the first time. It ul-
cludes Mish activation, Multi-input weighted residual connections           timately paid off and turned out to be the first CNN to win the
(MiWRC), SPP-block [26], PAN path-aggregation block [58], Cross-            ILSVRC2012 competition with great accuracy and a huge drop in
stage partial connections (CSP) [59] and Spatial Attention Module           error rate [45,63]. The triumph of AlexNet architecture is based on
block. YOLOv4 can be trained on a single GPU and uses genetic               the following mechanics [1]:
algorithm to select hyper-parameters [36].
                                                                              • Rectified Linear Unit (ReLU) activation function is used instead
2.2.8. YOLOv5                                                                   of sigmoid and tanh.
    Soon after the release of YOLOv4, the Ultralytics company                 • Multi-GPU’s processing is used to speed up the network train-
launched YOLOv5 repository with considerable enhancements over                  ing.
previous YOLO models [60]. As YOLOv5 was not published as a                   • To enlarge the dataset, some techniques are used to augment
peer-reviewed research so it creates many debates about its legit-              the data such as random clipping, transformation with color illu-
imacy [34]; but still it is being used in various applications and is           mination etc.
                                                                        6
R. Kaur and S. Singh                                                                                                                             Digital Signal Processing 132 (2023) 103812
Table 2
Characteristics of One-Stage Object Detection Models.
  Object               Year   Image Input   Backbone          Learning Method /   Loss / Cost            Strengths                                  Shortcomings
  Detector                    size          DCNN              (Optimization       function
                                                              method)
  DetectorNet [30]     2013   Arbitrary     AlexNet           Stochastic          Least Square Error     1. Multi-scale inference method            1. Training is expensive.
                                                              gradient            (L2 loss)              which produces object detection            2. It cannot deal with multiple objects
                                                              using                                      of high resolutions.                       of same class type.
                                                              ADAGRAD                                    2. Represents strong geometric
                                                                                                         information.
                                                                                                         3. Simple model because it has
                                                                                                         higher detection rate over large
                                                                                                         number of objects and can be
                                                                                                         conveniently applied to ample
                                                                                                         variety of classes.
  OverFeat [31]        2013   Arbitrary     AlexNet           SGD                 Least Square Error     1. Multi-scale, sliding window             1. Single bounding box regressor for
                                                                                  (L2 loss)              approach used for classification,           class.
                                                                                                         localization and detection.                2. Unable to deal with multiple instances
                                                                                                         2. Winner of ILSVRC2013                    of same class.
                                                                                                         competition for localization task.         3. Multi-stage pipeline sequentially
                                                                                                         3. Faster due to sharing of                trained.
                                                                                                         Convolutional features.
  YOLO v1 [23]         2016   Fixed         GoogLeNet         SGD                 Sum squared Error      1. First unified end-to-end                 1. Difficult to localize low resolution
                                                                                  (Classification loss,   framework.                                 objects.
                                                                                  localization loss,     2. Completely removes the                  2. Less flexible.
                                                                                  confidence loss)        concept of region proposal.                3. Cannot predict more than one box for
                                                                                                         3. Real-time object detection.             particular region without anchor boxes.
  SSD [22]             2016   Fixed         VGG-16            SGD                 Confidence loss         1. Multi-scale feature maps                1. Performs poorly when detecting
                                                                                  (categorical cross-    enhance the object detection               small objects.
                                                                                  entropy loss) +        at spatial levels.                         2. Small objects can only be identified in
                                                                                  Localization loss      2. Faster than YOLO and                    higher resolution layers however these
                                                                                  (regression loss)      on par with Faster RCNN.                   layers incorporate low-level features such
                                                                                                                                                    as edges that are not much effective for
                                                                                                                                                    classification.
  YOLO v2 [32]         2017   Fixed         Darknet-19        SGD                 Sum Squared Error      1. Faster and stronger than                1. Difficult in detecting small objects.
                                                                                                         YOLO v1.                                   2. Complex training.
                                                                                                         2. Batch Normalization
                                                                                                         3. Use of High Resolution classifier
                                                                                                         aims to increase accuracy.
                                                                                                         4. The k-means clustering algorithm
                                                                                                         is used to yield anchor boxes.
                                                                                                         5. Multi-scale training.
  YOLO v3 [33]         2018   Fixed         Darknet-53        SGD                 Binary cross entropy   1. To boost the multi-scale detection      1. YOLOv3 may not be ideal for using
                                                                                                         accuracy, it makes use of multi-level      niche models where large datasets can
                                                                                                         feature fusion.                            be hard to obtain.
                                                                                                         2. Detections done at different            2. Not suitable to detect small objects.
                                                                                                         feature maps of different sizes to
                                                                                                         detect features at different scales.
  YOLO v4 [36]         2020   Fixed         CSPDarknet-53     SGD                 Binary cross entropy   1. Introduces Mosaic data                  -
                                                                                                         augmentation.
                                                                                                         2. Bag of Freebies (BoF) and
                                                                                                         Bag of Specials (BoS) are used
                                                                                                         for backbone and detection purpose.
                                                                                                         3. Hyper-parameters are selected
                                                                                                         using genetic algorithms.
  YOLO v5 [34,35]      2020   Fixed         Focus structure   SGD                 Binary cross entropy   1. Faster than YOLOv4.                     -
                                            CSP Network                           with Logits Loss       2. Detect objects in real-time
                                                                                  function               with great accuracy.
  • The dropout regularization method is used during training to                                layer, it employs a kernel of size 3×3 with a stride of 1. Small ker-
     remove part of neurons. It brings down the chances of overfit-                              nel and stride acts as a more favorable to extract the details of the
     ting.                                                                                      object’s location in the image. It has a benefit of expanding the
                                                                                                network’s depth by incorporating additional convolutional layers.
3.2. ZFNet                                                                                      Minimizing the parameters leads to improved feature representa-
                                                                                                tion ability of the network [1,5].
    After the success of AlexNet, researchers wanted to know the
mechanism behind the visualization of the convolutional layers, to
                                                                                                3.4. GoogLeNet or inception v1
see how CNN learns the features and how to examine the dif-
ferences in image feature maps at each layer. So, a method was
designed by Zeiler, M. D. et al. [64] to visualize the feature maps                                 The main aim of GoogleNet [67] a.k.a. Inception v1 architec-
using deconvolutional layers, unpooling layers and ReLU non lin-                                ture was to achieve high accuracy by decreasing the computational
earities. As in AlexNet, the filter size of the first layer is 11×11                              cost. Adding 1×1 convolutional layers to the network, there is an
with a stride of 4, but in ZFNet, it is reduced to 7×7, and the                                 increase in its depth. This filter size was first used in the technique
stride is set to 2 instead of 4. The reason behind doing this was                               named Network-in-Network [68], and mainly used as dimensional-
that the filters of the first layer contain variations in frequency in-                           ity reduction to remove computational bottlenecks and increasing
formation; it can be high, low and have very small percentage of                                the width and height of the network [67]. GoogleNet is a 22-layer
mid frequencies. This method performs better than AlexNet and                                   deep architecture and is the winner of the ILSVRC 2014 competi-
proved that the depth of the network influences the deep learning                                tion. Based on this idea, the author developed an inception module
models performance [1,64,65].                                                                   [67] with dimensionality reductions. By using the inception mod-
                                                                                                ules, the number of GoogLeNet parameters is decreased, in contrast
3.3. VGGNet                                                                                     to [63,64,66]. The Inception module comprises of 1x1, 3x3, and
                                                                                                5x5 filter size convolution layers and max-pooling layers assem-
   VGG [66] further enlarges the depth of AlexNet to 16-19 layers                               bled parallelly with one another. Inception v2 series was the first
which refines the feature representation of the network. VGG16                                   network to propose batch normalization [69] resulting in speedy
and VGG19 are two popular VGG network architectures. In each                                    training [2,45,47,70].
                                                                                         7
R. Kaur and S. Singh                                                                                                     Digital Signal Processing 132 (2023) 103812
Table 3
Summary of DCNN architectures.
  DCNN                 Year   Depth (No.   No. of       Dataset used   Test Error      Accuracy    Category       Highlights
  Architecture                of Layers)   parameters                  (Top 5)         (Top-5)
  AlexNet [63]         2012   8            60M          ImageNet       15.3%           84.7%       Spatial        1. First deep CNN architecture.
                                                                                                   exploitation   2. ReLu activation function used
                                                                                                                  instead of Sigmoid and tanh.
                                                                                                                  3. Multi-GPU’s parallel computing
                                                                                                                  technology is used.
                                                                                                                  4. Shift from hand feature engineering
                                                                                                                  to deep conv neural network.
  ZFNet [64]           2014   8            60M          ImageNet       14.8%           85.2%       Spatial        1. Introduced a visualization technique
                                                                                                   exploitation   that gives insights of intermediate layers.
                                                                                                                  2. Analogous to AlexNet architecture
                                                                                                                  with a small difference in filter size,
                                                                                                                  no. of filters and stride for convolution.
  VGGNet [66]          2014   16           138M         ImageNet       6.8%            93.2%       Spatial        1. Increasing depth of the network
                                                                                                   exploitation   using very small 3*3 convolution filters.
  GoogleNet [67]       2015   22           6M           ImageNet       6.67%           93.3%       Spatial        1. Increased the depth and width without
                                                                                                   exploitation   raising the computational requirements.
                                                                                                                  2. Uses the Inception Module consisting
                                                                                                                  of conv layers with different filter sizes.
                                                                                                                  3. It makes use of global average pooling.
                                                                                                                  4. First bottleneck architecture.
  ResNet50 [61]        2016   50           25.6M        ImageNet       3.57%           96.43%      Depth +        1. Using the identity mapping, deeper
                                                                                                   Multi          networks can be learned to a great
                                                                                                                  extent.
                                                                                                                  2. Skip connections are used.
                                                                                                                  3. Increases the accuracy by preserving
                                                                                                                  the gradient in deeper layer.
  ResNet101 [61]       2016   101          44.5M        ImageNet       -               -           Depth +        1. Performance is identical to VGG
                                                                                                   Multi          with lesser number of parameters.
                                                                                                                  2. Uses the bottleneck and global
                                                                                                                  average pooling introduced in
                                                                                                                  GoogleNet.
  DenseNet [71]        2017   201          20M          -              -               -           Multi-         1. Framework uses the dense blocks.
                                                                                                   path           2. Every layer is linked to the next
                                                                                                                  layer in a feed forward manner.
                                                                                                                  3. Reduces the problem of vanishing gradient.
3.5. ResNet                                                                        of DCNN architectures. The top-5 error rate is the percentage of
                                                                                   test images where the correct label is not one of the five labels
   With the rise in the network’s depth there can be a situation                   considered most likely by the model. Top 5 accuracy indicates the
where accuracy drops after reaching a saturation point. This is                    dataset’s classification accuracy. CNN’s can be divided into differ-
known as degradation problem and to solve this, a residual learn-                  ent categories such as Spatial exploitation, depth based or multi-
ing (ResNet) module is proposed by [61]. It has less computational                 path based [70]. Spatial exploitation based CNNs adjust the spatial
complexity than earlier designed architectures like AlexNet [63]                   filters such that they can perform well on both coarse-grained
and VGGNet [66]. Generally, ResNet backbone networks with 50                       information (extracted by large size filters) and fine-grained in-
and 101 number of layers are used [1,70].                                          formation (extracted by small size filters). In depth based CNNs,
   In ResNet50, skip connections were used to preserve the gra-                    deeper networks perform better as compared to shallow ones as
dient in the deeper layer and a rise in accuracy was seen. In                      they manage the networks learning capability and can regulate the
ResNet101, the module performs identically to the VGG network                      complex tasks effectively. Multi-path based CNNs bridges one layer
with less number of parameters, following global average pooling                   to the other without using few intermediary layers, so that the in-
and bottleneck as in GoogLeNet [45].                                               formation flows over all layers. It also attempts to work out on the
                                                                                   problem of gradient descent. Readers can follow the survey [70]
3.6. DenseNet                                                                      for more details.
   Huang, G. et al. [71] presented DenseNet architecture composed                  4. Datasets for object detection and performance assessment
of dense blocks that links each layer to every other layer in a feed-
forward manner, giving rise to benefits like reuse of features, ef-                 4.1. Datasets
fectiveness of parameters and implicit deep supervision. DenseNet
reduces the problem of vanishing gradient [2,45].                                      Datasets play very important part in research. Due to the out-
   Table 3 shows performance comparisons of the various back-                      standing accomplishment of the image datasets, they can be used
bone architectures discussed above, It gives brief description about               in image classification, object detection and segmentation tasks
no. of layers and no. of parameters used, benchmark dataset, test                  [1,65]. There are many object detection datasets in the domain of
error (top 5), accuracy (top 5) and the category to which the corre-               research such as LISA [72], CIFAR-10 [73], PASCAL VOC [74], CIFAR-
sponding architecture belongs. Highlights show the main features                   100 [73], MS COCO [75], ImageNet [76], Tiny Images [77], SUN
                                                                               8
R. Kaur and S. Singh                                                                                               Digital Signal Processing 132 (2023) 103812
[78], Open Imagesv5 [79] etc. Fig. 6 shows some sample images                Images v5 is a standard dataset comprising 1.9 million images with
of commonly used datasets. A brief description of these datasets is          16 million annotated bounding boxes for 600 object categories. The
as follows:                                                                  images in this dataset are heterogeneous in nature and contain
                                                                             complicated scenes with various objects (on average, 8.3 object
4.1.1. PASCAL VOC                                                            categories are there per image) [24,60].
    The PASCAL VOC [74] datasets are extensively used for object                 The most famous object detection datasets are given in Table 4.
detection tasks. Having good quality images and corresponding la-            It gives details about the year in which the dataset was launched,
bels for each image, the evaluation of algorithms becomes easy. It           the no. of classes in each dataset, number of images and no. of
was launched in 2005 with four classes and with the time it in-              objects (bounding boxes) used in training and validation set. The
creases to 20 classes in 2007. These 20 classes were divided into            objects/images give the number of bounding boxes per image. Ref-
four primary sections- vehicles, people, household objects and an-           erence link is also provided for each dataset.
imals. PASCAL VOC 2007 and 2012 are the two most used versions
of PASCAL dataset. It also contains some imbalanced classes in               4.2. Evaluation metrics
2007 like instances of the class person are more than the class
sheep [2,5,24,74].
                                                                                 There are several parameters that can be used to measure the
                                                                             effectiveness of object detectors. These are Accuracy, Precision, IOU,
4.1.2. MS-COCO
                                                                             Recall, PR curve, Average Precision etc. [1,2,24,45,81–83]. Average
    The Microsoft Common Objects in Context (MS COCO) [75]
dataset has 91 common object categories found in everyday life for           Precision (AP) is the most often used metric obtained using recall
detecting and segmenting the objects. Out of 91 categories, 20 cat-          and precision.
egories are from PASCAL VOC dataset. The dataset has more than                   The goal of object detectors is to predict the object location by
2,500,000 labeled instances and 328,000 categories per image in              placing the bounding box over the object of a given class in an
total. MS COCO dataset contains diverse viewpoints and is rich in            image/video with a high confidence score. Overall detection can be
contextual information. It is a more challenging dataset than PAS-           considered as a collection of three elements: object class, bounding
CAL VOC; containing a large number of small objects with huge                box (BB) around that object and the confidence score [81]. The
scale variation [6,24,60].                                                   metrics terminology used in assessing the performances of object
                                                                             detection algorithms is explained below:
4.1.3. ImageNet
    ImageNet [76] is an extensive and diverse image dataset for as-          4.2.1. IoU (Intersection over Union)
sessing the performance of algorithms. Complex datasets can drive                 IoU is the ratio of the overlap area between the predicted BB
improvement in practical applications and computer vision tasks.             ( B B predict ) and the ground truth BB ( B B ground ) to the area of their
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [80] is           union. It uses the concept of the Jaccard index which calculates the
derived from ImageNet [2]. The ILSVRC object detection challenge             similarity between the above two sets. Fig. 7 shows the concept of
assesses an algorithm’s ability to categorize and locate all target          IoU.
objects in an image. It has become the benchmark dataset con-                     Values of IoU ranges between 0 and 1. More it is closer to 1,
taining 1000 object classes with millions of images in it [1,24].            more accurate is the detection. If area of the predicted BB and the
                                                                             ground truth BB overlap each other perfectly then the value of IoU
4.1.4. OpenImages                                                            is 1, else if they do not intersect each other then IoU is 0. If in case,
    Open Images dataset [79] is one of the greatest publicly avail-          the IoU value of both BB is larger than the predefined threshold
able datasets containing 9.2 million images annotated with object            (mostly used 0.5); it means the object is recognized properly [1,2,
bounding boxes, image-level labels and segmentation masks. Open              4,45,81,84]. IoU is calculated as follows:
                                                                         9
R. Kaur and S. Singh                                                                                                                            Digital Signal Processing 132 (2023) 103812
Table 4
Statistics for well known object detection datasets.
  Dataset               Launched      Dataset’s             No. of    No. of images             No. of objects                       No. of objects/   Link
                        in            Challenge             classes   Train         Val         Train               Val              image
  PASCAL                2005          VOC 2007              20        2501          2510        6301                6307             2.5               http://host.robots.ox.ac.uk/pascal/
  VOC                                                                                                                                                  VOC/voc2007/index.html
  [74]
                                      VOC 2012              20        5717          5823        13,609              13,841           2.4               http://host.robots.ox.ac.uk/pascal/
                                                                                                                                                       VOC/voc2012/index.html
  ImageNet [76]         2009          ILSVRC 2013           200       3,95,909      20,121      3,45,854            55,502           1.0               https://image-net.org/challenges/
                                                                                                                                                       LSVRC/2013/index.php
                                      ILSVRC 2017           200       4,56,567      20,121      4,78,807            55,502           1.1               https://image-net.org/challenges/
                                                                                                                                                       LSVRC/2017/index.php#det
  MS COCO [75]          2014          COCO 2017             80        1,18,287      5000        8,60,001            36,781           7.3               https://cocodataset.org/
  Open Images [79]      2016          OpenImages 2018       600       17,43,042     41,620      14,610,229          3,03,980         8.3               https://g.co/dataset/openimages/
                                                                                                                                       T P C mn + T N C mn
                                                                                             Accurac y C mn =                                                                          (2)
                                                                                                                     T P C mn + F P C mn + T N C mn + F N C mn
                                                                                             4.2.3. Precision
                                                                                                 Precision means how much positive identifications were actu-
                                                                                             ally correct. In other words, it is computed as a ratio between the
                                                                                             no. of accurately identified positive samples to the total count of
                                                                                             positive samples [81,83]. It is given by:
                                                                                                                              T P C mn
                                                                                             P recisionC mn =                                                                          (3)
                                                                                                                     T P C mn + F P C mn
                                                                                             4.2.4. Recall
                                                                                                 Recall is the measure of how many actual positives were iden-
                                                                                             tified correctly. It is evaluated as the proportion of the no. of
                                                                                             correctly identified positive samples to the total number of actual
                                                                                             positive samples. Recall is also known as sensitivity [45,82,83]. It
                            Fig. 7. Demonstration of IoU.                                    is given by:
                                                                                                                          T P C mn
                                                                                             Recall C mn =                                                                             (4)
IoU = J ( B B predict , B B ground )                                                                           T P C mn + F N C mn
            Area of intersection of predicted and ground truth boxes
      =                                                                                      4.2.5. Average Precision (AP)
               Area of union of predicted and ground truth boxes
                                                                                                 To compute the accuracy of detections, the most general metric
                                                                                  (1)
                                                                                             used is Average Precision (AP). It is calculated independently for
For each objection task, precision and recall is evaluated using IoU                         each object category C m [4,45,81]. It is evaluated as:
value, for the given threshold (t). If IoU ≥ t, then predictions are
                                                                                                         1
                                                                                                               i
correctly identified and if IoU < t, then predictions are identified
incorrectly. To compute the values of precision and recall, every BB                         A P Cm =               P recisionC mn                                                     (5)
                                                                                                         j
must be classified as [45,83]:                                                                                n =1
  • TP (True Positive) - Model has predicted positive and in actual                          4.2.6. Mean Average Precision (mAP)
     it’s true.                                                                                  The mAP is calculated by taking the average over all object cat-
  • TN (True Negative) - Model has predicted negative and in actual                          egories and thereby evaluates the performance of object detectors
     it’s true.                                                                              [45,81,82]. The formula is defined as below:
  • FP (False Positive) - Model has predicted positive and in actual
                                                                                                        1
                                                                                                         j
    it’s false.
  • FN (False Negative) - Model has predicted negative and in actual                         mAP =                  A P Cm                                                             (6)
                                                                                                         j
    it’s false.                                                                                              m =1
4.2.8. PR curve (Precision Recall curve)                                      semantic characteristics are represented differently, various feature
    The precision-recall curve depicts the tradeoff between recall            maps can be utilized to detect objects of varying sizes and reso-
and precision for distinct thresholds. A wide area under the curve            lutions at different layers. Representative methods include Multi-
indicates high recall and precision. The Precision-Recall plot gives          scale deep CNN [93], Deeply supervised object detection (DSOD)
more details as compared to ROC (Receiver Operating Characteris-              [94] and SSD [22]. To increase the reliability of multi-scale object
tics) curve plot in case of evaluation of binary classifiers on uneven         detection, multi-layer feature fusion and multi-layer detection can
distribution of datasets. As recall value starts increasing and corre-        be merged. This includes Feature Pyramid Network (FPN) [28], De-
spondingly if precision is maintaining a higher value, it indicates           convolutional single-shot detector (DSSD) [95], Scale transferrable
that detector has good performance. However if the value of recall            detection network (STDN) [41], Reverse connection with objectness
starts declining and at the same time high precision is attained,             prior networks (RON) [96], Top down modulation (TDM) [97] as a
then the detector has to keep the precision at a certain level to             few representative frameworks.
keep recall at high level [1,86,87].
    Since PR curves provide the positive prediction cases, thus it is
                                                                              5.3. Intraclass variation
used in many research analysis.
4.2.9. AUC-ROC curve                                                              Intraclass variation refers to the variation that occurs between
    AUC stands for Area under the ROC Curve, that measures the                different images of the same class. They vary in shape, size, color,
performance of classification problems at various thresholds. The              material, texture etc. Object instances appear to be flexible and
ROC (Receiver Operating Characteristics) curve is a probability               can be easily transformed in terms of scaling and rotation. These
curve related to Precision-Recall curve. The distinction is that the          are called intrinsic factors. Some noticeable effects are also expe-
ROC employs TPR (True Positive Rate) and FPR (False Positive Rate).           rienced by external factors. It includes improper lighting, weather
The area under AUC curve indicates high precision and high recall.            conditions, illuminations, low-quality camera etc. This difference
More closer is the ROC curve to the upper left co-ordinate (0,1),             could be caused by a variety of factors such as occlusion, light-
better the performance is. AUC value is the magnitude of the area             ing, position, perspective, and so on [2,45,60]. This problem can be
beneath the ROC curve whose value ranges from 0.5 to 1; greater               overcome by verifying that the training data has good amount of
the AUC value, more accurate is the performance of the detector               variety including all the factors mentioned above [98].
[1,45,88].
                                                                              5.4. Efficiency and scalability
5. Problems of object detection and its solutions
                                                                                  As the number of object classes increases, there is rise in com-
   Even though object detection has achieved remarkable perfor-
                                                                              putational complexity, hence there is a demand for high compu-
mance in computer vision, still it is a complicated task and has
                                                                              tation resources with a huge number of locations inside a single
some challenges. Some of these fundamental challenges that net-
                                                                              image. The scalability of the detector ensures that it can recognize
works encounter in real-world applications and solutions to over-
                                                                              unseen objects. It is impractical to annotate the images manually
come them are discussed as below.
                                                                              with the increasing number of images and categories, so weakly
                                                                              supervised techniques are used [2].
5.1. Small object detection
   Detecting small size objects is among the most difficult prob-               5.5. Generalization issues
lems in object detection. Object detection algorithms such as Faster
RCNN [20,21], and YOLO [23] are inadequate at detecting small size               Generalization problems in object detection emerge when the
objects. In deep convolutional neural network, there is a lack of             model goes either underfitting or overfitting. Underfitting can be
adequate knowledge in independent feature layers as they occupy               identified in the preliminary stages of the training phase, and this
only a small pixel size in the actual image. It is hard to detect             problem can be fixed by increasing the number of training epochs
low-resolution small-size objects since they carry finite contextual           or complexity of the model. For overfitting, we can use significant
details [1,47]. To overcome this issue, more data can be generated            methods such as an increase in the training data, early stopping,
by augmentation or model’s input resolution can be increased etc.             regularization method (L1, L2), or dropout layers [45].
[89].
                                                                              5.6. Class imbalance
5.2. Multi-scale object detection
    Multi-scale object detection is a challenging task in the area of            The irregular data distribution between classes is referred to
object detection. Each layer of deep CNN generates feature maps               as Class imbalance. In simple terms, it can be said that when
and the information generated by these feature maps is indepen-               the class contains disproportionate number of instances i.e. having
dent of the other. Discriminative details for multi-scale objects can         more specimens in one dataset than the other. From the view-
appear in either layer of the backbone network and for small-scale            point of object detection, a class imbalance can be of two types -
objects, it emerges in the preliminary layer and dissipated in the            foreground-background imbalance and foreground-foreground im-
later layers. In the object detection algorithms (one-stage and two-          balance. The former occurs during the training process and is in-
stage), predictions are carried out from the topmost layer, which             dependent of the number of categories in the dataset. The latter
creates hindrances in the way of detecting multi-scale objects, usu-          refers to the imbalance at the batch level within the number of
ally small objects. To overcome this difficulty; multi-layer detection          samples, concerning positive classes. Generally, one-stage object
and feature fusion is proposed with the association of information            detectors have low accuracy than the two-stage object detectors
fusion and DCNNs hierarchical structures [1,45].                              and one of the reasons behind this is class imbalance [99]. To solve
    Multiple layers are combined for detection purpose, and for               this issue, upsampling and downsampling of the class can be done
this, backbone networks like Inside-Outside Network (ION) [90],               or synthetic data can be generated using Synthetic Minority Over-
HyperNet [91], Hypercolumns [92] are used. Because each layer’s               sampling Technique (SMOTE) etc. [100,101].
                                                                         11
R. Kaur and S. Singh                                                                                              Digital Signal Processing 132 (2023) 103812
    In real-time, object detection has an extensive scope. It is being           Due to the ubiquitous use of social media, continuous growth
utilized in various areas of image processing applications such as            can be seen in multimedia content and one can find out about
monitoring systems, robotics, vehicle detection, autonomous driv-             real-world incidents due to its online availability. Many methods
ing etc. Important applications [1,4,102] of object detection are             such as multimodal graphs [117], multi-domain [118] and social
explained as follows:                                                         interaction graph modeling [119] are used for event detections.
                                                                              The objective of a multimodal graph is to identify and detect the
6.1. Self-driving cars                                                        event from a collection of 100 million photos or videos and briefly
                                                                              summarize it for the use of consumers. In [119], online social in-
    Self-driving cars are the distinctive application of object detec-        teraction features are integrated with the use of social affinity of
tion tasks. A self-driving car can carefully travel on the road if            two photos which helps in the detection of events. Social affinity
it can detect other objects by its side such as persons, cars and             uses the interaction graph to figure out the similarity between two
road signs to determine what next activity to be performed, like              pictures of the graph. Multi-domain event detection, collects data
whether to apply a brake or accelerate or to take a turn; and for             from multiple domains like social media, news media etc. consist-
this purpose, the car can be trained to perform detection of object           ing of heterogeneous data, to detect real-world incidents.
[102].
                                                                              6.5. Medical detection
6.2. Remote sensing target detection
                                                                                  The task of medical object detection is to identify medical-
    With the speedy increase in the development of remote sensing             based objects within an image. CNN-based algorithms play a key
automation, it is being used in many application areas like military          role in medical image classification. It can help doctors to analyze
field, urban planning, traffic navigation, disaster rescue etc. In the           the exact area of the wound, thus enhancing the accuracy of med-
last few years, the remote sensing target detection of ships, air-            ical diagnosis [1].
craft, roads etc. has become a current research trend. In DCNNs,                  In [120], the combination of CNN along with Long short term
object detection frameworks such as Faster RCNN [20,21] and SSD               memory (LSTM) and Recurrent Neural Network (RNN) is used to
[22], are gaining popularity in the remote sensing field.                      detect end-systolic and end-diastolic frames in the MRI image. To
    However there are some challenges in this field such as the dif-           classify the problem of skin lesions, multi-stream CNN is designed
ficulty in detecting remote sensing targets correctly and quickly              by [121], by extracting the information from images of different
because remote sensing images have immense volume of data. Re-                resolutions. A challenge was also organized by [122], for melanoma
mote sensing has quite a huge and intricate background which                  detection. Li, L. et al. [123] proposed an attention mechanism for
leads to much wrong detection. The difference between the re-                 glaucoma detection. For automated detection of synapses and au-
mote sensing images captured by different sensors presents a high             tomated neuron reconstruction, [124] introduced cellular morphol-
degree of variation. Sometimes, small object detection is also a dif-         ogy neural networks (CMNs) [1,24].
ficult task; making the detection process slow. So to rectify this,
the resolution of the feature map is increased. Attention mech-               6.6. Face detection and face recognition
anisms and feature fusion procedures have also been utilized to
enhance small target identification. [4,24].                                       The objective of face detection is to detect and localize face re-
    Datasets used for remote sensing target detection are LEVIR               gions in an image. Every face has a unique structure and attributes.
(LEarning, VIsion and Remote sensing laboratory) [103], DOTA                  The most popular detector in the early times was the Viola-Jones
(Dataset for Object deTection in Aerial images) [104], xView [105],           detector [13,14]. It has shown wonderful performance in the field
VeDAI (Vehicle Detection in Aerial Imagery) [106], TAS (Things and            of object detection by detecting the human faces for the first time
Stuff) [107] etc.                                                             along with attaining real-time efficiency [13,14].
                                                                                  Face detection generally has various problems like occlusion,
6.3. Pedestrian detection                                                     illumination, multi-scale detection as some faces may be tiny or
                                                                              some may be large, may have illumination or resolution variations
    Pedestrian detection is a critical application of object detection        etc. Also, human faces can have heterogeneous expressions, poses,
that is commonly used in video surveillance, autonomous driv-                 or skin colors. So to solve all these problems, various methods are
ing etc. Traditional methods of pedestrian detection include hand-            designed such as face calibration to improve the multi-pose de-
crafted features such as Histogram of Oriented Gradients (HOG)                tection [125,126]. Methods namely attention mechanism [127] and
[9], Integral Channel Features (ICF) [108] etc., they have build a            detection based on parts [128] are used to improve occluded face
powerful base for object detection. But with timely progress, DC-             detection problems. Furthermore, multi-scale feature fusion and
NNs have taken place and become more appropriate for pedestrian               multi-resolution detection are used to enhance multi-scale face de-
detection.                                                                    tection [4,24].
    Difficulties in pedestrian detection such as detection of dense                 Several datasets are used for face detection such as WiderFace
and occluded pedestrian, small pedestrian detection, and hard neg-            [129], FDDB (Face Detection Data set and Benchmark) [130], AFLW
ative detection impose great challenges in real applications. There           (Annotated Facial Landmarks in the Wild) [131], UFDD (Uncon-
are several methods through which these difficulties can be ame-                strained Face Detection Dataset) [132] and many more.
liorated. The techniques such as semantic segmentation [109] and
integration of boosted decision trees [110] help in improving the             6.7. Text detection
problem of hard negatives detection. For small pedestrian detec-
tion, feature fusion [110] is used. On the other hand, to improve                 Text detection aims to detect whether an image or video con-
the occlusion problem, an ensemble of part detectors [111,112]                tains a text and if it is there then to recognize and localize it. Text
and attention mechanism [113] are used [4,6,24].                              detection has gained much significance in latest years, as it helps
    To detect pedestrians, various datasets come into use like Cal-           visually impaired persons to read street signs. It is also utilized in
tech [114], INRIA [9], KITTI [115], CityPersons [116] etc.                    classification, video analysis etc. [4,24].
                                                                         12
R. Kaur and S. Singh                                                                                                               Digital Signal Processing 132 (2023) 103812
    Text detection faces many problems as it can be of different                             7.2. RGB-D detection
fonts and languages, perspective distortion or discrete orientations
can be there, blurred characters can be seen in street images, and                              Due to the popularity of research in autonomous driving, depth
irregular lighting. The problem of blurred text detection can be                             information has been added to the image to understand it in a
solved by using word-level recognition and sentence-level recog-
                                                                                             better way. LIDAR point cloud localizes the position of objects ac-
nition [133]. To rectify the problem of font size, training is done
                                                                                             curately in 3D area, by using depth information. To correctly place
with the help of synthetic samples [134].
                                                                                             the ground truth 3D bounding boxes around the objects, the 3D
    Some datasets like COCOText [135], synthetic dataset Syn90k
                                                                                             proposal network [149] can be referred [6,24].
[134], and ICDAR [136] are used for text detection.
    In the past few years, although the automated detection of traf-                            Detecting objects in real-time video such as surveillance videos,
fic lights and traffic signs has drawn lot of attention of users, still                         autonomous driving is of great importance at present. It faces
it is a challenging task to recognize it, as it faces many difficulties                        some difficulties such as image quality is not good which leads
in the detection process.                                                                    to poor accuracy. To associate objects across different frames in
    Bad weather is the main cause of false detection as it affects the                       order to understand the object’s actions, several video detectors
quality of an image. Real-time detection and illumination changes                            are designed concerning temporal factors. These video detectors
are also a challenging task. Techniques such as adversarial training                         include deep feature flow [150], flow-guided feature aggregation
[137], and attention mechanism [138] have been used to refine the                             (FGFA) [151], spatial-temporal memory networks (STMN), a novel
detection process in difficult traffic scenes. The CNN-based Faster                              tubelet network [152] for spatiotemporal proposals, and to inte-
RCNN, Single Shot detector (SSD) are used in traffic sign and light                            grate temporal information from it, LSTM is used etc.
detection [4,139–141].
    Some popular traffic light and sign datasets are LISA [72],                                7.4. Automatic Neural Architecture Search (NAS)
TT100K (Tsinghua-Tencent 100K) [139], GTSDB (German Traffic
Sign Detection Benchmark) [142] etc.
                                                                                                 The utilization of deep learning models is becoming popular
                                                                                             day by day. It can be considered to use the backbone architecture
7. Future research challenges
                                                                                             like AutoML (Automated Machine Learning) which is being used in
                                                                                             object detection for some specific purpose. NAS is a part of this
   Despite rapid evolution of object detection, there are still many
                                                                                             backbone, in addition to transfer learning and feature engineering.
areas where research needs to be done. In this section, various
                                                                                             So to reduce the involvement from humans at the time of outlin-
research directions are discussed.
                                                                                             ing the model by using NAS; AutoML could be the future research
                                                                                             direction [2,4,153,154].
7.1. Weakly supervised detection
                       Table 5
                       Performance comparison on PASCAL VOC 2007 and 2012 test dataset.
                        Type       Method            Model Used     No. of proposals        FPS    PASCAL VOC 2007 test set   PASCAL VOC 2012 test set
                                                                    generated                      Training data   mAP@.5     Training data    mAP@.5
                        2-stage    RCNN [25]         AlexNet        2000                    0.03   07              58.5       12               53.3
                        2-stage    SPP-Net [26]      ZFNet          2000                    0.44   07              59.2       -                -
                        2-stage    Fast RCNN [27]    VGG16          2000                    0.5    07              66.9       12               65.7
                                                                    2000                    0.5    07+12           70.0       07++12           68.4
                        2-stage    Faster RCNN       VGG16          300                     5      07              69.9       12               67.0
                                   [20,21]
                                                                    300                     5      07+12           73.2       07++12           70.4
                                                                    300                     5      COCO+07+12      78.8       07++12+COCO      75.9
                        1-stage    YOLO [23]         -              98                      45     07+12           63.4       07++12           57.9
                        1-stage    SSD300 [22][34]   VGG16          8732                    46     07              68.0       07++12           72.4
                                                                    8732                    46     07+12           74.3       07++12+COCO      77.5
                                                                    8732                    46     07+12+COCO      79.6       -                -
                        1-stage    SSD512 [22]       VGG16          24564                   19     07              71.6       07++12           74.9
                                                                    24564                   19     07+12           76.8       07++12+COCO      80.0
                                                                    24564                   19     07+12+COCO      81.6       -                -
                        1-stage    YOLOv2 [32]       Darknet19      -                       40     07+12           78.6       07++12           73.4
                        1-stage    YOLOv5x 692       CSPDarknet     -                       140    07+12           91.0       -                -
                                   [34,35,37]
                                                                                       13
R. Kaur and S. Singh                                                                                                                  Digital Signal Processing 132 (2023) 103812
                                     Table 6
                                     Description of Training data given in Table 5.
                       Table 7
                       Performance comparison on COCO 2015 and 2017 test dev dataset.
                        Type       Method              Model Used        No. of proposals       FPS      Training data      MS COCO test dev 2015
                                                                         generated                                          mAP@.5       mAP@[.5,.95]
                        2-stage    Fast RCNN [27]      VGG16             2000                   0.03     train              35.9         19.7
                        2-stage    Faster              VGG16             300                    5        trainval           42.7         21.9
                                   RCNN
                                   [20,21]
                        2-stage    Mask RCNN           ResNeXt-101-      -                      -        trainval35k        62.3         39.8
                                   [29]                FPN
                        1-stage    SSD300 [22]         VGG16             8732                   46       trainval35k        41.2         23.2
                        1-stage    SSD512 [22]         VGG16             24564                  19       trainval35k        46.5         26.8
                        1-stage    YOLOv2 [32]         Darkent19         -                      40       trainval35k        44.0         21.6
                                                                                                                            MS COCO test dev 2017
                                                                                                                            mAP@.5       mAP@[.5,.95]
                        1-stage    YOLOv3 320          Darknet53         -                      45       trainval           51.5         28.2
                                   [33]
                        1-stage    YOLOv4 512          CSPDarknet53      -                      31       trainval           64.9         43.0
                                   [36]
                        1-stage    YOLOv5x 640         CSPDarknet        -                      140      trainval           68.9         50.7
                                   [34,35,37]
7.6. Optimization                                                                          accuracy of 50.7. The values of mAP for YOLOv5x are taken from its
                                                                                           official github repository [37] as no formal paper is available for it.
    The structure of DCNNs can be optimized using various meta-
heuristic optimization algorithms. These algorithms can be used                            9. Conclusion
to improve convolutional neural network in diverse research tasks
and applications such as fine-tuning DCNNs hyperparameter, train-                               Deep learning based CNNs have accomplished great develop-
ing the DCNN etc. So the applicability of meta-heuristic techniques                        ment in recent years. Object detection progressed quickly follow-
can be explored. Optimization techniques such as [156–160] can                             ing the introduction of deep learning. This review paper provides
be used. Readers can also refer to [161] for more details.                                 a thorough analysis of state-of-the-art object detection models
                                                                                           (one-stage and two-stage), backbone architectures, and evaluates
8. Comparative results and discussion                                                      the performance of models using standard datasets and metrics.
                                                                                           Challenges of object detection are also discussed along with ap-
    In this section, comparison of various object detector algorithms                      plications and future research directions to provide an in-depth
is shown on two popular datasets; PASCAL VOC dataset [74] and                              coverage of object detection. It is clear from the results that even
MS COCO dataset [75]. This comparison is done on the basis of                              after achieving remarkable performance in detection of objects,
the results shown in their respective object detector paper. Models                        still there is a considerable scope for improvement.
are compared using mean average precision (mAP). The selection
of backbone network to extract features has a great impact on the                          Declaration of competing interest
performance of models.
    Table 5 compares the performance comparison of object detec-                               The authors declare that they have no known competing finan-
tors on the test datasets of PASCAL VOC 2007 and 2012. It gives                            cial interests or personal relationships that could have appeared to
brief details about the backbone model used, number of region                              influence the work reported in this paper.
proposals and frames per second (fps); all these effects the per-
formance of object detectors. PASCAL VOC calculates the mAP@0.5                            Data availability
where 0.5 is the threshold (t). As discussed in section 4.2.1, if IoU
≥ 0.5, it denotes that predictions are correctly identified. It can                            No data was used for the research described in the article.
be seen from table that YOLOv5x performs better than others on
VOC 2007 test set with an accuracy of 91%. For VOC 2012 test set,                          References
SSD512 achieves higher performance having accuracy of 80%.
    Description of training data in the above Table 5 is given in                            [1] Y. Xiao, Z. Tian, J. Yu, Y. Zhang, S. Liu, S. Du, X. Lan, A review of ob-
Table 6.                                                                                         ject detection based on deep learning, Multimed. Tools Appl. 79 (33) (2020)
    In Table 7, the performance comparison is evaluated on the                                   23729–23791.
COCO 2015 and 2017 test dev dataset. The metric mAP@[0.5,0.95]                               [2] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, M. Pietikäinen, Deep
                                                                                                 learning for generic object detection: a survey, Int. J. Comput. Vis. 128 (2)
is used by the COCO dataset with a threshold ranging from 0.5 to                                 (2020) 261–318.
0.95 having step size of 0.05. Here in Table 7, again YOLOv5x out-                           [3] X. Zhang, Y.-H. Yang, Z. Han, H. Wang, C. Gao, Object class detection: a survey,
performs all other models on COCO 2017 test dev dataset with an                                  ACM Comput. Surv. 46 (1) (2013) 1–53.
                                                                                      14
R. Kaur and S. Singh                                                                                                                          Digital Signal Processing 132 (2023) 103812
  [4] Z. Zou, Z. Shi, Y. Guo, J. Ye, Object detection in 20 years: a survey, arXiv                 [35] D. Thuan, Evolution of yolo algorithm and yolov5: the state-of-the-art object
      preprint, arXiv:1905.05055, 2019.                                                                 detection algorithm, 2021.
  [5] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, M.S. Lew, Deep learning for visual              [36] A. Bochkovskiy, C.-Y. Wang, H.-Y.M. Liao, Yolov4: optimal speed and accuracy
      understanding: a review, Neurocomputing 187 (2016) 27–48.                                         of object detection, arXiv preprint, arXiv:2004.10934, 2020.
  [6] Z.-Q. Zhao, P. Zheng, S.-t. Xu, X. Wu, Object detection with deep learning: a                [37] Yolov5, https://github.com/ultralytics/yolov5. (Accessed 6 March 2022).
      review, IEEE Trans. Neural Netw. Learn. Syst. 30 (11) (2019) 3212–3232.                      [38] A. Boukerche, Z. Hou, Object detection using deep learning methods in traffic
  [7] A.K. Shetty, I. Saha, R.M. Sanghvi, S.A. Save, Y.J. Patel, A review: object de-                   scenarios, ACM Comput. Surv. 54 (2) (2021) 1–35.
      tection models, in: 2021 6th International Conference for Convergence in                     [39] PulkitS, Introduction to object detection algorithms, https://www.
      Technology (I2CT), IEEE, 2021, pp. 1–8.                                                           analyticsvidhya.com/blog/2018/10/a-step-by-step-introduction-to-the-basic-
  [8] S. Mohan, 6 different types of object detection algorithms in nutshell, https://                  object-detection-algorithms-part-1/, Oct. 2018. (Accessed 6 March 2022).
      machinelearningknowledge.ai/different-types-of-object-detection-algorithms/,                 [40] S. Park, A guide to two-stage object detection: R-CNN, FPN, mask R-CNN,
      Jun. 2020. (Accessed 11 February 2022).                                                           https://medium.com/codex/a-guide-to-two-stage-object-detection-r-cnn-fpn-
  [9] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection,                        mask-r-cnn-and-more-54c2e168438c, Jul. 2021. (Accessed 15 March 2022).
      in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern                     [41] P. Zhou, B. Ni, C. Geng, J. Hu, Y. Xu, Scale-transferrable object detection, in:
      Recognition (CVPR’05), vol. 1, IEEE, 2005, pp. 886–893.                                           Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
 [10] R. Lienhart, J. Maydt, An extended set of Haar-like features for rapid ob-                        tion, 2018, pp. 528–537.
      ject detection, in: Proceedings. International Conference on Image Processing,               [42] J.R. Uijlings, K.E. Van De Sande, T. Gevers, A.W. Smeulders, Selective search for
      vol. 1, IEEE, 2002.                                                                               object recognition, Int. J. Comput. Vis. 104 (2) (2013) 154–171.
 [11] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J.                [43] P. Arbeláez, J. Pont-Tuset, J.T. Barron, F. Marques, J. Malik, Multiscale combi-
      Comput. Vis. 60 (2) (2004) 91–110.                                                                natorial grouping, in: Proceedings of the IEEE Conference on Computer Vision
 [12] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning                 and Pattern Recognition, 2014, pp. 328–335.
      and an application to boosting, J. Comput. Syst. Sci. 55 (1) (1997) 119–139.                 [44] C.L. Zitnick, P. Dollár, Edge boxes: locating object proposals from edges, in:
 [13] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple                      European Conference on Computer Vision, Springer, 2014, pp. 391–405.
      features, in: Proceedings of the 2001 IEEE Computer Society Conference on                    [45] E. Arulprakash, M. Aruldoss, A study on generic object detection with empha-
      Computer Vision and Pattern Recognition, CVPR 2001, vol. 1, IEEE, 2001.                           sis on future research directions, J. King Saud Univ., Comput. Inf. Sci. (2021).
 [14] P. Viola, M.J. Jones, Robust real-time face detection, Int. J. Comput. Vis. 57 (2)           [46] J. Hui, Understanding feature pyramid networks for object detection (FPN),
      (2004) 137–154.                                                                                   https://jonathan-hui.medium.com/understanding-feature-pyramid-networks-
 [15] H. Bay, T. Tuytelaars, L.V. Gool, Surf: speeded up robust features, in: European                  for-object-detection-fpn-45b227b9106c, Mar. 2018. (Accessed 21 February
      Conference on Computer Vision, Springer, 2006, pp. 404–417.                                       2022).
 [16] P. Felzenszwalb, D. McAllester, D. Ramanan, A discriminatively trained, multi-               [47] Y. Liu, P. Sun, N. Wergeles, Y. Shang, A survey and performance evaluation
      scale, deformable part model, in: 2008 IEEE Conference on Computer Vision                         of deep learning methods for small object detection, Expert Syst. Appl. 172
      and Pattern Recognition, IEEE, 2008, pp. 1–8.                                                     (2021) 114602.
 [17] W.Y. Kyaw, Histogram of oriented gradients, https://waiyankyawmc.medium.                     [48] F. Sultana, A. Sufian, P. Dutta, A review of object detection models based
      com/histogram-of-oriented-gradients-90567ea6490a, May 2021. (Accessed 9                           on convolutional neural network, in: Intelligent Computing: Image Process-
      April 2022).                                                                                      ing Based Applications, 2020, pp. 1–16.
 [18] D.S. Aljutaili, R.A. Almutlaq, S.A. Alharbi, D.M. Ibrahim, A speeded up robust               [49] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D.
      scale-invariant feature transform currency recognition algorithm, Int. J. Com-                    Jackel, Backpropagation applied to handwritten zip code recognition, Neural
      put. Inf. Eng. 12 (6) (2018) 365–370.                                                             Comput. 1 (4) (1989) 541–551.
 [19] AaronWard, Facial detection — understanding viola Jones’ algorithm, https://                 [50] C. Gentile, M.K. Warmuth, Linear hinge loss and average margin, Adv. Neural
      medium.com/@aaronward6210/facial-detection-understanding-viola-jones-                             Inf. Process. Syst. 11 (1998).
      algorithm-116d1a9db218, Jan. 2020. (Accessed 29 January 2022).                               [51] K. Janocha, W.M. Czarnecki, On loss functions for deep neural networks in
 [20] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detec-                 classification, arXiv preprint, arXiv:1702.05659, 2017.
      tion with region proposal networks, Adv. Neural Inf. Process. Syst. 28 (2015).               [52] P.-T. De Boer, D.P. Kroese, S. Mannor, R.Y. Rubinstein, A tutorial on the cross-
 [21] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object de-                    entropy method, Ann. Oper. Res. 134 (1) (2005) 19–67.
      tection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell.               [53] J. Shetty, P.S. Jogi, Study on different region-based object detection models
      39 (6) (2017) 1137–1149.                                                                          applied to live video stream and images using deep learning, in: International
 [22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A.C. Berg, Ssd:                     Conference on ISMAC in Computational Vision and Bio-Engineering, Springer,
      single shot multibox detector, in: European Conference on Computer Vision,                        2018, pp. 51–60.
      Springer, 2016, pp. 21–37.                                                                   [54] C. Tang, Y. Feng, X. Yang, C. Zheng, Y. Zhou, The object detection based on
 [23] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: unified, real-                 deep learning, in: 2017 4th International Conference on Information Science
      time object detection, in: Proceedings of the IEEE Conference on Computer                         and Control Engineering (ICISCE), IEEE, 2017, pp. 723–728.
      Vision and Pattern Recognition, 2016, pp. 779–788.                                           [55] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, D. Ren, Distance-iou loss: faster and
 [24] L. Jiao, F. Zhang, F. Liu, S. Yang, L. Li, Z. Feng, R. Qu, A survey of deep learning-             better learning for bounding box regression, in: Proceedings of the AAAI Con-
      based object detection, IEEE Access 7 (2019) 128837–128868.                                       ference on Artificial Intelligence, vol. 34, 2020, pp. 12993–13000.
 [25] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accu-            [56] G. Ghiasi, T.-Y. Lin, Q.V. Le Dropblock, A regularization method for convolu-
      rate object detection and semantic segmentation, in: Proceedings of the IEEE                      tional networks, Adv. Neural Inf. Process. Syst. 31 (2018).
      Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587.                    [57] I. Loshchilov, F. Hutter, Sgdr: stochastic gradient descent with warm restarts,
 [26] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional                    arXiv preprint, arXiv:1608.03983, 2016.
      networks for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell. 37 (9)              [58] S. Liu, L. Qi, H. Qin, J. Shi, J. Jia, Path aggregation network for instance seg-
      (2015) 1904–1916.                                                                                 mentation, in: Proceedings of the IEEE Conference on Computer Vision and
 [27] R. Girshick, Fast r-cnn, in: Proceedings of the IEEE International Conference                     Pattern Recognition, 2018, pp. 8759–8768.
      on Computer Vision, 2015, pp. 1440–1448.                                                     [59] C.-Y. Wang, H.-Y.M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, I.-H. Yeh, Cspnet: a
 [28] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyra-                new backbone that can enhance learning capability of cnn, in: Proceedings of
      mid networks for object detection, in: Proceedings of the IEEE Conference on                      the IEEE/CVF Conference on Computer Vision and Pattern Recognition Work-
      Computer Vision and Pattern Recognition, 2017, pp. 2117–2125.                                     shops, 2020, pp. 390–391.
 [29] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the               [60] S.S.A. Zaidi, M.S. Ansari, A. Aslam, N. Kanwal, M. Asghar, B. Lee, A survey of
      IEEE International Conference on Computer Vision, 2017, pp. 2961–2969.                            modern deep learning based object detection models, Digit. Signal Process.
 [30] C. Szegedy, A. Toshev, D. Erhan, Deep neural networks for object detection,                       (2022) 103514.
      Adv. Neural Inf. Process. Syst. 26 (2013).                                                   [61] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recogni-
 [31] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, Y. LeCun, Overfeat: in-                   tion, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
      tegrated recognition, localization and detection using convolutional networks,                    Recognition, 2016, pp. 770–778.
      arXiv preprint, arXiv:1312.6229, 2013.                                                       [62] S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, Aggregated residual transformations
 [32] J. Redmon, A. Farhadi, Yolo9000: better, faster, stronger, in: Proceedings of                     for deep neural networks, in: Proceedings of the IEEE Conference on Com-
      the IEEE Conference on Computer Vision and Pattern Recognition, 2017,                             puter Vision and Pattern Recognition, 2017, pp. 1492–1500.
      pp. 7263–7271.                                                                               [63] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep con-
 [33] J. Redmon, A. Farhadi, Yolov3: an incremental improvement, arXiv preprint,                        volutional neural networks, Adv. Neural Inf. Process. Syst. 25 (2012).
      arXiv:1804.02767, 2018.                                                                      [64] M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks,
 [34] J. Solawetz, YOLOv5 new version - improvements and evaluation, https://blog.                      in: European Conference on Computer Vision, Springer, 2014, pp. 818–833.
      roboflow.com/yolov5-improvements-and-evaluation/, Jun. 2020. (Accessed 1                      [65] A.R. Pathak, M. Pandey, S. Rautaray, Application of deep learning for object
      April 2022).                                                                                      detection, Proc. Comput. Sci. 132 (2018) 1706–1717.
                                                                                              15
R. Kaur and S. Singh                                                                                                                          Digital Signal Processing 132 (2023) 103812
 [66] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale                  [93] Z. Cai, Q. Fan, R.S. Feris, N. Vasconcelos, A unified multi-scale deep convolu-
      image recognition, arXiv preprint, arXiv:1409.1556, 2014.                                         tional neural network for fast object detection, in: European Conference on
 [67] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van-                  Computer Vision, Springer, 2016, pp. 354–370.
      houcke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of the                [94] Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, X. Xue, Dsod: learning deeply super-
      IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.                        vised object detectors from scratch, in: Proceedings of the IEEE International
 [68] M. Lin, Q. Chen, S. Yan, Network in network, arXiv preprint, arXiv:1312.4400,                     Conference on Computer Vision, 2017, pp. 1919–1927.
      2013.                                                                                        [95] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, A.C. Berg, Dssd: deconvolutional single
 [69] S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training                     shot detector, arXiv preprint, arXiv:1701.06659, 2017.
      by reducing internal covariate shift, in: International Conference on Machine                [96] T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, Y. Chen Ron, Reverse connection with ob-
      Learning, PMLR, 2015, pp. 448–456.                                                                jectness prior networks for object detection, in: Proceedings of the IEEE Con-
 [70] A. Khan, A. Sohail, U. Zahoora, A.S. Qureshi, A survey of the recent architec-                    ference on Computer Vision and Pattern Recognition, 2017, pp. 5936–5944.
      tures of deep convolutional neural networks, Artif. Intell. Rev. 53 (8) (2020)               [97] A. Shrivastava, R. Sukthankar, J. Malik, A. Gupta, Beyond skip connections:
      5455–5516.                                                                                        top-down modulation for object detection, arXiv preprint, arXiv:1612.06851,
 [71] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected                           2016.
      convolutional networks, in: Proceedings of the IEEE Conference on Computer                   [98] B. Dipert, Overcome these 6 problems with object detection, https://
      Vision and Pattern Recognition, 2017, pp. 4700–4708.                                              www.edge-ai-vision.com/2022/02/overcome-these-6-problems-with-object-
 [72] A. Mogelmose, M.M. Trivedi, T.B. Moeslund, Vision-based traffic sign detection                      detection/, Feb. 2022. (Accessed 24 April 2022).
      and analysis for intelligent driver assistance systems: perspectives and survey,             [99] K. Oksuz, B.C. Cam, S. Kalkan, E. Akbas, Imbalance problems in object de-
      IEEE Trans. Intell. Transp. Syst. 13 (4) (2012) 1484–1497.                                        tection: a review, IEEE Trans. Pattern Anal. Mach. Intell. 43 (10) (2020)
 [73] A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny                  3388–3415.
      images, 2009.                                                                               [100] S. Mazumder, 5 techniques to handle imbalanced data for a classification
 [74] M. Everingham, L. Van Gool, C.K. Williams, J. Winn, A. Zisserman, The pas-                        problem,       https://www.analyticsvidhya.com/blog/2021/06/5-techniques-to-
      cal visual object classes (voc) challenge, Int. J. Comput. Vis. 88 (2) (2010)                     handle-imbalanced-data-for-a-classification-problem/, Jun. 2021. (Accessed 25
      303–338.                                                                                          April 2022).
 [75] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L.           [101] S. Kumar, 5 techniques to work with imbalanced data in machine learn-
      Zitnick, Microsoft coco: common objects in context, in: European Conference                       ing, https://towardsdatascience.com/5-techniques-to-work-with-imbalanced-
      on Computer Vision, Springer, 2014, pp. 740–755.                                                  data-in-machine-learning-80836d45d30c, Sep. 2021. (Accessed 25 April 2022).
                                                                                                  [102] A. Vahab, M.S. Naik, P.G. Raikar, S. Prasad, Applications of object detection
 [76] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: a large-scale
                                                                                                        system, Int. J. Res. Eng. Technol. 6 (4) (2019) 4186–4192.
      hierarchical image database, in: 2009 IEEE Conference on Computer Vision
                                                                                                  [103] Z. Zou, Z. Shi, Random access memories: a new paradigm for target detection
      and Pattern Recognition, IEEE, 2009, pp. 248–255.
                                                                                                        in high resolution aerial remote sensing images, IEEE Trans. Image Process.
 [77] A. Torralba, R. Fergus, W.T. Freeman, 80 million tiny images: a large data
                                                                                                        27 (3) (2017) 1100–1111.
      set for nonparametric object and scene recognition, IEEE Trans. Pattern Anal.
                                                                                                  [104] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, L.
      Mach. Intell. 30 (11) (2008) 1958–1970.
                                                                                                        Zhang, Dota: a large-scale dataset for object detection in aerial images, in:
 [78] J. Xiao, K.A. Ehinger, J. Hays, A. Torralba, A. Oliva, Sun database: exploring a
                                                                                                        Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
      large collection of scene categories, Int. J. Comput. Vis. 119 (1) (2016) 3–22.
                                                                                                        tion, 2018, pp. 3974–3983.
 [79] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali,
                                                                                                  [105] D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli, M. Klaric, Y. Bulatov, B.
      S. Popov, M. Malloci, A. Kolesnikov, et al., The open images dataset v4, Int. J.
                                                                                                        McCord, xview: objects in context in overhead imagery, arXiv preprint, arXiv:
      Comput. Vis. 128 (7) (2020) 1956–1981.
                                                                                                        1802.07856, 2018.
 [80] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A.
                                                                                                  [106] S. Razakarivony, F. Jurie, Vehicle detection in aerial imagery: a small target
      Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recog-
                                                                                                        detection benchmark, J. Vis. Commun. Image Represent. 34 (2016) 187–203.
      nition challenge, Int. J. Comput. Vis. 115 (3) (2015) 211–252.
                                                                                                  [107] G. Heitz, D. Koller, Learning spatial context: using stuff to find things, in: Eu-
 [81] R. Padilla, W.L. Passos, T.L. Dias, S.L. Netto, E.A. da Silva, A comparative anal-
                                                                                                        ropean Conference on Computer Vision, Springer, 2008, pp. 30–43.
      ysis of object detection metrics with a companion open-source toolkit, Elec-
                                                                                                  [108] P. Dollár, Z. Tu, P. Perona, S. Belongie, Integral channel features, 2009.
      tronics 10 (3) (2021) 279.
                                                                                                  [109] Y. Tian, P. Luo, X. Wang, X. Tang, Pedestrian detection aided by deep learning
 [82] A. Gad, Evaluating object detection models using mean average precision,
                                                                                                        semantic tasks, in: Proceedings of the IEEE Conference on Computer Vision
      https://www.kdnuggets.com/2021/03/evaluating-object-detection-models-
                                                                                                        and Pattern Recognition, 2015, pp. 5079–5087.
      using-mean-average-precision.html. (Accessed 7 August 2022).
                                                                                                  [110] L. Zhang, L. Lin, X. Liang, K. He, Is faster r-cnn doing well for pedestrian
 [83] A. Gad, Evaluating deep learning models: the confusion matrix, accuracy,                          detection?, in: European Conference on Computer Vision, Springer, 2016,
      precision, and recall, https://www.kdnuggets.com/2021/02/evaluating-deep-                         pp. 443–457.
      learning-models-confusion-matrix-accuracy-precision-recall.html. (Accessed 2                [111] Y. Tian, P. Luo, X. Wang, X. Tang, Deep learning strong parts for pedestrian
      August 2022).                                                                                     detection, in: Proceedings of the IEEE International Conference on Computer
 [84] R. Padilla, S.L. Netto, E.A. Da Silva, A survey on performance metrics for                        Vision, 2015, pp. 1904–1912.
      object-detection algorithms, in: 2020 International Conference on Systems,                  [112] W. Ouyang, H. Zhou, H. Li, Q. Li, J. Yan, X. Wang, Jointly learning deep features,
      Signals and Image Processing (IWSSIP), IEEE, 2020, pp. 237–242.                                   deformable parts, occlusion and classification for pedestrian detection, IEEE
 [85] J. Brownlee, How to calculate precision, recall, and f-measure for im-                            Trans. Pattern Anal. Mach. Intell. 40 (8) (2017) 1874–1887.
      balanced classification, https://machinelearningmastery.com/precision-recall-                [113] S. Zhang, J. Yang, B. Schiele, Occluded pedestrian detection through guided
      and-f-measure-for-imbalanced-classification/, Jan. 2020. (Accessed 1 May                           attention in cnns, in: Proceedings of the IEEE Conference on Computer Vision
      2022).                                                                                            and Pattern Recognition, 2018, pp. 6995–7003.
 [86] Precision-Recall,           https://scikit-learn.org/stable/auto_examples/model_            [114] P. Dollar, C. Wojek, B. Schiele, P. Perona, Pedestrian detection: an evaluation
      selection/plot_precision_recall.html. (Accessed 17 April 2022).                                   of the state of the art, IEEE Trans. Pattern Anal. Mach. Intell. 34 (4) (2011)
 [87] J. Brownlee, How to use ROC curves and precision-recall curves for                                743–761.
      classification in python, https://machinelearningmastery.com/roc-curves-and-                 [115] A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? The kitti
      precision-recall-curves-for-classification-in-python/, Aug. 2018. (Accessed 17                     vision benchmark suite, in: 2012 IEEE Conference on Computer Vision and
      April 2022).                                                                                      Pattern Recognition, IEEE, 2012, pp. 3354–3361.
 [88] S. Narkhede, Understanding AUC - ROC curve, https://towardsdatascience.                     [116] S. Zhang, R. Benenson, B. Schiele, Citypersons: a diverse dataset for pedestrian
      com/understanding-auc-roc-curve-68b2303cc9c5, Jun. 2018. (Accessed 12                             detection, in: Proceedings of the IEEE Conference on Computer Vision and
      April 2022).                                                                                      Pattern Recognition, 2017, pp. 3213–3221.
 [89] J. Solawetz, Small object detection guide, https://blog.roboflow.com/detect-                 [117] M. Schinas, S. Papadopoulos, G. Petkos, Y. Kompatsiaris, P.A. Mitkas, Mul-
      small-objects/, Aug. 2020. (Accessed 7 August 2022).                                              timodal graph-based event detection and summarization in social media
 [90] S. Bell, C.L. Zitnick, K. Bala, R. Girshick, Inside-outside net: detecting objects                streams, in: Proceedings of the 23rd ACM International Conference on Mul-
      in context with skip pooling and recurrent neural networks, in: Proceedings                       timedia, 2015, pp. 189–192.
      of the IEEE Conference on Computer Vision and Pattern Recognition, 2016,                    [118] Z. Yang, Q. Li, W. Liu, J. Lv, Shared multi-view data representation for multi-
      pp. 2874–2883.                                                                                    domain event detection, IEEE Trans. Pattern Anal. Mach. Intell. 42 (5) (2019)
 [91] T. Kong, A. Yao, Y. Chen, F. Sun, Hypernet: towards accurate region proposal                      1243–1256.
      generation and joint object detection, in: Proceedings of the IEEE Conference               [119] Y. Wang, H. Sundaram, L. Xie, Social event detection with interaction graph
      on Computer Vision and Pattern Recognition, 2016, pp. 845–853.                                    modeling, in: Proceedings of the 20th ACM International Conference on Mul-
 [92] B. Hariharan, P. Arbelaez, R. Girshick, J. Malik, Object instance segmentation                    timedia, 2012, pp. 865–868.
      and fine-grained localization using hypercolumns, IEEE Trans. Pattern Anal.                  [120] B. Kong, Y. Zhan, M. Shin, T. Denny, S. Zhang, Recognizing end-diastole and
      Mach. Intell. 39 (4) (2016) 627–639.                                                              end-systole frames via deep temporal regression network, in: International
                                                                                             16
R. Kaur and S. Singh                                                                                                                           Digital Signal Processing 132 (2023) 103812
        Conference on Medical Image Computing and Computer-Assisted Intervention,                  [145] C. Cao, Y. Huang, Y. Yang, L. Wang, Z. Wang, T. Tan, Feedback convolutional
        Springer, 2016, pp. 264–272.                                                                     neural network for visual localization and segmentation, IEEE Trans. Pattern
[121]   J. Kawahara, G. Hamarneh, Multi-resolution-tract cnn with hybrid pretrained                      Anal. Mach. Intell. 41 (7) (2018) 1627–1640.
        and skin-lesion trained layers, in: International Workshop on Machine Learn-               [146] F. Wan, C. Liu, W. Ke, X. Ji, J. Jiao, Q. Ye, C-mil: continuation multiple in-
        ing in Medical Imaging, Springer, 2016, pp. 164–171.                                             stance learning for weakly supervised object detection, in: Proceedings of
[122]   N.C. Codella, D. Gutman, M.E. Celebi, B. Helba, M.A. Marchetti, S.W. Dusza,                      the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019,
        A. Kalloo, K. Liopyris, N. Mishra, H. Kittler, et al., Skin lesion analysis to-                  pp. 2199–2208.
        ward melanoma detection: a challenge at the 2017 international symposium                   [147] F. Wan, P. Wei, J. Jiao, Z. Han, Q. Ye, Min-entropy latent model for weakly
        on biomedical imaging (isbi), hosted by the international skin imaging col-                      supervised object detection, in: Proceedings of the IEEE Conference on Com-
        laboration (isic), in: 2018 IEEE 15th International Symposium on Biomedical                      puter Vision and Pattern Recognition, 2018, pp. 1297–1306.
        Imaging (ISBI 2018), IEEE, 2018, pp. 168–172.                                              [148] H. Noh, S. Hong, B. Han, Learning deconvolution network for semantic seg-
[123]   L. Li, M. Xu, X. Wang, L. Jiang, H. Liu, Attention based glaucoma detection: a                   mentation, in: Proceedings of the IEEE International Conference on Computer
        large-scale database and cnn model, in: Proceedings of the IEEE/CVF Confer-                      Vision, 2015, pp. 1520–1528.
        ence on Computer Vision and Pattern Recognition, 2019, pp. 10571–10580.                    [149] X. Chen, K. Kundu, Y. Zhu, A.G. Berneshawi, H. Ma, S. Fidler, R. Urtasun, 3d
[124]   P.J. Schubert, S. Dorkenwald, M. Januszewski, V. Jain, J. Kornfeld, Learning cel-                object proposals for accurate object class detection, Adv. Neural Inf. Process.
        lular morphology with neural networks, Nat. Commun. 10 (1) (2019) 1–12.                          Syst. 28 (2015).
[125]   X. Shi, S. Shan, M. Kan, S. Wu, X. Chen, Real-time rotation-invariant                      [150] X. Zhu, Y. Xiong, J. Dai, L. Yuan, Y. Wei, Deep feature flow for video recogni-
        face detection with progressive calibration networks, in: Proceedings of                         tion, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
        the IEEE Conference on Computer Vision and Pattern Recognition, 2018,                            Recognition, 2017, pp. 2349–2358.
        pp. 2295–2303.                                                                             [151] X. Zhu, Y. Wang, J. Dai, L. Yuan, Y. Wei, Flow-guided feature aggregation for
[126]   D. Chen, G. Hua, F. Wen, J. Sun, Supervised transformer network for efficient                      video object detection, in: Proceedings of the IEEE International Conference
        face detection, in: European Conference on Computer Vision, Springer, 2016,                      on Computer Vision, 2017, pp. 408–417.
        pp. 122–138.                                                                               [152] K. Kang, H. Li, T. Xiao, W. Ouyang, J. Yan, X. Liu, X. Wang, Object detection in
[127]   J. Wang, Y. Yuan, G. Yu, Face attention network: an effective face detector for                  videos with tubelet proposal networks, in: Proceedings of the IEEE Conference
        the occluded faces, arXiv preprint, arXiv:1711.07246, 2017.                                      on Computer Vision and Pattern Recognition, 2017, pp. 727–735.
[128]   S. Yang, P. Luo, C.C. Loy, X. Tang, Faceness-net: face detection through deep              [153] M. Heller, What is neural architecture search? AutoML for deep learn-
        facial part responses, IEEE Trans. Pattern Anal. Mach. Intell. 40 (8) (2017)                     ing, https://www.infoworld.com/article/3648408/what-is-neural-architecture-
        1845–1859.                                                                                       search.html, Jan. 2022. (Accessed 26 February 2022).
[129]   S. Yang, P. Luo, C.-C. Loy, X. Tang, Wider face: a face detection benchmark, in:           [154] Everything you need to know about AutoML and neural architecture search,
        Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-                       https://www.kdnuggets.com/2018/09/everything-need-know-about-automl-
        tion, 2016, pp. 5525–5533.                                                                       neural-architecture-search.html. (Accessed 26 February 2022).
[130]   V. Jain, E. Learned-Miller, Fddb: a benchmark for face detection in uncon-                 [155] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A.
        strained settings, Tech. Rep., UMass Amherst technical report, 2010.                             Courville, Y. Bengio, Generative adversarial nets, Adv. Neural Inf. Process. Syst.
[131]   M. Koestinger, P. Wohlhart, P.M. Roth, H. Bischof, Annotated facial landmarks                    27 (2014).
        in the wild: a large-scale, real-world database for facial landmark localization,          [156] S. Mahajan, A.K. Pandit, Hybrid method to supervise feature selection us-
        in: 2011 IEEE International Conference on Computer Vision Workshops (ICCV                        ing signal processing and complex algebra techniques, Multimed. Tools Appl.
        Workshops), IEEE, 2011, pp. 2144–2151.                                                           (2021) 1–22.
[132]   H. Nada, V.A. Sindagi, H. Zhang, V.M. Patel, Pushing the limits of uncon-                  [157] S. Mahajan, L. Abualigah, A.K. Pandit, M. Altalhi, Hybrid aquila optimizer with
        strained face detection: a challenge dataset and baseline results, in: 2018 IEEE                 arithmetic optimization algorithm for global optimization tasks, Soft Comput.
        9th International Conference on Biometrics Theory, Applications and Systems                      26 (10) (2022) 4863–4881.
        (BTAS), IEEE, 2018, pp. 1–10.                                                              [158] S. Mahajan, L. Abualigah, A.K. Pandit, A. Nasar, M. Rustom, H.A. Alkhaza-
[133]   Z. Wojna, A.N. Gorban, D.-S. Lee, K. Murphy, Q. Yu, Y. Li, J. Ibarz, Attention-                  leh, M. Altalhi, Fusion of modern meta-heuristic optimization methods using
        based extraction of structured information from street view imagery, in: 2017                    arithmetic optimization algorithm for global optimization tasks, Soft Comput.
        14th IAPR International Conference on Document Analysis and Recognition                          (2022) 1–15.
        (ICDAR), vol. 1, IEEE, 2017, pp. 844–850.                                                  [159] S. Mahajan, L. Abualigah, A.K. Pandit, Hybrid arithmetic optimization algo-
[134]   M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Synthetic data and ar-                      rithm with hunger games search for global optimization, Multimed. Tools
        tificial neural networks for natural scene text recognition, arXiv preprint,                      Appl. (2022) 1–24.
        arXiv:1406.2227, 2014.                                                                     [160] S. Mahajan, A.K. Pandit, Image segmentation and optimization techniques: a
[135]   A. Veit, T. Matera, L. Neumann, J. Matas, S. Belongie, Coco-text: dataset                        short overview, Medicon Eng. Themes 2 (2) (2022) 47–49.
        and benchmark for text detection and recognition in natural images, arXiv                  [161] M. Abd Elaziz, A. Dahou, L. Abualigah, L. Yu, M. Alshinwan, A.M. Kha-
        preprint, arXiv:1601.07140, 2016.                                                                sawneh, S. Lu, Advanced metaheuristic optimization techniques in applica-
[136]   S. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young, Icdar 2003 ro-                      tions of deep neural networks: a review, Neural Comput. Appl. 33 (21) (2021)
        bust reading competitions, in: Seventh International Conference on Document                      14079–14099.
        Analysis and Recognition, 2003, Proceedings, 2003, pp. 682–687.
[137]   J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, S. Yan, Perceptual generative adversarial
        networks for small object detection, in: Proceedings of the IEEE Conference                                            Ravpreet Kaur received her B.Tech degree from
        on Computer Vision and Pattern Recognition, 2017, pp. 1222–1230.                                                   Chandigarh University, India and M.Tech degree from
[138]   Y. Lu, J. Lu, S. Zhang, P. Hall, Traffic signal detection and classification in street                                CGC Landran, India in Computer Science and Engi-
        views using an attention model, Comput. Vis. Media 4 (3) (2018) 253–266.
                                                                                                                           neering. Currently she is pursuing Ph.D from UIET,
[139]   Z. Zhu, D. Liang, S. Zhang, X. Huang, B. Li, S. Hu, Traffic-sign detection and
                                                                                                                           Panjab University, Chandigarh. Her areas of interests
        classification in the wild, in: Proceedings of the IEEE Conference on Computer
        Vision and Pattern Recognition, 2016, pp. 2110–2118.
                                                                                                                           include Deep Learning and Machine Learning.
[140]   K. Behrendt, L. Novak, R. Botros, A deep learning approach to traffic lights:
        detection, tracking, and classification, in: 2017 IEEE International Conference
        on Robotics and Automation (ICRA), IEEE, 2017, pp. 1370–1377.
                                                                                                                          Sarbjeet Singh is a Professor at University Insti-
[141]   D. Li, D. Zhao, Y. Chen, Q. Zhang, Deepsign: deep learning based traffic
        sign recognition, in: 2018 International Joint Conference on Neural Networks                                  tute of Engineering and Technology, Panjab Univer-
        (IJCNN), IEEE, 2018, pp. 1–6.                                                                                 sity, India. He received his B.Tech degree in Computer
[142]   S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, C. Igel, Detection of traffic                                Science and Engineering from Punjab Technical Uni-
        signs in real-world images: the German traffic sign detection benchmark, in:                                    versity, Jalandhar, India, in 2001 and Ph.D. degree in
        The 2013 International Joint Conference on Neural Networks (IJCNN), IEEE,                                     Computer Science and Engineering from Thapar Uni-
        2013, pp. 1–8.                                                                                                versity, Patiala, India, in 2009. His research areas in-
[143]   H. Bilen, A. Vedaldi, Weakly supervised deep detection networks, in: Pro-                                     clude Machine Learning, Deep Learning, Object De-
        ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
                                                                                                                      tection, Activity Recognition, Cloud Computing, Social
        2016, pp. 2846–2854.
                                                                                                   Network Analysis and Sentiment Analysis.
[144]   A. Diba, V. Sharma, A. Pazandeh, H. Pirsiavash, L. Van Gool, Weakly supervised
        cascaded convolutional networks, in: Proceedings of the IEEE Conference on
        Computer Vision and Pattern Recognition, 2017, pp. 914–922.
17