CONTENT
TITLE                                         PAGE NO.
CHAPTER-1 INTRODUCTION                         1
CHAPTER-2 BACKGROUND STUDY AND RELATED WORK    4
CHAPTER-3 OBJECT DETECTION AND TRACKING MODELS 6
    3.1 ANCHOR BOXES                           7
    3.2 FASTER RCNN                            7
    3.3 SINGLE SHOT DETECTOR (SSD)             8
    3.4 YOLO                                   9
    3.5 DEEP SORT                              11
CHAPTER-4 PROPOSED APPROACH                    14
    4.1 WORKFLOW                               15
CHAPTER-5 EXPERIMENTS AND RESULTS              16
CHAPTER-6 EXECUTION                            18
CHAPTER-7 OUTPUT                               22
CHAPTER-8 FUTURE SCOPE AND CHALLENGES          23
CHAPTER-9 CONCLUSION                           24
CHAPTER-10 REFERENCES                          25
          LIST OF FIGURES
FIG NO.   FIG NAME                             PAGE NO.
1         An Outcome of Social Distancing.         1
2         Performance Overview                     5
3         FASTER RCNN                              8
4         Schematic Representation of SSD          8
5         Schematic Representation of YOLO V3      9
6         Data Samples Showing                     16
7         Losses Per Iteration of The Object       16
8         Sample Output                            22
                             LIST OF TABLES
TABLE NO.                                      PAGE NO.
Table 1 Performance of the feature challenge      6
Table 2 Outlines Hyperparameters                  7
                                 CHAPTER-1
                             INTRODUCTION
       COVID-19 belongs to the family of coronavirus caused diseases, initially
reported in Wuhan, China, during late December 2020. On March 11, it spread to over
114 countries with 118,000 active cases and 4000 deaths. WHO declared this a
pandemic. By May 4, 2020, over 3,519,901 cases and 247,630 deaths had been
reported worldwide. Despite efforts by healthcare organizations, medical experts, and
scientists to develop medicines and vaccines, no success is reported. This situation
prompts the global community to explore alternate ways to stop the spread of this
infectious virus. Social distancing is considered the best method in the current
scenario, and affected countries are implementing lockdowns to enforce it. This
research aims to support and mitigate the coronavirus pandemic with minimal
economic loss.
                     Figure 1 An Outcome of Social Distancing
Fig. 1: An outcome of social distancing showing the reduced peak of the epidemic,
matching with available healthcare capacity, and proposing a solution to detect social
distancing among people gathered at any public place. The term "social distancing" is
a best practice aiming to minimize or interrupt the transmission of COVID-19,
following WHO norms that recommend maintaining at least 6 feet of distance among
                                          1
individuals. Recent studies indicate that social distancing is crucial to prevent
SARSCoV-2, as individuals with mild or no symptoms may unknowingly carry the
virus and infect others. Proper social distancing reduces infectious physical contact,
thereby lowering the infection rate.
This reduced peak aligns with available healthcare infrastructure, providing better
facilities to patients battling the coronavirus pandemic. Epidemiology, the study of
factors and reasons for the spread of infectious diseases, often employs mathematical
models. The classical SIR model of Kermack and McKendrick (1927) serves as a
foundation, and various researchers have explored its deterministic and stochastic
extensions. Respiratory diseases' treatment and containment strategies consider the
rate and mode of virus transmission. While efforts are underway to develop COVID-
19 vaccines, no well-known medicine is available for treatment or prevention. Eksin
et al. proposed a modified SIR model incorporating a social distancing parameter (a(I,
R)), determined by the number of infected (I) and recovered (R) individuals.
(1)
Here, β represents the infection rate, δ represents the recovery rate, and the population
size N is computed as N = S + I + R. The social distancing term (a(I, R) : R 2 [0, 1])
maps the transition rate from a susceptible state (S) to an infected state (I), calculated
by aβSI/N. Social distancing models include "long-term awareness," where
interaction reduces proportionally with the cumulative percentage of affected
individuals, and "short-term awareness," where interaction reduction is directly
proportional to the proportion of infectious individuals.
                                            2
(2)
(3)
The behavior parameter k (k ≥ 0) defines sensitivity to disease prevalence, with a
higher value indicating increased sensitivity. On April 16, 2020, Landing AI, under
the leadership of Dr. Andrew Ng, announced an AI tool to monitor social distancing at
the workplace. Inspired by this, the authors aim to check and compare the
performance of popular object detection and tracking schemes in monitoring social
distancing. The paper's structure includes recent work in Section II, state-of-the-art
object detection and tracking models in Section III, a proposed deep learning-based
framework in Section IV, experimentation and results in Section V, the future scope
and challenges in Section VII, and the conclusion in Section VIII.
                                          3
                                  CHAPTER-2
       BACKGROUND STUDY AND RELATED WORK
       Social distancing emerged as a reliable technique to curb the spread of
infectious diseases. In December 2019, with the onset of COVID-19 in Wuhan,
China, social distancing was swiftly adopted as an unprecedented measure on January
23, 2020. Within a month, the outbreak peaked in the first week of February, with
2,000 to 4,000 new confirmed cases per day. Notably, there was a sign of relief in
China when no new confirmed cases were reported for five consecutive days up to
March 23, 2020. Social distancing measures enacted in China were later adopted
worldwide to control COVID-19.
Prem et al. studied the effects of social distancing measures on the spread of the
COVID-19 epidemic, using synthetic location-specific contact patterns and
susceptible-exposed-infected-removed (SEIR) models. It was suggested that
prematurely lifting social distancing could lead to an earlier secondary peak,
emphasizing the importance of gradual interventions. Adolph et al. highlighted
challenges in the United States, where lack of common consent among policymakers
resulted in ongoing harm to public health. Despite economic impacts, research like
that of Kylie et al. explored the correlation between the strictness of social distancing
and the economic status of a region, indicating that intermediate levels of activities
could be permitted to avoid a massive outbreak.
Technology-based solutions have been employed globally to contain the outbreak,
including the use of GPS and Bluetooth in apps like India's Arogya Setu to track
suspected or infected persons. Some countries use drones and surveillance cameras to
detect mass gatherings and disperse crowds. While manual interventions may help
flatten the curve, they also pose threats and challenges to the workforce.
Human detection in visual surveillance relies on manual methods, but recent
advancements advocate for intelligent systems due to constraints like low-resolution
video and varying environmental factors. Object detection involves two stages:
detection and classification. Background subtraction, optical flow, and spatiotemporal
                                           4
filtering are traditional methods for object detection. Advanced techniques, including
convolutional neural networks (CNN), region-based CNN, and faster region-based
CNN, have improved object detection efficiency. The YOLO approach, using
regression-based methods, efficiently divides images for faster detection.
Crowd counting research, focused on societal applications, has addressed challenges
in video surveillance. Shape-based, texture-based, and motion-based features aid in
human identification. Face and gait recognition techniques are used for further
identification, but occlusion in crowded scenarios remains challenging. Datasets like
KTH human motion, INRIA XMAS multi-view, and Weizmann human action have
been widely used for research. In this study, the open images dataset is considered for
fine-tuning object detection and tracking models, aiming to monitor social distancing
using Oxford town center surveillance footage. Unified datasets with annotations for
various tasks enable efficient object detection research and progress in scene
understanding.
                           Figure 2 Performance Overview
Fig. 2: Performance overview of the most popular object detection models on
PASCAL-VOC and MS-COCO datasets
                                           5
                                  CHAPTER-3
     OBJECT DETECTION AND TRACKING MODELS
       As indicated by Fig. 2, prominent object detection models such as RCNN, fast
RCNN, faster RCNN, SSD, YOLO v1, YOLO v2, and YOLO v3, evaluated on
PASCAL-VOC and MS-COCO               datasets, exhibit a trade-off between speed and
detection accuracy. This trade-off is influenced by factors like backbone architecture
(e.g., VGG-16 , ResNet-101, Inception v2), input sizes, model depth, and varying
software and hardware environments. Feature extraction networks, such as VGG-16,
ResNet-101, and Inception v2, encode input images into feature representations,
crucial for learning object patterns. These networks employ predefined anchor boxes
covering the entire image to identify objects of various scales or sizes.
Table I presents the performance and accuracy of these feature extraction networks on
the ILSVRC ImageNet challenge, along with the number of trainable parameters,
impacting training speed. Notably, Inception v2 achieves high accuracy with minimal
parameters and is used as a backbone for faster RCNN and SSD, while YOLO v3
employs Darknet-53.
   Table 1 Performance of the feature extraction network on ImageNet challenge.
                                            6
          Table 2 Outlines Hyperparameters For Generating Anchor Boxes.
3.1 ANCHOR BOXES
Every popular object detection model incorporates anchor boxes, overlaid on input
images at various spatial locations with different sizes and aspect ratios. For an image
of dimension breadth (b) × height (h), anchor boxes are generated using parameters
size (p) and aspect ratio (r). Anchor box dimensions are computed as bp√r × hp√r.
TABLE I: Hyperparameters for generating the anchor boxes.
3.2 FASTER RCNN
Proposed by Ren et al. Faster RCNN enhances the speed of RCNN and fast RCNN by
introducing the Region Proposal Network (RPN). RPN, leveraging CNN models like
VGGNet or ResNet, generates region proposals, significantly improving processing
speed. The schematic architecture of Faster RCNN is shown in below figures.
                                           7
                               Figure 3 FASTER RCNN
3.3 SINGLE SHOT DETECTOR (SSD)
       In this research, Single Shot Detector (SSD) [63] is also used as another object
identification method to detect people in real-time video surveillance systems. As
discussed earlier, Faster R-CNN works on region proposals to create boundary boxes
to indicate objects, showing better accuracy but has slow processing of frames per
second (FPS). For real-time processing, SSD further improves accuracy and FPS by
using multiscale features and default boxes in a single process. It follows the principle
of the feed-forward convolution network, which generates bounding boxes of fixed
sizes along with a score based on the presence of object class instances in those
boxes, followed by the NMS step to produce the final detections. Thus, it consists of
two steps: extracting feature maps and applying convolution filters to detect objects
using an architecture having three main parts. The first part is a base pretrained
network to extract feature maps, whereas, in the second part, multiscale feature layers
are used in which a series of convolution filters are cascaded after the base network.
The last part is a non-maximum suppression unit for eliminating overlapping boxes
and one object only per box. The architecture of SSD is shown in Fig. 4.
               Figure 4 Schematic Representation Of SSD Architecture
                                           8
1) Loss function:
Similar to the above-discussed Faster RCNN model, the overall loss function of the
SSD model is equal to the sum of multi-class classification loss (Lcls) and bounding
box regression loss (localization loss, Lreg), as shown in Eq. 4, where Lreg and Lcls
are defined by Eq. 6 and 7:
where l is the predicted box, g is the ground truth box, x p ij is an indicator that
matches ith the anchor box to the jth ground truth box, cx and cy are offsets to the
anchor box a.
3.4 YOLO
       For object detection, another competitor of SSD is YOLO [40]. This method
can predict the type and location of an object by looking only once at the image.
YOLO considers the object detection problem as a regression task instead of
classification to assign class probabilities to the anchor boxes. A single convolutional
network simultaneously predicts multiple bounding boxes and class probabilities.
Majorly, there are three versions of YOLO: v1, v2, and v3. YOLO v1 is
                                           9
             Figure 5 Schematic representation of YOLO v3 architecture
Inspired by Google Net (Inception network) designed for object classification in an
image. This network consists of 24 convolutional layers and 2 fully connected layers.
Instead of the Inception modules used by Google Net, YOLO v1 simply uses a
reduction layer followed by convolutional layers. Later, YOLO v2 [64] is proposed
with the objective of improving accuracy significantly while making it faster. YOLO
v2 uses Darknet-19 as a backbone network consisting of 19 convolution layers along
with 5 max pooling layers and an output softmax layer for object classification.
YOLO v2 outperformed its predecessor (YOLO v1) with significant improvements in
MAP, FPS, and object classification score. In contrast, YOLO v3 performs multi-label
classification with the help of logistic classifiers instead of using softmax as in the
case of YOLO v1 and v2. In YOLO v3, Redmon et al. proposed Darknet-53 as a
backbone architecture that extracts feature maps for classification. In contrast to
Darknet-19, Darknet-53 consists of residual blocks (short connections) along with the
up-sampling layers for concatenation and added depth to the network. YOLO v3
generates three predictions for each spatial location at different scales in an image,
eliminating the problem of not being able to detect small objects efficiently [77]. Each
prediction is monitored by computing objectness, boundary box regressor, and
classification scores. In Fig. 5, a schematic description of the YOLOv3 architecture is
presented.
1) LOSS FUNCTION:
The overall loss function of YOLO v3 consists of localization loss (bounding box
regressor), cross entropy, and confidence loss for classification score, defined as
follows:
                                          10
where λ coord indicates the weight of the coordinate error, S 2 indicates the number
of grids in the image, and B is the number of generated bounding boxes per grid. 1
obj i , j = 1 describes that object confines in the j th bounding box in grid I.
3.5 DEEP SORT
       Deep Sort is a deep learning-based approach to track custom objects in a video
[78]. In the present research, Deep Sort is utilized to track individuals present in the
surveillance footage. It makes use of patterns learned via detected objects in the
images, which are later combined with temporal information for predicting associated
trajectories of the objects of interest. It keeps track of each object under consideration
by mapping unique identifiers for further statistical analysis. Deep Sort is also useful
to handle associated challenges such as occlusion, multiple viewpoints, non-stationary
cameras, and annotating training data. For effective tracking, the Kalman filter and
the Hungarian algorithm are used. The Kalman filter is recursively used for better
association, and it can predict future positions based on the current position. The
Hungarian algorithm is used for association and ID attribution that identifies if an
object in the current frame is the same as the one in the previous frame. Initially, a
Faster RCNN is trained for person identification, and for tracking, a linear constant
velocity model is utilized to describe each target with an eight-dimensional space as
follows:
                                            11
where (u, v) text is the centroid of the bounding box, a is the aspect ratio, and h is the
height of the image. The other variables are the respective velocities of the variables.
Later, the standard Kalman filter is used with constant velocity motion and linear
observation model, where the bounding coordinates (u, v, λ, h) are taken as direct
observations of the object state. For each track k, starting from the last successful
measurement association ak, the total number of frames is calculated. The Hungarian
algorithm is then utilized to solve the mapping problem between the newly arrived
measurements and the predicted Kalman states by considering the motion and
appearance information with the help of Mahalanobis distance computed between
them as defined in Eq. 10.
where the projection of the I is track distribution into measurement space is
represented by (y_i, S_i) and the j is bounding box detection b d. The Mahalanobis
distance considers this uncertainty by estimating the count of standard deviations, the
detection is away from the mean track location. Further, using this metric, it is
possible to exclude unlikely associations by thresholding the Mahalan
obis distance. This decision is denoted with an indicator that evaluates to 1 if the
association between the } i^{th} \text{ track and } j^{th} \text{ detection is
admissible (Eq. 11).} \]
Though Mahalanobis distance performs efficiently but fails in the environment where
camera motion is possible, thereby another metric is introduced for the assignment
problem. This second metric measures the smallest cosine distance between the i th
track and j th detection in appearance space as follows in equation 12:
                                           12
Again, a binary variable is introduced to indicate if an association is admissible
according to the following metric in equation 13:
and a suitable threshold is measured for this indicator on a separate training dataset.
To build the association problem, both metrics are combined using a weighted sum
in equation 14:
where an association is admissible if it is within the gating region of both metrics in
equation 15:
The influence of each metric on the combined association cost can be controlled
through hyperparameter λ.
                                          13
                                  CHAPTER-4
                        PROPOSED APPROACH
       The emergence of deep learning has brought the best performing techniques
for a wide variety of tasks and challenges, including medical diagnosis, machine
translation, speech recognition, and a lot more. Most of these tasks are centered
around object classification, detection, segmentation, tracking, and recognition. In
recent years, convolutional neural network (CNN) based architectures have shown
significant performance improvements that are leading towards the high quality of
object detection, as shown in Fig. 2, which presents the performance of such models
in terms of MAP and FPS on standard benchmark datasets, PASCAL-VOC and MS-
COCO, and similar hardware resources.
       In the present article, a deep learning-based framework is proposed that
utilizes object detection and tracking models to aid in the social distancing remedy for
dealing with the escalation of COVID-19 cases.
In order to maintain the balance of speed and accuracy, YOLO v3 alongside Deepsort
is utilized as object detection and tracking approaches while surrounding each
detected object with bounding boxes. Later, these bounding boxes are utilized to
compute the pairwise L2 norm with computationally efficient vectorized
representation for identifying the clusters of people not obeying the order of social
distancing. Furthermore, to visualize the clusters in the live stream, each bounding
box is color-coded based on its association with the group where people belonging to
the same group are represented with the same color. Each surveillance frame is also
accompanied by the streamline plot depicting the statistical count of the number of
social groups and an index term (violation index) representing the ratio of the number
of people to the number of groups. Furthermore, estimated violations can be
computed by multiplying the violation index with the total number of social groups.
                                          14
4.1 WORKFLOW
This section includes the necessary steps undertaken to compose a framework for
monitoring social distancing.
1. Fine-tune the trained object detection model to identify and track the person in a
footage.
2. The trained model is fed with the surveillance footage. The model generates a set of
bounding boxes and an ID for each identified person.
3. Each individual is associated with three-dimensional feature space (x, y, d), where
(x, y) corresponds to the centroid coordinates of the bounding box, and d defines the
depth of the individual as observed from the camera.
\[d = \left(\frac{{2 \cdot 3.14 \cdot 180}}{{w + h \cdot 360}} \cdot 1000 + 3\right)\
(16)\]
where w is the width of the bounding box and h is the height of the bounding box.
4. For the set of bounding boxes, pairwise L2 norm is computed as given by the
following equation,
where in this work n = 3.
5. The dense matrix of L2 norm is then utilized to assign the neighbors for each
individual that satisfies the closeness sensitivity. With extensive trials, the closeness
threshold is updated dynamically based on the spatial location of the person in a given
frame ranging between (90, 170) pixels.
                                           15
                                 CHAPTER-5
                   EXPERIMENTS AND RESULTS
       The previously discussed object detection models are fine-tuned for binary
classification (person or not a person) with Inception v2 as a backbone network on the
Nvidia GTX 1060 GPU. The training is conducted using the dataset acquired from the
open image dataset (OID) repository, maintained by the Google open source
community. Diverse images with a class label as "Person" are obtained through the
OIDv4 toolkit, along with annotations.
   Figure 6 Data Samples Showing (A) True Samples And (B) False Samples Of A
                   "Person" Class From The Open Image Dataset.
 Figure 7 Losses Per Iteration of The Object Detection Models During The Training
     Phase On The OID Validation Set For Detecting The Person In An Image.
                                         16
The dataset consists of 800 images, filtered manually to only contain true samples. It
is then divided into training and testing sets in an 8:2 ratio. To ensure testing
robustness, the testing set is accompanied by frames of surveillance footage from the
Oxford town center [23]. This footage is later used to simulate the overall approach
for monitoring social distancing.
For Faster R-CNN, images are resized to P pixels on the shorter edge, with 600 and
1024 for low and high resolution. In SSD and YOLO, images are scaled to the fixed
dimension P × P, with P set to 416. Throughout the training phase, model performance
is continuously monitored using MAP, along with localization, classification, and
overall loss in detecting a person, as shown in Fig. 7.
Table III summarizes the results of each model at the end of the training phase,
including training time (TT), number of iterations (NOI), MAP, and total loss (TL)
value. The Faster R-CNN model achieved minimal loss with maximum MAP but had
the lowest FPS, making it unsuitable for real-time applications. In comparison, YOLO
v3 achieved better results with a balanced mAP, training time, and FPS score. The
trained YOLO v3 model is then utilized for monitoring social distancing in the
surveillance video.
                                           17
CHAPTER-6
EXECUTION
   18
19
20
21
                                 CHAPTER-7
                                   OUTPUT
       The proposed framework outputs (as shown in Fig. 8) the processed frame
with the identified people confined in the bounding boxes while also simulating the
statistical analysis showing the total number of social groups displayed by the same
color encoding and a violation index term computed as the ratio of the number of
people to the number of groups. The frames shown in Fig. 8 display violation indices
as 3, 2, 2, and 2.33. The frames with detected violations are recorded with the
timestamp for future analysis.
     Figure 8 Sample Output of The Proposed Framework For Monitoring Social
             Distancing on Surveillance Footage of Oxford Town Center.
                                        22
                                  CHAPTER-8
                FUTURE SCOPE AND CHALLENGES
         Given that this application is intended for use in any working environment,
accuracy and precision are crucial to fulfill its purpose. However, a higher number of
false positives may lead to discomfort and panic among the people being observed.
Genuine concerns about privacy and individual rights may also arise. These concerns
can be addressed through additional measures such as obtaining prior consent for the
use of such technology in working environments, implementing features to hide a
person's identity in general, and maintaining transparency about the fair and limited
use of the technology among stakeholders.
Challenges and Considerations:
1. False Positives: Balancing accuracy to avoid false positives is essential to prevent
unnecessary alarm or panic among individuals.
2. Privacy Concerns: Striking a balance between effective monitoring and respecting
privacy rights is critical. Implementing measures to anonymize individuals or seeking
consent can address these concerns.
3. Ethical Use: Ensuring the ethical use of the technology and transparent
communication about its purposes and limitations are vital for public acceptance and
trust.
4. Legislation and Regulation: Developing and adhering to regulations that govern the
use of such surveillance technologies is necessary to protect individual rights and
maintain accountability.
5. Continuous Improvement: The system should be subject to continuous
improvement based on feedback, technological advancements, and changes in societal
expectations.
By addressing these challenges and considering ethical implications, the future scope
of the application lies in creating a balance between effective monitoring for public
safety and respecting individual privacy rights.
                                          23
                                  CHAPTER-9
                                CONCLUSION
       This article introduces an efficient real-time deep learning-based framework
designed to automate the monitoring of social distancing through object detection and
tracking approaches. In this framework, individuals are identified in real-time using
bounding boxes, which further facilitate the identification of clusters or groups of
people adhering to the closeness property, computed through a pairwise vectorized
approach. The number of violations is determined by computing the number of
formed groups and a violation index term, representing the ratio of the number of
people to the number of groups.
Extensive trials were conducted with popular state-of-the-art object detection models,
namely Faster R-CNN, SSD, and YOLO v3. Among these, YOLO v3 demonstrated
efficient performance with a balanced Frames Per Second (FPS) and Mean Average
Precision (mAP) score. Given the sensitivity of this approach to the spatial location of
the camera, further fine-tuning can be conducted to better adjust to the corresponding
field of view.
In conclusion, the proposed framework offers an effective solution for real-time
monitoring of social distancing, leveraging the capabilities of deep learning and
object detection technologies. The choice of YOLO v3, with its balanced performance
metrics, makes it a suitable candidate for practical applications. However, ongoing
efforts should focus on addressing challenges related to privacy, false positives, and
ethical considerations to ensure the responsible and effective deployment of such
technologies in various working environments.
                                          24
                             CHAPTER-10
                              REFRENCES
1. World Health Organization, "WHO corona-viruses (COVID-19)," [Online;
   accessed May 02, 2020]. https://www.who.int/emergencies/diseases/novel-
   corona-virus-2019, 2020.
2. WHO, "Who director-generals opening remarks at the media briefing on
   covid-19-11    march     2020."    [Online;   accessed     March   12,   2020].
   https://www.who.int/dg/speeches/detail/, 2020.
3. L. Hensley, "Social distancing is out, physical distancing is in - here's how to
   do it," Global News–Canada (27 March 2020), 2020.
4. ECDPC, "Considerations relating to social distancing measures in response to
   COVID-19      second    update."   [Online;   accessed     March   23,   2020].
   https://www.ecdc.europa.eu/en/publications-data/considerations, 2020.
5. M. W. Fong, H. Gao, J. Y. Wong, J. Xiao, E. Y. Shiu, S. Ryu, and B. J.
   Cowling, "Nonpharmaceutical measures for pandemic influenza in non-
   healthcare settings - social distancing measures," 2020.
6. F. Ahmed, N. Zviedrite, and A. Uzicanin, "Effectiveness of workplace social
   distancing measures in reducing influenza transmission: a systematic review,"
   BMC public health, vol. 18, no. 1, p. 518, 2018.
7. W. O. Kermack and A. G. McKendrick, "Contributions to the mathematical
   theory of epidemics–i. 1927," 1991.
8. C. Eksin, K. Paarporn, and J. S. Weitz, "Systematic biases in disease
   forecasting–the role of behavior change," Epidemics, vol. 27, pp. 96–105,
   2019.
                                      25