1
Detection of Violence: Intelligent Video Surveillance in Real-Time
Using Machine Learning, Deep Learning, Computer Vision and
Transmission in Real-Time to the Closest Local Authority.
Md. Samiur Rahman, Md. Towfiqul Islam Mozumder, Md. Eklasur Rahman, Shahariar Rashid Fahim
Emails: samiur.rahman09@northsouth.edu (Md. Samiur Rahman)
towfiqul.mozumder@northsouth.edu (Md. Towfiqul Islam Mozumder)
eklasur.rahman@northsouth.edu (Md. Eklasur Rahman)
shahariar.fahim@northsouth.edu (Shahariar Rashid Fahim)
Violence is one of the most alarming activities in recent times. There is no society or civilization in the contemporary world that
is free of violence around the globe. If such kinds of irregular activities are flagged in real – time, then at least it would act as a
stepping stone to minimize violence from any civilized population. Out of altruism, we drove ourselves to identify violent videos
and different categories of violence (Street Fighting, Hand – Hand Combat, Brawl, Aggression, and Anarchy etc.) instantaneously
and the flagged video will be dispatched to the local authority within reach. A model has been built to detect violent videos, and
for that purpose data set have been trained on ResNet 50. Around 93% accuracy has been achieved after the testing phase. In the
future, different distinguished algorithms namely SSD (Single Shot Detector), YOLO ( You only Look Once), Faster R-CNN (Faster
Region-based Convolutional Neural Network) will be used to build unique model which can significantly identify violent outbreaks
of the crowds. In the upcoming sections, a detailed discussion would cast light on the overall procedure.
I. I NTRODUCTION of methods for dynamic texture recognition [7], namely in
gathering statistics of densely sampled, low-level features
It is quite arduous to monitor any environment practically due to the designing of a system capable of functioning in
by implementing video surveillance equipment cheaply. The real-time. For the basis of violence uncovering in crowded
significance of performing such surveillance, nevertheless, is scenes, the accuracy has been yielded without compromising
indeed questioned [1]. Video surveillance systems are often the processing speed keeping in consideration how flow-vector
considered to be futile due to inadequate population of skilled magnitudes change through time. Over short frame sequences
controllers observing the footage and the natural limits of we ’ ve accumulated information in an illustration which we
humanoid attention competences [2]. This is recognizable, the call the Violent Flows (ViF) descriptor. ViF descriptors
minute considering the enormous quantities of cameras that are then efficiently labeled as violent o r non-violent using
are need to be involved for the supervision, the monotonic a standard Linear Support Vector Machine (SVM).
behavior of the video footage, and the vigilance required
to identify the event and provide a swift response. Truly
speaking, even though it is apparently simpler to explore II. R ELATED R ESEARCH W ORK
recorded videos, off- line, for occasions that are known to In this research, the CCTV f ights scenario of real-world
have happened, entails the aid of Computer Vision methods surveillance has been developed with a fresh and difficult
for video retrieval (e.g. [3]) and summarization [4]. dataset. The pipeline for detection and localization was sub-
Here, we concentrate on identifying epidemics of crowd sequently suggested. Their findings showed that the use of
violence, as it take place, and captured from surveillance video explicit movement data (e.g. optical flows) is considerably
cameras. Such videos characteristically devoid of acoustic better than the RGB only processes. The data coming from
tracks, and, of course, descriptions and other circumstantial non-CCTV fighting footage can also be leveraged by a two-
sources of information are non- existent. The footage might be tier model that generalizes the CCTV source more generally.
faraway below motion picture quality frequently, and so color Possible future instructions include the improvement of spatial
signals are not dependable, and neither are the information characteristics that did not complement temporal data with
required for fine-scale action detection. Here, the videos depict beneficial results. Another interesting part is to better use
gatherings, and since it is unknown from the video that who sequence data at the forecast point because it was not available
will participate in the violence, it is often challenging to to the LSTM (Long Short Term Memory). In addition,
identify the violent activity. This may not only diminish the early detection techniques for this situation can be designed,
efficiency of the videos being observed over long periods of taking into account the significance of rapidly identifying the
time, but it can also substantially flood a Computer Vision beginning of fighting [8].
system with great extents of information, forcing methods rely In this document, a new video monitoring scheme in real-
on interest points too time-consuming. The following figure time was created mainly for drop identification. They argue
exemplifies the violent crowd behavior. that the scheme being suggested is not only normal tracking
We relinquish high-level contour and motion analysis [5], scheme for natural collapse; it has many relevant functions
and intensive processing [6], instead following the example and can be used in distinct monitoring schemes. Furthermore,
2
Fig. 1. Violent Crowd Behavior
while only the falling conduct is detected by current drop the current methods of three data sets recognizing violence
tracking technologies, the scheme suggested can identify types have shown their strategy to considerably outweigh them. Then
of falling incidents (forward, reverse or sideways). Because of they researched the common crime identification glass door
the temporal modifications in the human form, we use as a strategy. The IFV was reformulated and the summary area
feature vector the mixture of an approximate ellipse around the panel data structure was used to considerably accelerate the
human body, horizontal and vertical projection histogram and framework for the identification of violence. The assessments
temporal shifts in the situation of the animal skin. Furthermore, were conducted o n 4 state-of-the-art datasets [11].
our studies show that the Neural Network MLP is fully In this document, they contribute to the development of
appropriate for recognizing human motion. Reliable median a scheme of tracking of these occurrences several times as
test result identification rates (91.12%) underline the system’s significant: they define new methods of effective identification
ad equate output and effectiveness [9]. of audience violence. They collect a difficult data set of
This research addressed smart video surveillance systems associated clips along with normal benchmarks to check your
cantered on this context, in conjunction with the smart video scheme as well as current and prospective techniques. Finally,
surveillance scheme in the field of Shifting Image Segmen- both their own and current technologies show their efficiency
tation, with a perspective to achieving a more accurate and on their own benchmarks and other audio benchmarks. In-
practical algorithm with smart video surveillance systems in terestingly, its ViF exceeds existing techniques by depending
objective detection technology studies. This paper is focused only on the magnitudes of the optical flow fields Despite the
on HOG Technology, and some study findings have been fact that measuring methods were developed on the basis of
obtained by smart video-monitoring systems in the field of flow fields in the past [12].
d estination technology studies [10]. In this research, they tried and assessed their efficiency in
They suggested an expansion of the Enhanced Fishermen various datasets and sub-sampling image images using distinct
Vectors (IFVs) to detect violence in films which enables a optical stream techniques such as IRLS, Horn-Schunck and
clip to be displayed with both local characteristics and space- LucasKanade. This assessment showed that the precision of
time locations. In comparison with IFVs with spatio-temporary ViF with the I nitial Optical Flow Algorithm (IRLS) had
grids, the suggested expansion has demonstrated greater or improved outcomes, but that ViF’s with Horn-Schunck had
comparable precision (and more robust depiction). In addition, improved with hockey’s data set. You can not say which image
3
images are best to use because they are very dependent on Through examples and discussions, they have proved that their
the dataset and the optical stream matrix used. The highest suggested technique is efficient and efficient [17].
artist was HornSchunck, who only processes two images for They suggested an intelligent visual wizard on a handheld
0.25, opposed to 16.95 and 7.80 seconds of Lucas-Kanade and device for distant surveillance. A high-resolution video was
IRLS, with calculation costs of the optical stream algorithms. returned with a lower definition to display and save bandwidth
The use of ViF with Horn-Schunck is therefore extremely on a tiny monitor. First, areas of concern in film were
appropriate because of its small computer costs and good identified, and then one by one zoomed into those areas. The
outcomes for some datasets like Hockey which enables its entire scene is periodically overviewed to provide the operator
use in actual moment [13]. with the necessary context. The results of their user research
This article offers a smart audio assessment technique for show that the method proposed covers the scene in full,
identification of machine sight abnormalities in lift cages. while also demonstrating the details required for understanding
The three-dimensional function matrix can be achieved by its activities. Users have usually noted that the intelligent
designing Codebook, subtracting the context and morpholog- attendant scheme is useful to monitor a situation remotely [18].
ical handling. They can get an HMM to portray the ordinary In the present article, they suggested a primary video
behavior in 4~5 seconds after having been grouped by K- surveillance segmentation technique, and discovered that the
Means and coached by Baum-Welch. Then, the identification easy background modelling procedure reduces the precision
of anomalies can be carried out merely by examining the of front-ground identification and complex modelling methods
log-like yield. Experimental findings show the abnormality or reduces execution speeds after processing. The suggested
of the lift-cage can be efficiently detected and recognized technique is only accessible on stationary context screens but
by behavioral identification and HMM -based identification. potential research continues on implementing vibrant context
Their investigation will in the potential concentrate on the self- clips. They have created a straightforward quick procedure
supporting detection of this technology through uncontrolled for updating background designs while precisely identifying
teaching methods to automatically obtain the normal model of the foreground and separating noise, including the cloud area.
activities [14]. The outcome of execution moment and high precision of
This article describes an internet method to keep an audio foreground detection is achieved by the model. In addition,
sequence’s temporary decay, which emphasizes modifications the proposed technique was assessed on the basis of a fresh
in distinct moment scales. They demonstrate that this decay benchmark data set to demonstrate robustness against big
provides fresh instruments for visualization development and variations in lighting [19].
provides contextual information for the evaluation of moni- This paper introduces a solution for recognition of human
toring images. This contextual information will become more action based on luminescence field trajectory assessment and
and more essential as personal safety staff takes charge of real-time learning. The conclusions of the simulation demon-
the vast number of monitors. The systems they have include strated an effective identification precision of the proposed
several parameters — while behaviors that are comparatively solution. In this paper they described a solution for human
resistant to tiny modifications in these parameters are found intervention in real-time based on luminescence field trajectory
to be a key next phase. Moreover, it will also include more assessment and learning [20].
extensive designs (including allocation rather than value of The ViF Representation: The Violence Flows (ViF) de-
context), which are to extend this concept to maintain context scriptor has been made to come in live from a sequence
de signs at various time scales [15]. of frames, S, initially after estimating the optical flow be-
In this paper, they suggest SurFi, a monitoring camera tween pairs of successive frames which yields for each pixel
looping assault detection scheme in real time. In order to px, y, t, where t is the frame index, a flow vector (u x, y, t, v x, y, t )
handle and correlate image and CSI transmissions to identify matching it to a pixel in the next frame t + 1 Here we
any distracts, SurFi uses current WiFi I nfrastructure (with no consider only the magnitudes of these vectors: m x, y, t =
extra hardware and implementation expenses) to obtain signal q
condition data (CSI). SurFi improves tracking trust because the (u x, y, t ) 2 + (v x, y, t ) 2 .Doing so is in some sense a resemblance
two heterogeneous detecting modalities perceive and correlate to some early action recognition techniques which also relied
more occurrences. Their SurFi design demonstration achieves on flow-vector magnitudes for processing actions.
98.8% precision and 0.1% false-positive rates of assault iden- Obviously, our own work have subtle differences with
tification [16]. those earlier approaches. Unlike aforementioned methods, the
In this article, they propose a way to easily manipulate and magnitudes are not considered but rather how they change over
display video surveillance. The ROI is zoomed in and tracked time.
locally. Their research combines computer vision, computer Our reasoning is that although flow vectors translate mean-
graphics and perceptive psychology ideas and techniques. ingful temporal information, most frequently arbitrary quan-
First, tracking technology detects the ROI. The texture map- tities are considered to be the magnitudes: they depend on
ping mesh is deformed according to the tracking results. In this fram e resolution, different motions in different spatio-tempor
way, video frames are shown in a distorted way, as opposed al locations, etc. After comparing scales significant processes
to the traditional (flat) way. The ROI will be attracting more of the consequence of observed motion magnitudes in each
attention and zoomed in. This method can assist the monitor frame compared to its predecessor have been obtained. This
to efficiently identify ROI and spontaneously monitor ROI. is fairly related to the self-similarity descriptor of [ 31 ] and
4
its extension to action recognition using the LTP descriptors [ Analyzing Process: We created a machine learning model
32 ]. Unlike them, nevertheless, we consider resemblances of based on a special kind of Convolutional Neural Network
flow-magnitudes in period, rather than local appearances. (CNN) which was ResNet-50.
Specifically, for each pixel in each frame we obtain a binary All tasks aforementioned were completed smoothly for c
indicator b x, y, t, reflecting the significance of the change of reating the experimental model by the help of ResNet-50
magnitude between frames: A lgorithm. Every data were passed only a gray scale and
1 i f m x, y, t − m x, y, t−1 ≥ θ
)
224*224 pixels. Due to computational insufficiency (more
b x, y, t =
0 otherwise specifically, lack of high power computational resources) we
Where θ threshold
is adaptively set in each frame to the av- fixed our batch size to 32. Here we’ve used 50 epochs for best
erage value of m x, y, t − m x, y, t−1 hich yields us with a binary, fitting of the model.
magnitude-change, significance map bt or each frame f t We Accuracy
next calculate an average magnitude-change map by simply Tools and Materials: To build up our train ing model we
averaging these binary values, for each pixel, over all the used Python, Anaconda, OpenCV, built in library of ResNet-
frames f t ∈ S 50 from Tensor F low ’ s Keras. We ’ ve used Visual Studio
−
as our Integrated Development Environment (IDE).
1 X
b x, y = b x, y, t
T t IV. W HY WE HAVE USED R ES N ET-50
Simply interpreting, the ViF descriptor is a vector of fre- ResNet is the acronym for Residual Network. Deep Convo-
−
quencies of quantized values b x, y if the crowd motion patterns lutional Neural Networks have led to a series of breakthroughs
were indeed spatially stationary, this may suffice. In practice, for image classification. Many other visual recognition tasks
however, we found that different spatial regions have differ- have also greatly benefited from very deep models. So, over
ent characteristic behaviors. The ViF descriptor is therefore the years there is a trend to go deeper, to solve more comp
− lex tasks and to also increase or improve the classification and
formed by dividing binto M × N on -overlapping cells and recognition accuracy. But, as we go deeper; the training of the
collecting magnitude change frequencies in each cell respec- neural network becomes difficult and also the accuracy starts
tively & separately. The distribution of magnitude changes saturating and then eventually declines. Residual Learning
in each such cell is represented by a fixed-size histogram. have significantly gained popularity for solving both of these
These histograms are then concatenated into a single descriptor problems aforementioned.
vector.
V. W ORK THAT HAS BEEN DONE UNTIL
III. M ETHODOLOGY Our main objective is to detect violence in real-time and
Task: Our main objective is actually video classification on to transfer the detected violent video to the closest local
different distinguished events. The goal is to detect violence authority in real-time to minimize the damages. With this
using video classification. Due to the lacking of sufficient goal to be achieved, we worked on collecting the datasets
dataset regarding violence, we concentrated to build an experi- of violent outbreaks of crowds but due to the insufficiency
mental machine learning model by using a sport c lassification of the datasets, we moved on to build some experimen-
dataset involving three events football, weight lifting and tal deep learning models involving different sports such as
tennis. Even though there are shortages of the dataset involving weight lifting, football, basketball etc. just to smoothen our
violence, we found one dataset having 246 video containing temperament and techniques to build a formidable model
both violent and non-violent info graphics, and we imple- detecting distinguished activities and features, be it sports or
mented the dataset to classify the crowd violence behavior. be it violence. In the methodology section, we ’ ve discussed
We analyzed ViF descriptor to compare with other feature about those experimental models including the training of
detectors like Histogram of Oriented Gradients (HOG), the datasets, testing, and loss functions, accuracy obtained
Histogram of Optical Flow (HOF), and Local Temporal in details. Even though there are shortages of the dataset
Pattern (LTP) .The objective is to identify the violence. involving violence, we found one dataset having 246 video
With the help of sequential frames, methods are required to containing both violent and non-violent info graphics, and
process the videos. As soon as the test has been done, and we implemented the dataset to classify the crowd violence
the result obtained, we represent the information containing behavior. We analyzed ViF descriptor to compare with other
the percentage of violence (percent of videos where violence feature detectors like Histogram of Oriented Gradients
was correctly detecte d) for the growing de lays in time from (HOG), Histogram of Optical Flow (HOF), and Local
violent outbreak. We have compared different methods by their Temporal Pattern (LTP).It has been discussed in details in
accuracy vs. the time they re quire to identify the violence. the methodology section of this paper.
Data Collection: We collected our data from GitHub,
containing different kind s of sport’s images, like football, VI. F UTURE W ORKS TO BE DONE
cricket, tennis, weight lifting, basketball etc. The dataset we Since SSD (Single Shot Detector) only need s to take one
acquired was completely labeled and prepr ocessed, so we single shot to detect multiple objects within the image, while
relieved from data preprocessing. Regional Proposal Network (RPN) based approaches such as
5
TABLE I
TABLE 1: C LASSIFICATION O UTCOME ON THE V IOLENCE DATASET, M EAN OVER 4 F OLDS C ROSS -VALIDATION .
Method Accuracy (± S D)
LTP 74.39 ±0.15 %
HOG 59.36 ± 0.33 %
HOF 59.69 ±0.31 %
ViF 85.30 ±0.21 %
Fig. 2. Procedure of ResNet-50 Algorithm
Fig. 3. Training Loss and Accuracy on Datasets.
6
R-CNN series that need two shots, one for generating region [13] V. M. Arceda, J. G. A. Errez, and K. F. ., “Real Time Violence Detection
proposals, the other for detecting th e object of each proposal. in Video,” International Conference on Pattern Recognition Systems
(ICPRS-16), 2016.
Hence, SSD is much faster compared with two-shot RPN- [14] T. Hassner, Y. Itcher, and O. Kliper-Gross, “Violent flows: Real-time
based approaches.We will concentrate using SSD to develop a detection of violent crowd behavior,” 2012 IEEE Computer Society
model which will be capable of identifying multiple features Conference on Computer Vision and Pattern Recognition Workshops,
2012.
(violent outbreaks, brawl, combat, strike (someone ) with a [15] Y. P. Tang, X. J. Wang, and H. F. Lu, “Intelligent Video Analysis
short heavy blow, as if cutting at someone etc.) in real-time, Technology for Elevator Cage Abnormality Detection in Computer
and there will be a specified mechanism for transferring the Vision,” Fourth International Conference on Computer Sciences and
Convergence Information Technology, 2009.
identified violent info graphics to the corresponding closest [16] N. Jacobs and R. Pless, “Real-time constant memory visual summaries
local authority (Nearest Local Police Station, Nearest Law En- for surveillance,” Proceedings of the 4th ACM international workshop
forcement Agencies). If we encounter insufficiency of having on Video surveillance and sensor networks - VSSN 06, 2006.
[17] N. Lakshmanan, I. Bang, M. S. Kang, J. Han, J. T. Lee, and Surfi, 2019.
the relevant datasets then we will make our own datasets. Even [18] G. Wang, T. T. Wong, and P. A. Heng, “Real-time surveillance video
though it would require very painstaking efforts to create a display with salience,” Proceedings of the third ACM international
complete workable datasets, we are determined undertaking workshop on Video surveillance & sensor networks - VSSN 05, 2005.
[19] H. Kuang, B. Guthier, M. Saini, D. Mahapatra, and A. E. Saddik, “A
the challenge to fulfill the objective. Real-Time Smart Assistant for Video Surveillance Through Handheld
Devices,” Proceedings of the ACM International Conference on Multi-
media - MM 14, 2014.
VII. C ONCLUSION [20] S. Hwang, Y. Uh, M. Ki, K. Lim, D. Park, and H. Byun, “Real-
time background subtraction based on GPGPU for high-resolution video
Since violence is rapidly increasing, and is one of the surveillance,” Proceedings of the 11th International Conference on
fundamental cause of instability in a society, violence must Ubiquitous Information Management and Communication - IMCOM 17,
need to be flagged and stopped in real-time. If violent activities 2017.
can be flagged in real-time, then substantially it may create
a difference between life and death. In spite of the highly
importance of this activity, it hasn ’ t been acknowledged
responsiveness in the past. To uproot violence from the society,
here we come forward with a solution of identifying violence
in real-time, and it will be directed towards the nearest local
authority in real-time so that the injury can be substantially
lessened. The dataset collection phase has given us the most
painful experiences. After the training phase has finished,
when we moved on testing the dataset, we observed that ViF
outperforms some of the existing state of the art techniques
by relying on magnitudes of the optical-flow fields alone.
R EFERENCES
[1] R. Akers and C. Sellers, 2008.
[2] H. ., 2009, PhD thesis.
[3] N. Petrovic, N. Jojic, and T. Huang, “Adaptive video fast forward,”
Multimedia Tools and Applications, vol. 26, no. 3, pp. 327–344, 2005.
[4] Y. Pritch, S. Ratovitch, A. Hendel, and S. Peleg, “Clustered synopsis of
surveillance video,” Advanced Video and Signal Based Surveillance, pp.
195–200, 2009.
[5] M. Abdelkader, W. Abd-Almageed, A. Srivastava, and R. Chellappa,
“Silhouette-based gesture and action recognition via modeling trajecto-
ries on riemannian shape manifolds,” CVIU, vol. 115, no. 3, pp. 439–
455, 2011.
[6] O. Kliper-Gross, T. Hassner, and L. Wolf, pp. 31–45, 2011.
[7] V. Kellokumpu, G. Zhao, and M. Pietikainen, “Human activity recogni-
tion using a dynamic texture based method,” BMVC, pp. 1–10, 2008.
[8] M. Perez, A. C. Kot, and A. Rocha, “Detection of Real-world Fights
in Surveillance Videos,” ICASSP 2019 - 2019 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
[9] H. Foroughi, B. S. Aski, and H. Pourreza, “Intelligent video surveillance
for monitoring fall detection of elderly in home environments,” 11th
International Conference on Computer and Information Technology,
2008.
[10] G. Sreenu and M. A. S. Durai, “Intelligent video surveillance: a review
through deep learning techniques for crowd analysis,” Journal of Big
Data, vol. 6, no. 1, 2019.
[11] Y. Chen, “Study of moving object detection in intelligent video surveil-
lance system,” 2010 2nd International Conference on Computer Engi-
neering and Technology, 2010.
[12] P. Bilinski and F. Bremond, “Human violence recognition and detec-
tion in surveillance videos,” 13th IEEE International Conference on
Advanced Video and Signal Based Surveillance (AVSS), 2016.