Final Violence Report Updated
Final Violence Report Updated
JULY, 2025
B.Tech(H)
by
K C NARENDRA
1
School of Computer Science and Engineering
RV University
RV Vidyaniketan,8th Mile, Mysore Road, Bengaluru, Karnataka, India - 560059
CERTIFICATE
This is to certify that the project work titled “Violence Detection in Real Videos using Deep Learning
Techniques'' is performed by Harish P C (1RVU23CSE181), Shiven Yadav S (1RVU23CSE433), Shiven Yadav
S (1RVU23CSE456) a bonafide students of B.Tech(H) at the School of Computer Science and Engineering, RV
university, Bengaluru in partial fulfillment for the award of degree B.Tech(H) in Computer Science & Engineering ,
during the Academic year 2025-2026.
1. 1.
2. 2.
2
DECLARATION
I here by declare that the thesis entitled “Violence Detection in Real Videos using Deep
Learning Techniques” submitted by me, for the award of the degree B.Tech(H) in RV University
is a record of Bonafide work carried out by me under the supervision of K C Narendra.
I further declare that the work reported in this thesis has not been submitted and will not
be submitted, either in part or in full, for the award of any other degree or diploma in this
institute or any other institute or university.
Place: Bangalore
3
ABSTRACT
In an era where video surveillance is ubiquitous, the manual monitoring of extensive footage for violent
incidents is impractical and inefficient. This project presents a system designed to automatically classify
real-life video segments as either "violent" or "non-violent" using machine learning techniques. The
system leverages spatial and temporal features extracted from video frames to distinguish between
aggressive actions and normal activities. This project aims to enhance public safety by providing an
automated solution for early detection and alerting in various environments.
i
4
ACKNOWLEDGEMENT
It is my pleasure to express with deep sense of gratitude to Prof. K C Narendra, Professor,
School of Computer Science and Engineering, RV University, for his constant guidance,
continual encouragement, understanding; more than all, he taught me patience in my endeavor.
My association with him is not confined to academics only, but it is a great opportunity on my
part of work with an intellectual and expert in the field of ML models and Deep Learning
Techniques.
I would like to express my gratitude to Dr Dwarika Prasad Uniyal, in charge Vice Chancellor RV
University, and Dr Shobha G, Dean SoCSE, RV University, for providing with an environment
to work in and for his inspiration during the tenure of the course.
I would like to extend my sincere thanks to my mentor Prof. Harish K R, Professor, School of
Computer Science and Engineering, RV University, for their unwavering support, thoughtful
guidance, and motivational presence throughout the course of this project. Their mentorship not
only enriched my learning but also inspired me to approach challenges with confidence and
clarity. Working with [him/her/them] has been a meaningful experience, and I am truly grateful
for the valuable time and insights shared with me during this journey.
It is indeed a pleasure to thank my friends who persuaded and encouraged me to take up and
complete this task. Atlast but not least, I express my gratitude and appreciation to all those who
have helped me directly or indirectly toward the successful completion of this project.
ii
5
CONTENTS
TITLE PAGE 1
CERTIFICATE 2
DECLARATION 3
ABSTRACT 4
ACKNOWLEDGEMENT 5
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION 9
1.2 OVERVIEW 9
1.5 OBJECTIVES 10
CHAPTER 2
BACKGROUND
2.2 REQUIREMENTS 12
6
CHAPTER 3
METHODOLOGY
CHAPTER 4
IMPLEMENTATION
CHAPTER 5
RESULTS
CHAPTER 6
INFERENCES
6.3 LIMITATIONS 19
7
CHAPTER 7
IMPRESSION REPORT
CHAPTER 8
8.1 CONCLUSION 22
8.3 REFERENCES 23
8
Chapter 1: Introduction
1.1 INTRODUCTION
Social media companies and public safety organizations are now very concerned about the proliferation of
violent videos on the internet due to the quick development of digital platforms and user-generated
content. Given the enormous number of videos submitted every day, manually policing such content is not
only time-consuming but also ineffective. As a result, automated, intelligent systems that can swiftly and
precisely identify violent content are now required.
Using deep learning and computer vision techniques, this research aims to create a machine learning-based solution
for video violence identification. Long Short-Term Memory (frame-wise classification) networks are used to
capture the temporal patterns that indicate aggressive behavior over time, while EfficientNetB0 is used to extract
spatial characteristics from individual video frames.
The goal is to create a strong, real-time system that can help with public safety, content moderation, and
surveillance by automatically finding and flagging violent video material. This will cut down on human
error and speed up response times.
1.2 OVERVIEW
This project's main goal is to develop an automated system that uses machine learning and deep learning
techniques to identify violent material in video streams. By correctly identifying videos as violent or non-
violent, the system seeks to lessen the amount of human labor required for video content monitoring and
moderation. To learn how to distinguish between violent and non-violent videos, the models are trained
using a labeled dataset.
In order to increase effectiveness and real-time speed, various architectures such as Efficientnet, custom architecture
are also investigated. For data preprocessing, interface development, and model implementation, the project makes
use of programs like OpenCV and TensorFlow/Keras. Performance indicators including accuracy, precision, recall,
and F1-score are used to evaluate a product.
Faster response times and safer online environments are made possible by the prospective uses of this
technology in content moderation, surveillance, public safety, and law enforcement.
Developing an accurate and reliable violence detection system presents several significant challenges:
● Real-Time Processing: The system must process video streams in real-time to be effective for
surveillance applications, requiring efficient algorithms and computational resources.
● Variability of Violence: Violent actions can manifest in numerous ways, varying in intensity,
speed, and context. Differentiating subtle violent cues from normal, vigorous activities (e.g.,
sports, dancing) is complex.
● Environmental Factors: Lighting conditions, camera angles, occlusions, and crowded
environments can severely impact the quality of video data and the accuracy of detection.
9
● Dataset Limitations: Publicly available datasets for violence detection often lack diversity in
terms of scenarios, subject demographics, and environmental conditions, which can limit the
generalizability of trained models.
● False Positives/Negatives: A high rate of false positives can lead to alert fatigue, while false
negatives can result in missed incidents, both undermining the system's utility.
1.3PROJECT STATEMENT
To develop an automated system that accurately detects violent content in videos by leveraging deep
learning techniques—specifically using EfficientB0 for temporal sequence analysis—with the goal of
enabling real-time content moderation and enhancing public safety.
1.4OBJECTIVE
Designing and implementing a deep learning-based system that can automatically identify violent content in videos
is the aim of this project. The system's goal is to correctly identify movies as violent or non-violent by using
EfficientNetB0 for spatial feature extraction and frame-wise classification for temporal pattern recognition. This
will speed up content moderation, enhance public safety protocols, and lessen the need for manual video
monitoring.
10
CHAPTER 2
BACKGROUND
2.1 LITERTURE SURVEY
1. Human Violence Detection Using Deep Learning Techniques
Arun Akash S. A. et al., 2022 – Journal of Physics: Conference Series
The research paper “Human Violence Detection Using Deep Learning Techniques” by Arun Akash S. A.
et al. (2022) presents a novel approach for detecting violent behavior in real-time video streams using a
combination of deep learning and object detection models. The primary objective of the study is to
enable automated detection of human violence by analyzing visual features and identifying critical
objects (such as weapons or aggressive actions) through computer vision.
The system integrates two major models:
● Inception V3 is used for classifying individual video frames as violent or non-violent based on
extracted spatial features.
● YOLOv5, a state-of-the-art real-time object detection algorithm, is employed to identify key
elements such as humans and weapons that are commonly involved in violent scenarios.
For training and evaluation, the authors compiled a dataset of over 10,000 images sourced from real-
world videos, movie scenes, and CCTV footage. The videos were preprocessed by converting them into
individual frames, followed by manual object labeling using XML files converted to YOLO’s TXT
format. This ensured accurate bounding box annotations for supervised learning.
The system demonstrated a violence detection accuracy of 74%, which is a notable result given the
complexity of interpreting aggressive human behavior in varying environments. The model’s output was
also integrated into a live detection system using FastAPI for backend processing and a simple
HTML/CSS frontend for visualization, enabling real-time deployment and practical usability.
The key contribution of the paper lies in the hybrid model design, where the power of CNN-based
classification (Inception V3) is combined with YOLOv5’s real-time object detection. This dual-model
architecture enhances both recognition precision and contextual understanding of violent scenes.
This study significantly influenced the current project by validating the effectiveness of combining spatial analysis
(image classification) with object-level understanding (detection), and inspired the use of EfficientNetB0 + frame-
wise classification in our work to capture both spatial and temporal violence patterns more effectively.
11
● Transformer-based models
● Audio-based methods
● Skeleton-based and pose estimation models
● Hybrid approaches
● Other emerging deep learning techniques
As part of their review, the researchers identified 28 publicly available datasets, providing insights into
their structure, limitations, and real-world applicability. They also discussed 21 keyframe extraction
techniques and 16 types of input modalities, such as RGB frames, optical flow, pose sequences, and
spectrograms, showing the diversity in how violence can be represented and processed.
The paper highlights models that achieved up to 95% accuracy in violence detection under controlled
settings. However, it critically evaluates the challenges faced by these systems in real-world scenarios,
such as:
● Poor lighting conditions
● Camera motion and shake
● High-density (crowded) scenes
● Partial occlusions
● Latency and processing costs for real-time deployment
One of the most impactful contributions of this paper is its role as a systematic map of the research
landscape. It not only compares the performance of existing architectures but also discusses practical
constraints, such as hardware limitations, dataset imbalance, and generalizability across domains (e.g.,
sports violence vs. street surveillance).
This literature survey served as a foundational reference for the current project by showcasing the strengths and
trade-offs of various model types, justifying the use of EfficientNetB0 for spatial processing and frame-wise
classification for temporal analysis. It also emphasizes the importance of carefully selecting datasets and
evaluation metrics for building effective and real-time violence detection systems.
2.2 REQUIREMENTS
Implementing a deep learning-based violence detection system requires specific software and hardware
components:
● Software Requirements:
○ Programming Language: Python (for its extensive libraries and community support).
○ Deep Learning Frameworks: Libraries such as TensorFlow or Keras (as Keras offers an
easy-to-use API for building neural networks).
○ Data Manipulation: NumPy and Pandas (for data loading, preprocessing, and
manipulation).
○ Computer Vision Libraries: OpenCV (for video/image processing tasks like frame
extraction).
○
● Hardware Requirements:
○ GPU: A powerful Graphics Processing Unit (GPU) is essential for accelerating the training
and inference processes of deep learning models, especially for real-time applications and
large datasets. The notebook metadata indicates GPU acceleration.
○ Sufficient RAM and Storage: For handling large video datasets and model checkpoints.
o
3. Functional Requirements
● Load and preprocess video datasets by converting videos into frames.
● Implement a deep learning model using EfficientB0 for spatial feature extraction.
13
● Train the model on a labeled dataset (violent vs. non-violent) and validate using accuracy,
precision, recall, and F1-score.
● Evaluate the model's performance on test videos to ensure generalizability.
● Save trained model checkpoints and allow for model loading during testing.
● Experiment with other architectures like YOLOv8, MobileNetV2, or X3D for comparative
analysis.
Chapter 3: METHODOLOGY
The proposed methodology for real-life violence and non-violence classification follows a standard deep
learning pipeline, adapted for video analysis:
5. Evaluation:
○ The model's performance is rigorously evaluated using standard metrics on a separate test
dataset.
Chapter 5: RESULTS
The experimental setup involved training and evaluating the deep learning model on a dataset containing
real-life violence and non-violence data. The dataset was organized into "violence" and "non_violence"
categories, consistent with typical violence detection datasets. A standard split strategy was employed,
allocating a significant portion of the data for training and a separate, unseen portion for testing to ensure
unbiased evaluation of the model's generalization capabilities.
For video processing, frames were extracted from the video clips and resized to a consistent dimension
(e.g., 128x128 pixels), which is a common practice to standardize input for CNNs while maintaining
computational efficiency. The dataset paths in the provided Jupyter notebook suggest that images (frames)
were pre-extracted and organized into directories.
The model was evaluated using standard classification metrics to assess its performance in distinguishing
between violent and non-violent actions. The key metrics obtained were:
● Accuracy: 90.42%
15
● Precision: 90.06%
● Recall: 92.60%
● F1 Score: 91.31%
The results indicate that the developed deep learning model achieved strong performance in classifying
real-life violence and non-violence. An accuracy of 90.42% suggests that the model is generally effective
across the dataset. The relatively high recall (92.60%) is particularly important for violence detection
systems, as it signifies that the model is good at identifying actual violent incidents, minimizing the risk of
missing critical events. The precision of 90.06% also indicates a low rate of false alarms, which is crucial
for practical deployment to avoid alert fatigue. The F1-score of 91.31% further confirms a good balance
between precision and recall, demonstrating the model's robustness.
These results align with expectations for deep learning models in similar computer vision tasks,
demonstrating their capability to learn complex spatio-temporal patterns from video data. The use of
modern neural network architectures and proper regularization techniques likely contributed to these
favorable outcomes.
Chapter 6: INFERENCES
The key findings from this project reinforce the efficacy of deep learning for automated violence detection
in real-life scenarios. The high accuracy, precision, and recall metrics demonstrate that models can
effectively distinguish between violent and non-violent activities even with the inherent complexities of
real-world video data. This suggests that features learned by the model are robust enough to capture the
subtle cues associated with aggressive behaviors.
● Effective Data Processing: The project successfully implemented a pipeline to prepare raw video
frames for deep learning models.
● Robust Feature Learning: The deep learning model demonstrated its ability to learn and extract
discriminative spatio-temporal features necessary for violence classification.
● High Classification Performance: The model achieved commendable accuracy, precision, and
recall, indicating its potential for practical application.
● Visualization of Results: The ability to visualize predictions on individual frames using OpenCV
(cv2.imshow) provided clear insight into the model's decision-making process.
6.3 LIMITATIONS
The development of this real-life violence and non-violence classification system was a highly enriching
experience. Throughout the internship, our team gained practical insights into the complexities of secure
web application development, particularly in areas involving [e.g., machine learning, computer vision,
data privacy, real-time processing]. We learned to bridge the gap between theoretical knowledge and
practical implementation challenges in developing AI-powered solutions for societal problems.
● Python & Deep Learning Frameworks: Deepened expertise in Python programming and
practical application of TensorFlow/Keras for complex model development.
● Computer Vision (CV) Techniques: Gained hands-on experience with video and image
processing using OpenCV, including frame extraction and feature understanding.
● Machine Learning Model Development: Developed a comprehensive understanding of
designing, training, and evaluating deep learning models for classification tasks.
● Data Preprocessing and Handling: Improved skills in preparing large, unstructured video
datasets for machine learning consumption.
● Performance Metrics Analysis: Enhanced ability to interpret and utilize classification metrics
(accuracy, precision, recall, F1-score) for model assessment.
● Problem-Solving and Debugging: Strengthened problem-solving skills through identifying and
resolving challenges inherent in complex AI projects.
17
7.3 INDUSTRY/ACADEMIC RELEVANCE
In today’s data-centric world, particularly in India with its growing smart city initiatives, secure and
intelligent surveillance systems are of paramount importance. Our project closely aligns with industry
practices in domains such as:
● Public Safety and Security: Providing automated tools to monitor public spaces, reducing
response times to incidents.
● Content Moderation: Automatically identifying and filtering inappropriate or violent content in
online streaming services or social media.
● Smart City Initiatives: Contributing to the development of intelligent urban infrastructures that
leverage AI for citizen safety.
● Computer Vision Research: Applying and extending state-of-the-art deep learning techniques to
a challenging real-world problem.
While the current implementation serves as a robust prototype for violence classification, several
enhancements can be considered for future versions to improve its utility and real-world applicability:
● Dataset Expansion and Diversity: Incorporate more diverse datasets covering a wider range of
violent actions, environmental conditions (e.g., varying lighting, weather), and camera
perspectives to improve generalization.
● Integration of Audio Cues: Explore multimodal approaches by integrating audio analysis (e.g.,
detection of screams, gunshots) to complement visual information, which could significantly
improve detection accuracy, especially in low-visibility conditions.
● Real-Time Deployment Optimization: Optimize the model for faster inference speeds and
efficient deployment on edge devices or cloud platforms to enable true real-time surveillance
capabilities. This could involve using lightweight models or model quantization techniques.
● Alert System Enhancement: Develop a sophisticated alert system that provides more contextual
information (e.g., location, severity, probability) and integrates with existing security
infrastructures for rapid response.
● User Interface Development: Create a user-friendly interface for easy interaction with the
system, including video upload, analysis initiation, and visualization of results.
18
19
Chapter 8: Conclusion & Future Work
< Times New Roman, Font 16, Bold>
This chapter should summarize the key aspects of your project (failures as well as
successes) and should state the conclusions you have been able to draw. Outline what you would
do if given more time (future work). Try to pinpoint any insights your project uncovered that
might not have been obvious at the outset. Discuss the success of the approach you adopted and
the academic objectives you achieved. Avoid meaningless conclusions, [e.g. NOT “I learnt a lot
about C++ programming”]. Be realistic about potential future work. Avoid the dreaded: “All the
objectives have been met and the project has been a complete success”. You have to crisply state
the main take-away points from your work. Describe how your project is performed against
planned outputs and performance targets. Identify the benefits from the project. Be careful to
distinguish what you have done from what was there already. It is also a good idea to point out
how much more is waiting to be done in relation to a specific problem, or give suggestions for
improvement or extensions to what you have done.
20
2
21
Appendix
Appendices are provided to give supplementary information, which is not included in the
main text may serve as a separate part contributing to main theme.
● Appendices should be numbered using Arabic numerals, e.g. Appendix 1, Appendix 2 etc.
22
23
REFERENCES
<Times New Roman, Font 16, Bold, CAPS>
Hasselmann, K., W.H. Munk and G.J.F. MacDonald (1963). Bispectra of Ocean Waves, In: M.
Rosenblatt (ed.), Time Series Analysis, John Wiley Sons, p. 125, NewYork. <For referring from
books edited from a collection of different papers>.
Tatavarti, Rao V.S.N. and D.A. Huntley (1987). Wave reflection at Beaches. Proc. Canadian
Coastal Conference, Quebec City, pp. 241-255, Canada.
Wallace, J.M. and R.E. Dickinson (1972). Empirical orthogonal representation of time series in
the frequency domain, Part I: Theoretical considerations, J. App. Meteorology, Vol. 11, No. 6,
pp. 887-892. <For referring journals>
24
Chapter 8: Conclusion & Future Work
Conclusion:
This project successfully demonstrates the practical application of deep learning for real-time violence detection.
Using EfficientNetB0 for frame-wise classification and OpenCV for real-time display, the system effectively
distinguishes between violent and non-violent scenes. The simplicity and speed of the system make it suitable for
practical deployment in surveillance and content moderation scenarios.
Future Work:
To enhance this system's effectiveness and applicability, several improvements can be pursued:
1. Incorporating temporal models like 3D CNNs or transformers for better context.
2. Adding audio analysis for multimodal detection (e.g., gunshots, screams).
3. Expanding the dataset to improve robustness in diverse real-world conditions.
4. Deploying the model on edge devices or in cloud environments.
5. Developing a Streamlit or web-based UI for user interaction and alert notifications.
25
REFERENCES
Arun Akash S. A. et al. (2022). Human Violence Detection Using Deep Learning Techniques. Journal of Physics:
Conference Series.
Pablo Negre, Ricardo S. Alonso, et al. (2024). Literature Review of Deep-Learning-Based Detection of Violence in
Video. Sensors Journal.
Veltmeijer et al. (2024). Real-Time Violence Detection and Localization through Subgroup Analysis. Multimedia
Tools and Applications.
TensorFlow Keras Documentation. https://www.tensorflow.org/api_docs/python/tf/keras
OpenCV-Python Tutorials. https://docs.opencv.org/
26