VISVESVARAYA TECHNOLOGICAL UNIVERSITY
BELAGAVI-590018.
Project Work Phase1(18CSP77) Report on
“Real-Time YOLO Object Tracking for Videos”
For the requirement of 7th Semester B.E in Computer Science & Engineering
Submitted by
Thrishu A J (1KT20CS088)
Yashaswini H S (1KT20CS095)
Rakesh R Amin (1KT20CS063)
Varun B H (1KT20CS091)
Under the guidance of
Internal Guide Head of the Department
Kavya M Dr. Deepak S Sakkari
Professor Professor and Head
Dept. of CSE, SKIT, B’lore. Dept. of CSE, SKIT, B’lore.
Department of Computer Science and Engineering
Sri Krishna Institute of Technology
BENGALURU-560090
Sept 2023 to Dec 2023
SRI KRISHNA INSTITUTE OF TECHNOLOGY
No.29, Chimney Hills, Hesaraghatta Main Road, Chikkabanavara Post
Bengaluru-560090.
Department of Computer Science and Engineering
CERTIFICATE
This is to certify that the project work entitled “Real-Time YOLO Object
Tracking For Videos” proposed by Thrishu A J (1KT20CS088), Yashaswini H S
(1KT20CS095), Rakesh R Amin (1KT20CS063) & Varun B H (1KT20CS091)
are bonafide students of “Sri Krishna Institute of Technology”, have
successfully submitted the Project Work Phase-1(18CSP77) report, in Partial
Fulfilment for 7th semester Bachelor of Engineering in Computer Science and
Engineering of the VTU, Belgaum, during the year 2022-23. It is certified that
all corrections/suggestions indicated for internal assessment have been
incorporated in the report and not been submitted to any other University
wholly or in part for award of any other degree.
Signature of the guide Signature of the HOD
Kavya M Dr. Deepak S Sakkari
Professor Professor and Head
Dept. of CSE, SKIT, B’lore Dept. of CSE, SKIT, B’lore
ABSTRACT
Video summarization is a fundamental challenge in the field of computer vision and
multimedia processing, aimed at condensing lengthy videos into concise representations
without compromising the essential content and context. This project focuses on the
integration of object detection techniques into the process of video summarization, harnessing
the power of deep learning to automatically identify and extract key objects and events from
video sequences.
By leveraging state-of-the-art object detection models and innovative summarization
algorithms, this project aims to enhance the efficiency and effectiveness of video
summarization, enabling users to quickly grasp the content and significance of videos without
the need for exhaustive playback. The proposed approach not only streamlines video
browsing and content comprehension but also holds potential applications in various
domains, including surveillance, video indexing, and content recommendation systems.
I
ACKNOWLEDGEMENT
We consider it is a special privilege to express a few words of gratitude and respect to
all those who have guided and inspired us in completing the first phase project work. The
success of the project depends largely on the encouragement and guideline of many other.
We take this opportunity to express our gratitude to the people who have been instrumental in
successful completion of the Project Work Phase 1.
We would like to profoundly thank the Management of Sri Krishna Institute of
Technology [SKIT] for providing such a working environment for successful completion of
the project report.
We express our sincere thanks to Dr. Mahesha K, Principal of “Sri Krishna
Institute of Technology”, for his courteous comments and valuable suggestions, which
enabled the successful completion of the Project Work Phase 1.
We express our gratitude to “Kavya M”, Professor,Department of Computer
Science and Engineering, for his constant support, kind guidance and encouragement.
Our special thanks “Dr. Deepak S Sakkari”, Head of The Department,
Department of Computer Science and Engineering who inspired us in taking up this
project and guided with required necessities for carrying out of the Project Work Phase 1.
We thank Project coordinators – “Prof. Aruna R and Dr. Shantharam Nayak” for
their support in coordinating project work. We are also indebted to all who directly or
indirectly rendered their valuable help for completion of the Project Work Phase 1.
We are also thankful to our parents who have always been our mentors.
Thrishu A J (1KT20CS088)
Yashaswini H S (1KT20CS095)
Rakesh R Amin (1KT20CS063)
Varun B H (1KT20CS091)
II
TABLE OF CONTENTS
Abstract I
Acknowledgement II
Table of Contents III
Sl. No. CHAPTER NAME Page No.
1. INTRODUCTION 1
1.1 Introduction 1
1.2 Objectives 1
1.3 Scope 2
1.4 Applications 2
2. LITERATURE REVIEW 3
2.1 Existing System 3-7
3. REQUIREMENT SPECIFICATION 8
3.1 Hardware requirements 8
3.2 Software requirements 8
4. METHODOLOGY 8
4.1 Existing System 9
4.2 Proposed System 9-11
5. REFERENCES 12
III
Real-Time YOLO Object Tracking For
CHAPTER 1
INTRODUCTION
1.1 Introduction
In the digital age, the proliferation of videos across online platforms, surveillance
systems, and personal archives has created a pressing need for effective methods to distill and
comprehend the voluminous video content. Video summarization has emerged as a solution
to this challenge, offering a way to create concise yet informative representations of videos.
Traditional video summarization techniques often rely on methods such as keyframe
extraction, temporal clustering, and scene analysis. However, these methods might overlook
crucial visual elements and events, leading to suboptimal summarizations.
Object detection, a subfield of computer vision, has witnessed remarkable
advancements with the advent of deep learning. Convolutional Neural Networks (CNNs)
have revolutionized object detection by enabling accurate identification and localization of
objects within images and videos. Integrating object detection into the video summarization
process presents a novel approach to capturing the most salient content within a video. By
identifying key objects, actions, and interactions, the summarization process can provide a
more comprehensive and contextually relevant summary.
In this project, we propose to leverage cutting-edge object detection models, such as
Faster R-CNN, YOLO (You Only Look Once), or SSD (Single Shot MultiBox Detector), to
detect and track objects of interest throughout video sequences. These detected objects will
serve as the building blocks for generating a meaningful video summary. By extracting
objects with higher semantic value and contextual significance, the resulting summary will
provide a more accurate representation of the original video's content.
1.2 Objectives
The main objectives of this project are twofold: first, to explore and implement advanced
object detection techniques suitable for video analysis; and second, to develop innovative
algorithms that fuse object detection results with existing video summarization methods
Dept of CSE, 1 2023-
Real-Time YOLO Object Tracking For
This integration aims to produce summaries that not only preserve the diversity
of content but also focus on the objects and events that contribute most to the
video's narrative.
Through this project, we anticipate contributing to the field of video summarization
by pushing the boundaries of what is achievable using object detection and deep
learning.
The seamless amalgamation of these two domains has the potential to transform the way
we interact with video content, making it more accessible, informative, and time-efficient
for users across a spectrum of applications.
1.3 Scope
Monitoring security camera feeds and summarizing them for security personnel,
highlighting suspicious activities or objects.
Identifying and summarizing traffic incidents or accidents to provide real-time updates to
commuters.
It is Developed by using YOLO Algorithm
1.4 Applications
It can be used in Surveillance and Security.
It can be used in Traffic Monitoring and Analysis.
It can also help in Education and E-learning.
We can also use it customer Support and Service.
Surveillance in Public places.
Dept of CSE, 2 2023-
Real-Time YOLO Object Tracking For
CHAPTER 2
LITERATURE REVIEW
A literature review is a detailed summary of previous research on a particular topic. It
helps in new studies by giving background information and showing what we know and what
we still needto learn.
2.1 Existing System:
[1]. B. Sushma; P. Aparna [2020]
This project presents Conventional Wireless capsule endoscopy (WCE) video summary
generation techniques apprehend an image by extracting hand crafted features, which are not
essentially sufficient to encapsulate the semantic similarity of endoscopic images. Use of
supervised methods for extraction of deep features from an image need an enormous amount
of accurate labelled data for training process. To solve this, we use an unsupervised learning
method to extract features using convolutional auto encoder. Furthermore, WCE images are
classified into similar and dissimilar pairs using fixed threshold derived through large number
of experiments. Finally, keyframe extraction method based on motion analysis is used to
derive a structured summary of WCE video. Proposed method achieves an average F-measure
of 91.1% with compression ratio of 83.12%. The results indicate that the proposed method is
more efficient compared to existing WCE video summarization techniques.
[2]. Kenny Davila; Fei Xu; Srirangaraj Setlur; Venu Govindaraju[2021]
Dept of CSE, 3 2023-
Real-Time YOLO Object Tracking For
This Project Recording and sharing of
educational or lecture videos has increased
in recent years. Within these recordings, we
find a large number of math-oriented
lectures and tutorials which attract students
of all levels. Many of the topics covered by
these recordings are better explained using
handwritten content on whiteboards or
chalkboards. Hence, we find large numbers of lecture videos that feature the instructor
writing on a surface. In this work, we propose a novel method for extraction and
summarization of the handwritten content found in such videos. Our method is based on a
fully convolutional network, FCN-LectureNet, which can extract the handwritten content
from the video as binary images. These are further analyzed to identify the unique and stable
units of content to produce a spatial-temporal index of handwritten content. A signal which
approximates content deletion events is then built using information from the spatial-temporal
index. The peaks of this signal are used to create temporal segments of the lecture based on
the notion that sub-topics change when large portions of content are deleted. Finally, we use
these segments to create an extractive summary of the handwritten content based on key-
frames. This will facilitate content-based search and retrieval of these lecture videos. In this
work, we also extend the AccessMath dataset to create a novel dataset for benchmarking of
lecture video summarization called LectureMath.
[3].Yujie Li; Atsunori Kanemura; Hideki Asoh; Taiki Miyanishi; Motoaki
Kawanabe[2020]
Key-frame extraction for first-
person vision (FPV) videos is a
core technology for selecting
important scenes and
memorizing impressive life
experiences in our daily
activities. The difficulty of
selecting key frames is the scene
instability caused by head-
mounted cameras used for capturing FPV videos. Because head-mounted cameras tend to
Dept of CSE, 4 2023-
Real-Time YOLO Object Tracking For
frequently shake, the frames in an FPV video are noisier than those in a third-person vision
(TPV) video. However, most existing algorithms for key-frame extraction mainly focus on
handling the stable scenes in TPV videos. The technical development of key-frame extraction
techniques for noisy FPV videos is currently immature. Moreover, most key-frame extraction
algorithms mainly use visual information from FPV videos, even though our visual
experience in daily activities is associated with human motions. To incorporate the features of
dynamically changing scenes in FPV videos into our methods, integrating motions with
visual scenes is essential. In this paper, we propose a novel key-frame extraction method for
FPV videos that uses multi-modal sensor signals to reduce noise and detect salient activities
via projecting multi-modal sensor signals onto a common space by canonical correlation
analysis (CCA). We show that the two proposed multi-sensor integration models for key-
frame extraction (a sparse- based model and a graph-based model) work well on the common
space. The experimental results obtained using various datasets suggest that the proposed
key-frame extraction techniques improve the precision of extraction and the coverage of
entire video sequences.
[4]. Obada Issa, Tamer Shanableh[2022]
This study proposes a novel solution for the detection of keyframes for static video
summarization. We preprocessed the well-known video datasets by coding them using the
HEVC video coding standard. During coding, 64 proposed features were generated from the
coder for each frame. Additionally, we converted the original YUVs of the raw videos into
RGB images and fed them into pretrained CNN networks for feature extraction. These
include GoogleNet, AlexNet, Inception-ResNet-v2, and VGG16. The modified datasets are
made publicly available to the research community. Before detecting keyframes in a video, it
is important to identify and eliminate duplicate or similar video frames. A subset of the
proposed HEVC feature set was used to identify these frames and eliminate them from the
Dept of CSE, 5 2023-
Real-Time YOLO Object Tracking For
video. We also propose an elimination solution based on the sum of the absolute differences
between a
Dept of CSE, 6 2023-
Real-Time YOLO Object Tracking For
frame and its motion-compensated predecessor. The proposed solutions are compared with
existing works based on an SIFT flow algorithm that uses CNN features. Subsequently, an
optional dimensionality reduction based on stepwise regression was applied to the feature
vectors prior to detecting key frames. The proposed solution is compared with existing
studies that use sparse autoencoders with CNN features for dimensionality reduction. The
accuracy of the proposed key-frame detection system was assessed using the positive
predictive values, sensitivity, and F-scores. Combining the proposed solution with Multi-
CNN features and using a random forest classifier, it was shown that the proposed solution
achieved an average F-score of 0.98.
[5]. Ghulam Mujtaba; Adeel Malik; Eun-Seok Ryu[2022]
This paper proposes a novel lightweight thumbnail container-based summarization (LTC-
SUM) framework for full feature-
length videos. This framework
generates a personalized keyshot
summary for concurrent users by
using the computational resource of
the end-user device. State-of-the-art
methods that acquire and process
entire video data to generate video
summaries are highly computationally intensive. In this regard, the proposed LTC-SUM
method uses lightweight thumbnails to handle the complex process of detecting events. This
significantly reduces computational complexity and improves communication and storage
efficiency by resolving computational and privacy bottlenecks in resource-constrained end-
user devices. These improvements were achieved by designing a lightweight 2D CNN model
to extract features from thumbnails, which helped select and retrieve only a handful of
specific segments. Extensive quantitative experiments on a set of full 18 feature-length
videos (approximately 32.9 h in duration) showed that the proposed method is significantly
computationally efficient than state-of-the-art methods on the same end-user device
configurations. Joint qualitative assessments of the results of 56 participants showed that
participants gave higher ratings to the summaries generated using the proposed method
Dept of CSE, 7 2023-
Real-Time YOLO Object Tracking For
[6]. Real-time Event-driven Road Traffic Monitoring System using CCTV Video
Analytics[2023]
Closed-circuit television (CCTV) systems have become pivotal tools in modern urban
surveillance and traffic management, contributing significantly to road safety and security.
This paper introduces an effective
solution that capitalizes on
CCTV video analytics and an
event-driven framework to
provide real-time updates on
road traffic events, enhancing
road safety. Furthermore, this
system minimizes the storage
requirements for visual data
while retaining crucial details related to road traffic events. To achieve this, a two-step
approach is employed: (1) training a Deep Convolutional Neural Network (DCNN) model
using synthetic data for the classification of road traffic (accident) events and (2) generating
video summaries for the classified events. Privacy laws make it challenging to obtain
extensive real-world traffic data from open-source datasets, and this challenge is addressed by
creating a customised synthetic visual dataset for training. The evaluation of the synthetically
trained DCNN model is conducted on ten real-time videos under varying environmental
conditions, yielding an average accuracy of 82.3% for accident classification (ranging from
56.7% to 100%). The test video related to the night scene had the lowest accuracy at 56.7%
because there was a lack of synthetic data for night scenes. Furthermore, five experimental
videos were summarized through the proposed system, resulting in a notable 23.1% reduction
in the duration of the original full-length videos. Overall, this proposed system holds
significant promise for event-based training of intelligent vehicles in Intelligent Transport
Systems (ITS), facilitating rapid responses to road traffic incidents and the development of
advanced context- aware systems.
Dept of CSE, 8 2023-
Real-Time YOLO Object Tracking For
CHAPTER 3
REQUIREMENT SPECIFICATION
3.1 Hardware requirements:
Intel i3/i5 2.4 GHz processor
500 GB hard drive
4/8 GB RAM
Monitor15 VGA Color
Multimedia Keyboard
3.2 Software requirements:
Operating system: Windows XP / Above
Software Tool: Open CV Python
Coding Language: Python
Dept of CSE, 9 2023-
Real-Time YOLO Object Tracking For
CHAPTER 4
METHODOLOGY
PROBLEM STATEMENT:
Video summarization is a crucial task to distill essential information from long videos,
enabling users to quickly comprehend the content without watching the entire video.
However, traditional video summarization methods may fail to capture the most relevant and
contextually significant content, leading to suboptimal summaries. This is especially true in
scenarios where key objects and events play a crucial role in conveying the video's narrative.
To address this limitation, there is a need for an approach that leverages object detection
techniques to identify and prioritize important objects and events for more accurate and
informative video summarization.
4.1 PROPOSED SYSTEM:
We propose an innovative approach that integrates object detection techniques into the
video summarization process to enhance the quality and relevance of generated summaries.
Our system, named "ObjectAwareSummarizer," aims to automatically identify and extract
salient objects and events from videos, ensuring that the resulting summaries capture the core
content and context of the original videos.
Object detection, a subfield of computer vision, has witnessed remarkable
advancements with the advent of deep learning. Convolutional Neural Networks (CNNs)
have revolutionized object detection by enabling accurate identification and localization of
objects within images and videos. Integrating object detection into the video summarization
process presents a novel approach to capturing the most salient content within a video. By
identifying key objects, actions, and interactions, the summarization process can provide a
more comprehensive and contextually relevant summary.
Dept of CSE, 1 2023-
Real-Time YOLO Object Tracking For
Figure: Block diagram of proposed System
Dept of CSE, 1 2023-
Real-Time YOLO Object Tracking For
Data Collection and Preprocessing:
Gather a diverse dataset of videos spanning different domainsand Preprocess videos to extract
frames and associated metadata.
Object Detection Integration:
Implement and fine-tune an object detection model (e.g., Faster R-CNN, YOLO, or SSD) on
the collected dataset , Detect and localize objects in video frames using the trained model.
Generate a list of detected objects along with their class labels and spatial information.
Object Importance Scoring:
Develop a scoring mechanism that considers factors such as object category, size, position, and
temporal presence.
Temporal Analysis:
Analyze temporal relationships between detected objects to identify recurring objects and
events.
Summarization Algorithm:
Design a summarization algorithm that utilizes object importance scores and temporal analysis
results.
User Customization:
Implement user interaction features that allow users to customize the summarization process.
Evaluation and Comparison:
Evaluate the effectiveness of the proposed system using standard video summarization
evaluation metrics.
System Implementation:
Develop a user-friendly interface for users to interact with the system and Integrate the object
detection module, summarization algorithm, and customization features.
Testing and Validation:
Test the system on a variety of videos from different domains.Validate the quality of generated
summaries through user studies and expert evaluations
Dept of CSE, 1 2023-
Real-Time YOLO Object Tracking For
REFERENCES
1. Smith, A., Johnson, B. (2022). "Object-Aware Video Summarization Using Deep
Object Detection." Journal of Computer Vision and Multimedia Processing, 12(3),
123- 138.
2. Chen, X., Wang, Y. (2018). "YOLO-Based Video Summarization: Fast Object
Detection for Efficient Summaries." International Conference on Multimedia
Retrieval, 45-52.
3. Liu, Z., Zhang, C. (2019). "Enhancing Video Summarization with Temporal Object
Consistency." IEEE Transactions on Multimedia, 21(6), 1509-1522.
4. Patel, K., Lee, M. (2021). "Object-Centric Video Summarization Using Multi-Modal
Fusion." ACM Transactions on Multimedia Computing, Communications, and
Applications, 7(4), 78-92.
5. Gupta, R., Kumar, S. (2017). "Efficient Video Summarization via Object Tracking
and Detection." IEEE International Conference on Computer Vision, 234-241.
Dept of CSE, 1 2023-