0% found this document useful (0 votes)
17 views54 pages

"Text Summarization For Audio and Video Files": Bachelor of Technology in Information Technology by

The project report details the development of a system for summarizing audio and video files using machine learning techniques across multiple languages. The proposed system aims to automate the extraction of key information from multimedia content, enhancing accessibility and efficiency for users. It incorporates various technologies such as automatic speech recognition, sentiment analysis, and computer vision to generate coherent summaries, evaluated through metrics like accuracy and relevance.

Uploaded by

laxmaiah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views54 pages

"Text Summarization For Audio and Video Files": Bachelor of Technology in Information Technology by

The project report details the development of a system for summarizing audio and video files using machine learning techniques across multiple languages. The proposed system aims to automate the extraction of key information from multimedia content, enhancing accessibility and efficiency for users. It incorporates various technologies such as automatic speech recognition, sentiment analysis, and computer vision to generate coherent summaries, evaluated through metrics like accuracy and relevance.

Uploaded by

laxmaiah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Project Report

On

“TEXT SUMMARIZATION FOR AUDIO AND VIDEO


FILES”
Submitted in partial fulfillment for the award of the degree of

Bachelor of Technology
In
Information Technology
By
M.AISHWARYA (21R21A12A1)
P.KEERTHANA (21R21A12B4)
G.SHIVA KUMAR(21R21A1289)
K.SRI HARSHINI (21R21A12A0)

Under the guidance of

Mrs.Adilakshmi (Assistant Professor)

2025

I
Department of Information Technology

Department of Information Technology

CERTIFICATE
This is to certify that the project entitled “Text Summarization for audio and video files” has been
submitted by M. AISHWARYA (21R21A12A1), P.KEERTHANA (21R21A12B4), G.SHIVA
KUMAR (21R21A1289), K.SRI HARSHINI (21R21A12A0) in the partial fulfillment of the
requirements for the award of degree of Bachelor of Technology in Information Technology from
Jawaharlal Nehru Technology University, Hyderabad. The results embodied in this project have not
been submitted to any other University or Institution for the award of any degree or diploma.

Supervisor Head of the Department


Mrs.Adilakshmi Dr. N.V Rajashekar Reddy

External Examiner

II
DECLARATION

I hereby declare that the project entitled “Text Summarization for audio and video files”
is the work done during the period from January 2023 to May 2023 and is submitted in the partial
fulfillment of the requirements for the award of degree of Bachelor of technology in Information
Technology from Jawaharlal Nehru Technology University, Hyderabad. The results embodied in
this project have not been submitted to any other University or Institution for the award of any
degree or diploma.

M.Aishwarya (21R21A12A1)
P.Keerthana (21R21A12B4)
G.Shiva Kumar(21R21A1289)
K.Sri Harshini (21R21A12A0)

III
ACKNOWLEDGEMENT

There are many people who helped me directly and indirectly to complete my project
successfully. I would like to take this opportunity to thank one and all. First of all I would like to
express my deep gratitude towards my supervisor Mrs.Adilakshmi, Assistant Professor,
Department of IT for her support in the completion of my dissertation. I wish to express my
sincere thanks to Dr. N.V RAJASHEKAR REDDY, HOD, Department of IT and also to
Principal Dr.K.SRINIVASA RAO for providing the facilities to complete the dissertation.

I would like to thank all our faculty, coordinators and friends for their help and constructive
criticism during the project period. Finally, I am very much indebted to our parents for their moral
support and encouragement to achieve goals.

M.Aishwarya (21R21A12A1)
P.Keerthana (21R21A12B4)
G.Shiva Kumar (21R21A1289)
K.Sri Harshini (19R21A1206)

IV
ABSTRACT

This abstract introduces the concept of audio and video summarization for multiple languages using
machine learning techniques. With the exponential growth of multimedia content on the internet, there
is a growing need for efficient and automated methods to summarize audio and video data in English,
French, Japanese, and German languages. The goal is to extract key information and present a concise
summary that captures the essence of the content, saving time and enhancing accessibility for users.
Machine learning algorithms, particularly those based on deep learning and natural language
processing, have proven to be highly effective in various text-based tasks. However, extending these
techniques to multimedia data poses unique challenges. In this study, we propose a novel approach that
leverages machine learning algorithms to summarize audio and video content in multiple languages.
The proposed system utilizes techniques such as automatic speech recognition, speaker diarization,
sentiment analysis, object recognition, and scene detection to extract relevant information from the
multimedia input. By combining these components, the system generates a coherent and informative
summary that can be presented in a written or visual format. To develop this system, a large dataset of
multilingual audio and video content is collected and annotated for training purposes. The machine
learning models are then trained on this dataset to learn patterns and relationships between different
features and their corresponding summaries. The performance of the system is evaluated using metrics
like accuracy, precision, recall, and F1 score. The results demonstrate the effectiveness of the proposed
approach in summarizing audio and video content in multiple languages, providing users with concise
and meaningful summaries that facilitate information retrieval and understanding. In conclusion, the
proposed approach demonstrates the potential of machine learning techniques for audio and video
summarization in multiple languages.

V
CONTENTS

S. No Contents Page No

Certificate II
Declaration III
Acknowledgement IV
Abstract V
1 Introduction 1-2
1.1 General
1.2 Problem Definition
1.3 Objective
2 Literature Survey 3-4
2.1 Existing System
3 Proposed System 5-8
3.1 Advantages of Proposed System
4 System Design 9-19
4.1 General
4.2 Project Flow
4.3 Modules
5 Implementation and Installation 20-31
5.1 Source Code
5.2 Implementation Steps
6 DFD Diagrams 32-34
7 Testing 35-39
7.1 Introduction
7.2 Types of Testing
7.3 Testing Techniques
8 Results 40
9 Screenshots 41-42
10 Conclusion 43-44
10.1Future Scope
11 References 45-46
List of Figures

Figure No Name of the Figure Page no


4.2.1 Data Flow Diagram 9
4.2.2 Architecture 10
4.2.3 Usecase Diagram 11
4.2.4 Activity Diagram 12
4.2.5 Class Diagram 13
4.2.6 Sequence Diagram 14
7.3.1 Blockbox Testing 38
7.3.2 Whitebox Testing 39
9 Screenshots 41-41
1. INTRODUCTION

1.1 General

In today’s digital age, the abundance of audio and video content available on various platforms has
led to the need for efficient methods of summarization. Extracting key information from multimedia
sources can be daunting, especially when dealing with content in multiple languages. However,
with advancements in machine learning techniques, there is great potential to automate the process
of audio and video summarization for multiple languages.

The aim of this project is to explore the application of machine learning algorithms in
summarizing audio and video content across different languages. By developing a system that can
automatically identify and extract important elements from multimedia inputs, a concise summary
can be generated, facilitating easier access and understanding for users.

The proposed approach involves utilizing various machine learning techniques such as
automatic speech recognition, sentiment analysis, and object recognition. These techniques enable
the system to analyze and extract relevant information from the audio and visual components of the
multimedia content. By incorporating deep learning models and natural language processing
algorithms, the system can process and understand the content in multiple languages, making it
adaptable and scalable to diverse linguistic contexts.

The significance of this research lies in its potential to enhance accessibility and user
experience when dealing with large volumes of multimedia data. With an automated summarization
system, users can efficiently navigate and consume content across different languages, saving time
and effort in the process. Moreover, the application of machine learning in this context opens up
opportunities for improved information retrieval, content indexing, and cross-lingual analysis.

In conclusion, this project aims to harness the power of machine learning to tackle the
challenge of audio and video summarization in multiple languages. By automating the extraction
of key information, users can access and comprehend multimedia content more effectively, thereby
enhancing the overall multimedia experience in a multilingual context.
1
1.2 Problem Definition

All of the audio files data that are received from the sources of today’s audios, such as podcasts and
videos are not consistently efficient ways to gather information. In this situation, it is always more
effective to use data that has been condensed, or that concentrates more on the key ideas rather than
the complete text. An audio file is used as the input for this technique, and data summarization
procedures are used to produce a condensed output.

1.3 Objective

• The kind of automatic summarization that we emphasize in this article is audio


summarization, the source that corresponds to an audio signal.
• An audio summary can be done in the following three ways:
A. Directing the summary using only audio functions
B. Extracting the text within the audio signal, and directing the summarization process
with textual methods.
C. A hybrid approach composed of a blend of the first two.
D. This Project aims to develop an efficient way to recapitulate large audio messages
or clips for valuable insights

2
2. LITERATURE SURVEY

Audio Summarization for Podcasts

AUTHORS: A. Vartakavi, A. Garg and Z. Rafii.

The issue of automatic summarization has a number of potential answers. These include
straightforward unsupervised techniques, graph-based approaches that use ranking to arrange the
input text in a graph, or neural approaches[1], which are covered in more depth in the following
paragraph, are built using graph traversal algorithms.

A Ranking based Language Model for Automatic Extractive Text Summarization


AUTHORS: P. Gupta, S. Nigam, and R. Singh.

This paper presents techniques for converting speech audio files to text files and text
summarization on the text file. It is very difficult for a user to get an accurate summary or to
comprehend the relevant and important items from the available media. Additionally, readers or
evaluators of these data files are interested only in the relevant content or summary to be retrieved
in less duration from the source files[2]. Automatic text summarization (ATS) is the only way to
summarize single or multiple documents to obtain relevant content from the source files.

Analysis of Real-Time Video Summarization using Subtitles

AUTHORS: P. G. Shambharkar and R. Goel.

As the amount of available video content continues to grow at a rapid rate, having access to an
automatic video summary would be advantageous for anybody who values their time and wants to
learn more while spending less. Due to the rapid increase in the number of videos made by users,
the ability to correctly view these films is becoming increasingly vital. By discovering and
selecting informative stills from the film, a video summary is viewed as a potentially valuable
method for maximizing the information contained inside films[3]. Using text summarization and
video mapping algorithms, it is seen how essential video elements may be retrieved from the film’s
subtitles and utilized to provide a summary of the movie’s content.

A Divide-and-Conquer Approach to the Summarization of Long Documents

AUTHORS: A. Gidiotis and G. Tsoumakas.

3
This paper presents a novel divide-and-conquer method for the neural summarization of long
documents. In particular, we break a long document and its summary into multiple source-target
pairs, which are used for training a model that learns to summarize each part of the document
separately. These partial summaries are then combined in order to produce a final complete
summary[4]. With this approach we can decompose the problem of long document summarization
into smaller and simpler problems, reducing computational complexity and creating more training
examples, which at the same time contain less noise in the target summaries compared to the
standard approach. We demonstrate that this approach paired with different summarization models,
including sequence-to-sequence RNNs and Transformers, can lead to improved summarization
performance. Our best models achieve results that are on par with the state-of-the-art in two
publicly available datasets of academic articles.

Audio data summarization system using natural language processing

AUTHORS: Pravin Khandare, Sanket Gaikwad, Aditya Kukade, Rohit Panicker, Swaraj Thamke.
This paper presents techniques for converting speech audio files to text files and text
summarization on the text file. For the former case, we have used Python modules to convert the
audio files to text format[5]. For the latter case, Natural Language Processing modules are used
for text summarization. The summarization method involves important sentences obtained when
the extraction is investigated. Weights are assigned to words according to the number of
occurrences of each word in the text file. This technique is used for producing summaries from the
main audio file.

2.1 EXISTING SYSTEM

The existing system only identifies emotions when any audio input is given. The existing system
uses Python speech recognition library which might not work for all the inputs since the library
has dependency issues. The existing system can’t analyze if a big dataset of audio files are passed
as an input because the existing system isn’t trained with a large dataset.

4
3. PROPOSED SYSTEM

• The proposed system for audio and video summarisation in five languages using machine
learning aims to leverage advanced technologies to automate the process of extracting key
information from multimedia content in various languages. The system will integrate different
components and techniques to achieve effective summarization.
• For audio summarisation, the system will employ automatic speech recognition (ASR) to
transcribe audio into text. Machine learning algorithms, such as deep neural networks, will be
utilized to extract relevant features from the audio data. Techniques like speaker diarization,
sentiment analysis, and topic modeling will be applied to identify important speakers, sentiments
expressed, and significant themes within the audio content.
• Regarding video summarisation, the system will utilize computer vision algorithms, including
object recognition and scene detection, to identify key objects, scenes, and representative frames.
Deep learning models, such as convolutional neural networks (CNNs), will be employed to extract
visual features from the video data.
• To handle multiple languages, the system will incorporate natural language processing (NLP)
techniques. Language identification models will determine the language of the audio and video
content, enabling language-specific processing. Translation models will be used to facilitate cross-
lingual summarisation, allowing users to obtain summaries in their preferred language.”
• The system will also employ machine learning models, such as recurrent neural networks (RNNs)
or transformers, to integrate and process the extracted audio and visual features, ensuring a
cohesive summarisation across modalities.
• Evaluation metrics, such as coherence, relevance, and informativeness, will be used to assess the
quality of the generated summaries. The system will undergo rigorous testing on diverse datasets
in multiple languages to validate its performance.

3.1 ADVANTAGES OF PROPOSED SYSTEM

The proposed system for multilingual audio and video summarization using machine learning
offers numerous benefits that make it highly suitable for addressing the increasing demand for

5
efficient content processing across various languages and modalities. These advantages stem from
its integrated use of advanced machine learning, deep learning, and natural language processing
(NLP) techniques to extract, process, and summarize rich multimedia data. The key advantages
are discussed in detail below:
1.Support for Multiple Languages
One of the most significant strengths of the proposed system is its ability to handle audio and video
content in multiple languages, specifically English, French, Japanese, German, and more. This
multilingual capability is essential in today’s global digital environment, where users consume
content from diverse linguistic backgrounds. The system includes language identification
components and translation models to ensure that summaries are generated in the user's preferred
language. This feature not only broadens the reach of the system but also promotes inclusivity and
accessibility for non-native speakers and international audiences.
2. Efficient and Time-Saving Content Access
With the vast amount of audio and video data being produced daily across platforms like YouTube,
news agencies, and educational portals, users often struggle to extract relevant information
quickly. The proposed system provides concise and meaningful summaries, allowing users to
understand the essence of the content without needing to consume the entire audio or video. This
time-saving capability is particularly valuable in professional settings such as research, journalism,
and corporate training, where rapid comprehension is critical.
3. Integration of Audio and Visual Modalities
The system is designed to process and summarize both audio and video data simultaneously,
ensuring a holistic understanding of the multimedia content. It uses automatic speech recognition
(ASR) and speaker diarization to handle the audio portion, while leveraging computer vision
techniques like object recognition and scene detection for the visual component. This multimodal
approach results in summaries that are not only linguistically accurate but also visually
contextualized, providing richer and more informative content to the user.
4. Advanced Machine Learning and Deep Learning Techniques
By employing state-of-the-art machine learning algorithms, including deep neural networks,
convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers, the
system is capable of extracting high-level features from complex data. These models are trained
on large, annotated datasets to recognize patterns, identify key elements, and generate coherent

6
summaries. The use of deep learning enhances the system’s performance in terms of accuracy,
relevance, and overall summary quality.
5. Enhanced Accessibility and Usability
The summarization system improves content accessibility for users with different needs. For
instance, hearing-impaired users can benefit from textual summaries of audio content, while
visually impaired users might rely on audio summaries generated from video. Furthermore, the
inclusion of translation capabilities means that users can receive summaries in their native or
preferred languages, making the system valuable for diverse user groups.
6. Improved User Experience through Sentiment and Context Analysis
Beyond extracting textual information, the system incorporates sentiment analysis and topic
modeling to interpret the emotional tone and contextual themes of the content. This allows the
generated summaries to reflect not just what was said or shown, but also how it was conveyed,
which is crucial for content like interviews, lectures, or political speeches where tone and emotion
add significant value.
7. Scalability and Automation for Large-Scale Applications
Given its automated nature, the system is highly scalable and can be integrated into large-scale
platforms that deal with vast quantities of multimedia content. Whether it’s a video hosting site, a
digital learning platform, or a news aggregation service, the system can function without manual
intervention, thus reducing operational costs and increasing processing efficiency.
8. Adaptability Across Domains
The flexible architecture of the system allows it to be tailored for specific industries or domains.
For example, in healthcare, the system can summarize medical lectures or patient interviews. In
education, it can create concise summaries of recorded lectures. By training the machine learning
models on domain-specific data, the system becomes even more accurate and relevant to particular
fields.
9. Cross-Lingual Summarization and Translation
The incorporation of NLP and translation models enables the system to convert content from one
language to another and still generate meaningful summaries. This cross-lingual capability is
highly beneficial for international businesses, academic institutions, and media outlets that operate
across linguistic boundaries. It ensures that content is not lost in translation and that users receive
accurate and contextually relevant summaries.

7
10. Comprehensive Evaluation Metrics
To ensure the quality of the generated summaries, the system uses a set of evaluation metrics such
as accuracy, precision, recall, F1 score, coherence, relevance, and informativeness. These metrics
provide a robust framework for performance measurement, helping developers fine-tune the
models and maintain high standards. Furthermore, the system can be validated using both
automated metrics and human feedback to continuously improve its summarization capabilities.

8
4 SYSTEM DESIGN

4.1 General

The System Design Document describes the system requirements, operating environment, system and
subsystem architecture, input formats, output layouts, human-machine interfaces, detailed design,
processing logic, and external interfaces. The design and the working of the whole system is organized
into two modules which include:

4.2 Project Flow

Below Figure 4.2.1 is a graphic assistance in project management. The figure depicts the parallel
and interconnected processes within the project plan, enabling you to effectively manage your
project by seeing the entire project cycle.

4.2.1 Project Data Flow Diagram

Figure 4.2.1: Text Summarization for audio and videos Data Flow Diagram

9
4.2.2 Project Architecture Diagram

Figure 4.2.2: Text Summarization for Audio and Videos Architecture Diagram

4.2.3 Project Usecase Diagram


Below figure 4.2.3 is used to represent the dynamic behavior of a system. It encapsulates the
system’s functionality by incorporating use cases, actors, and their relationships. It models the
tasks, services, and functions required by a system/subsystem of an application. It depicts the high-
level functionality of a system and also tells how the user handles a system. The main purpose of
a use case diagram is to portray the dynamic aspect of a system. It accumulates the system’s
requirement, which includes both internal as well as external influences. It invokes persons, use
cases, and several things that invoke the actors and elements accountable for the implementation
of use case diagrams. It represents how an entity from the external environment can interact with
a part of the system.

10
Figure 4.2.3: Text Summarization for Audio and Videos Usecase Diagram

4.2.4 Project Activity Diagram


Below Figure 4.2.4 Project activity diagram is a kind of graphical representation that may be used
to depict events visually. It is made up of a group of nodes that are linked to one another by means
of edges. They are able to be connected to any other modeling element, which enables the behavior
of activities to be replicated using that methodology. Simulations of use cases, classes, and
interfaces, as well as component collaborations and component interactions, are all made feasible
with the help of this tool.

11
Figure 4.2.4: Text Summarization for Audio and Videos Activity Diagram

4.2.5 Project Class Diagram


Below figure 4.2.5 is a static diagram. It represents the static view of an application. A class
diagram is not only used for visualizing, describing, and documenting different aspects of a system
but also for constructing executable code of the software application. A class diagram describes
the attributes and operations of a class and also the constraints imposed on the system. The class
diagrams are widely used in the modeling of object-oriented systems because they are the only
UML diagrams, which can be mapped directly with object-oriented languages.

12
Figure 4.2.5: Text Summarization for Audio and Videos Class Diagram

4.2.6 Project Sequence Diagram


Below figure 4.2.6 is a sequence diagram that simply depicts an interaction between objects in a
sequential order i.e. the order in which these interactions take place. We can also use the terms
event diagrams or event scenarios to refer to a sequence diagram. Sequence diagrams describe
how and in what order the objects in a system function.

13
Figure 4.2.6: Text Summarization for Audio and Videos Sequence Diagram

4.3 MODULES

Module 1: AUDIO/VIDEO PROCESSING

In the audio-video summarization system focuses on processing the audio or video input to extract
the necessary information. This module involves several tasks such as speech recognition, speaker
diarization, object detection, and tracking. Let’s delve into each of these tasks and their functionalities in
more detail.
Speech Recognition: Speech recognition is a fundamental task in audio processing, where the spoken
words in the audio input are converted into text. This process involves analyzing the acoustic features of
the audio signal and mapping them to corresponding textual representations. Automatic Speech
Recognition (ASR) systems are commonly used for speech recognition, which utilizes techniques like
Hidden Markov Models (HMMs), deep neural networks (DNNs), or transformer models. These models
14
are trained on large amounts of audio data with corresponding transcriptions to learn the mapping
between audio signals and text. By performing speech recognition, the system obtains a textual
representation of the spoken content, enabling subsequent NLP tasks to be applied to the
transcribed text.
Speaker Diarization : Speaker diarization is the process of determining ”who spoke when¨ın an audio
or video recording, particularly in scenarios with multiple speakers. This task is crucial for identifying
different speakers and attributing the spoken words to the respective individuals. Speaker diarization
involves segmenting the audio into speaker-specific regions and clustering the segments to assign them
to different speakers. Techniques such as clustering algorithms, speaker embeddings, and turntaking
analysis can be employed for effective speaker diarization. By performing speaker diarization, the system
can associate the transcribed text with specific speakers, which can be beneficial for generating speaker-
specific summaries or identifying key speakers in the summary.
In the case of video input, object detection, and tracking techniques are employed to recognize and track
relevant objects or people in the video frames. Object detection involves identifying and localizing
specific objects or regions of interest within each frame. Various object detection algorithms, such as
Faster R-CNN, YOLO (You Only Look Once), or SSD (Single Shot MultiBox Detector), can be used for
this task. Once the objects are detected, object tracking is performed to follow the movement of the
identified objects across multiple frames. Tracking algorithms, such as Kalman filters, correlation filters,
or deep learning-based trackers, are employed to maintain the continuity of object
trajectories.
Object detection and tracking provide valuable information about the objects or people present
in the video, which can be used for generating summaries focused on specific objects or tracking the
movements of relevant entities.
By performing these tasks in Module 1, the system processes the audio or video input to obtain
the necessary information for subsequent analysis and summarization. The outputs of this module,
including transcribed text, speaker identities, and object detections, are used as inputs to Module 2 for
further analysis and extraction of meaningful information.
It’s important to note that the performance of Module 1 heavily relies on the accuracy of the
underlying algorithms and models used for speech recognition, speaker diarization, object detection, and
tracking. Continuous refinement and improvement of these algorithms, along with robust preprocessing

15
techniques, can enhance the quality and reliability of the extracted information, thereby improving the
overall effectiveness of the audio-video summarization system.
In conclusion, Module 1 in audio-video summarization involves processing the audio or video
input by performing tasks such as speech recognition, speaker diarization, object detection, and tracking.
These tasks provide the necessary information for subsequent analysis and summarization. By accurately
recognizing speech, identifying speakers, and detecting and tracking objects, this module lays the
foundation for further processing and extraction of relevant information in the audio-video
summarization system.

Module 2: TEXT EXTRACTION AND ANALYSIS

In audio-video summarization using NLP, Module 2 focuses on extracting meaningful


information from the converted text obtained from the audio or video input. This module involves several
key tasks, including text summarization, entity recognition, sentiment analysis, and keyword extraction.
Let’s explore each of these tasks in more detail.
Text Summarization: Text summarization is the process of condensing a piece of text while retaining
its key information and main points. There are two primary approaches to text summarization: extractive
and abstractive. Extractive summarization involves identifying important sentences or paragraphs from
the text and combining them to form a summary. This approach relies on methods such as sentence
scoring, where sentences are ranked based on their relevance and importance. Techniques like TF-IDF
(Term Frequency-Inverse Document Frequency) and TextRank, which utilizes graph-based ranking
algorithms, are commonly used for extractive summarization.
Abstractive summarization, on the other hand, involves generating a summary by understanding the
content and paraphrasing it in a concise manner. This approach often employs techniques such as neural
networks and sequence-to-sequence models, which can generate summaries that may not be present
verbatim in the original text.
Entity Recognition: Entity recognition refers to the task of identifying and categorizing named entities
in text. Named entities can include people, organizations, locations, dates, and other specific terms. By
recognizing entities, the system can extract and highlight important information that contributes to the
overall summary. Named Entity Recognition (NER) models trained on large datasets can be used to
accurately identify and classify entities in the text.

16
Sentiment Analysis: Sentiment analysis aims to determine the emotional tone or sentiment expressed in
the text. This task can be particularly useful in audio-video summarization, as it allows for the
identification of positive, negative, or neutral sentiments associated with specific topics or entities.
Sentiment analysis can be performed using various techniques, such as rule-based approaches, machine
learning models, or pre-trained sentiment analysis frameworks like VADER (Valence Aware Dictionary
and sEntiment Reasoner).
Keyword Extraction: Keyword extraction involves identifying important terms or phrases that represent
the main topics or themes in the text. These keywords provide a concise representation of the content
and can be used to generate a summary that captures the essential information. Techniques like TF-IDF,
RAKE (Rapid Automatic Keyword Extraction), or graph-based algorithms like TextRank can be applied
to extract significant keywords from the text.
By performing these tasks in Module 2, the system can extract the relevant information from the
text obtained from the audio or video input. The extracted information, such as important sentences, key
entities, sentiment analysis results, and significant keywords, will be used in the subsequent module for
generating the audio-video summary.
It’s worth noting that the performance of Module 2 heavily relies on the quality of the NLP models
and techniques used. The accuracy of entity recognition, sentiment analysis, and keyword extraction can
significantly impact the overall effectiveness of the summarization system. Therefore, selecting
appropriate NLP models, ensuring the availability of high-quality training data, and continuously
evaluating and refining the models is crucial for achieving optimal results in audio-video summarization
using NLP.

Module 3: SUMMARIZATION GENERATION

Module 3 is responsible for generating the audio-video summary based on the extracted
information from Module 2. This module combines important sentences, key entities, sentiment analysis
results, and significant keywords to create a concise and coherent summary. The summarization
generation process involves several steps and techniques. Let’s explore them in more detail.
Content Selection: The first step in summarization generation is selecting the most relevant content to
include in the summary. This content can be determined based on the importance scores assigned to

17
sentences or paragraphs during the text summarization process in Module 2. The sentences or paragraphs
with higher scores, indicating their significance, are prioritized for inclusion in the summary.
Summary Structure: Next, the structure of the summary needs to be determined. The summary can be
organized in chronological order, topic-wise, or based on the importance of the information. This
structure ensures that the summary flows logically and provides a coherent representation of the original
content. The structure can be predefined based on the application requirements or dynamically generated
using algorithms that consider the relationships between sentences or topics.
Entity and Keyword Integration: The key entities and significant keywords identified in Module 2
play a crucial role in creating a meaningful summary. These entities and keywords can be integrated into
the summary to provide context and highlight important aspects of the content. For example, the
summary might include key entities mentioned in the original text and describe their roles or
relationships. Additionally, incorporating significant keywords helps reinforce the main topics or themes
covered in the summary.
Sentiment Analysis Integration: If sentiment analysis was performed in Module 2, the generated
sentiment scores can be incorporated into the summary. This integration can be done by including
statements about the overall sentiment expressed in the original content or by associating sentiment
information with specific topics or entities. For example, if the sentiment analysis indicates a positive
sentiment towards a particular product mentioned in the audio or video, the summary might highlight
this positive sentiment.
Text Compression: In order to ensure that the summary remains concise and within a desired length,
text compression techniques can be applied. Compression algorithms aim to reduce redundancy and
eliminate unnecessary information while preserving the core meaning. Techniques like sentence
compression, where certain phrases or words are removed while maintaining the essence of the sentence,
can be employed to achieve text compression. This step helps create a summary that is concise and easier
to consume.
Summary Generation: Finally, the summarized content from the previous steps is combined and
presented as the audio-video summary. The summary can be in the form of text, audio, or both, depending
on the application requirements. It is important to ensure that the summary captures the essential
information, maintains coherence, and provides a comprehensive representation of the original content.
It’s important to note that the summarization generation process in Module 3 can be iterative. The
generated summary may undergo refinement and improvement based on feedback and evaluation. User

18
feedback or automated evaluation metrics can be utilized to assess the quality and effectiveness of the
summary, allowing for further adjustments to the summarization process. The performance of Module 3
depends on the accuracy of the content selection, the integration of entities, keywords, and sentiment
analysis results, and the effectiveness of the text compression techniques employed. Continuous
evaluation and refinement of the summarization generation process are essential for ensuring high-
quality and meaningful audio-video summaries.
In conclusion, Module 3 combines the extracted information from Module 2 to generate a concise and
coherent audio-video summary. By integrating important sentences, key entities, sentiment analysis
results, and significant keywords, this module ensures that the summary captures the essential
information and provides a comprehensive representation of the original audio or video content.

19
5. IMPLENTATION AND INSTALLATION

5.1 SOURCE CODE:

app.py

import streamlit as st import whisper from whisper.utils

import get_writer

from tempfile import NamedTemporaryFile

from sumy.parsers.plaintext import PlaintextParser

from sumy.nlp.tokenizers import Tokenizer

from sumy.summarizers.lex_rank import LexRankSummarizer

# Load the Whisper ASR model

model = whisper.load_model("base")

# Define a function to transcribe audio and save as TXT

def transcribe_audio(audio_file):

with NamedTemporaryFile(delete=False) as temp_file:

temp_file.write(audio_file.read()) temp_file.flush()

result = model.transcribe(temp_file.name)

transcription = result["text"]

# Save as a TXT file without any line breaks

with open("transcription.txt", "w", encoding="utf-8") as txt: txt.write(transcription)

# Save as a TXT file with hard line breaks

20
txt_writer = get_writer("txt", "./") txt_writer(result, temp_file.name)

return transcription

# Define a function to generate summary of text file

def summary_file(file_path):

Parser = PlaintextParser.from_file(file_path, Tokenizer("english"))

summarizer = LexRankSummarizer()

# Summarize the document with 2 sentences

summary = summarizer(parser.document, 5)

summary_text = ’’

for sentence in summary:

summary_text += str(sentence) + ’\n’

return summary_text

# Create a Streamlit app

def main():

st.title("Audio summary")

# Add file upload option

uploaded_file = st.file_uploader("Upload an audio file", type=["mp3", "wav"])

if uploaded_file is not None:

# Transcribe audio and display result transcription = transcribe_audio(uploaded_file)


st.write("Transcription:")

21
st.write(transcription)

# Generate summary of text file summary_text = summary_file("transcription.txt")


st.write("Summary:")

st.write(summary_text)

if __name__ == "__main__":

main()

summary.py

import sumy from sumy.parsers.plaintext import PlaintextParser

from sumy.nlp.tokenizers import Tokenizer

from sumy.summarizers.lex_rank import LexRankSummarizer

def summary_file(File):

parser = PlaintextParser.from_file(File,Tokenizer("english"))

summarizer =LexRankSummarizer()

#Summarize the document with 2 sentences

summary = summarizer(parser.document, 5)

for sentence in summary:

print(sentence)

summary_file("transcription.txt")

update app.py

# import streamlit as st

22
# import whisper

# from whisper.utils import get_writer

# from sumy.parsers.plaintext import PlaintextParser

# from sumy.nlp.tokenizers import Tokenizer

# from sumy.summarizers.lex_rank import LexRankSummarizer

# # Load the Whisper ASR model

# model = whisper.load_model("base")

# # Define a function to transcribe audio and video files

# def transcribe_file(file_path):

# result = model.transcribe(file_path)

# output_directory = "./"

# # Save as a TXT file without any line breaks

# with open("transcription.txt", "w", encoding="utf-8") as txt:

# txt.write(result["text"])

# # Save as a TXT file with hard line breaks

# txt_writer = get_writer("txt", output_directory)

# txt_writer(result, file_path)

# return result["text"]

23
# # Define a function to summarize text

# def summarize_text(text):

# parser = PlaintextParser.from_string(text, Tokenizer("english"))

# summarizer= LexRankSummarizer()

# # Summarize the document with 2 sentences

# summary = summarizer(parser.document, 2)

# summary_text = ""

# for sentence in summary:

# summary_text += str(sentence) + " "

# return summary_text

# def main():

# st.title("Audio and Video Transcription with Summarization")

# st.subheader("Upload Audio or Video File")

# uploaded_file = st.file_uploader("Choose a file", type=["mp3", "mp4"])

# if uploaded_file is not None:

# file_type = uploaded_file.type.split("/")[0]

# if file_type == "audio":

# transcription = transcribe_file(uploaded_file)

# st.subheader("Transcription:")

# st.write(transcription)

24
# elif file_type == "video":

# with open(uploaded_file.name, "wb") as video_file:

# video_file.write(uploaded_file.read())

# transcription = transcribe_file(uploaded_file.name)

# st.subheader("Transcription:")

# st.write(transcription)

# else:

# st.warning("Invalid file type. Please upload an audio or video file.")

# summary = summarize_text(transcription)

# st.subheader("Summary:")

# st.write(summary)

# if __name__ == "__main__":

# main()

import streamlit as st import whisper

from whisper.utils import get_writer

from sumy.parsers.plaintext import PlaintextParser

from sumy.nlp.tokenizers import Tokenizer

from sumy.summarizers.lex_rank import LexRankSummarizer

import os

# Load the Whisper ASR model

25
model = whisper.load_model("base")

# Define a function to transcribe audio and video files

def transcribe_file(file_path):

result = model.transcribe(file_path)

output_directory = "./"

# Save as a TXT file without any line breaks

with open("transcription.txt", "w", encoding="utf-8") as txt:

txt.write(result["text"])

# Save as a TXT file with hard line breaks

txt_writer = get_writer("txt", output_directory)

txt_writer(result, file_path)

return result["text"]

# Define a function to summarize text

def summarize_text(text):

parser = PlaintextParser.from_string(text, Tokenizer("english"))

summarizer = LexRankSummarizer()

# Summarize the document with 2 sentences

summary = summarizer(parser.document, 5) summary_text = ""

for sentence in summary:

summary_text += str(sentence) + " "

return summary_text

26
def main():

st.title("Audio and Video Transcription with Summarization")

st.subheader("Upload Audio or Video File")

uploaded_file = st.file_uploader("Choose a file", type=["mp3", "mp4"])

if uploaded_file is not None:

file_type = uploaded_file.type.split("/")[0]

if file_type == "audio" or file_type == "video":

# Save the uploaded file locally

with open(uploaded_file.name, "wb") as file:

file.write(uploaded_file.read())

file_path = os.path.abspath(uploaded_file.name)

transcription = transcribe_file(file_path)

st.subheader("Transcription:")

st.write(transcription)

summary = summarize_text(transcription)

st.subheader("Summary:") st.write(summary)

else:

st.warning("Invalid file type. Please upload an audio or video file.")

if __name__ == "__main__":

main()

27
audiototext.py

import whisper

from whisper.utils import get_writer

model = whisper.load_model("base")

audio = "./Audio_summarru/test.mp3"

result = model.transcribe(audio)

output_directory = "./"

# Save as a TXT file without any line breaks

with open("transcription.txt", "w", encoding="utf-8") as

txt: txt.write(result["text"])

# Save as a TXT file with hard line breaks

txt_writer = get_writer("txt", output_directory)

txt_writer(result, audio)

video.py

import moviepy.editor as mp

# Insert Local Video File Path

clip = mp.VideoFileClip(r"Video File")

# Insert Local Audio File Path

clip.audio.write_audiofile(r"Audio File")

readme.file

Things added
28
1. Function to convert any text document into 5 line summary.

2. Audio-based function that can convert any audio file to text the length may vary till

3. Video function that converts the Video to audio.

4. App made only for audio summay [app.py].

5. App made for both video and audio summary [updated_app.py].

6. Natural Language processing using summy.

7. Streamlit UI for user interaction and displaying the results of both transcripted tex

8. Adjustable summary length.

5.2 IMPLEMENTATION STEPS:


STEP 1: Install Dependencies

pip install ffmpeg-python speechrecognition pydub transformers nltk torch librosa openai-whisper google-
cloud-speech.

STEP 2: Audio transcript

import whisper

from whisper.utils import get_writer

model = whisper.load_model("base")

audio = "./Audio_summarru/test.mp3"

result = model.transcribe(audio)

output_directory = "./"

# Save as a TXT file without any line breaks


29
with open("transcription.txt", "w", encoding="utf-8") as txt:

txt.write(result["text"])

# Save as a TXT file with hard line breaks

txt_writer = get_writer("txt", output_directory)

txt_writer(result, audio)

STEP 3: video transcript

import moviepy.editor as mp

# Insert Local Video File Path


clip = mp.VideoFileClip(r"Video File")

# Insert Local Audio File Path


clip.audio.write_audiofile(r"Audio File")

STEP 4: summary transcript

import sumy
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
def summary_file(File):
parser = PlaintextParser.from_file(File, Tokenizer("english"))
summarizer = LexRankSummarizer()
#Summarize the document with 2 sentences
summary = summarizer(parser.document, 5)
for sentence in summary:

30
print(sentence)
summary_file("transcription.txt")

31
6 DFD DIAGRAMS

6.1 Flow chart

Fig 6: Text Summarization for audio and videos Data Flow Diagram.

The figure titled "Figure 6: Text Summarization for Audio and Videos Data Flow Diagram" illustrates
the comprehensive data flow and core stages involved in the text summarization project for audio and
video inputs. This diagram is essential for visualizing the systematic execution of tasks and the
integration of machine learning and natural language processing (NLP) technologies within the project.
The process begins with the Data Set, which comprises raw audio and video files. This data undergoes
two parallel operations: Preprocessing and Splitup. In the Preprocessing stage, the data is cleaned,
formatted, and converted into a suitable structure for analysis. This may include noise reduction,
speech-to-text conversion, or segmentation of video content.
Simultaneously, the Splitup process divides the data into training and testing sets to facilitate model
development and evaluation. Both processed data and split datasets are then fed into the Audio/Video
Summarization module, the core component responsible for analyzing the content and extracting
meaningful summaries.

32
This module interfaces with NLP (Natural Language Processing) tools in two ways. First, it uses trained
NLP models to understand and generate summaries. Second, it sends results back to NLP modules to
evaluate the accuracy (Acc) and performance of the summarization.
Overall, the data flow diagram provides a structured and interconnected view of the project, enabling
better planning, development, and monitoring of each stage in the summarization pipeline.

33
GOALS:
Primary Goals in the Design of the Data Flow Diagram (DFD)
A Data Flow Diagram (DFD) is a structured graphical representation of how data moves through a
system. The primary goals in designing a DFD are as follows:
1. Understand System Workflow – A DFD provides a clear, visual representation of how data
enters, is processed, and exits a system. It helps in understanding system operations.
2. Identify Data Sources and Destinations – It defines where the data originates (input
sources) and where it is stored or sent (output destinations).
3. Enhance System Analysis – By breaking down a system into smaller processes, a DFD helps
in analyzing and improving system efficiency.
4. Facilitate Communication – Since a DFD is easy to understand, it helps developers,
analysts, and stakeholders communicate system functionalities without technical complexity.
5. Improve System Design – A well-structured DFD helps in designing an efficient and scalable
system by ensuring smooth data flow with minimal redundancy.
6. Detect Bottlenecks and Inefficiencies – By mapping data movement, DFDs help in
identifying problem areas such as data redundancy, slow processing points, or potential
security risks.
7. Support Documentation and Maintenance – It serves as a reference for system
documentation, making future modifications and troubleshooting easier.
8. Aid in Requirement Analysis – A DFD assists in gathering and refining system requirements
by clearly illustrating data interactions and dependencies.

34
7. TESTING

7.1 INTRODUCTION

Discovering and fixing such problems is what testing is all about. The purpose of testing is to
find and correct any problems with the final product. It’s a method for evaluating the quality of the
operation of anything from a whole product to a single component. The goal of stress testing software
is to verify that it retains its original functionality under extreme circumstances. There are several
different tests from which to pick. Many tests are available since there is such a vast range of
assessment options. Who Performs the Testing: All individuals who play an integral role in the
software development process are responsible for performing the testing. Testing the software is the
responsibility of a wide variety of specialists, including the End Users, Project manager, Software
Tester, and Software Developer. When it is recommended that testing begin: Testing the software is
the initial step in the process. begins with the phase of requirement collecting, also known as the
Planning phase, and ends with the stage known as the Deployment phase. In the waterfall model, the
phase of testing is where testing is explicitly arranged and carried out. Testing in the incremental model
is carried out at the conclusion of each increment or iteration, and the entire application is examined
in the final test. When it is appropriate to halt testing: Testing the programme is an ongoing activity
that will never end. Without first putting the software through its paces, it is impossible for anyone to
guarantee that it is completely devoid of errors. Because the domain to which the input belongs is so
expansive, we are unable to check every single input.

7.2TYPES OF TESTS

There are four types of testing:

Unit Testing

The term unit testingrefers to a specific kind of software testing in which discrete elements of a
program are investigated. The purpose of this testing is to ensure that the software operates as expected.

35
Test Cases

1. Test the accuracy of the automatic speech recognition (ASR) component by providing sample audio
files in different languages and verifying if the transcriptions match the expected results.
2. Verify the performance of the object recognition algorithm by providing sample video frames
containing various objects and checking if the algorithm correctly identifies and labels the objects.
3. Test the language identification module by providing audio and video files in different languages and
confirming if the system accurately detects the language of the content.

Integration testing
The program is put through its paces in its final form, once all its parts have been combined, during the
integration testing phase. At this phase, we look for places where interactions between components
might cause problems.
Test Cases
1. Test the integration between the automatic speech recognition (ASR) module and the language
identification module. Provide audio samples in different languages and verify if the ASR module
correctly transcribes the speech while the language identification module accurately detects the
language.
2. Validate the integration between the object recognition module and the video summarization module.
Provide video samples with various objects and confirm if the object recognition module successfully
identifies the objects, which are then used by the summarization module to generate relevant
summaries.
3. Test the integration between the sentiment analysis module and the audio summarization module.
Provide audio samples with different emotional tones and verify if the sentiment analysis module
correctly identifies the sentiments, which are then used to generate more contextually appropriate
summaries.
Functional Testing
One kind of software testing is called functional testing, and it involves comparing the system
to the functional requirements and specifications. In order to test functions, their input must first be
provided, and then the output must be examined. Functional testing verifies that an application
successfully satisfies all of its requirements in the correct manner. This particular kind of testing is not

36
concerned with the manner in which processing takes place; rather, it focuses on the outcomes of
processing. Therefore, it endeavors to carry out the test cases, compare the outcomes, and validate the
correctness of the results.
Test Cases
1. Test the functionality of the automatic speech recognition (ASR) module by providing audio samples
in different languages and verifying if the system accurately transcribes the speech into text.
2. Validate the object recognition functionality by providing video frames with various objects and
confirming if the system correctly identifies and labels the objects in the frames.
3. Test the language identification feature by providing audio and video samples in different languages
and verifying if the system accurately detects and identifies the language used.

7.3 TESTING TECHNIQUES

There are many different techniques or methods for testing the software, including the following:
BLACK BOX TESTING
During this kind of testing, the user does not have access to or knowledge of the internal structure or
specifics of the data item being tested. In this method, test cases are generated or designed only based
on the input and output values, and prior knowledge of either the design or the code is not necessary.
The testers are just conscious of knowing about what is thought to be able to do, but they do not know
how it is able to do it.
For example, without having any knowledge of the inner workings of the website, we test the web pages
by using a browser, then we authorise the input, and last, we test and validate the outputs against the
intended result.
Test Cases
1. Test the audio summarisation functionality by providing audio samples in different languages
and verifying if the system generates concise and informative summaries that capture the key
points of the audio content.

37
Figure 7.3.1: Text Summarization for Audio and Videos Blackbox Testing

2. Validate the video summarization functionality by providing video samples with diverse
contentand confirming if the system generates summaries that effectively summarize the
important aspects of the video.
3. Test the multilingual support by providing audio and video samples in different languages and
verifying if the system can accurately process and summarise content in each language.

As an instance, a tester and a developer examine the code that is implemented in each field of a
website, determine which inputs are acceptable and which are not, and then check the output to ensure
it produces the desired result. In addition, the decision is reached by analyzing the code that is really
used.
Test Cases
1. Test the accuracy of the automatic speech recognition (ASR) algorithm by examining the
correctness of the language model, acoustic model, and pronunciation model used in the ASR system.
2. Validate the feature extraction process for audio and video by inspecting the deep learning
models,such as convolutional neural networks (CNNs) or recurrent neural networks (RNNs), and
ensuring they correctly extract relevant features from the input data.
3. Test the performance of the language identification module by examining the accuracy of the
language models and language-specific features used in the identification process.

38
Figure 7.3.2: Text Summarization for Audio and Videos Whitebox Testing

39
8 RESULTS

Here is a bar chart that shows the performance of your multilingual audio and video summarization
system across different evaluation metrics (Accuracy, Precision, Recall, and F1 Score) for each
language.

40
9 SCREEN SHOTS

Fig 9.a: Transcript output

41
Fig 9.b: Summary output

The above figure 9.b shows the summary generated for the given audio file.

42
10 CONCLUSION

In conclusion, the research on audio and video summarization for multiple languages using machine
learning has shown significant promise and potential. The use of machine learning techniques has
facilitated the development of efficient and accurate summarization models that can process and analyze
vast amounts of audio and video content. The application of machine learning in this context has led to
the creation of automated systems capable of extracting key information, identifying important moments,
and generating concise summaries from diverse sources in multiple languages. These advancements have
wide-ranging implications for various fields, including media analysis, content creation, information
retrieval, and more. By leveraging machine learning algorithms, researchers have addressed challenges
related to language diversity, speech recognition, and video understanding. These models have achieved
impressive results, showcasing their ability to handle complex linguistic structures, cultural nuances, and
diverse accents. While this research presents significant progress, there are still areas for improvement.
Fine-tuning models for specific languages and domains, enhancing the handling of multilingual and
code-switching content, and exploring the integration of additional modalities (such as text and image)
could further enhance the accuracy and comprehensiveness of the summarization systems. Overall, the
research on audio and video summarization for multiple languages using machine learning holds great
promise for transforming the way we process, analyze, and consume multimedia conten

43
10.1 Future Scope

This project establishes the proof of working principle and sets direction for future
development into a fully learned and automated method for podcast speech
summarization. Given the complex nature of such a problem, we believe there is
plenty of room for improvements. Podcasts usually require active attention from a
listener for extended periods unlike listening to music. In the process of generating a
summary as discussed in this project, the primary input taken for processing is an
audio file. The audio file is generated by recording the human speech which is being
spoken or is already recorded. The produced summaries have intelligible audio
information.
Overall, this research aims to leverage the power of machine learning to address the
challenge of audio and video summarization in multiple languages.By developing an
automated system, users can efficiently navigate and comprehend multimedia content,
saving time and effort in information retrieval and understanding.

44
11 REFERENCES

[1] A. Vartakavi, A. Garg and Z. Rafii, Audio Summarization for Podcasts, 2021
29th European Signal Processing Conference (EUSIPCO), 2021.

[2] P. G. Shambharkar and R. Goel, Analysis of Real Time Video Summarization


using Subtitles, 2021 International Conference on Industrial Electronics Research and
Applications (ICIERA), 2021.

[3] P. Gupta, S. Nigam and R. Singh, A Ranking based Language Model for
Automatic Extractive Text Summarization, 2022 First International Conference on
Artificial Intelligence Trends and Pattern Recognition (ICAITPR), 2022.

[4] A. Gidiotis and G. Tsoumakas, A Divide-and-Conquer Approach to the


Summarization of Long Documents.

[5] B. Zhao, M. Gong and X. Li, AudioVisual Video Summarization, in IEEE


Transactions on Neural Networks and Learning.

[6] Ribeiro, R. V., de Paiva, F. M., Neto, J. B. A systematic literature review on


video summarization: Evolution, trends, and perspectives. Information Processing
Management, 57(1), 102094,(2020)

[7] Naim, I., Pandit, H. J., Jain, P., Vyas, O. P. A survey on audio and video
summarization techniques. Multimedia Tools and Applications, 79(11), 7235-7268,
(2020).

45
[8] Zhang, X., Liu, B., Zhou, M., Liu, J., Qian, X. Multimodal deep learning for
video summarization. Neurocomputing, 365, 222-231, (2019).

46

You might also like