"Text Summarization For Audio and Video Files": Bachelor of Technology in Information Technology by
"Text Summarization For Audio and Video Files": Bachelor of Technology in Information Technology by
On
Bachelor of Technology
In
Information Technology
By
M.AISHWARYA (21R21A12A1)
P.KEERTHANA (21R21A12B4)
G.SHIVA KUMAR(21R21A1289)
K.SRI HARSHINI (21R21A12A0)
2025
I
Department of Information Technology
CERTIFICATE
This is to certify that the project entitled “Text Summarization for audio and video files” has been
submitted by M. AISHWARYA (21R21A12A1), P.KEERTHANA (21R21A12B4), G.SHIVA
KUMAR (21R21A1289), K.SRI HARSHINI (21R21A12A0) in the partial fulfillment of the
requirements for the award of degree of Bachelor of Technology in Information Technology from
Jawaharlal Nehru Technology University, Hyderabad. The results embodied in this project have not
been submitted to any other University or Institution for the award of any degree or diploma.
External Examiner
II
DECLARATION
I hereby declare that the project entitled “Text Summarization for audio and video files”
is the work done during the period from January 2023 to May 2023 and is submitted in the partial
fulfillment of the requirements for the award of degree of Bachelor of technology in Information
Technology from Jawaharlal Nehru Technology University, Hyderabad. The results embodied in
this project have not been submitted to any other University or Institution for the award of any
degree or diploma.
M.Aishwarya (21R21A12A1)
P.Keerthana (21R21A12B4)
G.Shiva Kumar(21R21A1289)
K.Sri Harshini (21R21A12A0)
III
ACKNOWLEDGEMENT
There are many people who helped me directly and indirectly to complete my project
successfully. I would like to take this opportunity to thank one and all. First of all I would like to
express my deep gratitude towards my supervisor Mrs.Adilakshmi, Assistant Professor,
Department of IT for her support in the completion of my dissertation. I wish to express my
sincere thanks to Dr. N.V RAJASHEKAR REDDY, HOD, Department of IT and also to
Principal Dr.K.SRINIVASA RAO for providing the facilities to complete the dissertation.
I would like to thank all our faculty, coordinators and friends for their help and constructive
criticism during the project period. Finally, I am very much indebted to our parents for their moral
support and encouragement to achieve goals.
M.Aishwarya (21R21A12A1)
P.Keerthana (21R21A12B4)
G.Shiva Kumar (21R21A1289)
K.Sri Harshini (19R21A1206)
IV
ABSTRACT
This abstract introduces the concept of audio and video summarization for multiple languages using
machine learning techniques. With the exponential growth of multimedia content on the internet, there
is a growing need for efficient and automated methods to summarize audio and video data in English,
French, Japanese, and German languages. The goal is to extract key information and present a concise
summary that captures the essence of the content, saving time and enhancing accessibility for users.
Machine learning algorithms, particularly those based on deep learning and natural language
processing, have proven to be highly effective in various text-based tasks. However, extending these
techniques to multimedia data poses unique challenges. In this study, we propose a novel approach that
leverages machine learning algorithms to summarize audio and video content in multiple languages.
The proposed system utilizes techniques such as automatic speech recognition, speaker diarization,
sentiment analysis, object recognition, and scene detection to extract relevant information from the
multimedia input. By combining these components, the system generates a coherent and informative
summary that can be presented in a written or visual format. To develop this system, a large dataset of
multilingual audio and video content is collected and annotated for training purposes. The machine
learning models are then trained on this dataset to learn patterns and relationships between different
features and their corresponding summaries. The performance of the system is evaluated using metrics
like accuracy, precision, recall, and F1 score. The results demonstrate the effectiveness of the proposed
approach in summarizing audio and video content in multiple languages, providing users with concise
and meaningful summaries that facilitate information retrieval and understanding. In conclusion, the
proposed approach demonstrates the potential of machine learning techniques for audio and video
summarization in multiple languages.
V
CONTENTS
S. No Contents Page No
Certificate II
Declaration III
Acknowledgement IV
Abstract V
1 Introduction 1-2
1.1 General
1.2 Problem Definition
1.3 Objective
2 Literature Survey 3-4
2.1 Existing System
3 Proposed System 5-8
3.1 Advantages of Proposed System
4 System Design 9-19
4.1 General
4.2 Project Flow
4.3 Modules
5 Implementation and Installation 20-31
5.1 Source Code
5.2 Implementation Steps
6 DFD Diagrams 32-34
7 Testing 35-39
7.1 Introduction
7.2 Types of Testing
7.3 Testing Techniques
8 Results 40
9 Screenshots 41-42
10 Conclusion 43-44
10.1Future Scope
11 References 45-46
List of Figures
1.1 General
In today’s digital age, the abundance of audio and video content available on various platforms has
led to the need for efficient methods of summarization. Extracting key information from multimedia
sources can be daunting, especially when dealing with content in multiple languages. However,
with advancements in machine learning techniques, there is great potential to automate the process
of audio and video summarization for multiple languages.
The aim of this project is to explore the application of machine learning algorithms in
summarizing audio and video content across different languages. By developing a system that can
automatically identify and extract important elements from multimedia inputs, a concise summary
can be generated, facilitating easier access and understanding for users.
The proposed approach involves utilizing various machine learning techniques such as
automatic speech recognition, sentiment analysis, and object recognition. These techniques enable
the system to analyze and extract relevant information from the audio and visual components of the
multimedia content. By incorporating deep learning models and natural language processing
algorithms, the system can process and understand the content in multiple languages, making it
adaptable and scalable to diverse linguistic contexts.
The significance of this research lies in its potential to enhance accessibility and user
experience when dealing with large volumes of multimedia data. With an automated summarization
system, users can efficiently navigate and consume content across different languages, saving time
and effort in the process. Moreover, the application of machine learning in this context opens up
opportunities for improved information retrieval, content indexing, and cross-lingual analysis.
In conclusion, this project aims to harness the power of machine learning to tackle the
challenge of audio and video summarization in multiple languages. By automating the extraction
of key information, users can access and comprehend multimedia content more effectively, thereby
enhancing the overall multimedia experience in a multilingual context.
1
1.2 Problem Definition
All of the audio files data that are received from the sources of today’s audios, such as podcasts and
videos are not consistently efficient ways to gather information. In this situation, it is always more
effective to use data that has been condensed, or that concentrates more on the key ideas rather than
the complete text. An audio file is used as the input for this technique, and data summarization
procedures are used to produce a condensed output.
1.3 Objective
2
2. LITERATURE SURVEY
The issue of automatic summarization has a number of potential answers. These include
straightforward unsupervised techniques, graph-based approaches that use ranking to arrange the
input text in a graph, or neural approaches[1], which are covered in more depth in the following
paragraph, are built using graph traversal algorithms.
This paper presents techniques for converting speech audio files to text files and text
summarization on the text file. It is very difficult for a user to get an accurate summary or to
comprehend the relevant and important items from the available media. Additionally, readers or
evaluators of these data files are interested only in the relevant content or summary to be retrieved
in less duration from the source files[2]. Automatic text summarization (ATS) is the only way to
summarize single or multiple documents to obtain relevant content from the source files.
As the amount of available video content continues to grow at a rapid rate, having access to an
automatic video summary would be advantageous for anybody who values their time and wants to
learn more while spending less. Due to the rapid increase in the number of videos made by users,
the ability to correctly view these films is becoming increasingly vital. By discovering and
selecting informative stills from the film, a video summary is viewed as a potentially valuable
method for maximizing the information contained inside films[3]. Using text summarization and
video mapping algorithms, it is seen how essential video elements may be retrieved from the film’s
subtitles and utilized to provide a summary of the movie’s content.
3
This paper presents a novel divide-and-conquer method for the neural summarization of long
documents. In particular, we break a long document and its summary into multiple source-target
pairs, which are used for training a model that learns to summarize each part of the document
separately. These partial summaries are then combined in order to produce a final complete
summary[4]. With this approach we can decompose the problem of long document summarization
into smaller and simpler problems, reducing computational complexity and creating more training
examples, which at the same time contain less noise in the target summaries compared to the
standard approach. We demonstrate that this approach paired with different summarization models,
including sequence-to-sequence RNNs and Transformers, can lead to improved summarization
performance. Our best models achieve results that are on par with the state-of-the-art in two
publicly available datasets of academic articles.
AUTHORS: Pravin Khandare, Sanket Gaikwad, Aditya Kukade, Rohit Panicker, Swaraj Thamke.
This paper presents techniques for converting speech audio files to text files and text
summarization on the text file. For the former case, we have used Python modules to convert the
audio files to text format[5]. For the latter case, Natural Language Processing modules are used
for text summarization. The summarization method involves important sentences obtained when
the extraction is investigated. Weights are assigned to words according to the number of
occurrences of each word in the text file. This technique is used for producing summaries from the
main audio file.
The existing system only identifies emotions when any audio input is given. The existing system
uses Python speech recognition library which might not work for all the inputs since the library
has dependency issues. The existing system can’t analyze if a big dataset of audio files are passed
as an input because the existing system isn’t trained with a large dataset.
4
3. PROPOSED SYSTEM
• The proposed system for audio and video summarisation in five languages using machine
learning aims to leverage advanced technologies to automate the process of extracting key
information from multimedia content in various languages. The system will integrate different
components and techniques to achieve effective summarization.
• For audio summarisation, the system will employ automatic speech recognition (ASR) to
transcribe audio into text. Machine learning algorithms, such as deep neural networks, will be
utilized to extract relevant features from the audio data. Techniques like speaker diarization,
sentiment analysis, and topic modeling will be applied to identify important speakers, sentiments
expressed, and significant themes within the audio content.
• Regarding video summarisation, the system will utilize computer vision algorithms, including
object recognition and scene detection, to identify key objects, scenes, and representative frames.
Deep learning models, such as convolutional neural networks (CNNs), will be employed to extract
visual features from the video data.
• To handle multiple languages, the system will incorporate natural language processing (NLP)
techniques. Language identification models will determine the language of the audio and video
content, enabling language-specific processing. Translation models will be used to facilitate cross-
lingual summarisation, allowing users to obtain summaries in their preferred language.”
• The system will also employ machine learning models, such as recurrent neural networks (RNNs)
or transformers, to integrate and process the extracted audio and visual features, ensuring a
cohesive summarisation across modalities.
• Evaluation metrics, such as coherence, relevance, and informativeness, will be used to assess the
quality of the generated summaries. The system will undergo rigorous testing on diverse datasets
in multiple languages to validate its performance.
The proposed system for multilingual audio and video summarization using machine learning
offers numerous benefits that make it highly suitable for addressing the increasing demand for
5
efficient content processing across various languages and modalities. These advantages stem from
its integrated use of advanced machine learning, deep learning, and natural language processing
(NLP) techniques to extract, process, and summarize rich multimedia data. The key advantages
are discussed in detail below:
1.Support for Multiple Languages
One of the most significant strengths of the proposed system is its ability to handle audio and video
content in multiple languages, specifically English, French, Japanese, German, and more. This
multilingual capability is essential in today’s global digital environment, where users consume
content from diverse linguistic backgrounds. The system includes language identification
components and translation models to ensure that summaries are generated in the user's preferred
language. This feature not only broadens the reach of the system but also promotes inclusivity and
accessibility for non-native speakers and international audiences.
2. Efficient and Time-Saving Content Access
With the vast amount of audio and video data being produced daily across platforms like YouTube,
news agencies, and educational portals, users often struggle to extract relevant information
quickly. The proposed system provides concise and meaningful summaries, allowing users to
understand the essence of the content without needing to consume the entire audio or video. This
time-saving capability is particularly valuable in professional settings such as research, journalism,
and corporate training, where rapid comprehension is critical.
3. Integration of Audio and Visual Modalities
The system is designed to process and summarize both audio and video data simultaneously,
ensuring a holistic understanding of the multimedia content. It uses automatic speech recognition
(ASR) and speaker diarization to handle the audio portion, while leveraging computer vision
techniques like object recognition and scene detection for the visual component. This multimodal
approach results in summaries that are not only linguistically accurate but also visually
contextualized, providing richer and more informative content to the user.
4. Advanced Machine Learning and Deep Learning Techniques
By employing state-of-the-art machine learning algorithms, including deep neural networks,
convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers, the
system is capable of extracting high-level features from complex data. These models are trained
on large, annotated datasets to recognize patterns, identify key elements, and generate coherent
6
summaries. The use of deep learning enhances the system’s performance in terms of accuracy,
relevance, and overall summary quality.
5. Enhanced Accessibility and Usability
The summarization system improves content accessibility for users with different needs. For
instance, hearing-impaired users can benefit from textual summaries of audio content, while
visually impaired users might rely on audio summaries generated from video. Furthermore, the
inclusion of translation capabilities means that users can receive summaries in their native or
preferred languages, making the system valuable for diverse user groups.
6. Improved User Experience through Sentiment and Context Analysis
Beyond extracting textual information, the system incorporates sentiment analysis and topic
modeling to interpret the emotional tone and contextual themes of the content. This allows the
generated summaries to reflect not just what was said or shown, but also how it was conveyed,
which is crucial for content like interviews, lectures, or political speeches where tone and emotion
add significant value.
7. Scalability and Automation for Large-Scale Applications
Given its automated nature, the system is highly scalable and can be integrated into large-scale
platforms that deal with vast quantities of multimedia content. Whether it’s a video hosting site, a
digital learning platform, or a news aggregation service, the system can function without manual
intervention, thus reducing operational costs and increasing processing efficiency.
8. Adaptability Across Domains
The flexible architecture of the system allows it to be tailored for specific industries or domains.
For example, in healthcare, the system can summarize medical lectures or patient interviews. In
education, it can create concise summaries of recorded lectures. By training the machine learning
models on domain-specific data, the system becomes even more accurate and relevant to particular
fields.
9. Cross-Lingual Summarization and Translation
The incorporation of NLP and translation models enables the system to convert content from one
language to another and still generate meaningful summaries. This cross-lingual capability is
highly beneficial for international businesses, academic institutions, and media outlets that operate
across linguistic boundaries. It ensures that content is not lost in translation and that users receive
accurate and contextually relevant summaries.
7
10. Comprehensive Evaluation Metrics
To ensure the quality of the generated summaries, the system uses a set of evaluation metrics such
as accuracy, precision, recall, F1 score, coherence, relevance, and informativeness. These metrics
provide a robust framework for performance measurement, helping developers fine-tune the
models and maintain high standards. Furthermore, the system can be validated using both
automated metrics and human feedback to continuously improve its summarization capabilities.
8
4 SYSTEM DESIGN
4.1 General
The System Design Document describes the system requirements, operating environment, system and
subsystem architecture, input formats, output layouts, human-machine interfaces, detailed design,
processing logic, and external interfaces. The design and the working of the whole system is organized
into two modules which include:
Below Figure 4.2.1 is a graphic assistance in project management. The figure depicts the parallel
and interconnected processes within the project plan, enabling you to effectively manage your
project by seeing the entire project cycle.
Figure 4.2.1: Text Summarization for audio and videos Data Flow Diagram
9
4.2.2 Project Architecture Diagram
Figure 4.2.2: Text Summarization for Audio and Videos Architecture Diagram
10
Figure 4.2.3: Text Summarization for Audio and Videos Usecase Diagram
11
Figure 4.2.4: Text Summarization for Audio and Videos Activity Diagram
12
Figure 4.2.5: Text Summarization for Audio and Videos Class Diagram
13
Figure 4.2.6: Text Summarization for Audio and Videos Sequence Diagram
4.3 MODULES
In the audio-video summarization system focuses on processing the audio or video input to extract
the necessary information. This module involves several tasks such as speech recognition, speaker
diarization, object detection, and tracking. Let’s delve into each of these tasks and their functionalities in
more detail.
Speech Recognition: Speech recognition is a fundamental task in audio processing, where the spoken
words in the audio input are converted into text. This process involves analyzing the acoustic features of
the audio signal and mapping them to corresponding textual representations. Automatic Speech
Recognition (ASR) systems are commonly used for speech recognition, which utilizes techniques like
Hidden Markov Models (HMMs), deep neural networks (DNNs), or transformer models. These models
14
are trained on large amounts of audio data with corresponding transcriptions to learn the mapping
between audio signals and text. By performing speech recognition, the system obtains a textual
representation of the spoken content, enabling subsequent NLP tasks to be applied to the
transcribed text.
Speaker Diarization : Speaker diarization is the process of determining ”who spoke when¨ın an audio
or video recording, particularly in scenarios with multiple speakers. This task is crucial for identifying
different speakers and attributing the spoken words to the respective individuals. Speaker diarization
involves segmenting the audio into speaker-specific regions and clustering the segments to assign them
to different speakers. Techniques such as clustering algorithms, speaker embeddings, and turntaking
analysis can be employed for effective speaker diarization. By performing speaker diarization, the system
can associate the transcribed text with specific speakers, which can be beneficial for generating speaker-
specific summaries or identifying key speakers in the summary.
In the case of video input, object detection, and tracking techniques are employed to recognize and track
relevant objects or people in the video frames. Object detection involves identifying and localizing
specific objects or regions of interest within each frame. Various object detection algorithms, such as
Faster R-CNN, YOLO (You Only Look Once), or SSD (Single Shot MultiBox Detector), can be used for
this task. Once the objects are detected, object tracking is performed to follow the movement of the
identified objects across multiple frames. Tracking algorithms, such as Kalman filters, correlation filters,
or deep learning-based trackers, are employed to maintain the continuity of object
trajectories.
Object detection and tracking provide valuable information about the objects or people present
in the video, which can be used for generating summaries focused on specific objects or tracking the
movements of relevant entities.
By performing these tasks in Module 1, the system processes the audio or video input to obtain
the necessary information for subsequent analysis and summarization. The outputs of this module,
including transcribed text, speaker identities, and object detections, are used as inputs to Module 2 for
further analysis and extraction of meaningful information.
It’s important to note that the performance of Module 1 heavily relies on the accuracy of the
underlying algorithms and models used for speech recognition, speaker diarization, object detection, and
tracking. Continuous refinement and improvement of these algorithms, along with robust preprocessing
15
techniques, can enhance the quality and reliability of the extracted information, thereby improving the
overall effectiveness of the audio-video summarization system.
In conclusion, Module 1 in audio-video summarization involves processing the audio or video
input by performing tasks such as speech recognition, speaker diarization, object detection, and tracking.
These tasks provide the necessary information for subsequent analysis and summarization. By accurately
recognizing speech, identifying speakers, and detecting and tracking objects, this module lays the
foundation for further processing and extraction of relevant information in the audio-video
summarization system.
16
Sentiment Analysis: Sentiment analysis aims to determine the emotional tone or sentiment expressed in
the text. This task can be particularly useful in audio-video summarization, as it allows for the
identification of positive, negative, or neutral sentiments associated with specific topics or entities.
Sentiment analysis can be performed using various techniques, such as rule-based approaches, machine
learning models, or pre-trained sentiment analysis frameworks like VADER (Valence Aware Dictionary
and sEntiment Reasoner).
Keyword Extraction: Keyword extraction involves identifying important terms or phrases that represent
the main topics or themes in the text. These keywords provide a concise representation of the content
and can be used to generate a summary that captures the essential information. Techniques like TF-IDF,
RAKE (Rapid Automatic Keyword Extraction), or graph-based algorithms like TextRank can be applied
to extract significant keywords from the text.
By performing these tasks in Module 2, the system can extract the relevant information from the
text obtained from the audio or video input. The extracted information, such as important sentences, key
entities, sentiment analysis results, and significant keywords, will be used in the subsequent module for
generating the audio-video summary.
It’s worth noting that the performance of Module 2 heavily relies on the quality of the NLP models
and techniques used. The accuracy of entity recognition, sentiment analysis, and keyword extraction can
significantly impact the overall effectiveness of the summarization system. Therefore, selecting
appropriate NLP models, ensuring the availability of high-quality training data, and continuously
evaluating and refining the models is crucial for achieving optimal results in audio-video summarization
using NLP.
Module 3 is responsible for generating the audio-video summary based on the extracted
information from Module 2. This module combines important sentences, key entities, sentiment analysis
results, and significant keywords to create a concise and coherent summary. The summarization
generation process involves several steps and techniques. Let’s explore them in more detail.
Content Selection: The first step in summarization generation is selecting the most relevant content to
include in the summary. This content can be determined based on the importance scores assigned to
17
sentences or paragraphs during the text summarization process in Module 2. The sentences or paragraphs
with higher scores, indicating their significance, are prioritized for inclusion in the summary.
Summary Structure: Next, the structure of the summary needs to be determined. The summary can be
organized in chronological order, topic-wise, or based on the importance of the information. This
structure ensures that the summary flows logically and provides a coherent representation of the original
content. The structure can be predefined based on the application requirements or dynamically generated
using algorithms that consider the relationships between sentences or topics.
Entity and Keyword Integration: The key entities and significant keywords identified in Module 2
play a crucial role in creating a meaningful summary. These entities and keywords can be integrated into
the summary to provide context and highlight important aspects of the content. For example, the
summary might include key entities mentioned in the original text and describe their roles or
relationships. Additionally, incorporating significant keywords helps reinforce the main topics or themes
covered in the summary.
Sentiment Analysis Integration: If sentiment analysis was performed in Module 2, the generated
sentiment scores can be incorporated into the summary. This integration can be done by including
statements about the overall sentiment expressed in the original content or by associating sentiment
information with specific topics or entities. For example, if the sentiment analysis indicates a positive
sentiment towards a particular product mentioned in the audio or video, the summary might highlight
this positive sentiment.
Text Compression: In order to ensure that the summary remains concise and within a desired length,
text compression techniques can be applied. Compression algorithms aim to reduce redundancy and
eliminate unnecessary information while preserving the core meaning. Techniques like sentence
compression, where certain phrases or words are removed while maintaining the essence of the sentence,
can be employed to achieve text compression. This step helps create a summary that is concise and easier
to consume.
Summary Generation: Finally, the summarized content from the previous steps is combined and
presented as the audio-video summary. The summary can be in the form of text, audio, or both, depending
on the application requirements. It is important to ensure that the summary captures the essential
information, maintains coherence, and provides a comprehensive representation of the original content.
It’s important to note that the summarization generation process in Module 3 can be iterative. The
generated summary may undergo refinement and improvement based on feedback and evaluation. User
18
feedback or automated evaluation metrics can be utilized to assess the quality and effectiveness of the
summary, allowing for further adjustments to the summarization process. The performance of Module 3
depends on the accuracy of the content selection, the integration of entities, keywords, and sentiment
analysis results, and the effectiveness of the text compression techniques employed. Continuous
evaluation and refinement of the summarization generation process are essential for ensuring high-
quality and meaningful audio-video summaries.
In conclusion, Module 3 combines the extracted information from Module 2 to generate a concise and
coherent audio-video summary. By integrating important sentences, key entities, sentiment analysis
results, and significant keywords, this module ensures that the summary captures the essential
information and provides a comprehensive representation of the original audio or video content.
19
5. IMPLENTATION AND INSTALLATION
app.py
import get_writer
model = whisper.load_model("base")
def transcribe_audio(audio_file):
temp_file.write(audio_file.read()) temp_file.flush()
result = model.transcribe(temp_file.name)
transcription = result["text"]
20
txt_writer = get_writer("txt", "./") txt_writer(result, temp_file.name)
return transcription
def summary_file(file_path):
summarizer = LexRankSummarizer()
summary = summarizer(parser.document, 5)
summary_text = ’’
return summary_text
def main():
st.title("Audio summary")
21
st.write(transcription)
st.write(summary_text)
if __name__ == "__main__":
main()
summary.py
def summary_file(File):
parser = PlaintextParser.from_file(File,Tokenizer("english"))
summarizer =LexRankSummarizer()
summary = summarizer(parser.document, 5)
print(sentence)
summary_file("transcription.txt")
update app.py
# import streamlit as st
22
# import whisper
# model = whisper.load_model("base")
# def transcribe_file(file_path):
# result = model.transcribe(file_path)
# output_directory = "./"
# txt.write(result["text"])
# txt_writer(result, file_path)
# return result["text"]
23
# # Define a function to summarize text
# def summarize_text(text):
# summarizer= LexRankSummarizer()
# summary = summarizer(parser.document, 2)
# summary_text = ""
# return summary_text
# def main():
# file_type = uploaded_file.type.split("/")[0]
# if file_type == "audio":
# transcription = transcribe_file(uploaded_file)
# st.subheader("Transcription:")
# st.write(transcription)
24
# elif file_type == "video":
# video_file.write(uploaded_file.read())
# transcription = transcribe_file(uploaded_file.name)
# st.subheader("Transcription:")
# st.write(transcription)
# else:
# summary = summarize_text(transcription)
# st.subheader("Summary:")
# st.write(summary)
# if __name__ == "__main__":
# main()
import os
25
model = whisper.load_model("base")
def transcribe_file(file_path):
result = model.transcribe(file_path)
output_directory = "./"
txt.write(result["text"])
txt_writer(result, file_path)
return result["text"]
def summarize_text(text):
summarizer = LexRankSummarizer()
return summary_text
26
def main():
file_type = uploaded_file.type.split("/")[0]
file.write(uploaded_file.read())
file_path = os.path.abspath(uploaded_file.name)
transcription = transcribe_file(file_path)
st.subheader("Transcription:")
st.write(transcription)
summary = summarize_text(transcription)
st.subheader("Summary:") st.write(summary)
else:
if __name__ == "__main__":
main()
27
audiototext.py
import whisper
model = whisper.load_model("base")
audio = "./Audio_summarru/test.mp3"
result = model.transcribe(audio)
output_directory = "./"
txt: txt.write(result["text"])
txt_writer(result, audio)
video.py
import moviepy.editor as mp
clip.audio.write_audiofile(r"Audio File")
readme.file
Things added
28
1. Function to convert any text document into 5 line summary.
2. Audio-based function that can convert any audio file to text the length may vary till
7. Streamlit UI for user interaction and displaying the results of both transcripted tex
pip install ffmpeg-python speechrecognition pydub transformers nltk torch librosa openai-whisper google-
cloud-speech.
import whisper
model = whisper.load_model("base")
audio = "./Audio_summarru/test.mp3"
result = model.transcribe(audio)
output_directory = "./"
txt.write(result["text"])
txt_writer(result, audio)
import moviepy.editor as mp
import sumy
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
def summary_file(File):
parser = PlaintextParser.from_file(File, Tokenizer("english"))
summarizer = LexRankSummarizer()
#Summarize the document with 2 sentences
summary = summarizer(parser.document, 5)
for sentence in summary:
30
print(sentence)
summary_file("transcription.txt")
31
6 DFD DIAGRAMS
Fig 6: Text Summarization for audio and videos Data Flow Diagram.
The figure titled "Figure 6: Text Summarization for Audio and Videos Data Flow Diagram" illustrates
the comprehensive data flow and core stages involved in the text summarization project for audio and
video inputs. This diagram is essential for visualizing the systematic execution of tasks and the
integration of machine learning and natural language processing (NLP) technologies within the project.
The process begins with the Data Set, which comprises raw audio and video files. This data undergoes
two parallel operations: Preprocessing and Splitup. In the Preprocessing stage, the data is cleaned,
formatted, and converted into a suitable structure for analysis. This may include noise reduction,
speech-to-text conversion, or segmentation of video content.
Simultaneously, the Splitup process divides the data into training and testing sets to facilitate model
development and evaluation. Both processed data and split datasets are then fed into the Audio/Video
Summarization module, the core component responsible for analyzing the content and extracting
meaningful summaries.
32
This module interfaces with NLP (Natural Language Processing) tools in two ways. First, it uses trained
NLP models to understand and generate summaries. Second, it sends results back to NLP modules to
evaluate the accuracy (Acc) and performance of the summarization.
Overall, the data flow diagram provides a structured and interconnected view of the project, enabling
better planning, development, and monitoring of each stage in the summarization pipeline.
33
GOALS:
Primary Goals in the Design of the Data Flow Diagram (DFD)
A Data Flow Diagram (DFD) is a structured graphical representation of how data moves through a
system. The primary goals in designing a DFD are as follows:
1. Understand System Workflow – A DFD provides a clear, visual representation of how data
enters, is processed, and exits a system. It helps in understanding system operations.
2. Identify Data Sources and Destinations – It defines where the data originates (input
sources) and where it is stored or sent (output destinations).
3. Enhance System Analysis – By breaking down a system into smaller processes, a DFD helps
in analyzing and improving system efficiency.
4. Facilitate Communication – Since a DFD is easy to understand, it helps developers,
analysts, and stakeholders communicate system functionalities without technical complexity.
5. Improve System Design – A well-structured DFD helps in designing an efficient and scalable
system by ensuring smooth data flow with minimal redundancy.
6. Detect Bottlenecks and Inefficiencies – By mapping data movement, DFDs help in
identifying problem areas such as data redundancy, slow processing points, or potential
security risks.
7. Support Documentation and Maintenance – It serves as a reference for system
documentation, making future modifications and troubleshooting easier.
8. Aid in Requirement Analysis – A DFD assists in gathering and refining system requirements
by clearly illustrating data interactions and dependencies.
34
7. TESTING
7.1 INTRODUCTION
Discovering and fixing such problems is what testing is all about. The purpose of testing is to
find and correct any problems with the final product. It’s a method for evaluating the quality of the
operation of anything from a whole product to a single component. The goal of stress testing software
is to verify that it retains its original functionality under extreme circumstances. There are several
different tests from which to pick. Many tests are available since there is such a vast range of
assessment options. Who Performs the Testing: All individuals who play an integral role in the
software development process are responsible for performing the testing. Testing the software is the
responsibility of a wide variety of specialists, including the End Users, Project manager, Software
Tester, and Software Developer. When it is recommended that testing begin: Testing the software is
the initial step in the process. begins with the phase of requirement collecting, also known as the
Planning phase, and ends with the stage known as the Deployment phase. In the waterfall model, the
phase of testing is where testing is explicitly arranged and carried out. Testing in the incremental model
is carried out at the conclusion of each increment or iteration, and the entire application is examined
in the final test. When it is appropriate to halt testing: Testing the programme is an ongoing activity
that will never end. Without first putting the software through its paces, it is impossible for anyone to
guarantee that it is completely devoid of errors. Because the domain to which the input belongs is so
expansive, we are unable to check every single input.
7.2TYPES OF TESTS
Unit Testing
The term unit testingrefers to a specific kind of software testing in which discrete elements of a
program are investigated. The purpose of this testing is to ensure that the software operates as expected.
35
Test Cases
1. Test the accuracy of the automatic speech recognition (ASR) component by providing sample audio
files in different languages and verifying if the transcriptions match the expected results.
2. Verify the performance of the object recognition algorithm by providing sample video frames
containing various objects and checking if the algorithm correctly identifies and labels the objects.
3. Test the language identification module by providing audio and video files in different languages and
confirming if the system accurately detects the language of the content.
Integration testing
The program is put through its paces in its final form, once all its parts have been combined, during the
integration testing phase. At this phase, we look for places where interactions between components
might cause problems.
Test Cases
1. Test the integration between the automatic speech recognition (ASR) module and the language
identification module. Provide audio samples in different languages and verify if the ASR module
correctly transcribes the speech while the language identification module accurately detects the
language.
2. Validate the integration between the object recognition module and the video summarization module.
Provide video samples with various objects and confirm if the object recognition module successfully
identifies the objects, which are then used by the summarization module to generate relevant
summaries.
3. Test the integration between the sentiment analysis module and the audio summarization module.
Provide audio samples with different emotional tones and verify if the sentiment analysis module
correctly identifies the sentiments, which are then used to generate more contextually appropriate
summaries.
Functional Testing
One kind of software testing is called functional testing, and it involves comparing the system
to the functional requirements and specifications. In order to test functions, their input must first be
provided, and then the output must be examined. Functional testing verifies that an application
successfully satisfies all of its requirements in the correct manner. This particular kind of testing is not
36
concerned with the manner in which processing takes place; rather, it focuses on the outcomes of
processing. Therefore, it endeavors to carry out the test cases, compare the outcomes, and validate the
correctness of the results.
Test Cases
1. Test the functionality of the automatic speech recognition (ASR) module by providing audio samples
in different languages and verifying if the system accurately transcribes the speech into text.
2. Validate the object recognition functionality by providing video frames with various objects and
confirming if the system correctly identifies and labels the objects in the frames.
3. Test the language identification feature by providing audio and video samples in different languages
and verifying if the system accurately detects and identifies the language used.
There are many different techniques or methods for testing the software, including the following:
BLACK BOX TESTING
During this kind of testing, the user does not have access to or knowledge of the internal structure or
specifics of the data item being tested. In this method, test cases are generated or designed only based
on the input and output values, and prior knowledge of either the design or the code is not necessary.
The testers are just conscious of knowing about what is thought to be able to do, but they do not know
how it is able to do it.
For example, without having any knowledge of the inner workings of the website, we test the web pages
by using a browser, then we authorise the input, and last, we test and validate the outputs against the
intended result.
Test Cases
1. Test the audio summarisation functionality by providing audio samples in different languages
and verifying if the system generates concise and informative summaries that capture the key
points of the audio content.
37
Figure 7.3.1: Text Summarization for Audio and Videos Blackbox Testing
2. Validate the video summarization functionality by providing video samples with diverse
contentand confirming if the system generates summaries that effectively summarize the
important aspects of the video.
3. Test the multilingual support by providing audio and video samples in different languages and
verifying if the system can accurately process and summarise content in each language.
As an instance, a tester and a developer examine the code that is implemented in each field of a
website, determine which inputs are acceptable and which are not, and then check the output to ensure
it produces the desired result. In addition, the decision is reached by analyzing the code that is really
used.
Test Cases
1. Test the accuracy of the automatic speech recognition (ASR) algorithm by examining the
correctness of the language model, acoustic model, and pronunciation model used in the ASR system.
2. Validate the feature extraction process for audio and video by inspecting the deep learning
models,such as convolutional neural networks (CNNs) or recurrent neural networks (RNNs), and
ensuring they correctly extract relevant features from the input data.
3. Test the performance of the language identification module by examining the accuracy of the
language models and language-specific features used in the identification process.
38
Figure 7.3.2: Text Summarization for Audio and Videos Whitebox Testing
39
8 RESULTS
Here is a bar chart that shows the performance of your multilingual audio and video summarization
system across different evaluation metrics (Accuracy, Precision, Recall, and F1 Score) for each
language.
40
9 SCREEN SHOTS
41
Fig 9.b: Summary output
The above figure 9.b shows the summary generated for the given audio file.
42
10 CONCLUSION
In conclusion, the research on audio and video summarization for multiple languages using machine
learning has shown significant promise and potential. The use of machine learning techniques has
facilitated the development of efficient and accurate summarization models that can process and analyze
vast amounts of audio and video content. The application of machine learning in this context has led to
the creation of automated systems capable of extracting key information, identifying important moments,
and generating concise summaries from diverse sources in multiple languages. These advancements have
wide-ranging implications for various fields, including media analysis, content creation, information
retrieval, and more. By leveraging machine learning algorithms, researchers have addressed challenges
related to language diversity, speech recognition, and video understanding. These models have achieved
impressive results, showcasing their ability to handle complex linguistic structures, cultural nuances, and
diverse accents. While this research presents significant progress, there are still areas for improvement.
Fine-tuning models for specific languages and domains, enhancing the handling of multilingual and
code-switching content, and exploring the integration of additional modalities (such as text and image)
could further enhance the accuracy and comprehensiveness of the summarization systems. Overall, the
research on audio and video summarization for multiple languages using machine learning holds great
promise for transforming the way we process, analyze, and consume multimedia conten
43
10.1 Future Scope
This project establishes the proof of working principle and sets direction for future
development into a fully learned and automated method for podcast speech
summarization. Given the complex nature of such a problem, we believe there is
plenty of room for improvements. Podcasts usually require active attention from a
listener for extended periods unlike listening to music. In the process of generating a
summary as discussed in this project, the primary input taken for processing is an
audio file. The audio file is generated by recording the human speech which is being
spoken or is already recorded. The produced summaries have intelligible audio
information.
Overall, this research aims to leverage the power of machine learning to address the
challenge of audio and video summarization in multiple languages.By developing an
automated system, users can efficiently navigate and comprehend multimedia content,
saving time and effort in information retrieval and understanding.
44
11 REFERENCES
[1] A. Vartakavi, A. Garg and Z. Rafii, Audio Summarization for Podcasts, 2021
29th European Signal Processing Conference (EUSIPCO), 2021.
[3] P. Gupta, S. Nigam and R. Singh, A Ranking based Language Model for
Automatic Extractive Text Summarization, 2022 First International Conference on
Artificial Intelligence Trends and Pattern Recognition (ICAITPR), 2022.
[7] Naim, I., Pandit, H. J., Jain, P., Vyas, O. P. A survey on audio and video
summarization techniques. Multimedia Tools and Applications, 79(11), 7235-7268,
(2020).
45
[8] Zhang, X., Liu, B., Zhou, M., Liu, J., Qian, X. Multimodal deep learning for
video summarization. Neurocomputing, 365, 222-231, (2019).
46