0% found this document useful (0 votes)
38 views57 pages

NLP Fyp Ii

The thesis presents a National Language Processing system designed for diarising and transcribing conference audio in Urdu and mixed Urdu-English contexts, addressing the challenges of human transcription and speaker identification. Utilizing OpenAI's Whisper for transcription and advanced algorithms for speaker diarization, the system generates accurate, time-stamped transcripts through a user-friendly web application. This project aims to enhance accessibility and comprehension of multi-speaker audio, significantly reducing manual effort in professional and academic settings.

Uploaded by

Hamza Chaudry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views57 pages

NLP Fyp Ii

The thesis presents a National Language Processing system designed for diarising and transcribing conference audio in Urdu and mixed Urdu-English contexts, addressing the challenges of human transcription and speaker identification. Utilizing OpenAI's Whisper for transcription and advanced algorithms for speaker diarization, the system generates accurate, time-stamped transcripts through a user-friendly web application. This project aims to enhance accessibility and comprehension of multi-speaker audio, significantly reducing manual effort in professional and academic settings.

Uploaded by

Hamza Chaudry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

National Language Processing (Urdu)

Authors
Muhammad Ahmad 21-SE-21
Haris Bin Shakeel 21-SE-76

Supervisor
Engr. Maria Andleeb
Lecturer

DEPARTMENT OF SOFTWARE ENGINEERING


FACULTY OF TELECOMMUNICATION AND
INFORMATION ENGINEERING
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
TAXILA
May – 2025
ABSTRACT

National Language Processing (Urdu)

This thesis introduces "National Language Processing," an artificial intelligence-powered


system for diarising and transcribing conference audio. Especially in Urdu and mixed
Urdu-English settings, where human transcription and speaker identification are time-
consuming, the research tackles the major difficulty of effectively processing and inter-
preting multi-speaker audio recordings. The approach consisted in creating a web applica-
tion using modern speech recognition (OpenAI’s Whisper) for high-accuracy transcription
and advanced speaker embedding models coupled with clustering algorithms for robust
speaker diarization, so producing outputs with timestamps. Important findings show that
the system, using a user-friendly interface, can automatically create precisely credited,
time-stamped transcripts from uploaded audio files. With a basis for future multilin-
gual and summarising improvements, NLP Diarise provides a useful solution to simplify
the analysis of multi-speaker audio, so greatly reducing manual effort and improving the
accessibility and comprehension of spoken content for professional and academic use.

i
UNDERTAKING

We, the undersigned, hereby certify that our project work titled National Language Pro-
cessing (Urdu) is the result of our own efforts, research, and original thought. We confirm
that this study has not been submitted elsewhere for assessment, award, or recognition.
Where information or content has been sourced from existing literature or external tools,
proper acknowledgements and citations have been provided as per standard academic con-
ventions. This project has been carried out in accordance with the ethical guidelines
and academic policies set forth by our institution. We have maintained integrity in both
the development and reporting of this research and have taken care to avoid any form
of academic misconduct or plagiarism. We fully understand that any violation of these
statements may result in disciplinary actions by the institution, including but not limited
to the revocation of our degree as well as academic penalties.

Signature of Student

Muhammad Ahmad
21-SE-21

Haris Bin Shakeel


21-SE-76

ii
PLAGIARISM DECLARATION

We take full responsibility for the project work conducted during the Final Year Project
(FYP) titled “National Language Processing (Urdu)”. We solemnly declare that the
project work presented in the FYP report is carried out solely by us, without significant
assistance from any external individual; however, any minor help taken has been duly ac-
knowledged. Furthermore, we confirm that this FYP (or any substantially similar project
work) has not been submitted previously to any other degree-awarding institution, either
within Pakistan or abroad.

We understand that the Department of Software Engineering at the University of Engi-


neering and Technology, Taxila upholds a zero-tolerance policy towards plagiarism. There-
fore, we as the authors solemnly declare that no portion of our FYP report has been
plagiarized and that all external material used is properly cited and referenced. We also
confirm that our FYP report complies with the similarity index requirement set by the
Higher Education Commission (HEC), which is less than or equal to 19%.

We further acknowledge that if we are found guilty of plagiarism in our FYP report,
the University reserves the right to withdraw our BSc degree. In addition, the University
reserves the right to publish our names on its official website that maintains a record of
students found guilty of plagiarism in their FYP reports.

Signature of Student

Muhammad Ahmad
21-SE-21

Haris Bin Shakeel


21-SE-76

Signature of Supervisor
Engr. Maria Andleeb

iii
ACKNOWLEDGEMENTS

We extend our profound gratitude to Engr. Maria Andleeb, our esteemed supervisor,
for her exceptional guidance, insightful feedback, and unwavering support throughout the
research and development of the project titled National Language Processing (Urdu).
Her expert advice and encouragement played a crucial role in shaping the direction and
outcomes of this project. We would also like to thank our reviewers and all participants
involved in the evaluation and testing phases of the project. Their valuable input and
critical remarks enabled us to enhance the functionality and usability of the system.

iv
TABLE OF CONTENTS

Abstract i

Undertaking ii

Plagiarism Declaration iii

Acknowledgements iv

Table of Contents v

List of Figures vii

List of Tables viii

1 INTRODUCTION 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Project goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Aims and objectives . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Deliverables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 LITERATURE REVIEW 4
2.1 Automatic Speech Recognition . . . . . . . . . . . . . . . . . . 4
2.2 Research Gap in Urdu ASR Technology . . . . . . . . . . . 4
2.3 ASR for Urdu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Language-Switching and Mixed-Language ASR . . . . . . 5
2.5 Review of ASR Techniques . . . . . . . . . . . . . . . . . . . . . . . 5
2.6 Performance of Whisper for Urdu Transcription . . . . . 9
2.7 Previous Research on ASR . . . . . . . . . . . . . . . . . . . . 10
2.8 Word Error Rate (WER) in ASR Evaluation . . . . . . . 10
2.9 Architecture of Whisper and its Application for Urdu . 11
2.10 Market survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

v
3 PROPOSED SOLUTION 14
3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Project timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Experimental/Simulation setup . . . . . . . . . . . . . . . . . 27
3.4 Evaluation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 RESULT AND DISCUSSION 29


4.1 Utilization (end users/beneficiaries) . . . . . . . . . . . . . . 29
4.2 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Application demo screen shots . . . . . . . . . . . . . . . . . . 34
4.4 Results discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5 Detailed work plan . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Budget requirements . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.7 Market Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 CONCLUSION 45

References 46

Abbreviations 48

vi
LIST OF FIGURES

3.1 Whisper ASR Model Variants by OpenAI [11] . . . . . . . . . . . . . . . 19


3.2 Comparison of Word Error Rate (WER) for different Whisper models on
Urdu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Flowchart of National Language Processing(NLP) . . . . . . . . . . . . . 21
3.4 System Architecture of National Language Processing . . . . . . . . . . . 22
3.5 Project Gantt Chart (a) . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6 Project Gantt Chart (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Uppermost Section of NLP Home Page. . . . . . . . . . . . . . . . . . . 35
4.2 Middle Section of NLP Home Page. . . . . . . . . . . . . . . . . . . . . . 36
4.3 Lower Section of NLP Home Page. . . . . . . . . . . . . . . . . . . . . . 37
4.4 Upper Section of NLP About Page. . . . . . . . . . . . . . . . . . . . . . 38
4.5 Lower Section of NLP About Page. . . . . . . . . . . . . . . . . . . . . . 38
4.6 FAQ page of NLP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.7 Initial user interface of the NLP Transcribe Page. . . . . . . . . . . . . . 40
4.8 Transcribe Page during audio processing. . . . . . . . . . . . . . . . . . . 41
4.9 ’Transcribe Page’ displaying the diarized results. . . . . . . . . . . . . . . 41

vii
LIST OF TABLES

2.1 Summary of Previous Research on Automatic Speech Recognition (ASR) . 10


4.1 Test case to check Web Application Loading. . . . . . . . . . . . . . . . . 30
4.2 Test case to check audio upload and transcription. . . . . . . . . . . . . . 30
4.3 Test case to check handling of unsupported file formats. . . . . . . . . . . 31
4.4 Test case to check speaker diarization accuracy. . . . . . . . . . . . . . . 31
4.5 Test case to evaluate system performance on long audio input. . . . . . . 32
4.6 Test case to verify transcript download functionality. . . . . . . . . . . . . 32
4.7 Test case to verify front-end responsiveness on different devices. . . . . . . 33
4.8 Test case to verify frontend behavior during API failure. . . . . . . . . . . 33
4.9 Test case to validate JSON response from API. . . . . . . . . . . . . . . 34
4.10 Budget Requirements Description . . . . . . . . . . . . . . . . . . . . . . 44

viii
CHAPTER 1: INTRODUCTION

1.1 Introduction

The fast development of digital technologies has greatly improved global communication
by means of language barrier overcoming capability. In fields including education, indus-
try, and legal systems, speech-to–text processing has become increasingly important for
documentation, teamwork, and information retrieval. Although speech recognition has
advanced, the Urdu language used by over 230 million people globally still suffers under-
privilege in this domain. Most commercial transcription tools ignore Urdu and code-mixed
Urdu-English speech in favor of English and other generally spoken languages. For Urdu-
speaking populations, this technological divide compromises access, record-keeping, and
communication in several important sectors.
Conventional ASR systems find great difficulty in multilingual contexts especially those
involving intra-sentential code-mixing of Urdu and English. Professional settings such as
business meetings, academic debates, interviews, and court cases intensify these chal-
lenges since speaker identification and transcription accuracy are crucial. This project
suggests the creation of the Urdu Meeting Transcription Generator, an artificial intelligence-
powered speech-to–text system specifically for Urdu and mixed-language speech, to close
this gap. It combines speaker diarization, timestamping, and noise reduction for maxi-
mum accuracy and contextual clarity while using OpenAI’s Whisper ASR model to manage
language variability, background noise, and speaker diversity.
Fast and dependable performance in professional use cases is ensured by the system’s
optimization for GPU-accelerated environments, which supports near real-time transcrip-
tion of lengthy and complex audio. A web interface that enables audio upload, transcript
editing, and export in DOCX and PDF formats will be used for user interaction. This
guarantees usability across a range of professional domains without necessitating highly
developed technical skills.
By providing a context-aware, scalable solution that empowers Urdu-speaking profes-
sionals and improves accessibility, inclusivity, and productivity across various industries,
this project ultimately seeks to close the linguistic and technological gap.

1
1.2 Project goal

This project’s main objective is to create an artificial intelligence, high-accuracy tran-


scription system specifically for meetings held in Urdu and mixed Urdu-English languages.
Using state-of- the-art Automatic Speech Recognition (ASR) and Natural Language Pro-
cessing (NLP), the system seeks to automatically translate spoken content into structured,
editable, exportable text.
Since most transcription systems ignore low-resource languages like Urdu, this project
fills in a major void in current technologies. By including speaker diarization to recognize
and separate between several speakers and timestamping to guarantee temporal alignment
of speech sections, the proposed solution will surpass simple transcription. In formal,
multi-speaker environments including business meetings, academic debates, interviews,
and legal proceedings where clarity, accuracy, and context are critical, these qualities are
especially useful.
Accessibility is a major project emphasis. Users of the system will be able to upload
recorded audio, review and edit the produced transcript, and export the final document
in formats including DOCX and PDF from an easy web-based interface. This makes the
instrument appropriate for a wide spectrum of users including researchers, teachers, office
staff, and attorneys, independent of their technical background.
Its modular and scalable design also enables future improvements including real-time
transcription, support of additional languages, and automatic summarizing of meetings
using sophisticated NLP techniques.

1.3 Aims and objectives

The major objective of this project is to create an Artificial Intelligence-based high-


accuracy system that possesses the ability to transcribe meeting audio in Urdu and code-
mixed (Urdu-English) language for institutional and professional purposes. To accom-
plish this overall goal, the following specific objectives were formulated:

1. Develop an ASR system capable of accurately transcribing speech from pre-recorded


and live meeting audio in Urdu and mixed Urdu-English language.

2. Use speaker diarization techniques to isolate and label particular speakers in record-
ings involving multiple speakers, making the resultant transcriptions more readable
and useful.

3. Append timestamp metadata to every section of the transcription to supply temporal


reference, thereby enabling users to accurately place the spoken content within the
audio timeline.

2
4. Develop an intuitive interface where users can upload audio files, edit and view
transcriptions, and export the completed content in popular formats like .txt and
.pdf.

5. Develop the system on a modular and scalable framework to incorporate future en-
hancements, including multilingual functionality, real-time transcription, and auto-
mated summarization of meetings via natural language processing methods.

1.4 Deliverables

The following are the key deliverables of this project:

1. A highly accurate automatic transcription system capable of transcribing meeting au-


dio in Urdu and code-mixed (Urdu-English) language. The system integrates speaker
diarization and timestamping functionalities to ensure business-quality outputs with
structured, readable transcriptions.

2. An intuitive and interactive interface that allows users to upload meeting audio
files, edit the generated transcriptions, and export the finalized content in widely
used formats such as .docx and .pdf.

3. A detailed thesis report documenting all aspects of the project, including the sys-
tem architecture, literature review, proposed methodology, technical implementation,
evaluation metrics, and the overall workflow of the transcription and diarization
system.

3
CHAPTER 2: LITERATURE REVIEW

Speech-to-text technologies have changed the ways information is documented and re-
trieved and play a critical role in the fields of education, business, and legal services.
Meeting transcription systems aim to ease the automation process of transcribing oral
conversations into text. However, for languages like Urdu, transcription systems face sig-
nificant challenges resulting from the limitations of resources, linguistic complexity, and
widespread use of the combination of two major languages, that being Urdu and English.
This literature review showcases a thorough review of current technologies, challenges, and
opportunities leading to finalizing the proposed formulation of Urdu Meeting Transcription
System.

2.1 Automatic Speech Recognition

ASR systems are designed to convert spoken language into text by leveraging AI models.
For languages like English and Mandarin, ASR systems achieve high accuracy due to
abundant datasets and robust language models. However, for Urdu, the development of
ASR systems is hindered by limited linguistic datasets and phonetic complexities [5] [14].
Multilingual ASR models, such as the Transformer-based systems, have demonstrated
success in low-resource languages, including Hindi and regional South Asian dialects [5].
For Urdu, these models need adaptation to its phonetic rules and unique script. Ali et
al. (2021) highlighted that integrating phoneme-based recognition systems and leveraging
transfer learning can improve Urdu ASR performance [1].

2.2 Research Gap in Urdu ASR Technology

While substantial progress has been made in Automatic Speech Recognition (ASR) for
widely spoken languages, there exists a notable research gap when it comes to ASR sys-
tems for low-resource languages like Urdu. Despite the availability of large-scale datasets
and models like Whisper, Wav2Vec, and others, Urdu ASR still faces challenges such
as handling diverse accents, code-switching between Urdu and English, and dealing with
noisy environments. The lack of comprehensive and high-quality annotated speech data
for Urdu further limits the development of robust ASR systems. Additionally, existing
ASR systems often fail to provide satisfactory performance for Urdu’s unique linguistic

4
features, such as complex morphology and phonetic variations. Addressing these gaps
requires further research focused on improving speech data collection, fine-tuning exist-
ing models for Urdu-specific tasks, and exploring novel approaches like phoneme-based or
hybrid models to enhance transcription accuracy.

2.3 ASR for Urdu

The biggest challenge for Urdu ASR is phonetic diversity. Urdu contains many phonemes
borrowed from Arabic, Persian, and South Asian regional languages, hence needing a
specific phoneme-based recognition model. Models like Wav2Vec 2.0 and Whisper have
been amazingly advanced to solve these issues. For instance, Whisper is trained on a
gigantic multilingual dataset and has shown its capability to transcribe speeches in many
languages, like Urdu, even in cases of mixed languages [12] [10].
Whisper, a powerful ASR model by OpenAI, leverages a multitask learning approach,
allowing it to handle diverse speech transcription tasks, including language identification,
multilingual speech recognition, and speech translation. This ability is particularly valuable
in Urdu-English code-switching scenarios, as Whisper’s robustness allows it to perform
well in detecting and transcribing both languages simultaneously [13].

2.4 Language-Switching and Mixed-Language ASR

In professional environments, language-switching between Urdu and English is quite com-


mon, especially in corporate meetings and academic settings. Research shows that ASR
systems such as Whisper handle mixed-language inputs fairly well, but the accuracy is
reduced in real-world applications because of switching between languages in a sentence
[2]. For instance, Whisper’s models have shown performance problems in cases where
there is a transition from Urdu to English, but fine-tuning on mixed-language datasets
can help improve such results [8].

2.5 Review of ASR Techniques


Advancements in speech-to-text technologies have led to significant improvements in Auto-
matic Speech Recognition (ASR) systems. These systems are central to enabling efficient
transcription and analysis of audio content, particularly for low-resource languages like
Urdu. This literature review explores relevant studies, evaluates the strengths and lim-
itations of various ASR approaches, and positions OpenAI’s Whisper V3 Turbo as a
promising candidate for Urdu transcription.
The paper titled “Careless Whisper: Speech-to-Text Hallucination Harms” [7] evaluates
OpenAI’s Whisper API for speech-to-text transcription, highlighting concerns about hal-

5
lucinations, where entire phrases are inaccurately generated. The study analyzed 13,140
audio segments, revealing a hallucination rate of 1.4%, with higher rates for speakers with
aphasia (1.7% versus 1.2% for controls). The study emphasized the need for better model-
ing and inclusive AI design to reduce transcription disparities, particularly for vulnerable
populations.
The development of efficient ASR architectures has been another area of focus. “Squeeze-
former: An Efficient Transformer for Automatic Speech Recognition” [9] introduces Squeeze-
former, a hybrid attention-convolution model that addresses inefficiencies in the Con-
former architecture. The model incorporates a Temporal U-Net structure, simplified
Transformer-style blocks, and depthwise separable subsampling layers, reducing compu-
tational overhead while achieving a Word Error Rate (WER) of 6.50% on the LibriSpeech
test-other dataset. Squeezeformer outperforms Conformer with 1.4% improved accuracy
and up to 40% fewer FLOPs, making it suitable for real-world deployment with reduced
inference costs.
In “Turning Whisper into Real-Time Transcription System” [4], researchers adapted
Whisper into a real-time transcription and translation system, achieving an average la-
tency of 3.3 seconds and a WER of 8.1% for English transcription on the ESIC test
set. By employing the LocalAgreement-2 algorithm, Voice Activity Detection (VAD), and
chunk segmentation techniques, the system balances latency and quality. While Whis-
per streaming modes increase WER by 2–6% compared to offline processing, the system
demonstrated robustness in live multilingual settings, highlighting its practical usability.
However, limitations include potential overlap in training data and the need for further
optimization.
Another notable contribution is the study “Improving Multilingual ASR in the Wild
Using Simple N-best Re-ranking” [6], which introduces a method for enhancing ASR ac-
curacy in real-world multilingual scenarios. Using N-best re-ranking with features such as
language models, text length, and acoustic models, the approach improved Speech Language
Identification (SLID) accuracy from 18.1% to 83.1% for tail languages and reduced WER
from 67.4% to 39.3%. For Whisper, SLID accuracy improved by 6.1%, and WER dropped
by 2.0%. The lightweight implementation of this method underscores its cost-effectiveness
and adaptability across ASR systems.
Phoneme-based models have also shown promise in multilingual ASR, as demonstrated
in the paper “Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition
via Weakly Phonetic Supervision” [15]. The study utilized International Phonetic Alpha-
bet (IPA) transcriptions with grapheme-to-phoneme (G2P) models, achieving a WER of
6.56% on the CommonVoice dataset, outperforming subword-based and self-supervised
models. Whistle demonstrated particular efficiency in low-resource settings, reducing
WER by up to 18% compared to subword-based models and requiring 24% fewer training
epochs. This approach highlights the potential of phoneme-based modeling for low-resource

6
languages like Urdu.
While several ASR models, such as Wav2Vec, Google Speech-to-Text, and Amazon
Transcribe, are widely used, they have limitations for Urdu transcription. Wav2Vec re-
quires significant fine-tuning for languages with limited representation in its pre-trained
data, whereas Google and Amazon systems rely on cloud-based APIs, raising cost and
privacy concerns. Whisper, in contrast, is open-source, supports offline processing, and
is pre-trained on a diverse multilingual dataset, making it better suited for Urdu tran-
scription with code-switching and varied accents.
Despite Whisper’s advantages, studies like [7] raise concerns about hallucinations and
ethical considerations in ASR systems. Addressing these limitations through techniques
such as N-best re-ranking [6] or incorporating phoneme-based approaches [15] could en-
hance the system’s reliability. Additionally, architectures like Squeezeformer [9] offer
alternative pathways for achieving cost-effective, high-accuracy ASR.
In conclusion, OpenAI’s Whisper presents a robust foundation for Urdu speech-to-text
transcription. Its adaptability for real-time applications, multilingual capabilities, and
offline processing make it a strong candidate. However, leveraging insights from studies on
hallucination mitigation, phoneme-based modeling, and novel architectures could further
optimize the system, ensuring it meets the linguistic and computational requirements of
Urdu transcription.

Careless Whisper
• WER: N/A

• Accuracy/Performance: Hallucination rate: 1.4% (1.7% for aphasia; 1.2% con-


trol); 38% hallucinations contain harmful content

• Latency/Cost: N/A

• Techniques Used: Analysis of hallucination patterns in Whisper API transcrip-


tion

• Key Insights: Ethical and legal concerns raised due to hallucination issues, espe-
cially for vulnerable populations like aphasia speakers.

Whisper-Streaming
• WER: 8.1% (English), 9.4% (German), 12.9% (Czech)

• Accuracy/Performance: Real-time transcription with 2–6% WER increase com-


pared to offline; robust multilingual conference performance

7
• Latency/Cost: 3.3-second average latency (English)

• Techniques Used: LocalAgreement-2 algorithm; Voice Activity Detection (VAD);


chunk segmentation

• Key Insights: Trade-offs between latency, chunk size, and WER; practical appli-
cability in real-time scenarios.

Improving Multilingual ASR (SLID)


• WER: Decreased by 3.3% (MMS), 2.0% (Whisper); tail languages: reduced from
67.4% to 39.3%

• Accuracy/Performance: SLID accuracy: +8.7% (MMS), +6.1% (Whisper);


strong improvements for low-performing languages

• Latency/Cost: Lightweight implementation; cost-efficient N=2

• Techniques Used: N-best re-ranking; incorporation of language models, text length,


and acoustic models

• Key Insights: Tail languages and low-resource ASR benefit most; demonstrates
the importance of feature selection for adaptability.

Squeezeformer
• WER: 6.5% (LibriSpeech test-other)

• Accuracy/Performance: Temporal U-Net reduces attention complexity by 4x;


1.4% WER improvement over Conformer

• Latency/Cost: 40% reduction in computational overhead

• Techniques Used: Hybrid attention-convolution architecture; Temporal U-Net;


efficient subsampling layers

• Key Insights: Achieves state-of-the-art efficiency and accuracy for ASR tasks,
with scalable performance improvements.

8
Whistle (Weakly Phonetic Supervision)
• WER: 6.56% (Phoneme-based models), 9.3% (Subword-based models)

• Accuracy/Performance: 18% relative WER reduction in crosslingual tasks; ro-


bust against catastrophic forgetting; 24% fewer training epochs

• Latency/Cost: Effective scaling with natural data mixing

• Techniques Used: Weakly phonetic supervision; IPA-based transcription via Lan-


guageNet G2P models; WFST-based decoding

• Key Insights: Phoneme-based models outperform subword and self-supervised mod-


els, particularly in low-resource MCL-ASR tasks.

2.6 Performance of Whisper for Urdu Transcription

Recent performance analysis of Whisper in Urdu reveals encouraging results. In the case
of the language, Urdu-Whisper-large-v2 obtained WER as 0.3241 in the conversational
speech scenario, significantly outperforming models such as Conformer-MoE that obtained
WER as 1.1624 on the same language. This makes Whisper potentially effective for
transcription-based Urdu applications assuming that relevant speech data for fine-tuning
is readily available [8].

9
2.7 Previous Research on ASR

Table 2.1: Summary of Previous Research on Automatic Speech Recognition (ASR)

Paper Name Model Description


Wav2Vec 2.0: A Frame- Wav2Vec 2.0 [3] A self-supervised learning model
work for Self-Supervised that uses contrastive learning to
Learning of Speech Rep- build robust speech representa-
resentations tions. Fine-tuned for Urdu with
datasets like Common Voice.
Whisper: Robust Speech Whisper A multitask model trained on
Recognition via Large- diverse datasets, enabling it to
Scale Weak Supervision handle Urdu and multilingual
speech recognition, including
Urdu-English code-switching.
Urdu-Speech- Whisper, Conformer- Compared Whisper’s multilingual
Recognition-Conformer- MoE performance to a custom Mixture-
MoE-vs-Whisper Model of-Experts (MoE) model for Urdu,
Comparison demonstrating Whisper’s superior
accuracy for Urdu speech.
WER We Stand: Whisper, Google Benchmarked Whisper against
Benchmarking Urdu Speech API Google’s ASR API for Urdu.
ASR Models Whisper performed better in
conversational Urdu, especially
in noisy and code-switching
environments.
Multilingual Speech Transformer ASR Explored Transformer-based ASR
Recognition with Trans- models for South Asian lan-
former Models guages, including Urdu. Focused
on code-switching and improving
multilingual recognition accuracy.

2.8 Word Error Rate (WER) in ASR Evaluation

Word Error Rate (WER) is a critical metric used to assess the performance of Automatic
Speech Recognition (ASR) systems. It quantifies the errors in transcriptions by comparing
the system’s output to a reference transcription (ground truth). The formula for WER

10
is:

S+D+I
WER =
N
Where:
• S is the number of substitutions (incorrect words),

• D is the number of deletions (missing words),

• I is the number of insertions (extra words),

• N is the total number of words in the reference transcription.


WER provides a clear indication of the accuracy of an ASR system, with lower val-
ues indicating better performance. In addition to measuring word-level accuracy, it helps
identify specific error types—substitutions, deletions, and insertions—allowing for tar-
geted improvements in system design. WER is often used to compare different ASR
models, such as Whisper and Wav2Vec, by measuring how closely their transcriptions
match the actual speech content.

2.9 Architecture of Whisper and its Application for Urdu

Whisper, developed by OpenAI, is a state-of-the-art Automatic Speech Recognition (ASR)


model designed for multilingual transcription tasks. It is based on a transformer archi-
tecture, which enables it to handle a wide variety of languages, including Urdu. Whis-
per’s architecture is trained on a large and diverse multilingual dataset, making it robust
for languages with limited resources like Urdu. The model is capable of handling both
clean speech and noisy environments, as well as code-switching between languages. It
utilizes a sequence-to-sequence framework where the input speech is first converted into
spectrograms, which are then processed by the encoder-decoder structure of the model.
The decoder generates the transcription by attending to the encoded representations. For
Urdu, Whisper leverages its multilingual capabilities, transcribing spoken Urdu into text
even in mixed-language scenarios like Urdu-English code-switching. The model’s ability
to perform well with varied accents and different dialects makes it a promising tool for
improving Urdu ASR systems. However, while Whisper performs well, challenges such as
dealing with the complex morphology of Urdu, and handling noise and regional accents,
remain areas for further optimization. [12]

2.10 Market survey

The Urdu Meeting Transcription System is an application of a customized approach


targeting the transcription needs of an organization, focusing particularly on meetings

11
recorded in Urdu and those held in mixed-language settings (Urdu-English).

2.10.1 Current Challenges


• The organization relies on manual transcription techniques, which are time-consuming
and unproductive.

• Multi-speaker scenarios and mixed-language content tend to incur accuracy issues.

• The lack of functionalities like speaker identification and timestamping makes the
documentation process more complex.

• The existing framework rarely generates coherent and systematically arranged tran-
scripts, making it troublesome to use meeting documentation efficiently.

2.10.2 Proposed Features


• Accurate Transcription: Leverages a highly optimized Whisper ASR model for
analyzing Urdu and multilingual recordings with high accuracy.

• Speaker Diarization: Uses Pyannote for speaker recognition and identification in


multi-person meetings.

• Timestamping: Adds time markers to the transcription to facilitate easy naviga-


tion and clarity.

• Editing and Exporting Options: Enables users to edit their transcripts and
export them into various formats, such as PDF, DOCX, and TXT.

• User-Friendly Interface: Provides a simplified, easy-to-use interface to upload


audio, view transcriptions, and make edits conveniently.

2.10.3 Expected Benefits


• Efficiency: Automated transcription reduces manual work, enabling faster docu-
mentation.

• Greater Accuracy: Fine-tuning the ASR model ensures reliable transcription out-
puts, even with mixed-language material.

• Structured Transcripts: Diarization and timestamping facilitate easy referencing


of well-organized meeting transcriptions.

• Cost Savings: Eliminating manual transcription reduces reliance on external ser-


vices and operational costs.

12
• User Convenience: The system’s editing and exporting features allow users to
edit and use transcriptions as desired.

2.10.4 Targeted Impact


The system is specifically designed to address the unique transcription requirements of the
organization, thereby improving productivity and operational effectiveness. It aligns with
the organization’s objectives of maintaining accurate, accessible, and professional meeting
records. This market analysis underlines the unique requirements of the organization
and explains how the Urdu Meeting Transcription System addresses these issues using its
high-end features and customized functionality.

13
CHAPTER 3: PROPOSED SOLUTION

3.1 Methodology

The approach taken in creating the project National Language Processing (Urdu) is
described in this section. Combining aspects of the Waterfall and Agile models, a hybrid
development approach was adopted to ensure both structure and adaptability throughout
the process.
The project began with preliminary studies on the linguistic challenges involved in
processing Urdu, a low-resource language. This included analyzing its phonetic and syn-
tactic characteristics and exploring existing tools for speech recognition, transcription,
and speaker diarization.
During the requirements analysis phase, the main objectives were defined: pri-
marily, to transcribe Urdu speech and perform speaker diarization. To support model
training and evaluation, Urdu audio datasets with various speaker counts and regional
accents were collected.
In the design phase, the system architecture was outlined, including stages such
as audio preprocessing, embedding extraction, clustering, and transcription. The user
interface flow and API design were also completed during this stage.
The development phase involved implementing core components, including a custom
speaker diarization module using embeddings and clustering algorithms, a FastAPI-based
backend, and integration of Whisper for transcription. The system was optimized for
deployment in GPU-based environments.
In the testing phase, sample Urdu recordings were used to evaluate the application’s
transcription accuracy and speaker separation performance. Iterative improvements were
made to handle common challenges such as background noise and code-switching.
Using this methodology, an efficient, scalable, and practical Urdu NLP application was
developed to meet real-world requirements.

3.1.1 Requirement Gathering


The requirements gathering phase entailed interaction with prospective end-users, specif-
ically professionals who regularly have meetings in Urdu or a combination of Urdu and

14
English. These discussions explained practical problems and user expectations around the
application. The gathered insights revealed the following essential requirements:

• Accurate transcription: The system must accurately transcribe both pure Urdu
and mixed-language (Urdu-English) audio information.

• Speaker diarization: The application must be able to identify and classify distinct
speakers, particularly in multi-participant meetings.

• Time-stamping: Every section of the transcription must incorporate timestamps


to assist users in navigating and identifying certain portions of the audio.

• An understandable interface: A straightforward and intuitive design is essential


for uploading audio, altering text, and keeping transcription records.

• Export functionality: Users must have the capability to download the final tran-
scription in many formats, including PDF, DOCX, and TXT.

• Support for long audio recordings: The system must proficiently manage ex-
tended audio recordings, ensuring optimised performance on GPU-based systems to
enhance processing speed.

These requirements became the basis for the design and development of a functional
and efficient Urdu Natural Language Processing program appropriate for use in the real
world.

3.1.2 System Design


The design process entails formulating a precise and comprehensive architecture to guaran-
tee the seamless integration of all components. Figure ?? illustrates the system workflow,
which encompasses the following steps:

• Uploading Urdu or multilingual audio recordings via the web-based user interface.

• Preprocessing and transforming the audio into the suitable format for analysis.

• Employing the Whisper ASR model to convert audio material into text.

• Utilising a bespoke speaker diarization module that employs audio embeddings and
clustering methods to distinguish between various speakers.

• Creating timestamps for each segment of the transcription to improve clarity and
traceability.

15
• Enabling users to access, assess, and modify the transcription via an intuitive in-
terface.

• Exporting the completed transcriptions in multiple accessible forms, including PDF,


DOCX, and TXT.

The system architecture guarantees that each module—comprising audio preprocess-


ing, automated speech recognition (ASR), speaker diarization, timestamp creation, and
user interface—functions in unison. The architecture is refined for performance and pre-
cision while ensuring usability in practical meeting transcribing contexts including the
Urdu language.

3.1.3 System Development


The creation of the Urdu Natural Language Processing system comprises three essential
elements: the web application interface, the FastAPI-driven backend, and the machine
learning pipeline for transcription and speaker diarization. The system is engineered to
be lightweight, scalable, and efficient in GPU-based settings , ensuring accessibility and
cost-effectiveness for research and practical applications.

1. Web Application Interface


The front-end is constructed with HTML, Tailwind CSS, and JavaScript, guaran-
teeing a responsive, streamlined, and user-centric interface. It offers a cohesive
experience for file uploads, transcription reviews, and output management. The
system consists of the following pages:

• Home Page: Serves as the entry point of the program, providing users with
a succinct summary of the system’s functionalities and objectives. It features
intuitive navigation to all other aspects of the website.
• Transcribe Page: This is the fundamental operational page of the system.
Users may submit audio files in Urdu or code-switched (Urdu-English) via an
easy form. Upon submission, the audio file is transmitted to the backend API,
which processes it and delivers a transcription. The page thereafter presents:
– Transcription of text, categorised by speaker.
– Speaker designations (e.g., Speaker 1, Speaker 2).
– Time markers indicating the commencement of each speaker section.
– Export Transcription in format such as TXT and PDF.
• About Page: Outlines the system’s rationale, objectives, fundamental tech-
nologies (Whisper, embeddings-based diarization), and its significance to the

16
Urdu-speaking population. It additionally comprises developer information and
acknowledgements.
• FAQs Page: Offers responses to frequently asked issues, including supported
file formats, average processing durations, use-case suggestions, and system
constraints. It additionally instructs users on file uploads and result interpre-
tation.

User Experience Emphasis: The interface is crafted for usability by experts, re-
searchers, and linguists with diverse technical proficiency, prioritising accessibility,
clarity, and streamlined interaction stages.

2. Backend Development
The backend is constructed with FastAPI, selected for its efficiency, adaptability,
and seamless connection with contemporary machine learning processes. It func-
tions as the intermediate between the user-facing frontend and the machine learning
models. Essential backend functionalities comprise:

• File Management: The backend receives audio files from users and tem-
porarily stores them in a secure area for processing.
• Model Execution and Integration: Upon file reception, the backend invokes
the transcription and speaker diarization modules. It guarantees optimal use
of GPU resources for rapid processing.
• Response Structuring: The backend structures the transcription output with
distinct speaker labels and timestamps after processing, subsequently sending
the structured results to the front-end.
• Scalability and Security: The modular API is designed to accommodate
several concurrent users and has fundamental error handling and validation to
ensure data integrity.

3. Machine Learning Workflow


The machine learning pipeline constitutes the computational foundation of the sys-
tem, refined for precision and efficacy in processing Urdu speech recordings:

• Audio Preprocessing: The input audio is standardized to a uniform for-


mat (e.g., mono channel, 16 kHz sampling rate WAV format) to guarantee
compatibility across all models.
• Transcription: The Whisper ASR model, developed by OpenAI, is utilized
for transcribing audio into text. Whisper exhibits strong performance in low-
resource language contexts, particularly with Urdu and code-switched audio. It
also aids in managing ambient noise and diverse speaker accents.

17
• Tailored Speaker Diarization Module:
– The system uses proprietary embeddings (ECAPA), derived from the audio
signal, rather than depending on comprehensive diarization pipelines like
Pyannote.
– Clustering algorithm K-Means is utilized to categorize segments according
to speaker identification.
– The diarization process is efficient and flexible, facilitating its optimization
and integration with the overall pipeline.
• Timestamp Alignment: Each spoken part is associated with specific time
intervals in the audio, facilitating navigation, editing, and verification. This
also facilitates the segmentation of content for subsequent applications such as
summarization.
• Output Generation: The final output comprises a comprehensive transcrip-
tion, categorized by speaker, enhanced with timestamps, and provided in ex-
portable forms (PDF, DOCX, TXT).
• GPU Optimization: The pipeline is meticulously calibrated for GPU-enabled
settings, significantly decreasing the duration needed to transcribe and diarize
lengthy recordings (1–2 hours or more). This facilitates practical application
in real-time or near-real-time environments.

3.1.4 Automatic Speech Recognition (ASR) Model


Developed by OpenAI [12], the Whisper Automatic Speech Recognition (ASR) model drives
the fundamental transcription component in this system. Designed on a vast collection of
varied audio and text combinations, Whisper is a flexible and multilingual voice recogni-
tion model. For low-resource languages like Urdu, which are generally underrepresented
in traditional ASR techniques, it is especially successful.
Whisper can manage pure Urdu as well as code-switched Urdu-English discourse.
Common in real-world conference settings, its strong training helps it to control differ-
ences in accent, background noise, and recording quality. Audio is passed via the Whisper
model once it has been preprocessed into a (mono, 16 kHz) WAV file. Time-aligned on a
per-utterance basis, the output is a raw transcription of the audio free of speaker segmen-
tation.
Running the Whisper model in an environment enabled by a GPU, such as Google
Colab, improves its performance in this system. This enables scalability and speedier
inference, particularly in formal meetings or lectures when processing long recordings of
one to two hours or more.

18
Whisper forms the basis upon which additional processing such as speaker separation
and timestamping is built, allowing the system to provide high-quality, ordered transcrip-
tions fit for both academic and professional use.

Figure 3.1: Whisper ASR Model Variants by OpenAI [11]

The Word Error Rate (WER) performance of many Whisper models on Urdu tran-
scribing jobs is shown below using a bar chart. Speech recognition system accuracy is
measured using WER, a standard metric where lower WER denotes improved transcrib-
ing quality.
Six Whisper model variations—tiny, base, small, medium, large, and turbo—are in-
cluded in this comparison study. Every model was evaluated consistently and fairly using
the same Urdu audio collection.
Using a turbo model that we are working on—our project obtains a WER of around
0.16. This performance presents an excellent balance between speed and accuracy and is
somewhat similar to the medium model. Lower WER indicates better performance.

19
Figure 3.2: Comparison of Word Error Rate (WER) for different Whisper models on
Urdu

3.1.5 Speaker Diarization Techniques


While Whisper excels at transcription, it does not inherently support speaker diariza-
tion—the process of distinguishing and labeling multiple speakers within an audio file. To
address this, the system incorporates a custom diarization module based on embedding
extraction and clustering strategies.

• Embedding Extraction:
The system captures speaker-specific vocal characteristics using pre-trained speaker
embedding models such as ECAPA-TDNN. These embeddings are derived from fixed-
length segments of audio aligned with the timestamped output from the Whisper ASR
model. Each segment is transformed into a high-dimensional vector that uniquely
represents the speaker’s vocal identity.

• Clustering:
Once the embeddings are extracted, unsupervised clustering methods are employed
to group similar speaker segments. Two primary clustering approaches are used:
K-Means Clustering:
Effective in scenarios where the number of speakers is known or can be estimated
beforehand. It offers a simple and efficient way to partition speaker embeddings.

Based on the clustering results, speaker labels such as Speaker 1, Speaker 2, etc., are
assigned to corresponding transcription segments. These are then aligned with Whisper’s
timestamps to produce a structured, speaker-annotated transcript.

20
By integrating a high-quality ASR system with a modular diarization pipeline, the
system achieves a lightweight yet powerful alternative to fully integrated solutions like
Pyannote. This hybrid approach proves especially valuable for long Urdu-language record-
ings where real-time speaker tracking and user-friendly editing support are critical. Ad-
ditionally, the modularity ensures that the system can be easily updated or extended with
improved diarization models in the future.

3.1.6 Design Diagram


1. Flowchart Diagram

Figure 3.3: Flowchart of National Language Processing(NLP)

21
The NLP Diarize technology automatically pipelines meeting audio. The individual
submitting an audio file starts it. The algorithm next preprocesses this audio and
transcribes it into timestamped text parts. Different speakers are then located and
labeled using speaker clusterering methods. Should it be successful, the user is shown
a diarized transcript; else, an error is shown. This simplified approach turns spoken
meeting materials into a disciplined, speaker-attributed textual form.

3.1.7 System Architecture


The Urdu Meeting Transcription web application’s system architecture is meant to create
speaker-attributed transcripts by processing supplied audio recordings. The user uploading
an audio file via the frontend web application starts the procedure. Sent to the backend
API server handling all processing is this file. The backend first forwards the audio to the
preprocessing module, which segment of the data prepares it. Two different modules—the
Urdu Transcription Module and the Speaker Diarization Module—then get the audio.
Whereas the diarization module finds and divides individual speakers, the transcription
module turns speech into text.

Figure 3.4: System Architecture of National Language Processing

22
The Transcript Assembly Module receives both outputs and combines speaker infor-
mation with transcribed text to create a whole transcript. The Storage System stores this
last transcript, from which the user may access or download via the frontend interface.
With speaker separation, this design guarantees a disciplined data flow from input to
final output, hence supporting reliable Urdu transcription.

3.1.8 Web Application Interface Development


The Urdu Speaker Diarization and Transcription System provides a user-friendly web-
based interface designed to simplify interaction for users with varying technical back-
grounds. The front-end is developed using HTML, Tailwind CSS, and JavaScript, ensur-
ing responsiveness and accessibility across devices.
There are four primary pages to the web application:

• Home Page: Acting as the landing page, the home page gently introduces the
program, its goals, and features. It gives other areas of the website simple access.

• Transcribe Page: Users can post Urdu or mixed-language (Urdu-English) audio


files on this basic functional website called Transcribe website. The file arrives to
the backend API for processing upon upload. The transcribed text is returned and
shown once processed together with speaker labels and timestamps.

• About Page: This part notes the developer’s contribution and offers specifics on
the evolution of the system, its objectives, underlying technologies like the Whisper
ASR model and speaker embedding-based diarization.

• FAQs Page: Designed to help users grasp how the system operates, what formats
are supported, processing constraints, and basic troubleshooting, FAQs Page answers
often asked issues.

3.1.9 Backend Integration using FastAPI


The system combines the speaker diarization and transcription elements via a custom-
built RESTful API created using FastAPI. Specifically for long-form audio inputs, this
backend is tuned to work in GPU-enabled settings like Google Colab to guarantee fast and
scalable processing.
The process of integration consists in the following actions:

• Via the Transcribe Page, a user submits an audio recording.

• Using a POST request, the front-end forward-ends this file to a FastAPI endpoint.

23
• The backend handles audio preprocessing, Whisper ASR model transcription, speaker
diarization with embeddings and clustering algorithms, among other chores.

• The last result, a structured transcript including timestamps and speaker labels,
comes back in JSON style.

• Then the front-end shows the user this information.

With better models or implemented on other cloud platforms (such AWS or GCP)
that have GPU support, this modular approach lets the system be readily maintained,
upgraded, or expanded. FastAPI guarantees flawless interaction between front-end and
back-end components as well as great performance.

3.1.10 Testing System


Throughout several stages of development, a thorough testing plan was used to guarantee
the usability, performance, and dependability of the system. Individual module valida-
tion, evaluation of integrated functionality, and user experience assessment dominated
the testing

1. Unit Testing:
Every fundamental component of the system was separately tested to ensure that,
prior to integration, every aspect performs as it should. This comprising:

• The Whisper model was validated against noise and accent fluctuations by
testing several Urdu and Urdu-English audio samples, hence verifying its tran-
scription accuracy and resilience.
• Correct speaker separation was ensured by testing individually the embedding
extraction and clustering components using synthetic and real-world speech
samples.
• Tests guaranteed the right conversion of supplied audio files into the necessary
mono-channel 16kHz WAV format.
• FastAPI endpoints were tested under several conditions for correct response
formats, file validation, error handling, and response times.

2. Integration Testing:
The whole pipeline was gradually combined as individual units passed their tests.
Integration testing proved that data flows naturally from the frontend to the backend
and via the ML pipeline and that components interact successfully. More specifically:

• Front-end file uploads were tested to guarantee they reached the backend accu-
rately.

24
• Verified was the backend’s capacity to process outputs, trigger the ASR and
diarization pipeline, and produce structured results.
• System resilience was examined using edge scenarios including extended audio
durations and empty or invalid file types.

3. System Testing:
The system was examined overall to guarantee it satisfies the general project objec-
tives. This stage confirmed end-to-end functionality comprising:

• uploading Urdu or Urdu-English audio files.


• Processing and generating complete, speaker-labeled transcriptions.
• Displaying results on the web interface with speaker tags and timestamps.
• Exporting final transcripts in TXT format.

4. Functional Testing:
Every feature performs in line with the software requirements, confirmed by func-
tional testing. This comprising:

• Components of the user interface—buttons, forms, file inputs, etc.—were tested


for intended behavior.
• For both accuracy and format, expected outcomes were matched against tran-
scription and diarization outputs.
• Tests of error messages, loading states, and user comments guaranteed respon-
siveness and clarity.

5. Usability Testing:
Evaluating the system from the end-user’s point of view, usability testing sought
user happiness and ease of use. Important domains under examination include:

• Easy navigation among Home, Transcribe, About, and FAQ pages.


• Accessibility includes clarity of instructions, responsiveness on several devices,
desktops, and so forth.
• Test users revealed generally a flow of interaction from uploading a file to
viewing and analyzing findings.
• Users were requested to offer comments on system performance, visual clarity,
and result interpretability.

25
3.2 Project timeline

The following Gantt Charts represent the Project Timeline that depict the procedure as
to how work was divided between both team members and the completion of deliverables
over the span of the project timeline.

Figure 3.5: Project Gantt Chart (a)

26
Figure 3.6: Project Gantt Chart (b)

3.3 Experimental/Simulation setup

Our experimental/simulation setup is the following:

• Operating System: Windows 11

• Hardware Components:

– Laptop Specifications
∗ DESKTOP-6AV1I6J
∗ RAM: 16.0 GB (15.7 GB usable)
∗ Processor: Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz 1.90 GHz.
∗ GPU: NVIDIA GeForce GTX 1650
– Google Colab Specifications
∗ GPU: Tesla T4
∗ RAM: 13 GB

27
∗ Disk Space: 108 GB

• Tools: Google Colab , Whisper (ASR Model), Pyannote, Speech Brain, VSCode,
FastApi, Google Drive

• Documentation Resources: LaTeX

• System testing and running can be performed on any device.

3.4 Evaluation Parameters


The following key performance indicators will help us to assess the Urdu Meeting Tran-
scription System:

• Accuracy: For Urdu and mixed Urdu-English audio including multi-speaker set-
tings, the transcription system should reach at least 85% accuracy.

• Speaker Diarization Quality: In meetings involving at least 90% accuracy in up


to five participants, the diarization module must appropriately distinguish between
speakers.

• User Satisfaction: Feedback from experts including educators, researchers, and


office workers as well as usability testing will help to evaluate the transcription edit-
ing interface’s simplicity and effectiveness. TXT-style exported transcripts should
have no formatting mistakes or data loss.

• Scalability: The system will be assessed for capacity to manage higher workloads,
such processing several audio files or supporting several users concurrently contin-
uously without performance deterioration.

• Robustness to Audio Quality: To guarantee dependable functioning in real-world


use scenarios, the system will be evaluated against diverse audio inputs—including
those with background noise, different accents, and changing speech clarity.

28
CHAPTER 4: RESULT AND DISCUSSION

The main elements and techniques applied in building the Urdu Speaker Diarization and
Transcription System are described in this chapter. It addresses speaker diarization meth-
ods, backend API integration, Whisper ASR model application, and web interface develop-
ment. The chapter also emphasizes the testing techniques, difficulties experienced during
development, and the remedies used to raise system performance and accuracy.

4.1 Utilization (end users/beneficiaries)

Professionals and companies who run official meetings, interviews, academic debates, or
legal actions in the Urdu language or in a code-switched Urdu-English format are the main
end users of the Urdu Speaker Diarization and Transcription System. These comprise:

• Speaker-wise transcriptions for recording and analyzing conversations, lectures, or


interviews, researchers, and academics.

• Journalists and Media Professionals: To faithfully record press conferences, inter-


views, and reports including speaker identity.

• Legal and government bodies: for recording investigation processes, witness testi-
mony, or multilingual hearings.

• Healthcare Providers: Psychologists and psychiatrists can use the system for
recording therapy sessions in Urdu.

• Business professionals and NGOs: To create transcripts of team conversations,


board meetings, and field interviews, including several speakers involved.

4.2 Simulation results

4.2.1 Test Cases


Testing was conducted to verify that the system accurately identifies and labels several
speakers, consistently transcribes Urdu and Urdu-English audio, and shows the result
free from mistakes. This guaranteed flawless performance across the web interface for

29
users of different technological backgrounds. From audio uploading to final transcript
display, every element—from correctness to stability to usability—was evaluated for. The
objectives were to guarantee constant performance, excellent speaker-labeled transcripts,
and a user experience free of errors or interruptions. Testing for National Language
Processing was conducted manually. Numerous test cases were run and here are a few of
them:

Table 4.1: Test case to check Web Application Loading.

Test ID Test-1
Test name Web Interface Loads Successfully
Date of test 20/05/2025
Name of application Urdu Speaker Diarization and Transcription Sys-
tem
Description The Home Page should load when the user opens
the application URL
Input Navigate to the application URL in a browser
Expected output Home Page loads with navigation to Transcribe,
About, and FAQs pages
Actual output Home Page loaded successfully with all navigation
links working
Test Role (Actor) Team Member
Test verified by Supervisor

Table 4.2: Test case to check audio upload and transcription.

Test ID Test-2
Test name Audio File Upload and Transcription Processing
Date of test 21/05/2025
Name of application Urdu Speaker Diarization and Transcription Sys-
tem
Description The system should accept and transcribe an Urdu
or mixed Urdu-English audio file
Input Upload a sample Urdu audio file (WAV format)
from the Transcribe page
Expected output Transcription with speaker labels and timestamps
displayed on screen
Actual output Transcription generated successfully with correct
speaker segmentation
Test Role (Actor) Developer
Test verified by Supervisor

30
Table 4.3: Test case to check handling of unsupported file formats.

Test ID Test-3
Test name Handling of Invalid Audio File Format
Date of test 21/05/2025
Name of application Urdu Speaker Diarization and Transcription Sys-
tem
Description The system should reject unsupported file types
(e.g., .mp3, .aac)
Input Upload an unsupported file format from the Tran-
scribe page
Expected output Error message shown: "Unsupported file format.
Please upload a WAV file."
Actual output Proper error message displayed, upload rejected
Test Role (Actor) Developer
Test verified by Supervisor

Table 4.4: Test case to check speaker diarization accuracy.

Test ID Test-4
Test name Accurate Speaker Diarization
Date of test 21/05/2025
Name of application Urdu Speaker Diarization and Transcription Sys-
tem
Description The system should correctly identify and segment
speakers in multi-speaker audio
Input Upload a 2-minute audio file with two alternating
speakers
Expected output Transcript segments labeled as Speaker 1 and
Speaker 2 with correct timestamps
Actual output Diarization accurate; each speaker segment labeled
correctly
Test Role (Actor) Tester
Test verified by Supervisor

31
Table 4.5: Test case to evaluate system performance on long audio input.

Test ID Test-5
Test name Processing Long Audio (1 Hour)
Date of test 21/05/2025
Name of application Urdu Speaker Diarization and Transcription Sys-
tem
Description The backend should process long audio files effi-
ciently and without crashing
Input Upload a 1-hour recorded meeting in Urdu
Expected output Full transcript with speakers and timestamps re-
turned in under 10 minutes
Actual output Transcript returned in 9 minutes; no crash or time-
out observed
Test Role (Actor) Developer
Test verified by Supervisor

Table 4.6: Test case to verify transcript download functionality.

Test ID Test-6
Test name Transcript Download Functionality
Date of test 21/05/2025
Name of application Urdu Speaker Diarization and Transcription Sys-
tem
Description This test ensures that users can download the final
transcription file after processing is complete. The
download should work for .txt format.
Input Upload a valid Urdu audio file, wait for transcrip-
tion, and click the download button for the tran-
script.
Expected output Transcript is successfully downloaded in the se-
lected format with proper file name and content.
Actual output Transcript downloaded as "transcript.txt"; con-
tent matched the displayed result.
Test Role (Actor) End User
Test verified by Supervisor

32
Table 4.7: Test case to verify front-end responsiveness on different devices.

Test ID Test-7
Test name Responsive UI on Mobile and Desktop
Date of test 21/05/2025
Name of application Urdu Speaker Diarization and Transcription Sys-
tem
Description The interface should render properly across screen
sizes and remain usable
Input Open application on desktop browser and Android
mobile browser
Expected output Layout adjusts correctly; all elements visible and
functional
Actual output Responsive design works as intended on both de-
vices
Test Role (Actor) Frontend Developer
Test verified by Supervisor

Table 4.8: Test case to verify frontend behavior during API failure.

Test ID Test-8
Test name Backend API Failure Handling
Date of test 21/05/2025
Name of application Urdu Speaker Diarization and Transcription Sys-
tem
Description The frontend should handle API downtime or net-
work errors gracefully
Input Disable internet and try to upload an audio file for
transcription
Expected output "Connection error. Please try again later." mes-
sage displayed
Actual output Connection error message appeared correctly
Test Role (Actor) Frontend Developer
Test verified by Supervisor

33
Table 4.9: Test case to validate JSON response from API.

Test ID Test-9
Test name API Response Format Validation
Date of test 21/05/2025
Name of application Urdu Speaker Diarization and Transcription Sys-
tem
Description The backend should return a JSON response with
keys: "transcript", "speakers", "timestamps"
Input Upload a test audio file and inspect response
Expected output JSON structure with correct keys and valid data
Actual output JSON received: transcript,
speakeri d, startt ime, endt ime
Test Role (Actor) Backend Developer
Test verified by Supervisor

As shown above all these test cases passed hence verifying the system was working and
the user’s requirements were met properly.

4.3 Application demo screen shots

Designed to host a User Interface that is professional, simple, and efficient for consumers
needing audio transcription and speaker diarization services, the web application, is de-
veloped with a modern frontend stack (HTML, CSS/Tailwind CSS, JavaScript) and a
Python (FastAPI) backend, Including important modules and user interactions, the demo
screenshots of the NLP Diarise: Urdu Meeting Transcription Diarization System web
application show:

4.3.1 Home Page


Thoughtfully designed to provide customers an easy and simple approach to its fundamen-
tal Urdu audio transcription and speaker diarization capabilities, the Home Screen is the
main entrance point and major landing page for the NLP Diarize online application. As
shown in Figure 4.1, this page shows the goal and branding aspects of the program right
away, usually containing the project title or a unique logo that points the user. Especially
highlighted are easily available buttons or navigation links that direct visitors to the ma-
jor parts of the application. Along with links to instructional pages including "About,"
"FAQs," these include a main call to action to reach the "Transcribe Audio" page, where
users may upload their recordings and set processing settings. The Home Screen’s general
design and layout underline a neat, professional, and user-centric experience that guaran-
tees users may quickly grasp the capabilities of the application and effectively go to their
intended tasks.

34
Figure 4.1: Uppermost Section of NLP Home Page.

35
Figure 4.2: Middle Section of NLP Home Page.

36
Figure 4.3: Lower Section of NLP Home Page.

4.3.2 About Page


The NLP Diarise web application’s "About" page (Figure 4.4) gives visitors understanding
of the heart of the project. It discusses the AI technologies we use—such as Whisper and
speaker embedding models—introduces the development team, and summarises our aim for
future improvements, thereby improving Urdu and mixed-language audio analysis. This
page seeks to be transparent and help users to grasp the goals and powers of the system.

37
Figure 4.4: Upper Section of NLP About Page.

Figure 4.5: Lower Section of NLP About Page.

38
4.3.3 FAQ Page
Designed to proactively address common user inquiries and offer easily available infor-
mation about the functionalities, capabilities, and constraints of the system, the "FAQs"
(Frequently Asked Questions) page of the NLP Diarize web application (Figure 4.6) Usu-
ally using an accordion-style arrangement for a neat and orderly presentation, this part
lets users readily extend queries to examine their corresponding responses. The material
addresses important topics such supported audio formats, Urdu and mixed-language tran-
scription accuracy, speaker diarization technique, and any possible restrictions on audio
file size or duration. Moreover, it sometimes makes clear how the system manages situa-
tions like an unknown number of speakers and offers broad knowledge about data security
and privacy. The FAQs page seeks to increase user knowledge, lower support needs, and
improve the general user experience with the NLP Diarize tool by providing clear and
succinct answers to expected inquiries.

Figure 4.6: FAQ page of NLP.

39
4.3.4 Transcribe Page
Users of the NLP Diarize web application interface directly with the fundamental tran-
scription and speaker diarizing capabilities at the "Transcribe Page," the major hub. Fig-
ures X.X, Y.Y, and Z.Z show how carefully this page is thoughtfully laid using a two-
column arrangement to guarantee a clear and effective user workflow. The "Controls"
section of the left column allows users to choose an audio file for processing, give the
estimated number of speakers in the recording, and optionally input a language code. The
intuitive interface lets users One notable "Upload and Diarize" button to start the pro-
cess. Underneath the controls, a dedicated status section shows the user real-time feedback
indicating the current operating state—that of "Uploading," "Processing," or "Diariza-
tion complete," together with any error warnings. The "Transcription Output," where the
resultant diarized transcript—complete with speaker labels and timestamps—is shown in
a readable, monospace font inside a scrollable area—reserved in the right column. Once
the findings are correctly produced, this part shows a "Download Transcript (.txt)" button
letting users quickly save the output. From input to result retrieval, its methodical design
seeks to give a flawless experience.

Figure 4.7: Initial user interface of the NLP Transcribe Page.

40
Figure 4.8: Transcribe Page during audio processing.

Figure 4.9: ’Transcribe Page’ displaying the diarized results.

41
4.4 Results discussion

Developing a web-based solution able to automatically transcribing Urdu and mixed Urdu-
English audio while separating between distinct speakers was the main aim of the Urdu
Speaker Diarization and Transcription System. This project sought to facilitate correct
transcription via a simple online interface and solve the dearth of easily available tools
for processing Urdu audio in multi-speaker situations. Every main goal of the system was
fulfilled with success. Users of the program can upload long-form Urdu audio recordings,
run them through a diarizing and transcribing pipeline, and view clearly segmented tran-
scripts depending on speaker changes. For additional use, the transcript is available in.txt
form. Built React and designed with Tailwind CSS, the minimal and simple web interface
provides access even to people with low technical knowledge. FastAPI drives the back-
end, which effectively handles audio processing and interfaces nicely with the diarization
and transcription parts. The application satisfies its stated use that of a useful tool for
storing, analyzing, and reviewing Urdu-language meetings, interviews, and discussions
by combining a clean user interface with sophisticated speaker diarization and transcrib-
ing capability. Functional testing confirmed and successfully implemented all main project
goals including multi-speaker diarization, transcription, downloadable output, and respon-
sive UI.

4.5 Detailed work plan

The project was carried out in line with a set of well-defined benchmarks to guarantee a
successful and quick progress. Every phase helped to create a completely working Urdu
Speaker Diarization and Transcription System able of processing multilingual audio in-
put and producing accurate, speaker-labeled transcription. The main turning points are
enumerated here:

1. Research and Requirement Analysis

• Particularly for low-resource languages like Urdu, I carefully went over the
body of current studies on speaker diarization and automated speech recognition
(ASR).
• Investigated tools and technologies including clustering methods, Whisper, Pyn-
note, and FastAPI to identify the best-fit components for diarization and tran-
scription.
• Researching current transcription systems such as Otter.ai and HappyScribe
helped one to grasp industry standards and user expectations.

42
• Found important Urdu transcription and speaker diarizing use cases include
government records, interviews, and academic gatherings.

2. Model Selection, Adaptation, and Backend Development

• Analyzed pre-trained ASR and embedding models for Urdu and mixed-language
datasets’ compatibility.
• Selected Whisper for transcription; diarized using a speaker embedding model
along with a clustering technique.
• Separately refined and verified the diarization and transcription components to
guarantee language coverage and correctness.
• Designed a FastAPI backend to manage model inference, response delivery,
and audio file processing.
• Implement and test support for 1–2 hour long lengthy audio files while opti-
mizing for GPU-based deployment.

3. Web Application Design and Development

• Designed with HTML, Tailwind CSS, and JavaScript a responsive, clean fron-
tend to guarantee accessibility.
• Designed major pages including About, Login, Signup, Home, and Transcribe.
• Integrated frontend processing uploaded audio and obtaining diarized transcripts
via asynchronous queries in frontend with backend API
• Use in.txt’s transcript download capability.
• Guaranteed compatibility with desktop and mobile browsers to enhance acces-
sibility.

4. Testing and Evaluation

• Tested individual components including audio upload, diarization, transcrip-


tion, and download capabilities under a unit basis.
• Conducted integration testing to guarantee the flawless interaction between
frontend and backend modules and confirm the end-to-end workflow.
• Test for edge cases like large files, silent detection, and mixed-language input
by simulating several use situations.
• To improve usability and system performance, gathered comments from man-
agers and peers.
• Iteratively improved the application based on identified issues and user feedback.

43
4.6 Budget requirements

Table 4.10: Budget Requirements Description

System Requirements Budget


Google Colab pro 9.8 USD
Total Cost 9.8 USD

4.7 Market Forecasting

We assessed the demand for speech-to–text services by means of market research on cur-
rent transcription tools. Especially for long-term audio, our study exposed a notable
discrepancy in systems supporting the Urdu language with speaker diarization. The appli-
cation covers fields including education, journalism, legal, government, and where precise
Urdu transcription with speaker separation is required. Competitive edge comes from key
elements including diarization, a simple web interface, Urdu-English code-mixed support.
Subscriptions, pay-per-use API access, and institutional licenses can all help to create
income. We intend to maximize academic networks, tech incubators, and government
channels for dissemination and reach. Because of low competition and growing demand
in local and institutional sectors, the market presents great possibilities for a specialized
Urdu transcribing solution.

44
CHAPTER 5: CONCLUSION

Designed primarily to transcribe Urdu and code-mixed audio recordings and separate be-
tween various speakers, the Urdu Speaker Diarization and Transcription System Often
disadvantaged by mainstream voice processing technology, this method meets the demand
for an easily available transcription tool in the Urdu language. Combining speaker diariza-
tion with automatic voice recognition lets users upload long-distance audio, quickly process
it, and obtain precise transcripts. The basic and functional online interface improves us-
ability even more, so the system fits academic, governmental, and organisational use cases
where documenting spoken Urdu is crucial.The findings verify that the main goals were
satisfied: diarization and transcription were effectively applied and combined, transcript
download capabilities operated as intended, and user interaction was kept straightforward
and understandable.

5.1 Limitations and Future Work


Although the initiative reached its main objectives, numerous areas need work. These
contain:

• More sophisticated speaker embedding models and fine-tuning on Urdu-specific datasets


will help to raise diarization accuracy even more.

• In live meetings or webinars, future versions might provide real-time transcription


and speaker identification.

• Transcripts exported in PDF, DOCX, and subtitle formats—e.g., SRT—will be more


usable.

• Advanced Transcript Editor combining temporal navigation, audio playback, and


speaker name customising with a transcript editor.

• Integration with Online Platforms, Adoption will rise from building APIs or plugins
for video conference systems like Zoom or Google Meet.

45
REFERENCES

[1] M. Ali, S. Hussain, and S. Rehman. Integrating phoneme-based recognition for im-
proved asr in urdu. Journal of Speech Processing, 15:100–110, 2021.

[2] Anonymous. Wer we stand: Benchmarking urdu asr models. arXiv preprint, 2023.

[3] Alexei Baevski, Hedi Zhou, Abdelrahman Mohamed, and Michael Auli. Wav2vec 2.0:
A framework for self-supervised learning of speech representations. arXiv preprint
arXiv:2006.11477, 2020.

[4] A. Balgovind, D. Raj, and P. Kothari. Turning whisper into real-time transcription
system. In Proceedings of Interspeech 2023, 2023.

[5] CLE Research Group. Cle urdu corpus: Speech resources for urdu asr. Center for
Language Engineering, 2020.

[6] R. Gupta, A. Kannan, and V. Kumar. Improving multilingual asr in the wild using
simple n-best re-ranking. arXiv preprint arXiv:2304.06234, 2023.

[7] J. Hanna, S. Pandya, and D. Jurgens. Careless whisper: Speech-to-text hallucination


harms. arXiv preprint arXiv:2305.13964, 2023.

[8] M. Haziq. Urdu-speech-recognition-conformer-moe-vs-whisper


model comparison. : // github. com/ MuhammadHaziq1337/
Urdu-Speech-Recognition-Conformer-MoE-vs-Whisper-Model-Comparison ,
2023.

[9] J. Kim, S. Jung, and S. Kim. Squeezeformer: An efficient transformer for automatic
speech recognition. arXiv preprint arXiv:2210.14756, 2022.

[10] Dominik Macháček, Raj Dabre, and Ondřej Bojar. Turning whisper into real-
time transcription system. https: // www. researchgate. net/ publication/
372684083_ Turning_ Whisper_ into_ Real-Time_ Transcription_ System , 2024.
Accessed: 2024-12-29.

[11] OpenAI. Whisper: Robust speech recognition via large-scale weak super-
vision. https: // github. com/ openai/ whisper/ blob/ main/ model-card. md ,
2022. Accessed: 2025-05-22.

46
[12] OpenAI. Whisper: Robust speech recognition via large-scale weak supervision.
https: // github. com/ openai/ whisper , 2022.

[13] Speechly. Analyzing open ai’s whisper asr accuracy. https: // www. speechly. com ,
2023.

[14] ICASSP 2022 Team. Icassp multi-speaker meeting transcription challenge. ICASSP,
2022.

[15] P. Yenigalla, A. Gupta, and H. Singh. Whistle: Data-efficient multilingual and


crosslingual speech recognition via weakly phonetic supervision. IEEE/ACM Trans-
actions on Audio, Speech, and Language Processing, 2023.

47
ABBREVIATIONS

API: Application Programming Interface


DL: Deep Learning
ASR: Automatic Speech Recognition
FYP: Final Year Project
FYDP: Final Year Design Project
CNN: Convolutional Neural Network
HTML5: Hypertext Markup Language, version 5
CSS: Cascading Style Sheets
SRS: Software Requirements Specification
UI: User Interface
DOCX: Microsoft Word Document Format
WER: Word Error Rate
PDF: Portable Document Format
TXT: Text File
GPU: Graphics Processing Unit
AI: Artificial Intelligence
UAT: User Acceptance Testing

48

You might also like