NLP Fyp Ii
NLP Fyp Ii
Authors
Muhammad Ahmad 21-SE-21
Haris Bin Shakeel 21-SE-76
Supervisor
Engr. Maria Andleeb
Lecturer
i
UNDERTAKING
We, the undersigned, hereby certify that our project work titled National Language Pro-
cessing (Urdu) is the result of our own efforts, research, and original thought. We confirm
that this study has not been submitted elsewhere for assessment, award, or recognition.
Where information or content has been sourced from existing literature or external tools,
proper acknowledgements and citations have been provided as per standard academic con-
ventions. This project has been carried out in accordance with the ethical guidelines
and academic policies set forth by our institution. We have maintained integrity in both
the development and reporting of this research and have taken care to avoid any form
of academic misconduct or plagiarism. We fully understand that any violation of these
statements may result in disciplinary actions by the institution, including but not limited
to the revocation of our degree as well as academic penalties.
Signature of Student
Muhammad Ahmad
21-SE-21
ii
PLAGIARISM DECLARATION
We take full responsibility for the project work conducted during the Final Year Project
(FYP) titled “National Language Processing (Urdu)”. We solemnly declare that the
project work presented in the FYP report is carried out solely by us, without significant
assistance from any external individual; however, any minor help taken has been duly ac-
knowledged. Furthermore, we confirm that this FYP (or any substantially similar project
work) has not been submitted previously to any other degree-awarding institution, either
within Pakistan or abroad.
We further acknowledge that if we are found guilty of plagiarism in our FYP report,
the University reserves the right to withdraw our BSc degree. In addition, the University
reserves the right to publish our names on its official website that maintains a record of
students found guilty of plagiarism in their FYP reports.
Signature of Student
Muhammad Ahmad
21-SE-21
Signature of Supervisor
Engr. Maria Andleeb
iii
ACKNOWLEDGEMENTS
We extend our profound gratitude to Engr. Maria Andleeb, our esteemed supervisor,
for her exceptional guidance, insightful feedback, and unwavering support throughout the
research and development of the project titled National Language Processing (Urdu).
Her expert advice and encouragement played a crucial role in shaping the direction and
outcomes of this project. We would also like to thank our reviewers and all participants
involved in the evaluation and testing phases of the project. Their valuable input and
critical remarks enabled us to enhance the functionality and usability of the system.
iv
TABLE OF CONTENTS
Abstract i
Undertaking ii
Acknowledgements iv
Table of Contents v
1 INTRODUCTION 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Project goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Aims and objectives . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Deliverables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 LITERATURE REVIEW 4
2.1 Automatic Speech Recognition . . . . . . . . . . . . . . . . . . 4
2.2 Research Gap in Urdu ASR Technology . . . . . . . . . . . 4
2.3 ASR for Urdu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Language-Switching and Mixed-Language ASR . . . . . . 5
2.5 Review of ASR Techniques . . . . . . . . . . . . . . . . . . . . . . . 5
2.6 Performance of Whisper for Urdu Transcription . . . . . 9
2.7 Previous Research on ASR . . . . . . . . . . . . . . . . . . . . 10
2.8 Word Error Rate (WER) in ASR Evaluation . . . . . . . 10
2.9 Architecture of Whisper and its Application for Urdu . 11
2.10 Market survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
v
3 PROPOSED SOLUTION 14
3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Project timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Experimental/Simulation setup . . . . . . . . . . . . . . . . . 27
3.4 Evaluation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 CONCLUSION 45
References 46
Abbreviations 48
vi
LIST OF FIGURES
vii
LIST OF TABLES
viii
CHAPTER 1: INTRODUCTION
1.1 Introduction
The fast development of digital technologies has greatly improved global communication
by means of language barrier overcoming capability. In fields including education, indus-
try, and legal systems, speech-to–text processing has become increasingly important for
documentation, teamwork, and information retrieval. Although speech recognition has
advanced, the Urdu language used by over 230 million people globally still suffers under-
privilege in this domain. Most commercial transcription tools ignore Urdu and code-mixed
Urdu-English speech in favor of English and other generally spoken languages. For Urdu-
speaking populations, this technological divide compromises access, record-keeping, and
communication in several important sectors.
Conventional ASR systems find great difficulty in multilingual contexts especially those
involving intra-sentential code-mixing of Urdu and English. Professional settings such as
business meetings, academic debates, interviews, and court cases intensify these chal-
lenges since speaker identification and transcription accuracy are crucial. This project
suggests the creation of the Urdu Meeting Transcription Generator, an artificial intelligence-
powered speech-to–text system specifically for Urdu and mixed-language speech, to close
this gap. It combines speaker diarization, timestamping, and noise reduction for maxi-
mum accuracy and contextual clarity while using OpenAI’s Whisper ASR model to manage
language variability, background noise, and speaker diversity.
Fast and dependable performance in professional use cases is ensured by the system’s
optimization for GPU-accelerated environments, which supports near real-time transcrip-
tion of lengthy and complex audio. A web interface that enables audio upload, transcript
editing, and export in DOCX and PDF formats will be used for user interaction. This
guarantees usability across a range of professional domains without necessitating highly
developed technical skills.
By providing a context-aware, scalable solution that empowers Urdu-speaking profes-
sionals and improves accessibility, inclusivity, and productivity across various industries,
this project ultimately seeks to close the linguistic and technological gap.
1
1.2 Project goal
2. Use speaker diarization techniques to isolate and label particular speakers in record-
ings involving multiple speakers, making the resultant transcriptions more readable
and useful.
2
4. Develop an intuitive interface where users can upload audio files, edit and view
transcriptions, and export the completed content in popular formats like .txt and
.pdf.
5. Develop the system on a modular and scalable framework to incorporate future en-
hancements, including multilingual functionality, real-time transcription, and auto-
mated summarization of meetings via natural language processing methods.
1.4 Deliverables
2. An intuitive and interactive interface that allows users to upload meeting audio
files, edit the generated transcriptions, and export the finalized content in widely
used formats such as .docx and .pdf.
3. A detailed thesis report documenting all aspects of the project, including the sys-
tem architecture, literature review, proposed methodology, technical implementation,
evaluation metrics, and the overall workflow of the transcription and diarization
system.
3
CHAPTER 2: LITERATURE REVIEW
Speech-to-text technologies have changed the ways information is documented and re-
trieved and play a critical role in the fields of education, business, and legal services.
Meeting transcription systems aim to ease the automation process of transcribing oral
conversations into text. However, for languages like Urdu, transcription systems face sig-
nificant challenges resulting from the limitations of resources, linguistic complexity, and
widespread use of the combination of two major languages, that being Urdu and English.
This literature review showcases a thorough review of current technologies, challenges, and
opportunities leading to finalizing the proposed formulation of Urdu Meeting Transcription
System.
ASR systems are designed to convert spoken language into text by leveraging AI models.
For languages like English and Mandarin, ASR systems achieve high accuracy due to
abundant datasets and robust language models. However, for Urdu, the development of
ASR systems is hindered by limited linguistic datasets and phonetic complexities [5] [14].
Multilingual ASR models, such as the Transformer-based systems, have demonstrated
success in low-resource languages, including Hindi and regional South Asian dialects [5].
For Urdu, these models need adaptation to its phonetic rules and unique script. Ali et
al. (2021) highlighted that integrating phoneme-based recognition systems and leveraging
transfer learning can improve Urdu ASR performance [1].
While substantial progress has been made in Automatic Speech Recognition (ASR) for
widely spoken languages, there exists a notable research gap when it comes to ASR sys-
tems for low-resource languages like Urdu. Despite the availability of large-scale datasets
and models like Whisper, Wav2Vec, and others, Urdu ASR still faces challenges such
as handling diverse accents, code-switching between Urdu and English, and dealing with
noisy environments. The lack of comprehensive and high-quality annotated speech data
for Urdu further limits the development of robust ASR systems. Additionally, existing
ASR systems often fail to provide satisfactory performance for Urdu’s unique linguistic
4
features, such as complex morphology and phonetic variations. Addressing these gaps
requires further research focused on improving speech data collection, fine-tuning exist-
ing models for Urdu-specific tasks, and exploring novel approaches like phoneme-based or
hybrid models to enhance transcription accuracy.
The biggest challenge for Urdu ASR is phonetic diversity. Urdu contains many phonemes
borrowed from Arabic, Persian, and South Asian regional languages, hence needing a
specific phoneme-based recognition model. Models like Wav2Vec 2.0 and Whisper have
been amazingly advanced to solve these issues. For instance, Whisper is trained on a
gigantic multilingual dataset and has shown its capability to transcribe speeches in many
languages, like Urdu, even in cases of mixed languages [12] [10].
Whisper, a powerful ASR model by OpenAI, leverages a multitask learning approach,
allowing it to handle diverse speech transcription tasks, including language identification,
multilingual speech recognition, and speech translation. This ability is particularly valuable
in Urdu-English code-switching scenarios, as Whisper’s robustness allows it to perform
well in detecting and transcribing both languages simultaneously [13].
5
lucinations, where entire phrases are inaccurately generated. The study analyzed 13,140
audio segments, revealing a hallucination rate of 1.4%, with higher rates for speakers with
aphasia (1.7% versus 1.2% for controls). The study emphasized the need for better model-
ing and inclusive AI design to reduce transcription disparities, particularly for vulnerable
populations.
The development of efficient ASR architectures has been another area of focus. “Squeeze-
former: An Efficient Transformer for Automatic Speech Recognition” [9] introduces Squeeze-
former, a hybrid attention-convolution model that addresses inefficiencies in the Con-
former architecture. The model incorporates a Temporal U-Net structure, simplified
Transformer-style blocks, and depthwise separable subsampling layers, reducing compu-
tational overhead while achieving a Word Error Rate (WER) of 6.50% on the LibriSpeech
test-other dataset. Squeezeformer outperforms Conformer with 1.4% improved accuracy
and up to 40% fewer FLOPs, making it suitable for real-world deployment with reduced
inference costs.
In “Turning Whisper into Real-Time Transcription System” [4], researchers adapted
Whisper into a real-time transcription and translation system, achieving an average la-
tency of 3.3 seconds and a WER of 8.1% for English transcription on the ESIC test
set. By employing the LocalAgreement-2 algorithm, Voice Activity Detection (VAD), and
chunk segmentation techniques, the system balances latency and quality. While Whis-
per streaming modes increase WER by 2–6% compared to offline processing, the system
demonstrated robustness in live multilingual settings, highlighting its practical usability.
However, limitations include potential overlap in training data and the need for further
optimization.
Another notable contribution is the study “Improving Multilingual ASR in the Wild
Using Simple N-best Re-ranking” [6], which introduces a method for enhancing ASR ac-
curacy in real-world multilingual scenarios. Using N-best re-ranking with features such as
language models, text length, and acoustic models, the approach improved Speech Language
Identification (SLID) accuracy from 18.1% to 83.1% for tail languages and reduced WER
from 67.4% to 39.3%. For Whisper, SLID accuracy improved by 6.1%, and WER dropped
by 2.0%. The lightweight implementation of this method underscores its cost-effectiveness
and adaptability across ASR systems.
Phoneme-based models have also shown promise in multilingual ASR, as demonstrated
in the paper “Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition
via Weakly Phonetic Supervision” [15]. The study utilized International Phonetic Alpha-
bet (IPA) transcriptions with grapheme-to-phoneme (G2P) models, achieving a WER of
6.56% on the CommonVoice dataset, outperforming subword-based and self-supervised
models. Whistle demonstrated particular efficiency in low-resource settings, reducing
WER by up to 18% compared to subword-based models and requiring 24% fewer training
epochs. This approach highlights the potential of phoneme-based modeling for low-resource
6
languages like Urdu.
While several ASR models, such as Wav2Vec, Google Speech-to-Text, and Amazon
Transcribe, are widely used, they have limitations for Urdu transcription. Wav2Vec re-
quires significant fine-tuning for languages with limited representation in its pre-trained
data, whereas Google and Amazon systems rely on cloud-based APIs, raising cost and
privacy concerns. Whisper, in contrast, is open-source, supports offline processing, and
is pre-trained on a diverse multilingual dataset, making it better suited for Urdu tran-
scription with code-switching and varied accents.
Despite Whisper’s advantages, studies like [7] raise concerns about hallucinations and
ethical considerations in ASR systems. Addressing these limitations through techniques
such as N-best re-ranking [6] or incorporating phoneme-based approaches [15] could en-
hance the system’s reliability. Additionally, architectures like Squeezeformer [9] offer
alternative pathways for achieving cost-effective, high-accuracy ASR.
In conclusion, OpenAI’s Whisper presents a robust foundation for Urdu speech-to-text
transcription. Its adaptability for real-time applications, multilingual capabilities, and
offline processing make it a strong candidate. However, leveraging insights from studies on
hallucination mitigation, phoneme-based modeling, and novel architectures could further
optimize the system, ensuring it meets the linguistic and computational requirements of
Urdu transcription.
Careless Whisper
• WER: N/A
• Latency/Cost: N/A
• Key Insights: Ethical and legal concerns raised due to hallucination issues, espe-
cially for vulnerable populations like aphasia speakers.
Whisper-Streaming
• WER: 8.1% (English), 9.4% (German), 12.9% (Czech)
7
• Latency/Cost: 3.3-second average latency (English)
• Key Insights: Trade-offs between latency, chunk size, and WER; practical appli-
cability in real-time scenarios.
• Key Insights: Tail languages and low-resource ASR benefit most; demonstrates
the importance of feature selection for adaptability.
Squeezeformer
• WER: 6.5% (LibriSpeech test-other)
• Key Insights: Achieves state-of-the-art efficiency and accuracy for ASR tasks,
with scalable performance improvements.
8
Whistle (Weakly Phonetic Supervision)
• WER: 6.56% (Phoneme-based models), 9.3% (Subword-based models)
Recent performance analysis of Whisper in Urdu reveals encouraging results. In the case
of the language, Urdu-Whisper-large-v2 obtained WER as 0.3241 in the conversational
speech scenario, significantly outperforming models such as Conformer-MoE that obtained
WER as 1.1624 on the same language. This makes Whisper potentially effective for
transcription-based Urdu applications assuming that relevant speech data for fine-tuning
is readily available [8].
9
2.7 Previous Research on ASR
Word Error Rate (WER) is a critical metric used to assess the performance of Automatic
Speech Recognition (ASR) systems. It quantifies the errors in transcriptions by comparing
the system’s output to a reference transcription (ground truth). The formula for WER
10
is:
S+D+I
WER =
N
Where:
• S is the number of substitutions (incorrect words),
11
recorded in Urdu and those held in mixed-language settings (Urdu-English).
• The lack of functionalities like speaker identification and timestamping makes the
documentation process more complex.
• The existing framework rarely generates coherent and systematically arranged tran-
scripts, making it troublesome to use meeting documentation efficiently.
• Editing and Exporting Options: Enables users to edit their transcripts and
export them into various formats, such as PDF, DOCX, and TXT.
• Greater Accuracy: Fine-tuning the ASR model ensures reliable transcription out-
puts, even with mixed-language material.
12
• User Convenience: The system’s editing and exporting features allow users to
edit and use transcriptions as desired.
13
CHAPTER 3: PROPOSED SOLUTION
3.1 Methodology
The approach taken in creating the project National Language Processing (Urdu) is
described in this section. Combining aspects of the Waterfall and Agile models, a hybrid
development approach was adopted to ensure both structure and adaptability throughout
the process.
The project began with preliminary studies on the linguistic challenges involved in
processing Urdu, a low-resource language. This included analyzing its phonetic and syn-
tactic characteristics and exploring existing tools for speech recognition, transcription,
and speaker diarization.
During the requirements analysis phase, the main objectives were defined: pri-
marily, to transcribe Urdu speech and perform speaker diarization. To support model
training and evaluation, Urdu audio datasets with various speaker counts and regional
accents were collected.
In the design phase, the system architecture was outlined, including stages such
as audio preprocessing, embedding extraction, clustering, and transcription. The user
interface flow and API design were also completed during this stage.
The development phase involved implementing core components, including a custom
speaker diarization module using embeddings and clustering algorithms, a FastAPI-based
backend, and integration of Whisper for transcription. The system was optimized for
deployment in GPU-based environments.
In the testing phase, sample Urdu recordings were used to evaluate the application’s
transcription accuracy and speaker separation performance. Iterative improvements were
made to handle common challenges such as background noise and code-switching.
Using this methodology, an efficient, scalable, and practical Urdu NLP application was
developed to meet real-world requirements.
14
English. These discussions explained practical problems and user expectations around the
application. The gathered insights revealed the following essential requirements:
• Accurate transcription: The system must accurately transcribe both pure Urdu
and mixed-language (Urdu-English) audio information.
• Speaker diarization: The application must be able to identify and classify distinct
speakers, particularly in multi-participant meetings.
• Export functionality: Users must have the capability to download the final tran-
scription in many formats, including PDF, DOCX, and TXT.
• Support for long audio recordings: The system must proficiently manage ex-
tended audio recordings, ensuring optimised performance on GPU-based systems to
enhance processing speed.
These requirements became the basis for the design and development of a functional
and efficient Urdu Natural Language Processing program appropriate for use in the real
world.
• Uploading Urdu or multilingual audio recordings via the web-based user interface.
• Preprocessing and transforming the audio into the suitable format for analysis.
• Employing the Whisper ASR model to convert audio material into text.
• Utilising a bespoke speaker diarization module that employs audio embeddings and
clustering methods to distinguish between various speakers.
• Creating timestamps for each segment of the transcription to improve clarity and
traceability.
15
• Enabling users to access, assess, and modify the transcription via an intuitive in-
terface.
• Home Page: Serves as the entry point of the program, providing users with
a succinct summary of the system’s functionalities and objectives. It features
intuitive navigation to all other aspects of the website.
• Transcribe Page: This is the fundamental operational page of the system.
Users may submit audio files in Urdu or code-switched (Urdu-English) via an
easy form. Upon submission, the audio file is transmitted to the backend API,
which processes it and delivers a transcription. The page thereafter presents:
– Transcription of text, categorised by speaker.
– Speaker designations (e.g., Speaker 1, Speaker 2).
– Time markers indicating the commencement of each speaker section.
– Export Transcription in format such as TXT and PDF.
• About Page: Outlines the system’s rationale, objectives, fundamental tech-
nologies (Whisper, embeddings-based diarization), and its significance to the
16
Urdu-speaking population. It additionally comprises developer information and
acknowledgements.
• FAQs Page: Offers responses to frequently asked issues, including supported
file formats, average processing durations, use-case suggestions, and system
constraints. It additionally instructs users on file uploads and result interpre-
tation.
User Experience Emphasis: The interface is crafted for usability by experts, re-
searchers, and linguists with diverse technical proficiency, prioritising accessibility,
clarity, and streamlined interaction stages.
2. Backend Development
The backend is constructed with FastAPI, selected for its efficiency, adaptability,
and seamless connection with contemporary machine learning processes. It func-
tions as the intermediate between the user-facing frontend and the machine learning
models. Essential backend functionalities comprise:
• File Management: The backend receives audio files from users and tem-
porarily stores them in a secure area for processing.
• Model Execution and Integration: Upon file reception, the backend invokes
the transcription and speaker diarization modules. It guarantees optimal use
of GPU resources for rapid processing.
• Response Structuring: The backend structures the transcription output with
distinct speaker labels and timestamps after processing, subsequently sending
the structured results to the front-end.
• Scalability and Security: The modular API is designed to accommodate
several concurrent users and has fundamental error handling and validation to
ensure data integrity.
17
• Tailored Speaker Diarization Module:
– The system uses proprietary embeddings (ECAPA), derived from the audio
signal, rather than depending on comprehensive diarization pipelines like
Pyannote.
– Clustering algorithm K-Means is utilized to categorize segments according
to speaker identification.
– The diarization process is efficient and flexible, facilitating its optimization
and integration with the overall pipeline.
• Timestamp Alignment: Each spoken part is associated with specific time
intervals in the audio, facilitating navigation, editing, and verification. This
also facilitates the segmentation of content for subsequent applications such as
summarization.
• Output Generation: The final output comprises a comprehensive transcrip-
tion, categorized by speaker, enhanced with timestamps, and provided in ex-
portable forms (PDF, DOCX, TXT).
• GPU Optimization: The pipeline is meticulously calibrated for GPU-enabled
settings, significantly decreasing the duration needed to transcribe and diarize
lengthy recordings (1–2 hours or more). This facilitates practical application
in real-time or near-real-time environments.
18
Whisper forms the basis upon which additional processing such as speaker separation
and timestamping is built, allowing the system to provide high-quality, ordered transcrip-
tions fit for both academic and professional use.
The Word Error Rate (WER) performance of many Whisper models on Urdu tran-
scribing jobs is shown below using a bar chart. Speech recognition system accuracy is
measured using WER, a standard metric where lower WER denotes improved transcrib-
ing quality.
Six Whisper model variations—tiny, base, small, medium, large, and turbo—are in-
cluded in this comparison study. Every model was evaluated consistently and fairly using
the same Urdu audio collection.
Using a turbo model that we are working on—our project obtains a WER of around
0.16. This performance presents an excellent balance between speed and accuracy and is
somewhat similar to the medium model. Lower WER indicates better performance.
19
Figure 3.2: Comparison of Word Error Rate (WER) for different Whisper models on
Urdu
• Embedding Extraction:
The system captures speaker-specific vocal characteristics using pre-trained speaker
embedding models such as ECAPA-TDNN. These embeddings are derived from fixed-
length segments of audio aligned with the timestamped output from the Whisper ASR
model. Each segment is transformed into a high-dimensional vector that uniquely
represents the speaker’s vocal identity.
• Clustering:
Once the embeddings are extracted, unsupervised clustering methods are employed
to group similar speaker segments. Two primary clustering approaches are used:
K-Means Clustering:
Effective in scenarios where the number of speakers is known or can be estimated
beforehand. It offers a simple and efficient way to partition speaker embeddings.
Based on the clustering results, speaker labels such as Speaker 1, Speaker 2, etc., are
assigned to corresponding transcription segments. These are then aligned with Whisper’s
timestamps to produce a structured, speaker-annotated transcript.
20
By integrating a high-quality ASR system with a modular diarization pipeline, the
system achieves a lightweight yet powerful alternative to fully integrated solutions like
Pyannote. This hybrid approach proves especially valuable for long Urdu-language record-
ings where real-time speaker tracking and user-friendly editing support are critical. Ad-
ditionally, the modularity ensures that the system can be easily updated or extended with
improved diarization models in the future.
21
The NLP Diarize technology automatically pipelines meeting audio. The individual
submitting an audio file starts it. The algorithm next preprocesses this audio and
transcribes it into timestamped text parts. Different speakers are then located and
labeled using speaker clusterering methods. Should it be successful, the user is shown
a diarized transcript; else, an error is shown. This simplified approach turns spoken
meeting materials into a disciplined, speaker-attributed textual form.
22
The Transcript Assembly Module receives both outputs and combines speaker infor-
mation with transcribed text to create a whole transcript. The Storage System stores this
last transcript, from which the user may access or download via the frontend interface.
With speaker separation, this design guarantees a disciplined data flow from input to
final output, hence supporting reliable Urdu transcription.
• Home Page: Acting as the landing page, the home page gently introduces the
program, its goals, and features. It gives other areas of the website simple access.
• About Page: This part notes the developer’s contribution and offers specifics on
the evolution of the system, its objectives, underlying technologies like the Whisper
ASR model and speaker embedding-based diarization.
• FAQs Page: Designed to help users grasp how the system operates, what formats
are supported, processing constraints, and basic troubleshooting, FAQs Page answers
often asked issues.
• Using a POST request, the front-end forward-ends this file to a FastAPI endpoint.
23
• The backend handles audio preprocessing, Whisper ASR model transcription, speaker
diarization with embeddings and clustering algorithms, among other chores.
• The last result, a structured transcript including timestamps and speaker labels,
comes back in JSON style.
With better models or implemented on other cloud platforms (such AWS or GCP)
that have GPU support, this modular approach lets the system be readily maintained,
upgraded, or expanded. FastAPI guarantees flawless interaction between front-end and
back-end components as well as great performance.
1. Unit Testing:
Every fundamental component of the system was separately tested to ensure that,
prior to integration, every aspect performs as it should. This comprising:
• The Whisper model was validated against noise and accent fluctuations by
testing several Urdu and Urdu-English audio samples, hence verifying its tran-
scription accuracy and resilience.
• Correct speaker separation was ensured by testing individually the embedding
extraction and clustering components using synthetic and real-world speech
samples.
• Tests guaranteed the right conversion of supplied audio files into the necessary
mono-channel 16kHz WAV format.
• FastAPI endpoints were tested under several conditions for correct response
formats, file validation, error handling, and response times.
2. Integration Testing:
The whole pipeline was gradually combined as individual units passed their tests.
Integration testing proved that data flows naturally from the frontend to the backend
and via the ML pipeline and that components interact successfully. More specifically:
• Front-end file uploads were tested to guarantee they reached the backend accu-
rately.
24
• Verified was the backend’s capacity to process outputs, trigger the ASR and
diarization pipeline, and produce structured results.
• System resilience was examined using edge scenarios including extended audio
durations and empty or invalid file types.
3. System Testing:
The system was examined overall to guarantee it satisfies the general project objec-
tives. This stage confirmed end-to-end functionality comprising:
4. Functional Testing:
Every feature performs in line with the software requirements, confirmed by func-
tional testing. This comprising:
5. Usability Testing:
Evaluating the system from the end-user’s point of view, usability testing sought
user happiness and ease of use. Important domains under examination include:
25
3.2 Project timeline
The following Gantt Charts represent the Project Timeline that depict the procedure as
to how work was divided between both team members and the completion of deliverables
over the span of the project timeline.
26
Figure 3.6: Project Gantt Chart (b)
• Hardware Components:
– Laptop Specifications
∗ DESKTOP-6AV1I6J
∗ RAM: 16.0 GB (15.7 GB usable)
∗ Processor: Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz 1.90 GHz.
∗ GPU: NVIDIA GeForce GTX 1650
– Google Colab Specifications
∗ GPU: Tesla T4
∗ RAM: 13 GB
27
∗ Disk Space: 108 GB
• Tools: Google Colab , Whisper (ASR Model), Pyannote, Speech Brain, VSCode,
FastApi, Google Drive
• Accuracy: For Urdu and mixed Urdu-English audio including multi-speaker set-
tings, the transcription system should reach at least 85% accuracy.
• Scalability: The system will be assessed for capacity to manage higher workloads,
such processing several audio files or supporting several users concurrently contin-
uously without performance deterioration.
28
CHAPTER 4: RESULT AND DISCUSSION
The main elements and techniques applied in building the Urdu Speaker Diarization and
Transcription System are described in this chapter. It addresses speaker diarization meth-
ods, backend API integration, Whisper ASR model application, and web interface develop-
ment. The chapter also emphasizes the testing techniques, difficulties experienced during
development, and the remedies used to raise system performance and accuracy.
Professionals and companies who run official meetings, interviews, academic debates, or
legal actions in the Urdu language or in a code-switched Urdu-English format are the main
end users of the Urdu Speaker Diarization and Transcription System. These comprise:
• Legal and government bodies: for recording investigation processes, witness testi-
mony, or multilingual hearings.
• Healthcare Providers: Psychologists and psychiatrists can use the system for
recording therapy sessions in Urdu.
29
users of different technological backgrounds. From audio uploading to final transcript
display, every element—from correctness to stability to usability—was evaluated for. The
objectives were to guarantee constant performance, excellent speaker-labeled transcripts,
and a user experience free of errors or interruptions. Testing for National Language
Processing was conducted manually. Numerous test cases were run and here are a few of
them:
Test ID Test-1
Test name Web Interface Loads Successfully
Date of test 20/05/2025
Name of application Urdu Speaker Diarization and Transcription Sys-
tem
Description The Home Page should load when the user opens
the application URL
Input Navigate to the application URL in a browser
Expected output Home Page loads with navigation to Transcribe,
About, and FAQs pages
Actual output Home Page loaded successfully with all navigation
links working
Test Role (Actor) Team Member
Test verified by Supervisor
Test ID Test-2
Test name Audio File Upload and Transcription Processing
Date of test 21/05/2025
Name of application Urdu Speaker Diarization and Transcription Sys-
tem
Description The system should accept and transcribe an Urdu
or mixed Urdu-English audio file
Input Upload a sample Urdu audio file (WAV format)
from the Transcribe page
Expected output Transcription with speaker labels and timestamps
displayed on screen
Actual output Transcription generated successfully with correct
speaker segmentation
Test Role (Actor) Developer
Test verified by Supervisor
30
Table 4.3: Test case to check handling of unsupported file formats.
Test ID Test-3
Test name Handling of Invalid Audio File Format
Date of test 21/05/2025
Name of application Urdu Speaker Diarization and Transcription Sys-
tem
Description The system should reject unsupported file types
(e.g., .mp3, .aac)
Input Upload an unsupported file format from the Tran-
scribe page
Expected output Error message shown: "Unsupported file format.
Please upload a WAV file."
Actual output Proper error message displayed, upload rejected
Test Role (Actor) Developer
Test verified by Supervisor
Test ID Test-4
Test name Accurate Speaker Diarization
Date of test 21/05/2025
Name of application Urdu Speaker Diarization and Transcription Sys-
tem
Description The system should correctly identify and segment
speakers in multi-speaker audio
Input Upload a 2-minute audio file with two alternating
speakers
Expected output Transcript segments labeled as Speaker 1 and
Speaker 2 with correct timestamps
Actual output Diarization accurate; each speaker segment labeled
correctly
Test Role (Actor) Tester
Test verified by Supervisor
31
Table 4.5: Test case to evaluate system performance on long audio input.
Test ID Test-5
Test name Processing Long Audio (1 Hour)
Date of test 21/05/2025
Name of application Urdu Speaker Diarization and Transcription Sys-
tem
Description The backend should process long audio files effi-
ciently and without crashing
Input Upload a 1-hour recorded meeting in Urdu
Expected output Full transcript with speakers and timestamps re-
turned in under 10 minutes
Actual output Transcript returned in 9 minutes; no crash or time-
out observed
Test Role (Actor) Developer
Test verified by Supervisor
Test ID Test-6
Test name Transcript Download Functionality
Date of test 21/05/2025
Name of application Urdu Speaker Diarization and Transcription Sys-
tem
Description This test ensures that users can download the final
transcription file after processing is complete. The
download should work for .txt format.
Input Upload a valid Urdu audio file, wait for transcrip-
tion, and click the download button for the tran-
script.
Expected output Transcript is successfully downloaded in the se-
lected format with proper file name and content.
Actual output Transcript downloaded as "transcript.txt"; con-
tent matched the displayed result.
Test Role (Actor) End User
Test verified by Supervisor
32
Table 4.7: Test case to verify front-end responsiveness on different devices.
Test ID Test-7
Test name Responsive UI on Mobile and Desktop
Date of test 21/05/2025
Name of application Urdu Speaker Diarization and Transcription Sys-
tem
Description The interface should render properly across screen
sizes and remain usable
Input Open application on desktop browser and Android
mobile browser
Expected output Layout adjusts correctly; all elements visible and
functional
Actual output Responsive design works as intended on both de-
vices
Test Role (Actor) Frontend Developer
Test verified by Supervisor
Table 4.8: Test case to verify frontend behavior during API failure.
Test ID Test-8
Test name Backend API Failure Handling
Date of test 21/05/2025
Name of application Urdu Speaker Diarization and Transcription Sys-
tem
Description The frontend should handle API downtime or net-
work errors gracefully
Input Disable internet and try to upload an audio file for
transcription
Expected output "Connection error. Please try again later." mes-
sage displayed
Actual output Connection error message appeared correctly
Test Role (Actor) Frontend Developer
Test verified by Supervisor
33
Table 4.9: Test case to validate JSON response from API.
Test ID Test-9
Test name API Response Format Validation
Date of test 21/05/2025
Name of application Urdu Speaker Diarization and Transcription Sys-
tem
Description The backend should return a JSON response with
keys: "transcript", "speakers", "timestamps"
Input Upload a test audio file and inspect response
Expected output JSON structure with correct keys and valid data
Actual output JSON received: transcript,
speakeri d, startt ime, endt ime
Test Role (Actor) Backend Developer
Test verified by Supervisor
As shown above all these test cases passed hence verifying the system was working and
the user’s requirements were met properly.
Designed to host a User Interface that is professional, simple, and efficient for consumers
needing audio transcription and speaker diarization services, the web application, is de-
veloped with a modern frontend stack (HTML, CSS/Tailwind CSS, JavaScript) and a
Python (FastAPI) backend, Including important modules and user interactions, the demo
screenshots of the NLP Diarise: Urdu Meeting Transcription Diarization System web
application show:
34
Figure 4.1: Uppermost Section of NLP Home Page.
35
Figure 4.2: Middle Section of NLP Home Page.
36
Figure 4.3: Lower Section of NLP Home Page.
37
Figure 4.4: Upper Section of NLP About Page.
38
4.3.3 FAQ Page
Designed to proactively address common user inquiries and offer easily available infor-
mation about the functionalities, capabilities, and constraints of the system, the "FAQs"
(Frequently Asked Questions) page of the NLP Diarize web application (Figure 4.6) Usu-
ally using an accordion-style arrangement for a neat and orderly presentation, this part
lets users readily extend queries to examine their corresponding responses. The material
addresses important topics such supported audio formats, Urdu and mixed-language tran-
scription accuracy, speaker diarization technique, and any possible restrictions on audio
file size or duration. Moreover, it sometimes makes clear how the system manages situa-
tions like an unknown number of speakers and offers broad knowledge about data security
and privacy. The FAQs page seeks to increase user knowledge, lower support needs, and
improve the general user experience with the NLP Diarize tool by providing clear and
succinct answers to expected inquiries.
39
4.3.4 Transcribe Page
Users of the NLP Diarize web application interface directly with the fundamental tran-
scription and speaker diarizing capabilities at the "Transcribe Page," the major hub. Fig-
ures X.X, Y.Y, and Z.Z show how carefully this page is thoughtfully laid using a two-
column arrangement to guarantee a clear and effective user workflow. The "Controls"
section of the left column allows users to choose an audio file for processing, give the
estimated number of speakers in the recording, and optionally input a language code. The
intuitive interface lets users One notable "Upload and Diarize" button to start the pro-
cess. Underneath the controls, a dedicated status section shows the user real-time feedback
indicating the current operating state—that of "Uploading," "Processing," or "Diariza-
tion complete," together with any error warnings. The "Transcription Output," where the
resultant diarized transcript—complete with speaker labels and timestamps—is shown in
a readable, monospace font inside a scrollable area—reserved in the right column. Once
the findings are correctly produced, this part shows a "Download Transcript (.txt)" button
letting users quickly save the output. From input to result retrieval, its methodical design
seeks to give a flawless experience.
40
Figure 4.8: Transcribe Page during audio processing.
41
4.4 Results discussion
Developing a web-based solution able to automatically transcribing Urdu and mixed Urdu-
English audio while separating between distinct speakers was the main aim of the Urdu
Speaker Diarization and Transcription System. This project sought to facilitate correct
transcription via a simple online interface and solve the dearth of easily available tools
for processing Urdu audio in multi-speaker situations. Every main goal of the system was
fulfilled with success. Users of the program can upload long-form Urdu audio recordings,
run them through a diarizing and transcribing pipeline, and view clearly segmented tran-
scripts depending on speaker changes. For additional use, the transcript is available in.txt
form. Built React and designed with Tailwind CSS, the minimal and simple web interface
provides access even to people with low technical knowledge. FastAPI drives the back-
end, which effectively handles audio processing and interfaces nicely with the diarization
and transcription parts. The application satisfies its stated use that of a useful tool for
storing, analyzing, and reviewing Urdu-language meetings, interviews, and discussions
by combining a clean user interface with sophisticated speaker diarization and transcrib-
ing capability. Functional testing confirmed and successfully implemented all main project
goals including multi-speaker diarization, transcription, downloadable output, and respon-
sive UI.
The project was carried out in line with a set of well-defined benchmarks to guarantee a
successful and quick progress. Every phase helped to create a completely working Urdu
Speaker Diarization and Transcription System able of processing multilingual audio in-
put and producing accurate, speaker-labeled transcription. The main turning points are
enumerated here:
• Particularly for low-resource languages like Urdu, I carefully went over the
body of current studies on speaker diarization and automated speech recognition
(ASR).
• Investigated tools and technologies including clustering methods, Whisper, Pyn-
note, and FastAPI to identify the best-fit components for diarization and tran-
scription.
• Researching current transcription systems such as Otter.ai and HappyScribe
helped one to grasp industry standards and user expectations.
42
• Found important Urdu transcription and speaker diarizing use cases include
government records, interviews, and academic gatherings.
• Analyzed pre-trained ASR and embedding models for Urdu and mixed-language
datasets’ compatibility.
• Selected Whisper for transcription; diarized using a speaker embedding model
along with a clustering technique.
• Separately refined and verified the diarization and transcription components to
guarantee language coverage and correctness.
• Designed a FastAPI backend to manage model inference, response delivery,
and audio file processing.
• Implement and test support for 1–2 hour long lengthy audio files while opti-
mizing for GPU-based deployment.
• Designed with HTML, Tailwind CSS, and JavaScript a responsive, clean fron-
tend to guarantee accessibility.
• Designed major pages including About, Login, Signup, Home, and Transcribe.
• Integrated frontend processing uploaded audio and obtaining diarized transcripts
via asynchronous queries in frontend with backend API
• Use in.txt’s transcript download capability.
• Guaranteed compatibility with desktop and mobile browsers to enhance acces-
sibility.
43
4.6 Budget requirements
We assessed the demand for speech-to–text services by means of market research on cur-
rent transcription tools. Especially for long-term audio, our study exposed a notable
discrepancy in systems supporting the Urdu language with speaker diarization. The appli-
cation covers fields including education, journalism, legal, government, and where precise
Urdu transcription with speaker separation is required. Competitive edge comes from key
elements including diarization, a simple web interface, Urdu-English code-mixed support.
Subscriptions, pay-per-use API access, and institutional licenses can all help to create
income. We intend to maximize academic networks, tech incubators, and government
channels for dissemination and reach. Because of low competition and growing demand
in local and institutional sectors, the market presents great possibilities for a specialized
Urdu transcribing solution.
44
CHAPTER 5: CONCLUSION
Designed primarily to transcribe Urdu and code-mixed audio recordings and separate be-
tween various speakers, the Urdu Speaker Diarization and Transcription System Often
disadvantaged by mainstream voice processing technology, this method meets the demand
for an easily available transcription tool in the Urdu language. Combining speaker diariza-
tion with automatic voice recognition lets users upload long-distance audio, quickly process
it, and obtain precise transcripts. The basic and functional online interface improves us-
ability even more, so the system fits academic, governmental, and organisational use cases
where documenting spoken Urdu is crucial.The findings verify that the main goals were
satisfied: diarization and transcription were effectively applied and combined, transcript
download capabilities operated as intended, and user interaction was kept straightforward
and understandable.
• Integration with Online Platforms, Adoption will rise from building APIs or plugins
for video conference systems like Zoom or Google Meet.
45
REFERENCES
[1] M. Ali, S. Hussain, and S. Rehman. Integrating phoneme-based recognition for im-
proved asr in urdu. Journal of Speech Processing, 15:100–110, 2021.
[2] Anonymous. Wer we stand: Benchmarking urdu asr models. arXiv preprint, 2023.
[3] Alexei Baevski, Hedi Zhou, Abdelrahman Mohamed, and Michael Auli. Wav2vec 2.0:
A framework for self-supervised learning of speech representations. arXiv preprint
arXiv:2006.11477, 2020.
[4] A. Balgovind, D. Raj, and P. Kothari. Turning whisper into real-time transcription
system. In Proceedings of Interspeech 2023, 2023.
[5] CLE Research Group. Cle urdu corpus: Speech resources for urdu asr. Center for
Language Engineering, 2020.
[6] R. Gupta, A. Kannan, and V. Kumar. Improving multilingual asr in the wild using
simple n-best re-ranking. arXiv preprint arXiv:2304.06234, 2023.
[9] J. Kim, S. Jung, and S. Kim. Squeezeformer: An efficient transformer for automatic
speech recognition. arXiv preprint arXiv:2210.14756, 2022.
[10] Dominik Macháček, Raj Dabre, and Ondřej Bojar. Turning whisper into real-
time transcription system. https: // www. researchgate. net/ publication/
372684083_ Turning_ Whisper_ into_ Real-Time_ Transcription_ System , 2024.
Accessed: 2024-12-29.
[11] OpenAI. Whisper: Robust speech recognition via large-scale weak super-
vision. https: // github. com/ openai/ whisper/ blob/ main/ model-card. md ,
2022. Accessed: 2025-05-22.
46
[12] OpenAI. Whisper: Robust speech recognition via large-scale weak supervision.
https: // github. com/ openai/ whisper , 2022.
[13] Speechly. Analyzing open ai’s whisper asr accuracy. https: // www. speechly. com ,
2023.
[14] ICASSP 2022 Team. Icassp multi-speaker meeting transcription challenge. ICASSP,
2022.
47
ABBREVIATIONS
48