0% found this document useful (0 votes)
41 views125 pages

AIMRS

Uploaded by

hariskhan38273
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views125 pages

AIMRS

Uploaded by

hariskhan38273
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 125

AI Model for Recognition and

Synchronization of Quranic Text with Audio

Author

Muhammad Zain Ul Abideen 19-EE-96

Muhammad Armaghan Ur Rehman 19-EE-114

Nashit 19-EE-187

Supervisor

Dr. Qamas Gul Khan Safi


Assistant professor

DEPARTMENT OF ELECTRICAL ENGINEERING


FACULTY OF ELECTRONICS & ELECTRICAL ENGINEERING
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
TAXILA
July 2023

i
AI Model for Recognition and
Synchronization of Quranic Text with Audio

Author

Muhammad Zain Ul Abideen 19-EE-96

Muhammad Armaghan Ur Rehman 19-EE-114

Nashit 19-EE-187

A thesis submitted in partial fulfillment of the requirements for the degree of

B.Sc. Electrical Engineering

Supervisor:

Dr. Qamas Gul Khan Safi


Assistant professor

External Examiner Signature:___________________________________________

Thesis Supervisor Signature: ___________________________________________

DEPARTMENT OF ELECTRICAL ENGINEERING


FACULTY OF ELECTRONICS & ELECTRICAL ENGINEERING
UNIVERSITY OF ENGINEERING AND TECHNOLOGY, TAXILA

July 2023

ii
Undertaking

Certifying that research work titled “AI Model for Recognition and Synchronization of

Quranic Text with Audio” is our own work. The work has not been presented elsewhere for

assessment. Where material has been used from other sources it has been properly

acknowledged / referred.

_______________________

Muhammad Zain Ul Abideen

19-EE-96

_______________________

Muhammad Armaghan Ur Rehman

19-EE-114

______________________

Nashit

19-EE-187

iii
DEDICATION

We, with immense gratitude and heartfelt appreciation, dedicate this thesis to the

individuals who have played a pivotal role in making our academic journey a reality. Their

unwavering support, encouragement, and inspiration have collectively shaped us into the

individuals we are today.

First and foremost, we express our deepest thanks to our supervisor Dr. Qamas Gul Khan

Safi, for his exceptional guidance and mentorship throughout this research endeavor. His

expertise, patience, and invaluable feedback have greatly influenced the outcome of this

thesis.

We are also indebted to our families, whose love, understanding, and constant

encouragement have been the bedrock of support in every step of our education. Their

unwavering belief in our abilities has fueled our determination to pursue excellence.

This thesis stands as a testament to the collective efforts of these remarkable individuals,

and we are humbled to have had the opportunity to learn from and be supported by each

one of them.

To our friends and classmates, we extend our sincere appreciation for sharing this journey

with us. The camaraderie, stimulating discussions, and late-night study sessions have been

instrumental in keeping our passion for learning alive.

Lastly, we wish to acknowledge the countless researchers, scholars, and authors whose

works have laid the foundation for our study. The wealth of knowledge they have

contributed to the field has been paramount in shaping our research.

iv
Abstract

AI Model for Recognition and Synchronization of Quranic Text with Audio

Muhammad Zain Ul Abideen 19-EE-96

Muhammad Armaghan Ur Rehman 19-EE-114

Nashit 19-EE-187

Thesis Supervisor: Dr. Qamas Gul Khan Safi

Assistant Professor, Electrical Engineering Dept.

In an era where technology revolutionizes traditional methods, the creation of Qur'anic

content often involves tedious manual editing, posing challenges for individuals and

organizations seeking to produce high-quality content efficiently. This Senior Design

Project responds to this challenge by introducing an innovative AI model that not only

streamlines the content creation process but also maintains the sanctity and accuracy of the

Qur'anic recitation.

By harnessing the power of Automatic Speech Recognition (ASR) technology, the model

exhibits a remarkable capability to analyze intricate audio files of Qur'anic recitations,

predict the recited surahs and verses, and seamlessly retrieve the corresponding text. The

integration of precise timestamps derived from the ASR stream further enhances the

authenticity and synchronization of the generated content. A unique feature of the model

v
lies in its ability to overlay the retrieved text onto a user-selected background, ensuring a

visually captivating representation of the recitation.

The culmination of these advanced functionalities culminates in the production of

compelling video outputs, reflecting a harmonious blend of technology and spirituality. By

drastically reducing the time and effort traditionally associated with manual content

creation, this pioneering approach signifies a profound advancement in the realm of

Qur'anic content generation. It not only relieves users of the burdensome tasks that have

long characterized this domain but also paves the way for a new era of accelerated content

creation methods.

As the boundaries of technology continue to expand, the implications of this AI-driven

innovation extend beyond immediate applications. The project sets a precedent for the

integration of AI into religious content creation, sparking conversations about how

technology can play a constructive role in preserving and sharing sacred texts. The model's

success opens the door to further possibilities in content creation, encouraging researchers

and practitioners to explore novel ways of enhancing spiritual experiences through the

marriage of cutting-edge technology and age-old traditions.

vi
ACKNOWLEDGEMENTS

All the praises, thanks and acknowledgements are for the creator, Allah Almighty, whose

blessings are unlimited, the most beneficent, the most merciful, who gave us strength and

enabled us to complete this task. Countless salutations upon the Holy Prophet (PBUH), his

family and his companions, source of knowledge for enlightening with the essence of faith

in Allah and guiding mankind to the true path of life. We would like to thank our project

supervisor, Dr. Qamas Gul Khan Safi, for the unwavering support and guidance throughout

the duration of this project and having empowered us to undertake and persist on this

journey. His constant encouragement has been a driving force, keeping us motivated every

step of the way. Likewise, Zikra Dar-ul-Nashr truly deserves immense praise for

facilitating our journey towards this innovative project. We would also like to express

special gratitude to our parents for providing us with all the conveniences and supporting

us morally and financially for this project. This project will be motivation for our future

expedition.

vii
TABLE OF CONTENTS

Abstract ............................................................................................................................... v

ACKNOWLEDGEMENTS .............................................................................................. vii

LIST OF FIGURES .......................................................................................................... xii

LIST OF TABLES ........................................................................................................... xiv

DEFINITIONS .................................................................................................................. xv

ABBREVIATIONS ......................................................................................................... xvii

1. Chapter 1: Introduction .............................................................................................. 1

1.1. Problem Statement .............................................................................................. 1

1.2. Aims and Objectives ........................................................................................... 2

1.3. Related SDGs ...................................................................................................... 2

2. Chapter 2: Literature Review ..................................................................................... 3

2.1. Qur’anic Content Creation - Traditional Approach ............................................ 3

2.2. Our Former Methodology ................................................................................... 7

2.2.1. First Website ................................................................................................... 8

2.2.2. Second Website ............................................................................................. 13

2.3. Art of Qur’anic Recitation ................................................................................ 16

2.3.1. The Tajweed Art ............................................................................................ 17

2.3.2. Impact of Tajweed Art on the Acoustic Method ........................................... 22

2.4. Automatic Speech Recognition (ASR) ............................................................. 23

2.4.1. ASR Algorithms ............................................................................................ 24

2.4.1.1. Tradition ASR Algorithms .................................................................... 24

viii
2.4.1.2. Deep learning ASR algorithms ............................................................. 25

2.4.2. ASR Service Providers ................................................................................. 26

2.4.2.1. Google Cloud Speech-to-Text............................................................... 26

2.4.2.2. Microsoft Azure Speech Services ......................................................... 27

2.4.2.3. Amazon Transcribe ............................................................................... 27

2.4.2.4. IBM Watson Speech to Text.................................................................. 27

2.4.2.5. Speechmatics......................................................................................... 27

2.4.3. Usage of Cloud Speech to Text: .................................................................... 28

2.4.3.1. Speech-to-Text request construction ..................................................... 28

2.4.3.2. Encoding ............................................................................................... 31

2.4.3.3. Speech Accuracy ................................................................................... 33

2.4.3.4. Improve transcription results with model adaptation............................ 35

2.5. Databases .......................................................................................................... 36

2.5.1. Qur’anic Databases ....................................................................................... 38

2.5.1.1. Tanzil Project ........................................................................................ 38

2.5.1.2. The Holy Quran – Zeeshan Usmani...................................................... 39

2.6. Searching Algorithms........................................................................................ 40

2.6.1. Linear Searching ........................................................................................... 41

2.6.2. Binary Search ................................................................................................ 41

2.6.3. Hashing ......................................................................................................... 42

2.6.4. Interpolation Search ...................................................................................... 42

2.6.5. Boyer-Moore Algorithm ............................................................................... 42

2.6.6. Knuth-Morris-Pratt Algorithm ...................................................................... 43

ix
2.6.7. Levenshtein Distance (Edit Distance) Algorithm ......................................... 43

3. Chapter 3: Design .................................................................................................... 44

3.1. Introduction ....................................................................................................... 44

3.2. Reading Audio and Applying ASR ................................................................... 45

3.2.1. Background ................................................................................................... 45

3.2.2. Process .......................................................................................................... 47

3.2.3. Explanation with Code .................................................................................. 48

3.3. Ayah Prediction ................................................................................................. 50

3.3.1. Background ................................................................................................... 50

3.3.2. Process .......................................................................................................... 53

3.3.3. Explanation with Code.................................................................................. 54

3.4. Database and Matching ..................................................................................... 58

3.4.1. Fetching Data from Database........................................................................ 58

3.4.1.1. Background ........................................................................................... 58

3.4.1.2. Explanation with Code .......................................................................... 61

3.4.2. Matching ....................................................................................................... 69

3.4.2.1. Background ........................................................................................... 69

3.4.2.2. Process .................................................................................................. 70

3.4.2.3. Explanation with Code .......................................................................... 71

3.5. Video Creation .................................................................................................. 80

3.5.1. Background ................................................................................................... 80

3.5.2. Process .......................................................................................................... 81

3.5.3. Explanation with Code .................................................................................. 84

x
3.5.3.1. Explanation ........................................................................................... 84

4. Chapter 4: Results .................................................................................................... 88

4.1. Generation of SRT ............................................................................................ 88

4.1.1. SRT - Words by Words .................................................................................. 88

4.1.2. SRT - Full Ayah ............................................................................................. 89

4.1.3. SRT – Full Ayah with Translation in Multiple Languages ............................ 90

4.2. Results of matching........................................................................................... 91

4.3. Generated Video ................................................................................................ 92

4.3.1. Word by Word ............................................................................................... 93

4.3.2. Video with Full Ayah .................................................................................... 94

4.3.3. Full Ayah with Translation in Multiple Languages ....................................... 95

5. Chapter 5: Conclusion.............................................................................................. 98

5.1. Source Files ....................................................................................................... 99

6. Chapter 6: Future Work/Aspects ............................................................................ 100

7. References .............................................................................................................. 102

xi
LIST OF FIGURES

Figure 1: Qari Reciting Surah ............................................................................................. 3

Figure 2: Audio imported in an editing software ................................................................ 4

Figure 3: Background video file imported .......................................................................... 5

Figure 4: Adding Text layers ............................................................................................... 6

Figure 5:Adding translation layers...................................................................................... 7

Figure 6: First Site Interface ............................................................................................... 8

Figure 7: Adding text to site input box ............................................................................... 9

Figure 8:Text splitting ......................................................................................................... 9

Figure 9: Ready to record Recitation ................................................................................ 10

Figure 10: Words appear individuallly, progressing sequentially upon pressing of the ‘next’

arrow key .......................................................................................................................... 12

Figure 11:Second Website Dialogue Box ......................................................................... 13

Figure 12: Surah Verses with their Translations appearing one after another on pressing the

next arrow key................................................................................................................... 15

Figure 13. Deep learning speech recognition pipeline...................................................... 26

Figure 14: Big-O Complexity ........................................................................................... 41

Figure 15: Reading Audio and Applying ASR .................................................................. 47

Figure 16: ASR Transcription - asrt.txt ............................................................................. 50

Figure 17: Ayah Prediction - Searching Algorithm ........................................................... 53

Figure 18: Matching Algorithm ........................................................................................ 70

Figure 19: Video Creation ................................................................................................. 81

Figure 20: Time Comparison between Traditional Method and Proposed Method .......... 92

xii
Figure 21:Video Word by Word ........................................................................................ 93

Figure 22:Video with Full Ayah ........................................................................................ 94

Figure 23: Code Allows the selection of 2 languages for Translation .............................. 95

Figure 24: Full Ayahs with Translation in English and Urdu ........................................... 96

Figure 25:Video with translation in Multiple Languages ................................................. 97

xiii
LIST OF TABLES

Table 1: Supported Audio Codecs..................................................................................... 32

Table 2: : Searching Algorithm Complexities ................................................................... 40

Table 3: Problem Letters (‫ ة‬and ‫ أ‬in this case) cause mismatching (highlighted in red) ... 51

Table 4: Problem Characters (‫ صےل‬in this case) cause mismatching (highlighted in red) .. 51

Table 5: Excluded Characters - 'exclusion chars' .............................................................. 52

Table 6: Exclusion Chars Occupy a Cell during Text Splitting ........................................ 59

Table 7: Mis-Mapping due to exclusion chars .................................................................. 59

Table 8: Exclusion chars combined with Previous word for Correct Mapping ............... 60

Table 9: DATABASE.xlsx (Qur'an Database) .................................................................. 61

Table 10: Exclusion Chars appended to Previous Word ................................................... 65

Table 11: search_match.xlsx ............................................................................................. 68

Table 12: Results of Matching .......................................................................................... 76

Table 13: result.srt ............................................................................................................. 78

Table 14: : Results of Matching ........................................................................................ 91

xiv
DEFINITIONS

Within the thesis context, subsequent terms are introduced in the main text with distinctive

marking on their first occurrence. This guides the reader to reference this section for a

precise understanding of their definition are as follows:

Artificial Intelligence (AI): encompasses the development of computer systems capable

of emulating human-like intelligence, enabling them to learn from experience, reason, and

solve complex tasks.

Natural Language Processing (NLP): is an AI discipline that involves machines

understanding and processing human language for various applications.

Automatic Speech Recognition (ASR): is an advanced technology that utilizes

sophisticated algorithms and machine learning to seamlessly transform spoken language

into accurate and coherent written text.

Speech to text (STT): is a technology that enables the automatic conversion of spoken

language into written text, facilitating easy transcription of audio or speech recordings.

Hidden Markov Model (HMM): is a sequential probabilistic speech model utilized for

speech recognition and synthesis, effectively predicting acoustic and linguistic features.

Qur'anic Content Creation: involves producing multimedia, including beautifully recited

videos of the Holy Qur'an, to disseminate its teachings and spiritual essence to a broader

audience.

Arabic Phonology: analyzes Arabic speech sounds (phonemes), exploring their distinct

features, functions, and organizational rules within the language's sound system.

xv
Arabic Phonetics: studies the physical aspects of speech sounds—production,

articulation, and acoustics—and how they are generated and perceived by the human vocal

tract and ear.

Intonation: refers to the rise and fall of pitch in spoken language, conveying emotions,

meaning, and grammatical elements in communication.

Coarticulation: is a phenomenon observed during rapid speech, where speech sounds

blend and overlap to facilitate smooth and efficient communication in natural speech.

Tawwuz (‫)تَعوذ‬: is the act of seeking refuge and protection in Allah from the influence and

whispers of devil.

Tasmiyah (‫)تَسْمِ يَة‬: involves invoking the name of Allah before beginning an action or

consuming something.

Zikra Dar-Un-Nashr: Zikra (ٰ‫ ) ِذ ْك َرى‬is a name of the Quran, and it is the advice that inspires

people to acquire real knowledge. The main difference between humans and all other

creatures is that humans have a wide field of real and acquired knowledge apart from

physical sciences and this knowledge is what nurtures humanity in it. As if Zikra is the

name of an effort to develop humanity among humans. This true knowledge is available to

humans in the form of the Holy Qur'an on behalf of Allah. Since it is in Arabic, we need

the translation of the Qur'an to understand it. In this regard, Zikra Dar-ul-Nashr has

strived to make people's access to the understanding of the Holy Qur'an easier.

xvi
ABBREVIATIONS

ASR: Automatic Speech Recognition

SDP: Senior Design Project

AI: Artificial Intelligence

STT: Speech-to-Text

NLP: Natural Language Processing

HMM: Hidden Markov Model

API: Application Programming Interface

AWS: Amazon Web Services

WER: Word Error Rate

xvii
1. Chapter 1: Introduction

1.1. Problem Statement

Creation of Qur’anic recitation videos is a hectic job. It involves multiples steps

starting with the recording of audio, importing to an editing software (e.g Adobe

Premier Pro), importing the desired background video(s), and the actual hectic part is

adding Arabic subtitles (Qur’anic Text) verse by verse (big verses are split into

parts). Then these subtitles are synchronized manually with the audio which is

extremely laborious. Adding translation moreover makes it arduous.

The proposed senior design project aims to revolutionize Qur'anic content creation by

creating an innovative Model that encompasses an Automatic Speech Recognition (ASR)

model along with a Searching and Matching Algorithm, contributing to an exceptional

Synchronization model. The demand for Qur'anic content creation has never been greater,

prompting the need for an automated approach to ease this tiresome process.

ASR and transcription in a continuous speech is difficult due to coarticulation, and the

inherent variability in the pronunciation especially relating to Qur’an (where there is no

room or mistake in the transcription). Moreover, ASR recognizes with only 80 to 90%

accuracy the Qur’anic phonetics. Additionally, retrieving the required data from the

Qur'anic database posed certain challenges (explained in section 3.3). Furthermore, the

issue of synchronizing the Qur'anic transcript with audio arises, specifically concerning the

accurate appearance of subtitles (burnt-in text) in complete sync with the speaker's

recitation. Therefore, the primary focus of this work is to develop a comprehensive model

that effectively addresses and overcomes all these challenges.

1
1.2. Aims and Objectives

Main objectives of this work to design a novel Model, which is capable of

1. Recognition of Arabic (Qur’anic) audio.

2. Synchronization of Qur’anic Text with audio.

3. Making Qur’anic Recitation Videos in relatively no time.

4. Automatic subtitling.

5. Easing and Lowering the Barrier of Entry for Qur’anic Content Creation

6. Requiring negligible human labor.

1.3. Related SDGs

Following SDGs are related with our project,

Goal 4 – Quality Education

Goal 9 - Industry, Innovation

2
2. Chapter 2: Literature Review

2.1. Qur’anic Content Creation - Traditional Approach

Qur'anic videos feature a reciter engaging in the recitation of Qur'an, accompanied by a

complementary background video. These videos also include Qur'anic subtitles that appear

synchronously with the recitation, displaying the verses and their translations (optional).

Presented below is an illustrative video of a reciter reciting Qur'an, where the reciter's

rendition is complemented by on-screen verses and their corresponding translations.

Figure 1: Qari Reciting Surah

This is how it is done,

First Audio is imported in an editing software (in this case it is premier pro),

3
Figure 2: Audio imported in an editing software

Then the Background video is imported (shown with purple color),

4
Figure 3: Background video file imported

Now each verse is manually aligned to the audio. First ٓ‫ يس‬is matched to the respective

audio segment, then ٓ‫ َوٱلۡ ُق ۡر َء ِانٓٱلۡ َح ِك ِي‬,then ‫ي‬


َٓ ‫ان ََّكٓ لَ ِم َنٓٱلۡ ُم ۡر َس ِل‬. These subtitles are visible in pink
ِ
color in the following screenshot,

5
Figure 4: Adding Text layers

Translations are added in a likewise manner,

6
Figure 5:Adding translation layers

Following this procedure costs around 5 to 6 hours for surahs of moderate length, such as

Surah Mulk, Surah Yaseen Surah Al-Ibraheem and other similar ones.

2.2. Our Former Methodology

As previously stated, the onerousness of Qur’anic content creation, Zikra Dar-un-Nashr

(organization we work with) operated in an identical manner. Upon experiencing this

ordeal, inventions of relatively fewer demanding procedures kicked off.

7
Presently, we manage two dedicated (non-public) websites (that serve as platforms for

integrating our model). These websites have been developed as part of our ongoing efforts

to streamline the process of creating Qur'anic content.

2.2.1. First Website

Figure 6: First Site Interface

Qur’anic text is inserted in the first field, and Juz’ number is selected from the dropdown

menu. Entering text of Juz’ 1 and choosing 1,

8
Figure 7: Adding text to site input box

Figure 8:Text splitting

9
The algorithm identifies the spaces between the words and places them in an array so that

when the next arrow key is pressed, next word will appear sequentially.

Figure 9: Ready to record Recitation

Now on pressing the next button () (next button with respect to Arabic script), text

appears word by word.

10
Pressing the next button again,

Again,

11
Again,

Figure 10: Words appear individuallly, progressing sequentially upon pressing of the ‘next’ arrow key

12
The objective of this site is to record videos or livestream. A reciter recites Quran and a

person glued to the laptop, diligently presses the next button as the Qari proceeds. Visibly,

another person alongside the reciter is indispensable for livestreaming or recording.

2.2.2. Second Website

This website has additional features like background videos and Translation (in 1 language

yet) with the Quranic Text. Instead of individual appearance of words, a part of a big verse

or a complete small verse is displayed, and the next part or verse appears on pressing next

(). Similar is the aim, elimination of dependency on the ‘next’ person.

Figure 11:Second Website Dialogue Box

Qur’anic text is entered in the first field and its translation is entered in the second field.

13
On pressing the next button,

Again,

14
Figure 12: Surah Verses with their Translations appearing one after another on pressing the next arrow key

15
2.3. Art of Qur’anic Recitation

In the fast-pacing world of modern technology, there is huge advancement in the field

Artificial intelligence and ASR also known as computer speech recognition or speech to

text (STT), incorporates knowledge and research in the computer science, linguistics, NLP

and computer engineering fields [1]–[3]. ASR has been envisioned as the future dominant

method for human-computer interaction. Both acoustic modeling and language modeling

are important parts of modern statistically based speech recognition algorithms [4], [5].

Hidden Markov models (HMMs) are also widely used in many systems. Language

modeling is also used in many other natural language processing applications such as

document classification or statistical machine translation [1], [6].

The Qur’an is written in the language of classical Arabic. The number of words in the entire

Qur’an is 77,794 words (including bismillah). Therefore, The Qur’an is considered a

source of specific vocabulary with a considerable number of Arabic words [7]–[9]. These

Arabic words comprise the 28 (29 with the hamza) letters of the Arabic alphabet. In the
ْ ْ
Arabic language, Diacritics (I’raab ‘ْ‫)’إع َراب‬
ِ are symbols added to letters. In Arabic, they

correspond to short-vowel phonemes, gemination or absence of short-vowel phonemes

(sukoon) [10]. Discretization is the process of adding those diacritics to Arabic script so

that they are pronounced correctly because the Arabic language is characterized by the

presence of identical words in letters but differ in meaning [11], [12].

ْ َ
The meaning is determined by the diacritics ("‫ ”تش ِكيل‬Tashkeel) that are represented by

the difference in pronunciation [8], [13]. The main diacritic marks include are:

16
• Dammaٌْٰٰ

َ
• Fatha َ

• Kasra ٌِْ

• Tanween al-dam ٌْٰ

• Tanween al-fath ٌْ

• Tanween al-kasr ٌْ

• Madd ~

Hence, the Arabic text can be written in more than one form, depending on the number of

symbols and diacritic marks that are used. The Qur’an is read according to the rules of

recitation and the provisions of intonation to be within the correct way according to the

commands of Islam. Tajweed, which is considered an improvement on the recitation,

articulates each letter, and gives it the correct attributes [14]–[16].

2.3.1. The Tajweed Art

The Arabic Word ‘Tajweed’ (‫ (تجويد‬means proper Recitation with pronunciation at a

moderate speed. It refers to a group of rules that governs the Recitation of The Holy Qur’an.

Therefore, it is regarded as an art due to the performance of the reciters in a similar form.

The Tajweed Art refers to the rules that are flexible and well-defined for reciting the Holy

Qur’an [9], [10], [17].

17
Tajweed is the science that concerns with applying the rules of pronunciation to the letters

during the recitation of the Qur’an within specific provisions that include as follows:

َ ْ
• Izhaar (‫)إظهار‬: it is an Arabic term used in the field of Tajweed, which refers to the

clear and distinct pronunciation of Arabic letters and sounds. In the context of

Quranic recitation, "Izhar" signifies articulating each letter with clarity and

openness, avoiding any merging or assimilation with neighboring letters. This

precise pronunciation is essential for accurate recitation of the Quranic text,

maintaining the integrity of each letter's individual sound.


َ ْ
• Idghaam (‫)إدغام‬: In Arabic phonetics and Tajweed that denotes the assimilation or

merging of certain sounds or letters within the pronunciation of words. In the

context of Quranic recitation, "Idgham" refers to the specific articulation where a

non-vowel sound (consonant) is smoothly and gradually merged into the following

sound, resulting in a combined and connected pronunciation. This phenomenon is

an important aspect of accurate Quranic recitation, contributing to the rhythmic and

melodious cadence of the verses.


ْ
• Iqlab (‫)إقالب‬: In the context of Quranic recitation, the term "Iqlab" (‫ )إقالب‬refers to

the phonetic transformation of a particular consonant sound into a different sound

when certain letters interact within specific word combinations. This phonetic

change contributes to the accurate and melodious recitation of the Quranic text.,

"Iqlab" refers to the transformation of the "‫( "ن‬noon) sound into a "‫( "م‬Meem) sound

when followed by the letter "‫( "ب‬ba). This phonetic change contributes to the

18
accurate and melodious recitation of the Quranic text, maintaining the proper

articulation and flow of sounds.

• Ikhfa (‫ ) ِإ ْخ َفاء‬: In the realm of Quranic recitation, Ikhfa (‫ )إخفاء‬is a vital Tajweed

technique where certain letters with specific diacritics are softly and smoothly

assimilated into neighboring letters, contributing to the fluidity and musicality of

the recitation. This practice ensures a balanced and melodious rendition of the

Quranic verses, embodying the essence of accurate and beautiful recitation."


َََْ
• Qalqalah (‫)قلقلة‬: is a fundamental concept within the realm of Quranic recitation

and Tajweed. It pertains to the slight echoing or bouncing articulation of certain

Arabic letters when they carry a sukun (no vowel). These letters, namely "‫( "ق‬qaf),

"‫( "ط‬ṭa), "‫( "ب‬ba), "‫( "ج‬jim), and "‫( "د‬dal), produce a distinctive sound that adds a

rhythmic quality to the recitation. Qalqalah embodies the art of delivering the

Quranic verses with precise phonetic resonance and melodic resonance, enriching

the auditory experience for both reciter and listener."


ْ َ
• Waqf (‫)وقف‬: In Arabic, "pauses" are known as "‫"و ْقف‬
َ (waqf). It refers to the act of

stopping or pausing while reciting the Quran or during formal speech to observe

the appropriate punctuation and rules of recitation. Waqf helps in maintaining the

correct meaning and pronunciation of the words in the Quranic verses.

• َ In Arabic, it is a fundamental concept in Arabic phonetics and Tajweed,


Madd (ْ‫)مد‬:

referring to the elongation or prolongation of a specific vowel sound within a word.

It involves lengthening the pronunciation of a vowel for a specific duration,

contributing to the rhythmic and melodic cadence of Quranic recitation. Madd is an

19
essential aspect of proper Tajweed, enhancing the beauty, clarity, and correct

pronunciation of the Quranic text."

• Huroof al-Halaqi (‫وف ال َحلَق ِْي‬


ُ ‫) ُح ُر‬: includes the letters that require the emission of

sound from the throat during their pronunciation. The letters in "Huroof al-Halaqi"

are:

o ‫( ح‬ḥā)

o ‫( خ‬khā)

o ‫'( ع‬ayn)

o ‫( غ‬ghayn)

o ‫( همزة‬hamza)

o ‫( هـ‬hā)

These letters have a distinct and guttural pronunciation involving the constriction of

the throat while producing the sound [3], [9], [10], [18], [19].

َ
ْ ‫)ت ْرت‬:
• Tarteel (ْ‫يل‬ ِ It presents a captivating journey into the realm of Quranic recitation,

akin to a serene river of melodies that courses through the heart, arousing the

deepest of emotions and forging a spiritual bond with the divine. This practice is

grounded in the intricate weave of Quranic tradition, extending beyond mere

recitation to emerge as an art form that invites one to plunge into the ocean of verses

with profound awe and reverence.

As the voyage into Tarteel unfolds, each syllable transforms into a note, and every

word morphs into a melodious strand. The heart, much like a skilled conductor,

orchestrates the tempo of each verse in harmony with the very rhythm of creation.

These Quranic words become steadfast companions, igniting within the heart a

20
symphony of emotions—reverence, jubilation, and an overwhelming sense of

connection.

Envision the heart as a vessel brimming with the brilliance of the Quran. During

recitation, even the pores of one's being seemed to resonate in perfect unity, attuned

to the resonance of the divine message. The verses envelop and embrace, cradling

the soul within their profound wisdom. The reciter metamorphoses into a vessel of

light, radiating the beauty of the Quran's teachings to those in their midst [20].

This journey of Tarteel transcends the bounds of mere recitation; it is a symphony

that resonates within the heart. The Quranic verses wrap themselves around one's

very essence, immersing the reciter in a profound dialogue with the divine. With

each utterance, the heart dances in elation, casting off the burdens of the world and

nestling within a cocoon of tranquility.

Tarteel acts as a bridge, harmonizing the heart with the Quran. It charts a melodious

course toward a deeper comprehension and an intensified connection with the

Creator. Within each verse, a treasury of wisdom lies waiting to be uncovered. The

heart, aflutter with emotions, discovers solace, inspiration, and unbounded joy

within the embrace of the Quran's timeless and resounding message [10], [21], [22].

Wasting any of the intonation rules is considered a mistake that does not change the

meaning. Changing or deleting a word or letter or changing the diacritic of a vowel is a

21
type of mistake that probably changes the meaning. This error is considered more important

than the mistake of intonation and must be corrected [23]. The recitation of the Qur’an is

organized and the differences between them are related to changing some of the diacritics

of a vowel or changing letters. The Holy Qur’an’s 10 readings (‫ )قراعات‬also differ in

Tajweed rules, which constitute the major variation reason [24]–[26].

Eminently, recognizing the reciters of the Holy Qur’an examination lacks common

datasets. Many of the speech recognition researchers carried out their examination on the

datasets of the English language [14], [27]. However, only a few examinations are made in

the language of Arabic. There is a limitation on the Arabic speech recognition systems

compared with speech recognition systems of other languages. The recognition of the

Qur’an reciter is a complex task as it is done with the way of Tajweed. Every Qur’anic

reciter has a differentiating signal. This reciter signal encapsulates a dynamic range, which

is temporary and altered over time based on the pronunciation basis and the recitation way

called Tarteel (‫)ترتيل‬. Classifying and recognizing the reciters of the holy Qur’an are

regarded as parts of the field of speech and voice recognition systems [8].

2.3.2. Impact of Tajweed Art on the Acoustic Method

The voice of every person is unique. As a result, the audios of the Qur’an recited by many

reciters differ significantly from one another. Even if a sentence is fetched from the same

place of the Holy Qur’an, the delivering or recitation way of the Holy Qur’an is not the

same [13], [16]. Various sounds or slices are created by various reciters. Many challenges

are encountered while reciting the Holy Qur’an specialties due to the variations between

22
the Holy Qur’an recitation and the written language [27]. Many similar letter combinations
َ
are pronounced in different ways because of the utilization of diacritics (َ, َ , َ, َ, َ,
ِ َ).

Mastering correct Arabic pronunciation, especially in the context of the Quran, is crucial

for effective communication and deepening one's understanding of the sacred text. Proper

pronunciation of Arabic in the Quran ensures clarity and reverence enabling a profound

connection with its divine message. It allows for a more meaningful recitation and

engagement with the Quranic verses, fostering a spiritual and linguistic connection with

the Quran [28].

2.4. Automatic Speech Recognition (ASR)

ASR (Automatic Speech Recognition) is a type of software that enables computers to

recognize and transcribe spoken language into text. These algorithms are complex and

involve several components, including an acoustic model, a language model, and a decoder

[29], [30].

The acoustic model analyzes the audio signal and converts it into a series of acoustic feature

vectors [2]. The language model then determines the most likely sequence of words that

corresponds to the acoustic feature vectors, based on statistical models and machine

learning algorithms. Finally, the decoder combines the output of the acoustic and language

models to generate the final transcription of the speech signal [11], [30].

ASR algorithms are used in a wide variety of applications, including speech recognition

for virtual assistants, automated transcription of audio and video content, and real-time

translation of spoken language. [1]

23
2.4.1. ASR Algorithms

An ASR pipeline consists of the following components:

• Spectrogram generator that converts raw audio to spectrograms.

• Acoustic model that takes the spectrograms as input and outputs a matrix of

probabilities over characters over time.

• Decoder (optionally coupled with a language model) that generates possible

sentences from the probability matrix.

• Punctuation and capitalization model that formats the generated text for easier

human consumption. [2]

2.4.1.1. Tradition ASR Algorithms

Traditional statistical techniques for speech recognition involve using probabilistic models

to represent the relationships between speech sounds, words, and language. These models

are trained on large amounts of speech data and can recognize speech by calculating the

probability that a given sequence of sounds corresponds to a particular word or phrase. One

popular statistical technique used in speech recognition is Hidden Markov Models

(HMMs), which model the probability of a sequence of observations (in this case, speech

sounds) given a hidden state sequence (in this case, the corresponding sequence of words).

[3]

24
2.4.1.2. Deep learning ASR algorithms

In recent years, deep learning techniques such as neural networks have been applied to

speech recognition with great success. In these systems, the acoustic properties of speech

signals are first extracted and transformed into a sequence of feature vectors, which are

then processed by one or more layers of neural networks to produce the final transcription.

These deep learning techniques have been shown to significantly improve the accuracy of

speech recognition systems, particularly in noisy environments or when dealing with

speakers with non-standard accents or speech patterns. [3][31]

Some of the most popular state-of-the-art speech recognition acoustic models

are Quartznet, Citrinet, and Conformer [31], [32]. In a typical speech recognition

pipeline, you can choose and switch any acoustic model that you want based on your use

case and performance.

A typical deep learning pipeline for speech recognition includes the following components:

• Data pre-processing

• Neural acoustic model

• Decoder (optionally coupled with an n-gram language model)

• Punctuation and capitalization model

Figure 13 shows an example of a deep learning speech recognition pipeline. [2]

25
Figure 13. Deep learning speech recognition pipeline

2.4.2. ASR Service Providers

There are many ASR (Automatic Speech Recognition) service providers that offer cloud-

based APIs and SDKs for developers to integrate speech recognition capabilities into their

applications. Here are a few examples:[33]

2.4.2.1. Google Cloud Speech-to-Text

Google Cloud Speech-to-Text is a cloud-based ASR service that supports real-time and

batch transcription of audio files. It offers high accuracy and supports a wide range of

languages and dialects. The service also provides features such as speaker diarization and

punctuation.[34]

26
2.4.2.2. Microsoft Azure Speech Services

Microsoft Azure Speech Services provides cloud-based ASR capabilities for developers. It

supports real-time and batch transcription of audio files and offers features such as speaker

recognition and custom keyword spotting.[35]

2.4.2.3. Amazon Transcribe

Amazon Transcribe is a speech recognition service offered by Amazon Web Services

(AWS). It can transcribe audio and video files into text and supports a wide range of

languages and accents. It also provides features such as speaker identification and custom

vocabulary.[35]

2.4.2.4. IBM Watson Speech to Text

IBM Watson Speech to Text is a cloud-based ASR service that supports real-time and batch

transcription of audio files. It offers high accuracy and supports a wide range of languages

and dialects. The service also provides features such as speaker diarization and

customization options.[35][36]

2.4.2.5. Speechmatics

Speechmatics is a cloud-based ASR service that supports real-time and batch transcription

of audio and video files. It offers high accuracy and supports a wide range of languages

and dialects. The service also provides features such as speaker identification and

customizable language models.[35][37]

27
2.4.3. Usage of Cloud Speech to Text:

2.4.3.1. Speech-to-Text request construction

Speech-to-Text has three main methods to perform speech recognition. These are listed

below:

• Synchronous Recognition (REST and gRPC) sends audio data to the Speech-to-

Text API, performs recognition on that data, and returns results after all audio has

been processed. Synchronous recognition requests are limited to audio data of 1

minute or less in duration.

• Asynchronous Recognition (REST and gRPC) sends audio data to the Speech-to-

Text API and initiates a Long Running Operation. Using this operation, you can

periodically poll for recognition results. Use asynchronous requests for audio data

of any duration up to 480 minutes.

• Streaming Recognition (gRPC only) performs recognition on audio data provided

within a gRPC bi-directional stream. Streaming requests are designed for real-time

recognition purposes, such as capturing live audio from a microphone. Streaming

recognition provides interim results while audio is being captured, allowing result

to appear, for example, while a user is still speaking.

A synchronous Speech-to-Text API request consists of a speech recognition

configuration, and audio data. A sample request is shown below:

28
{
"config": {
"encoding": "LINEAR16",
"sampleRateHertz": 16000,
"languageCode": "en-US",
},
"audio": {
"uri": "gs://bucket-name/path_to_audio_file"
}

All Speech-to-Text API synchronous recognition requests should include a speech

recognition config field called RecognitionConfig. A RecognitionConfig consists of the

following sub-fields:

• encoding (required): specifies the encoding scheme of the provided audio, such as

FLAC or LINEAR16. Lossless encodings are recommended for optimal

performance.

• sampleRateHertz (required): specifies the sample rate (in Hertz) of the audio,

determining the number of samples per second in the audio file. It must match the

actual sample rate of the audio. For Speech-to-Text, the supported range is between

8000 Hz and 48000 Hz. In the case of FLAC or WAV files, you can alternatively

specify the sample rate in the file header. To achieve optimal speech recognition

accuracy, it is recommended to use a sample rate of 16000 Hz. Resampling audio

that has already been recorded at a different sample rate, especially legacy

telephony audio at 8000 Hz, should be avoided. Instead, provide the audio at its

native sample rate.

• languageCode (required): indicates the language and region/locale for speech

recognition. It follows the BCP-47 identifier format and helps the API understand

29
the language being spoken. Since we are recognizing Arabic, we used ar-SA for

ASR.

• maxAlternatives (optional, defaults to 1): indicates the number of alternative

transcriptions to provide in the response. A higher value can be set to evaluate

different alternatives, particularly useful for real-time applications.

• profanity Filter (optional): determines whether to filter out profane words or

phrases from the transcriptions.

• speech Context (optional): provides additional contextual information for

processing the audio, including word or phrase boosts and hints for the speech

recognition task.

Audio can be provided to the Speech-to-Text API through the audio parameter of type

Recognition Audio. The audio field can be specified in two ways:

• content: contains the audio content embedded within the request. This method has

a duration limit of 1 minute.

• uri: contains a URI pointing to the audio content, such as a Google Cloud Storage

URI (gs://bucket-name/path_to_audio_file). The audio file must not be

compressed.

These parameters and options allow for customization and control over the speech

recognition process when using the Speech-to-Text API.

Speech-to-Text can also include time offset values (timestamps) for the beginning and end

of each spoken word that is recognized in the supplied audio. A time offset value represents

30
the amount of time that has elapsed from the beginning of the audio, in increments of

100ms.

Time offsets are especially useful for analyzing longer audio files, where you may need to

search for a particular word in the recognized text and locate it (seek) in the original audio.

Time offsets are supported for all our recognition methods: recognize, streaming

recognize, and long running recognize.

To include time offsets in the results of your request, we have to set the enable Word Time

Offsets parameter to true in the request configuration.[38]

{
"config": {
"languageCode": "en-US",
"enableWordTimeOffsets": true
},
"audio":{
"uri":"gs://gcs-test-data/gettysburg.flac"
}
}

2.4.3.2. Encoding

Please note that an audio format should not be confused with an audio encoding. For

example, the .WAV file format defines the structure of the header in an audio file but does

not specify the encoding itself. The actual audio encoding used in a .WAV file can vary, so

it is essential to inspect the header to determine the encoding.

On the other hand, FLAC serves as both a file format and an encoding, which can

sometimes cause confusion. In the case of FLAC files, the sample rate must be included in

the FLAC header for proper submission to the Speech-to-Text API. It's important to note

31
that FLAC is the specific codec used in this context, while the term "FLAC file format"

refers to a file with a .FLAC extension.

Remember to verify the header and encoding of your audio files to ensure compatibility

and accurate processing when working with the Speech-to-Text API.

The Speech-to-Text API supports a number of different encodings. The following table lists

supported audio codecs:[34]

Table 1: Supported Audio Codecs

Codec Name Lossless Usage Notes

MP3 MPEG Audio Layer III No MP3 encoding is a Beta feature

and only available in v1p1beta1.

See RecognitionConfig reference

documentation for details.

FLAC Free Lossless Audio Codec Yes 16-bit or 24-bit required for

streams

LINEAR16 Linear PCM Yes 16-bit linear pulse-code

modulation (PCM) encoding. The

header must contain the sample

rate.

MULAW μ-law No 8-bit PCM encoding

32
AMR Adaptive Multi-Rate No Sample rate must be 8000 Hz

Narrowband

AMR_WB Adaptive Multi-Rate Wideband No Sample rate must be 16000 Hz

OGG_OPUS Opus encoded audio frames in No Sample rate must be one of 8000

an Ogg container Hz, 12000 Hz, 16000 Hz, 24000

Hz, or 48000 Hz

SPEEX_WITH_ Speex wideband No Sample rate must be 16000 Hz

HEADER_BYTE

WEBM_OPUS WebM Opus No Sample rate must be one of 8000

Hz, 12000 Hz, 16000 Hz, 24000

Hz, or 48000 Hz

2.4.3.3. Speech Accuracy

Speech accuracy can be assessed using various metrics, depending on the specific

requirements. However, the widely accepted standard for comparison is the Word Error

Rate (WER). WER measures the percentage of incorrect word transcriptions within a given

set, serving as an indicator of accuracy. A lower WER value implies a higher level of

accuracy in the system's performance.

33
In the context of ASR accuracy assessment, the term "ground truth" refers to the 100%

accurate transcription typically provided by humans. This ground truth is used as a

reference to compare and measure the accuracy of the ASR system under evaluation.

• Word Error Rate (WER)

The Word Error Rate (WER) calculation accounts for three types of transcription errors

that may occur. These errors include:

• Insertion Error (I): Words present in the hypothesis transcript that aren't present

in the ground truth.

• Substitution errors (S): Words that are present in both the hypothesis and ground

truth but aren't transcribed correctly.

• Deletion errors (D): Words that are missing from the hypothesis but present in the

ground truth.

𝑆+𝑅+𝑄 (2.1)
𝑊𝐸𝑅 =
𝑁

To find the WER, add the total number of each one of these errors, and divide by the total

number of words (N) in the ground truth transcript. The WER can be greater than 100% in

situations with very low accuracy, for example, when a large amount of new text is inserted.

• Relation of WER to a confidence score

It is crucial to differentiate the WER from the confidence score as they are independent

metrics that do not necessarily correlate. A confidence score is based on likelihood, while

the WER is based on the accuracy of word identification. Grammatical errors, even if

34
minor, can lead to a high WER if words are not correctly identified. Conversely, a high

confidence score can be attributed to frequently occurring words that are more likely to be

transcribed correctly by the ASR system. Therefore, the confidence score and WER are not

expected to have a direct correlation.[34], [38]

• Normalisation

Normalization is an essential step in calculating the WER metric. Both the machine

transcription and the human-provided ground truth transcription are normalized before

comparison. This normalization involves removing punctuation and disregarding

capitalization when comparing the machine transcription with the human-provided ground

truth.[38]

2.4.3.4. Improve transcription results with model adaptation

The model adaptation feature can be leveraged in Speech-to-Text systems to enhance the

recognition of specific words or phrases by biasing the system towards those preferred

options over other alternatives. This feature is especially beneficial in several use cases,

which are outlined below:

• Enhancing the accuracy of frequently occurring words and phrases in the audio

data: By employing model adaptation, users can train the recognition model to

prioritize accurately transcribing commonly spoken words or phrases. For instance,

if the word "weather" frequently appears in the audio data, the model can be adapted

to consistently transcribe it as "weather" instead of "whether."

35
• Expanding the vocabulary recognized by Speech-to-Text: Although Speech-to-Text

already encompasses an extensive vocabulary, it may not include certain words that

are specific to particular domains or rare in general language usage, such as proper

names. Model adaptation allows the addition of such words to the system's

vocabulary, enabling accurate transcription in specialized contexts.

• Improving transcription accuracy in the presence of noise or unclear audio: In

scenarios where the provided audio is noisy or lacks clarity, model adaptation can

aid in improving the accuracy of speech transcription. By adapting the model to

account for the characteristics of the audio, the system can better handle challenging

acoustic conditions and produce more reliable transcriptions [38].

In summary, model adaptation in Speech-to-Text systems offers valuable capabilities for

enhancing word recognition accuracy, expanding vocabulary coverage, and improving

transcription quality in the presence of audio challenges. These applications can greatly

contribute to the overall performance and usability of the system [33], [38], [39],[29].

2.5. Databases

A database is an organized collection of data, usually stored in a digital format, that is

designed to be easily accessed, managed, and updated. It is used to store, organize, and

retrieve data for a wide range of applications, from simple data management tasks to

complex data analysis and reporting [40].

Databases are made up of tables, which are organized into rows and columns, with each

row representing a unique record and each column representing a data field. The data stored

36
in a database can be accessed and manipulated using Structured Query Language (SQL), a

programming language designed for managing data in relational databases [41].

There are several different types of databases, including:

1. Relational databases: Relational databases are the most common type of database

and are based on the relational model of data, which organizes data into tables with

relationships defined between them. Examples of popular relational databases

include Oracle, MySQL, and Microsoft SQL Server.

2. NoSQL databases: NoSQL databases are non-relational databases that are designed

to handle large amounts of unstructured or semi-structured data. Examples of

popular NoSQL databases include MongoDB and Cassandra.

3. Object-oriented databases: Object-oriented databases are based on the object-

oriented programming model and store data as objects, which encapsulate both data

and behavior. Examples of popular object-oriented databases include db4o and

ObjectStore.

Databases are an essential component of many modern applications and are used in a wide

range of industries, including finance, healthcare, retail, and manufacturing. The effective

management and use of databases can provide organizations with valuable insights and

efficiencies, making them an important asset for businesses of all sizes [4][40], [41].

37
2.5.1. Qur’anic Databases

A Qur'anic database is a digital repository that stores the text of the Quran and related

metadata, such as transliterations, translations, commentaries, and historical context. These

databases are used to facilitate the study and analysis of the Quran by researchers, scholars,

and students.

Qur'anic databases typically include various features, such as search functionality,

annotation tools, and the ability to compare and analyze different versions of the text. They

can also incorporate multimedia elements, such as audio recordings and videos of

recitations, to provide a more immersive experience for users.

Qur'anic databases play an important role in advancing the study and understanding of the

Quran, by providing researchers and scholars with easy access to the text and related

materials, as well as advanced analytical tools to aid in their research.[5] Following are

some Qur'anic databases [41].

2.5.1.1. Tanzil Project

It is an initiative that provides the Quranic text in various languages, including Arabic,

English, French, and Urdu. The project also includes a web-based Quranic reader that

allows users to view and search the text and includes features such as highlighting,

bookmarking, and recitation playback [42].

38
2.5.1.2. The Holy Quran – Zeeshan Usmani

This dataset has been put together to aid data scientists to run their NLP algorithms and

Kernels to find and explore the sacred text by themselves. The data contains complete Holy

Quran in following 21 languages (so data scientists from different parts of the world can

work with it). First language is the original Arabic text. Other 18 files are the translations

of the original text.

We have used this database in our project since the data was represented in MS Excel

making it easier to visualize and to work with. Also the database capabilities of Excel are

very powerful. In fact, not only can Excel have a simple searchable database, it also can be

have a proper relational database. A relational database consists of a master table that links

with its slave tables, which are also known as child tables. Hence for the aforementioned

reasons, we opted for this database [43].

39
2.6. Searching Algorithms

Table 2: : Searching Algorithm Complexities

Algorithm Best Time Average Time Worst Time Worst Space

Complexity Complexity Complexity Complexity

Linear Search O(1) O(n) O(n) O(1)

Binary Search O(1) O(log n) O(log n) O(1)

Bubble Sort O(n) O(n^2) O(n^2) O(1)

Selection Sort O(n^2) O(n^2) O(n^2) O(1)

Insertion Sort O(n) O(n^2) O(n^2) O(1)

Merge Sort O(nlogn) O(nlogn) O(nlogn) O(n)

Quick Sort O(nlogn) O(nlogn) O(n^2) O(log n)

Heap Sort O(nlogn) O(nlogn) O(nlogn) O(n)

Bucket Sort O(n+k) O(n+k) O(n^2) O(n)

Radix Sort O(nk) O(nk) O(nk) O(n+k)

Tim Sort O(n) O(nlogn) O(nlogn) O(n)

Shell Sort O(n) O((nlog(n))^2) O((nlog(n))^2) O(1)

40
Figure 14: Big-O Complexity

2.6.1. Linear Searching

A linear search algorithm has a worst-case time complexity of O(n), where 'n' represents

the number of elements in the dataset. It sequentially examines each element until a match

is found [44]–[48].

2.6.2. Binary Search

Binary search is an efficient algorithm used for sorted arrays, with a worst-case time

complexity of O(log n). It employs a divide-and-conquer strategy, repeatedly dividing the

dataset in half to narrow down the search range until the desired element is found. In our

specific application, it disturbs the index order and they do not remain sorted [49].

41
2.6.3. Hashing

Hashing is a fast-searching algorithm that involves generating a unique hash value for each

word. This hash value serves as an index or identifier, enabling efficient retrieval of the

word. Hashing offers constant-time complexity on average, ensuring rapid word lookup.

In our context, data sorting poses a challenge as it distorts the words [49].

2.6.4. Interpolation Search

Interpolation search is an efficient algorithm designed for searching sorted datasets. By

estimating the position of the desired element using a formula that considers the values of

the first and last elements, along with the desired value, interpolation search effectively

narrows down the search range. This approach allows for quick search range reduction in

specific scenarios[50].

2.6.5. Boyer-Moore Algorithm

The Boyer-Moore algorithm is a powerful and efficient string-searching algorithm used to

find occurrences of a given pattern within a text. It utilizes two key preprocessing steps,

the "bad character" and "good suffix" rules, to skip unnecessary comparisons and quickly

narrow down the search space. Due to its ability to handle repetitive patterns effectively,

the Boyer-Moore algorithm is a valuable tool for various applications in computer science

and data processing[50].

42
2.6.6. Knuth-Morris-Pratt Algorithm

The Knuth-Morris-Pratt Algorithm uses a similar preprocessing step to the Boyer-Moore

Algorithm, but it also uses a table to keep track of which characters in the pattern

match[50].

2.6.7. Levenshtein Distance (Edit Distance) Algorithm

The Levenshtein Distance, also known as the Edit Distance, is a string similarity metric

that measures the minimum number of single-character insertions, deletions, or

substitutions required to transform one string into another[50].

43
3. Chapter 3: Design

3.1. Introduction

The primary objective of this project is to create an innovative model that takes a Quranic

recitation audio as input and generates a video which includes the original audio, a user

selected background video, and synchronized Quranic text with beautiful font, accurately

aligned with the audio.

The outline of our designed procedure is as follows,

• Reading an audio file,

• Generating transcript with timestamps through ASR,

• Prediction of ayahs by ayah prediction algorithm,

• Retrieving corresponding text from database,

44
• Matching of words by matching algorithm.

• Outputting SRT,

• Generating video by combining the output SRT with the selected background video

and input audio.

This process is explained in detail with corresponding code blocks in the subsequent

headings.

3.2. Reading Audio and Applying ASR

3.2.1. Background

The initial challenge in our project lies in transcribing the audio file. We explored two main

methods to achieve this goal,

45
1. Designing a specialized ASR model tailored for Qur'anic Recognition.

2. Utilizing an existing general ASR model.

After careful consideration, we chose the latter approach for the following compelling

reasons,

• Developing a custom ASR model requires substantial effort, time, and a trial-and-

error process. It entails addressing various challenges, ranging from the model's

inability to recognize certain words or letters to achieving accurate transcriptions.

Moreover, fixing these issues for some words or letters might introduce new

recognition problems for others. The complexity of this process could divert our

focus from the primary objective of synchronizing Qur'anic text with audio.

• Our research findings indicate that existing custom ASR models primarily focus on

transcribing a limited set of short surahs. Designing a comprehensive model for the

entire Qur'an is a significant undertaking that we have not come across in the

literature. Even developing a small-scale model for specific short surahs requires

considerable effort and a trial-and-error approach, which might steer us away from

our main objective.

• Existing general ASR models for Arabic have been trained on vast and diverse

datasets, giving them an edge in terms of efficiency and accuracy. Moreover, these

models provide additional functionalities, including access to timestamps that

indicate the start and end times for each transcribed word, empowering us with

valuable features for further analysis.

46
Considering these significant factors, we made the decision to employ an existing simple

Arabic ASR model. The next step was to choose the most suitable model from the array of

available options, such as those offered by Google, Microsoft, Amazon, and others. After

thorough evaluation, we ultimately selected Google's ASR model due to its ease of access,

cost-effectiveness, and its extensive capabilities in catering to our project requirements.

3.2.2. Process

Figure 15: Reading Audio and Applying ASR

1. Reading Audio file containing Recitation,

2. Performing ASR on it,

47
3. Getting ASR Transcript with Start and End times for each word (timestamps),

3.2.3. Explanation with Code

Importing the necessary libraries:

import speech_recognition as sr
import numpy as np
from google.cloud import speech
import os
import pandas as pd
import csv

Setting up the Google Cloud Speech-to-Text client:

from google.cloud import speech


client = speech.SpeechClient.from_service_account_file('quran
374914- afa6baf2bbe3.json')

Defining the audio file or URI for speech recognition:

gcs_uri = "gs://quran_mulk/audio-files/kahaf.flac"

Implementing the transcribe_speech() function for speech recognition:

48
def transcribe_speech():
# Configure the recognition audio and settings
audio = speech.RecognitionAudio(uri=gcs_uri)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.FLAC,
sample_rate_hertz=44100,
language_code="ar-SA",
model="default",
audio_channel_count=2,
with open("asrt.txt", "w", encoding="utf-8") as file:
for result in result.results:
alternative = result.alternatives[0]
enable_word_time_offsets=True,
)

# Perform speech recognition using long running operation


operation = client.long_running_recognize(config=config, audio=
audio)
result = operation.result(timeout=90)

# Extract and save the recognized speech data


num = 1
for word_info in alternative.words:
speech_recognition = word_info.word
start_time = word_info.start_time
start_time = second_to_timecode(start_time.t
otal_seconds())
end_time = word_info.end_time
end_time = second_to_timecode(end_time.total
_seconds())
file.write(f"{num}\t{speech_recognition}\t{s
tart_time}\t{end_time}\n")
num += 1
print("Speech recognition done.")

49
1 ‫وما‬ 13.7s 14.4s

2 ‫انزل‬ 14.4s 15.0s


3 ‫من‬ 15.0s 16.7s
4 ‫قبلك‬ 16.7s 17.0s

Figure 16: ASR Transcription - asrt.txt

3.3. Ayah Prediction

3.3.1. Background

The process of predicting which part of the Qur'an is recited in the uploaded audio

presented us with unforeseen challenges. Initially, we attempted a seemingly

straightforward approach by comparing the ASR transcript with the vowel-less column of

the database. We started by searching for T’awwuz (‫ )تعوذ‬and Tasmiyah (‫ )تسميه‬in the code,

checking the first 8 indices (4 for T’awwuz and 4 for Tasmiyah). Based on their presence

or absence, the code made decisions on whether to add either, both, or neither of them.

However, we encountered unexpected discrepancies during the initial testing phase. When

comparing the first six words (indices decided based on the presence of T’awwuz and

Tasmiyah) and the last six words with the entire Qur'anic database, no match was found.

Our thorough inspection of the database, facilitated by its Excel format, revealed that

mismatches occurred due to certain letter differences in the ASR transcript and the database

text. For instance, the ‫ ي‬of the actual Qur'anic text was represented as ‫ ى‬in the ASR

transcript, and ‫ ة‬was represented as ‫ ه‬and may more characters like that.

50
Table 3: Problem Letters (‫ ة‬and ‫ أ‬in this case) cause mismatching (highlighted in red)

Qur’anic Database
ٰ ASR Transcript
ٰ Text (Vowel-less)

‫والحياة‬ ‫والحياه‬

‫ليبلوكم‬ ‫ليبلوكم‬

‫أيكم‬ ‫ايكم‬

‫أحسن‬ ‫احسن‬

To address this issue, we undertook a comprehensive solution. We replaced all the

problematic letters in the database with their corresponding forms from the ASR transcript.

This adjustment significantly improved the matching accuracy, but we were still faced with

challenges in Surah and Ayah prediction.

After much effort and perseverance, we identified the second problem - mismatching due

to the presence of additional characters, such as ‫ صل‬,‫ ق‬,‫ ج‬, in the Qur'anic text from the

database, which were absent in the ASR stream.

Table 4: Problem Characters (‫ صےل‬in this case) cause mismatching (highlighted in red)

Qur’anic Database
ٰ ASR Transcript
Text (Vowel-less)

51
َ
‫فسوف‬

‫فسوف‬ ‫يعلمون‬

To overcome this obstacle, we applied further data filtering, designating these characters

as 'exclusion chars' in the code.

Table 5: Excluded Characters - 'exclusion chars'

ًٰ ًٰ ًٰ ًٰ ًٰ ًٰ ًَٰ ًٰ ًِٰ ًٰ ًْٰ ًٰ ۩ ًٰ ًٰ ًٰ

1 2 3 4 5 6 7 8 9

This meticulous process allowed us to fine-tune the Ayah Prediction algorithm and

ultimately achieved a functional and reliable prediction mechanism.

Through these iterative steps and continuous refinement, we overcame the complexities of

Ayah Prediction, paving the way for accurate and reliable identification of the specific

Surah and Ayah being recited in the uploaded audio. The success of this algorithm

constitutes a crucial milestone in achieving our project's objectives, since the upcoming

hurdle of text matching has already been cleared.

52
3.3.2. Process

Figure 17: Ayah Prediction - Searching Algorithm

1. Reading the ASR Transcript,

2. Replacing Problematic Letters with correspondents in the ASR Transcript,

3. Checking for the presence of T’awwuz (‫ )تعوذ‬and Tasmiyah (‫ )تسميه‬and assigning

True or False values to both,

53
4. Reading Database,

5. Searching First six words and Last six words (in ASR Transcript) in the entire

database,

6. Getting Prediction Results,

3.3.3. Explanation with Code

Import the necessary libraries: `pandas` and `openpyxl`.


```python
import pandas as pd

Open the file asrt.txt and replace occurrences of the Arabic letter '‫ 'ي‬with '‫'ى‬.

54
with open("asrt.txt", "r", encoding="utf-8") as file:
text = file.read()
updated_text = text.replace('‫ 'ى‬,'‫)'ي‬

with open("asrt.txt", "w", encoding="utf-8") as file:


file.write(updated_text)

Initialize auzbillah and bismillah variables to False.

auzbillah = False
bismillah = False

Read the lines of the file asrt.txt and store them in a list.

with open("asrt.txt", "r", encoding="utf-8") as file:


lines = file.readlines()

Check if the first line contains the string "‫"اﻋﻮذ‬. If it does, set auzbillah to True.

if lines[0].split('\t')[1] == "‫"اﻋﻮذ‬:
auzbillah = True

Check if the sixth line contains the string "‫"ﺑﺴﻢ‬. If it does, set bismillah to True.

if lines[5].split('\t')[1] == "‫"ﺑﺴﻢ‬:
bismillah = True

Based on the values of auzbillah and bismillah, construct the first_six_words

list. If both auzbillah and bismillah are True, append words from lines 8 to 14

(inclusive). If either auzbillah or bismillah is True, append words from lines 4 to

10 (inclusive). If neither auzbillah nor bismillah is True, append words from lines

55
0 to 5 (inclusive).

combined_words = ""
if auzbillah and bismillah:
for i in range(8, 15): # Adjusted the range to include line 14
combined_words += lines[i].split('\t')[1] + " "
elif auzbillah or bismillah:
for i in range(4, 11): # Adjusted the range to include line 10
combined_words += lines[i].split('\t')[1] + " "
else:
for i in range(0, 6):
combined_words += lines[i].split('\t')[1] + " "
first_six_words = combined_words.split()[-6:]

Construct the last_six_words string by concatenating words from the last six lines.

last_six_words = ""
for i in range(len(lines) - 6, len(lines)):
last_six_words += lines[i].split('\t')[1] + " "

Read the Excel file DATABASE.xlsx and store the data in a data frame.

df = pd.read_excel('DATABASE.xlsx')

Remove rows with missing values in the 'ArabicText' column.

df = df.dropna(subset=['ArabicText'])

Search for the string represented by `first_six_words` in the 'ArabicText' column

and store the resulting rows in `result`, and search for the string represented

by last_six_words in the 'ArabicText' column and store the resulting rows in result2.

56
result = df[df['ArabicText'].str.contains(' '.join(first_six_words),
na=False)]
result2 = df[df['ArabicText'].str.contains(last_six_words, na=False)]

If result is not empty, iterate over its rows and print the corresponding 'SurahNo' and

'AyahNo' values. Similarly, for result2.

if not result.empty:
for index, row in result.iterrows():
print(f"SurahNo: {row['SurahNo']}, AyahNo: {row['AyahNo']}")
else:
print("Not found: Ayah start")

if not result2.empty:
for index, row in result2.iterrows():
print(f"SurahNo: {row['SurahNo']}, AyahNo: {row['AyahNo']}")
else:
print("Not found: Ayah end")

Finally, print the values of auzbillah, bismillah, first_six_words, and

last_six_words.

print("auzbillah value:", auzbillah)


print("bismillah value:", bismillah)
print("First six words:", ' '.join(first_six_words))
print("Last six words:", last_six_words)

This code can be used to process the text file asrt.txt and search for corresponding

values in the DATABASE.xlsx file.

57
3.4. Database and Matching

3.4.1. Fetching Data from Database

3.4.1.1. Background

The quest for a suitable database presented a significant challenge in our project. We

meticulously explored various options, including an SQLite database that had plain text

with each letter, character, and space as an array element. This approach provided granular

data and it lacked the desired data representation, ease of visualization, and flexibility for

alterations and data fetching. Subsequently, we turned to the API of quran.com, hoping it

would resolve our data representation challenges. However, this too presented certain

problems that hindered seamless integration with our project.

Various other databases were also examined, but unfortunately, none of them met our

requirements, leaving us unsatisfied with the available options. In essence, the databases

we explored failed to offer the ideal data representation, making it challenging to visualize

and manipulate the data efficiently.

After extensive research, our quest led us to discover 'The Holy Quran' database by Zeeshan

Usmani on Kaggle. This database proved to be an ideal fit for much of our requirements.

It was presented in MS Excel format, featuring two columns for Qur'anic text—one with

vowels and another without vowels. Splitting of text into individual words enabled by

identifying spaces (' ') between them.

Furthermore, we effectively filtered the vowel-less text column for specific characters

(referred to as 'exclusion chars' in subsequent headings). These characters were not present

58
in the ASR transcript and excluding them was essential for enhancing the efficiency of our

Matching Algorithm.

After this text matching between Arabic Text column (vowel-less) and the ASR Transcript,

the words from the vowel-less column are to be mapped on to the column containing

vowels (called as ‘Original Arabic’ in database). The Original Arabic column contains both

vowels and the ‘exclusion chars’. When text is split into ‘Word by Word’ form for matching

and mapping, the exclusion chars also get separated and occupy a cell. See the table below,

Table 6: Exclusion Chars Occupy a Cell during Text Splitting

َْ
‫اْل َمل‬

َ ‫َف َس ْو‬
‫ف‬

This disturbs the mapping of text with vowels on to the text without vowels,

Table 7: Mis-Mapping due to exclusion chars

Exclusion Chars Separate Arabic Text (Vowel-less)

َْ
‫اْل َمل‬ ‫ويلههم‬

َ ‫االمل‬

َ ‫َف َس ْو‬
‫ف‬ ‫فسوف‬

59
َْ
‫ ويلههم‬will be mapped on to ‫ اْل َمل‬and ‫ االمل‬will be mapped on to ‫صل‬. To address this issue,

whenever code encounters an exclusion char during text splitting, it appends it with the

previous word in the Original Arabic column. This way we get correctly mapped output

text containing both vowels and the exclusion chars (which are necessary in Qur’anic text).

Table 8: Exclusion chars combined with Previous word for Correct Mapping

Exclusion Chars
Exclusion Chars Arabic Text ASR
Combined with
Separate (Vowel-less) Transcript
Previous Word

َْ ْ
‫اْل َمل‬ ‫َويل ِه ِهم‬ ‫ويلههم‬ ‫ويلههم‬

َْ
َ ‫اْل َمل‬ ‫االمل‬ ‫االمل‬

َ ‫َف َس ْو‬ َ ‫َف َس ْو‬


‫ف‬ ‫ف‬ ‫فسوف‬ ‫فسوف‬

The use of an Excel-based representation simplified the application of the matching

algorithm, as it allowed for easy identification of problematic words or characters. This

seamless identification process facilitated the integration of the Matching Algorithm within

our model effortlessly.

60
3.4.1.2. Explanation with Code

This code is designed to process Quranic verses stored in an Excel file. It allows the user

to confirm the predicted Surah No. and Ayah No. and then filters the data. The code then

performs several operations on the filtered verses, including adding optional strings,

splitting the verses into individual words based on spaces, and incorporating

supplementary data from a separate file. The processed data is organized and stored in a

panda DataFrame, which is subsequently exported to an Excel file.

import pandas as pd
import csv

The code begins by importing the necessary libraries - pandas for data manipulation and

csv for handling CSV files.

wb = pd.read_excel('DATABASE.xlsx')
ws = wb

The code loads the data from the ‘DATABASE.xlsx’ Excel file into a pandas DataFrame

called wb and assigns it to ws as well. Essentially, both wb and ws will point to the same

DataFrame.

Table 9: DATABASE.xlsx (Qur'an Database)

SurahNo SurahNameArabic AyahNo OrignalArabicText ArabicText

1 ‫الفاتحة‬ 1 َ ‫ْال َح ْمد هّلِل َر ِّب ْال َع َالم‬


‫ي‬ ‫الحمد هلل رب العالمي‬
ِ ِ ِ

1 ‫الفاتحة‬ 2 َّ ‫الر ْح َم َٰ ن‬
‫الر ِح ِيم‬ َّ ‫الرحمن الرحىم‬
ِ

61
ِّ ْ َ َ
1 ‫الفاتحة‬ 3 ‫ين‬
ِ ‫م ِال ِك يو ِم الد‬ ‫مالك ىوم الدىن‬

َ َ َ َ َ
1 ‫الفاتحة‬ 4 ‫ِإ َّياك ن ْعبد َو ِإ َّياك ن ْست ِعي‬ ‫اىاك نعبد واىاك نستعي‬

surah_no = int(input("Enter SurahNo: "))


ayah_start = int(input("Enter AyahNo start: "))
ayah_end = int(input("Enter AyahNo end: "))

The code prompts the user to enter the Surah number (surah_no), the starting Ayah number

(ayah_start), and the ending Ayah number (ayah_end) through standard input (keyboard).

The input values are converted to integers.

add_auzbillah = True
add_bismillah = True

Two boolean variables add_auzbillah and add_bismillah are initialized to True as default

values. These variables are used to control whether additional strings ‘ٰ‫اّللٰ أَع ْوذ‬
ِ ٰ ِ‫ان مِ نَٰ ب‬ َّ ‫ال‬
ِٰ ‫شيْط‬

َّ and ‘ٰ‫ٱّلل بِسۡ ِم‬


ٰ‫’الر ِجي ِْم‬ ِٰ ‫ٱلر ۡح َم‬
َِّٰ ‫ن‬ َّ ‫يم‬
ِٰ ِ‫’ٱلرح‬
َّ are added at the start of filtered data.

change_auzbillah = input("Enter 'F' to exclude 'Auzbillah' from the start


of filtered data: ")
if change_auzbillah == "F":
add_auzbillah = False
change_bismillah = input("Enter 'F' to exclude 'Bismillah' from the start
of filtered data: ")
if change_bismillah == "F":
add_bismillah = False

62
The code gives the user the option to change the default values for add_auzbillah and

add_bismillah. If the user enters ‘F’ for either prompt, the respective variable is set to

False, indicating that the corresponding string should not be added to the start of the filtered

data.

filtered_data_Orignal_arabic = []
filtered_data_arabic_text = []
for row in ws.itertuples():
if row[5] == surah_no and ayah_start <= row[11] <= ayah_end:
ayah_no = row[11] # Get the AyahNo from column K
filtered_data_Orignal_arabic.append(row[13] + "" + str(ayah_no))
filtered_data_arabic_text.append(row[14])

The code filters the original Arabic text and Arabic text data based on the provided Surah

and Ayah range. It iterates through the rows of the DataFrame ws using itertuples(). If

the row’s Surah number (at index 5) matches surah_no and the Ayah number (at index 11)

(index 5 and 11 is a way of telling the code form which column you have to filer the data)

is within the specified range, the Arabic text and Arabic text with Ayah number are

appended to the filtered_data_Orignal_arabic and filtered_data_arabic_text lists,

respectively.

63
if add_auzbillah and add_bismillah:
filtered_data_Orignal_arabic.insert(0, "ٰ‫اّلل أَع ْوذ‬
ِٰٰ ‫ان مِنَٰ ِب‬ َّ ‫الر ِجي ِْٰم ال‬#
ِٰ ‫شيْط‬ َّ " + "ٰ‫ِبسۡ ِم‬
َّ ‫ن‬
ِٰ‫ٱّلل‬ ِٰ ‫ٱلر ۡح َم‬
َّ ‫يم‬
ِٰ ِ‫ٱلرح‬%
َّ ")
filtered_data_arabic_text.insert(0, "‫" الرجيم الشيطان من باهلل اعوذ‬+ "‫الرحمن هللا بسم‬
‫)" الرحيم‬
elif add_auzbillah:
filtered_data_Orignal_arabic.insert(0, "ٰ‫اّلل أَع ْوذ‬
ِٰٰ ‫ان مِنَٰ ِب‬ َّ ‫الر ِجي ِْٰم ال‬#")
ِٰ ‫شيْط‬ َّ
filtered_data_arabic_text.insert(0, "‫)"الرجيم الشيطان من باهلل اعوذ‬
elif add_bismillah:

َّ ‫ٰٱلر ۡح َم ِن‬
filtered_data_Orignal_arabic.insert(0, "ٰ‫ٰٱلرحِ ِيم‬ ِ َّ ‫ ِبسۡ ِم‬%")
َّ ‫ٰٱّلل‬
filtered_data_arabic_text.insert(0, "‫)"الرحيم الرحمن هللا بسم‬

Based on the values of add_auzbillah and add_bismillah, additional strings are inserted

at the start of the filtered_data_Orignal_arabic and filtered_data_arabic_text lists.

exclusion_chars = [К '', ًٰ '', ًٰ '', ًٰ '', У '', ًٰ '', Й '', Т '', ً


ِٰ
'', П '', О '', С '', '۩', ًٰ '', ًٰ '','ًٰٰ' , '1', '2', '3', '4', '5',
'6', '7', '8', '9',ًٰ '1']

A list called exclusion_chars is defined, which contains characters that should be excluded

when splitting the words later on.

The reason is that to make the arabic text and orignal arabic text indexes same so that we

can map the arbic text to orignal arabic text when we will be doing matching and then

mapping.

64
Ayahs_oa = ' '.join(filtered_data_Orignal_arabic)
words = Ayahs_oa.split(" ")
split_words_oa = [words[0]]
for i in range(1, len(words)):
if words[i] in exclusion_chars:
split_words_oa[-1] += words[i]
else:
split_words_oa.append(words[i])
split_words_oa = list(filter(None, split_words_oa))

The code joins the filtered original Arabic data into a single string separated by spaces and

then splits it into words based on space characters. The resulting words are stored in the

split_words_oa list. If a word contains a character in exclusion_chars, it is appended to

the previous word. Finally, empty words are filtered out using filter(None,

split_words_oa).

Table 10: Exclusion Chars appended to Previous Word

Exclusion Chars Combined with


Exclusion Chars Separate
Previous Word

َْ ْ
‫اْل َمل‬ ‫َويل ِه ِهم‬

َ َْ
‫اْل َمل‬

َ ‫َف َس ْو‬ َ ‫َف َس ْو‬


‫ف‬ ‫ف‬

65
Ayahs_at = ' '.join(filtered_data_arabic_text)
split_words_at = Ayahs_at.split()

Similarly, the code joins the filtered Arabic text data into a single string and splits it into

words. The resulting words are stored.

asrt_data = []
with open("asrt.txt", "r", encoding="utf-8") as asrt_file:
reader = csv.reader(asrt_file, delimiter='\t')
for row in reader:
if len(row) >= 4:
asrt_data.append(row)

The code reads the data from the “asrt.txt” file, which is assumed to be a tab-separated

values (TSV) file. Each row is appended to the asrt_data list if it contains at least four

elements.

output_data = {
"Index": [],
"Original Arabic": [],
"Arabic text": [],
"Speech Recognition": [],
"Start Time": [],
"End Time": []
}

A dictionary called output_data is created with keys representing the column names and

empty lists as initial values.

66
max_length = max(len(asrt_data), len(split_words_oa),
len(split_words_at))

The code determines the maximum length among asrt_data, split_words_oa, and

split_words_at lists. This will be used to iterate and populate the output_data dictionary.

for i in range(max_length):
output_data["Index"].append(i)
if i < len(asrt_data):
output_data["Speech Recognition"].append(asrt_data[i][1])
output_data["Start Time"].append(asrt_data[i][2])
output_data["End Time"].append(asrt_data[i][3])
else:
output_data["Speech Recognition"].append(None)
output_data["Start Time"].append(None)
output_data["End Time"].append(None)
if i < len(split_words_oa):
output_data["Original Arabic"].append(split_words_oa[i])
else:
output_data["Original Arabic"].append(None)
if i < len(split_words_at):
output_data["Arabic text"].append(split_words_at[i])
else:
output_data["Arabic text"].append(None)

The code populates the output_data dictionary by iterating from 0 to max_length. For each

index, the corresponding values are appended to the respective lists in the dictionary. If the

index exceeds the length of a particular list, None is appended as a placeholder.

67
output_df = pd.DataFrame(output_data)

A pandas DataFrame called output_df is created using the output_data dictionary.

output_df.to_excel("search_match.xlsx", index=False)
print("search matching file created")

The DataFrame output_df is exported to an Excel file named “search_match.xlsx”. Finally,

a message indicating the successful creation of the search matching file is printed.

Table 11: search_match.xlsx

Original Speech

Index ْ Arabic ْ Arabic Text ْ Recognition Start Time End Time

ْ
9 ‫ال َح ْمد‬ ‫الحمد‬ ‫الحمد‬ 00:00:10,199 00:00:12,199

‫ه‬
10 ‫ّلِل‬
ِ ِ ‫هلل‬ ‫هلل‬ 00:00:12,199 00:00:12,400

‫ه‬
11 ‫ال ِذي‬ ‫الذى‬ ‫الذى‬ 00:00:12,400 00:00:13,199

َْ
12 ‫أن َز َل‬ ‫انزل‬ ‫انزل‬ 00:00:13,199 00:00:14,599

13 َٰ َ ‫َع‬
‫ل‬ ‫عل‬ ‫عل‬ 00:00:14,599 00:00:15,199

َ
14 ‫ع ْب ِد ِه‬ ‫عبده‬ ‫عبده‬ 00:00:15,199 00:00:15,300

15 َ ‫ْالك َت‬
‫اب‬ ‫الكتاب‬ ‫الكتاب‬ 00:00:15,300 00:00:15,800
ِ

68
3.4.2. Matching

3.4.2.1. Background

Section 3.3.1 has comprehensively covered the challenges encountered in text matching

and mapping, where we employed the Linear Search algorithm. Addressing potential

worst-case scenarios with a time complexity of O(n), we implemented a technique to limit

the search range, thereby optimizing the matching process.

The technique involves calculating the difference in the number of words between the

Speech Recognition Transcript and the Original Arabic (vowel-less) text from the database.

By adding a specific number to this calculated difference, we effectively narrow down the

search range. For instance, if the difference is 15, we add a value of 10 to restrict the search

within a range of 25 indices. This strategic approach ensures that the impact of worst-case

scenarios is mitigated, as we consider both positive and negative offsets. By incorporating

this tailored method within the Linear Search algorithm, we significantly enhance the

efficiency of our text matching process.

69
3.4.2.2. Process

Figure 18: Matching Algorithm

1. The vowel-less Arabic Text (AT) is matched with the ASR transcript using Linear

Searching algorithm.

2. For the matched indices, the corresponding Original Arabic text (OA) containing

vowels is replaced with the ASR transcript. The timestamps from the ASR

transcript are also applied.

3. For the unmatched indices, the indices are assigned a value of ‘-1’.

70
4. All the index values of ‘-1’ are interpolated and the corresponding Original Arabic

(OA) Text replaces the ASR text.

5. Hence output SRT is generated.

3.4.2.3. Explanation with Code

In this part, a pandas DataFrame named output_df is created using the output_data

dictionary. The DataFrame is then exported to an Excel file named "search_match.xlsx"

using the to_excel() function. The index=False argument ensures that the row index is not

included in the exported Excel file. Finally, a message is printed indicating that the search

matching file has been created.

The rest of the code performs additional operations on the generated data, such as counting

non-null cells, matching Speech Recognition and Arabic text, handling consecutive -1

71
values, mapping indices and time information, and saving the updated DataFrame and SRT

content to files.

import pandas as pd

This line imports the pandas library, which is used for data manipulation and analysis.

df = pd.read_excel('search_match.xlsx')

This line reads the data from an Excel file named 'search_match.xlsx' and stores it in a

DataFrame called df.

at_count = df['Arabic text'].count()


sr_count = df['Speech Recognition'].count()

These lines count the number of non-null cells in the columns 'Arabic text' and 'Speech

Recognition' of the DataFrame df and store the counts in the variables at_count and

sr_count, respectively.

diff = sr_count - at_count if sr_count > at_count else at_count -

sr_count

This line calculates the difference between sr_count and at_count. It checks if sr_count is

greater than at_count and subtracts at_count from sr_count if it is, otherwise, it subtracts

sr_count from at_count. The result is stored in the variable diff.

72
matched_data = []
const = 2

These lines initialize an empty list matched_data to store the modified data and set the value

of const as 2. const is used to define the number of rows to search before and after the

current row for a match.

i = 0
while i < len(df):
if df.loc[i, 'Speech Recognition'] == df.loc[i, 'Arabic text']:
start_time = df.loc[i, 'Start Time']
end_time = df.loc[i, 'End Time']
matched_data.append((i, start_time, end_time))
else:
match_found = False
for j in range(max(0, i - (const + diff)), min(i + (const + diff),
len(df))):
if df.loc[i, 'Speech Recognition'] == df.loc[j, 'Arabic text']:
start_time = df.loc[i, 'Start Time']
end_time = df.loc[i, 'End Time']
matched_data.append((j, start_time, end_time))
match_found = True
break
if not match_found:
matched_data.append((-1, df.loc[i, 'Start Time'],
df.loc[i, 'End Time']))
i += 1

This block of code iterates over the rows of the DataFrame df using a while loop. It

checks if the 'Speech Recognition' column value matches the corresponding value in the

73
'Arabic text' column for each row. If a match is found, it extracts the start and end

times from the 'Start Time' and 'End Time' columns, respectively, and appends a tuple

(i, start_time, end_time) to the matched_data list. Here, i represents the index of the

matched row.

If a match is not found, it searches for a match in the next and previous (const + diff)

rows. If a match is found, it appends a tuple (j, start_time, end_time) to matched_data,

where j represents the index of the matched row. If no match is found, it appends (-1,

start_time, end_time) to indicate that no match was found after searching the

neighboring rows.

data = [x[0] for x in matched_data]


start_times = [x[1] for x in matched_data]
end_times = [x[2] for x in matched_data]
i = 0
while i < len(data):
if data[i] == -1:
above = data[i-1] if i > 0 else None
below = None
j = i + 1
while j < len(data):
if data[j] != -1:
below = data[j]
break
j += 1
if above is not None and below is not None and (below -
above) <= 2:
if below - above == 1:
data[i] = above + 1

74
start_times[i] = df.loc[above, 'Start Time']
end_times[i] = df.loc[above, 'End Time']
else:
data[i] = above + 1
start_times[i] = df.loc[above, 'Start Time']
end_times[i] = df.loc[above, 'End Time']
data[i+1] = above + 2
start_times[i+1] = df.loc[above, 'Start Time']
end_times[i+1] = df.loc[above, 'End Time']
else:
i += 1
i += 1

In this part, the code processes the matched_data list to replace consecutive -1 values using

the +1 approach. It iterates over the data list, which contains the indices from

matched_data, and checks for consecutive -1 values. If found, it looks for the nearest non-

-1 indices above and below. If the difference between them is less than or equal to 2, it

replaces the -1 values with the corresponding indices and copies the start and end times

from the previous non--1 row.

df['matched_index'] = data
df['start_time'] = start_times
df['end_time'] = end_times
df['matched_orignal_Arabic'] = ['' if x == -1 else df.loc[x, 'Original
Arabic']
for x in data
df['matched'] = ['Matched' if x != -1 else 'Not matched' for x in
data]

75
These lines add new columns to the DataFrame df to store the matched index, start time,

end time, original Arabic text, and a column indicating if a word was matched or not. The

values are populated based on the data list.

df.to_excel('results of matching.xlsx', index=False)

This line saves the updated DataFrame df to a new Excel file named 'results of

matching.xlsx' without including the index column.

Table 12: Results of Matching

Original Arabic Speech


Index Start Time End Time
arabic text Recognition

ْ
9 ‫ال َح ْمد‬ ‫الحمد‬ ‫الحمد‬ 00:00:10,199 00:00:12,199

‫ه‬
10 ‫ّلِل‬
ِ ِ ‫هلل‬ ‫هلل‬ 00:00:12,199 00:00:12,400

‫ه‬
11 ‫ال ِذي‬ ‫الذى‬ ‫الذى‬ 00:00:12,400 00:00:13,199

َْ
12 ‫أن َز َل‬ ‫انزل‬ ‫انزل‬ 00:00:13,199 00:00:14,599

13 َٰ َ ‫َع‬
‫ل‬ ‫عل‬ ‫عل‬ 00:00:14,599 00:00:15,199

14 ‫َع ْب ِد ِه‬ ‫عبده‬ ‫عبده‬ 00:00:15,199 00:00:15,300

َ ‫ْالك َت‬
‫اب‬ ‫الكتاب‬ ‫الكتاب‬
15 ِ 00:00:15,300 00:00:15,800

76
Matched index start_time end_time matched orignal Arabic matched

ْ
9 00:00:06,700 00:00:07,200 ‫ال َح ْمد‬ Matched

‫ه‬
10 00:00:12,199 00:00:12,400 ‫ّلِل‬
ِ ِ Matched

‫ه‬
11 00:00:12,400 00:00:13,199 ‫ال ِذي‬ Matched

َْ
12 00:00:13,199 00:00:14,599 ‫أن َز َل‬ Matched

13 00:00:14,599 00:00:15,199 َٰ َ ‫َع‬


‫ل‬ Matched

14 00:00:15,199 00:00:15,300 ‫َع ْب ِد ِه‬ Matched

َ ‫ْالك َت‬
‫اب‬
15 00:00:15,300 00:00:15,800 ِ Matched

num = 1
srt_content = ""
for start_time, end_time, matched_text in zip(start_times, end_times,
df['matched_orignal_Arabic']):
srt_content += f"{num}\n{start_time} --> {end_time}\n{matched_text}\n\n"
num += 1

This loop iterates over the start_times, end_times, and matched_orignal_Arabic columns

of the DataFrame df using the zip function. It constructs the content for the SRT file by

concatenating the line number, start time, end time, and matched text for each row.

with open('results.srt', 'w', encoding='utf-8') as f:


f.write(srt_content)

77
This code block opens a file named 'results.srt' in write mode with UTF-8 encoding

and writes the srt_content to the file. It saves the content in the SubRip Text (SRT) format,

which is commonly used for subtitles.

Table 13: result.srt

Word by Word Full Ayah Translation

10 2 2

00:00:06,700 --> 00:00:07,200 00:00:07,200 --> 00:00:07,200 -->

ْ 00:00:18,100 00:00:18,100
‫ال َح ْمد‬
َ
َٰ َ ‫ّلِل هال ِذي أ ْن َز َل َع‬
‫ل َع ْب ِد ِه‬
‫ه‬ َ ْ
ِ ِ ‫الح ْمد‬ Praise be to Allah, Who
َ َ َ َ ْ
١ ‫اب َول ْم َي ْج َع ْل له ِع َو ًجا‬‫ال ِكت‬ hath sent to His Servant

the Book, and hath

allowed therein no

Crookedness:

‫سب تعریف خدا یہ کو ےہ‬

)‫جس ن اپن بندے (محمدﷺ‬

‫پر (یہ) کتاب نازل یک اور اس‬

‫کج (اور‬
‫مي کیس طرح یک ی‬
‫ی‬
‫پیچیدیک) نہ رکیھ‬

11 3 3

78
‫‪00:00:12,199 --> 00:00:12,400‬‬ ‫>‪00:00:18,100 --‬‬ ‫>‪00:00:18,100 --‬‬

‫ه‬ ‫‪00:00:29,699‬‬ ‫‪00:00:29,699‬‬


‫ّلِل‬
‫ِ ِ‬
‫َ ْ‬ ‫ْ َ ً‬ ‫ْ‬ ‫َ‬
‫ق ِّي ًما ِلين ِذ َر َبأ ًسا ش ِديدا ِم ْن لدنه‬ ‫)‪(He hath made it‬‬
‫ُ َ‬ ‫ي هالذ َ‬
‫ين َي ْع َملون‬ ‫ّش ْالم ْؤمن َ‬
‫َوي َب ِِّ َ‬ ‫‪Straight (and Clear) in‬‬
‫ِ‬ ‫ِ ِ‬
‫ً‬ ‫َ َّ َ َ‬ ‫َّ َ‬
‫ات أن له ْم أ ْج ًرا َح َسنا‪٢‬‬
‫الص ِالح ِ‬ ‫‪order that He may warn‬‬

‫‪(the godless) of a terrible‬‬

‫‪Punishment from Him,‬‬

‫‪and that He may give‬‬

‫‪Glad Tidings to the‬‬

‫‪Believers who work‬‬

‫‪righteous deeds, that they‬‬

‫‪shall have a goodly‬‬

‫‪Reward,‬‬

‫سیدیھ (اور سلیس اتاری) تاکہ‬

‫لوگوں کو عذاب سخت ےس جو‬

‫اس یک طرف ےس (آن واال) ےہ‬

‫ڈر ےائ اور مومنوں کو جو نیک‬


‫کرئ ہي خوشخبی سنا ے‬
‫ن‬ ‫عمل ے‬
‫ی‬
‫ُ‬
‫کہ ان ےک ےلئ (ان ےک کاموں کا)‬

‫نیک بدلہ (یعن) بہشت ےہ‬

‫‪12‬‬ ‫‪4‬‬ ‫‪4‬‬

‫‪79‬‬
00:00:12,400 --> 00:00:13,199 00:00:29,699 --> 00:00:29,699 -->

‫ه‬ 00:00:32,500 00:00:32,500


‫ال ِذي‬
ً َ
٣‫يه أ َبدا‬ َ َ
ِ ‫م ِاك ِثي ِف‬ Wherein they shall

remain forever:

‫ی‬
‫جس مي وه ابدا الآباد رہي ےک‬

3.5. Video Creation

3.5.1. Background

Video creation is a crucial step in our project, and it involves processing the output SRTs

(section 4.1) generated by the matching algorithm. Users have the flexibility to choose any

of the three SRTs for video generation. These SRTs offer options for word-by-word

Qur'anic text, full ayahs, and translations in two languages.

The video creation process demands some time, especially during tasks like video

rendering. On machines equipped with a good GPU, this time-consuming step can be

expedited by more than 80%. For instance, for medium-range surahs (section 2.1), the

processing time can be reduced from around 5 minutes to just 1 minute with a good-

performance GPU.

80
3.5.2. Process

Figure 19: Video Creation

1. We begin by importing Background Video, Recitation Audio file, Qur’anic Subtitle

file, and Translation Subtitle file.

81
2. Background video is either trimmed or looped based on the duration of input Audio

Recitation, and Audio and Video is combined.

3. Text clips for both Qur’anic subtitles and Translation are generated each catering to

acceptable font style, size, and color. Then their positions are set in the video

frames.

82
4. Text clips with predetermined positions and timestamps are added to the Video

(containing both Recitation Audio and Background Video), and thus we have

arrived at our goal.

83
3.5.3. Explanation with Code

3.5.3.1. Explanation

import re
import pysrt
from moviepy.editor import VideoFileClip, AudioFileClip, TextClip,
CompositeVideoClip

In this section, the necessary modules are imported: refor regular expression operations,

pysrt for working with subtitle files, and various classes from the moviepy.editor

module for video and audio processing.

def time_to_seconds(time_obj):
return time_obj.hours * 3600 + time_obj.minutes * 60 +
time_obj.seconds + time_obj.milliseconds / 1000

This function time_to_seconds converts a time object from the subtitle file to

seconds. It calculates the total time in seconds by adding up the hours, minutes, seconds,

and milliseconds components of the time object.

84
def create_subtitle_clips(subtitles, videosize, fontsize=24,
font='Arial', color='yellow'):
subtitle_clips = []
for subtitle in subtitles:
start_time = time_to_seconds(subtitle.start)
end_time = time_to_seconds(subtitle.end)
duration = end_time - start_time
video_width, video_height = videosize
text_clip = TextClip(subtitle.text, fontsize=60, font=font,
color='white', bg_color='transparent', size=(video_width * 3
/ 4, None),
method='caption').set_start(start_time).set_duration(duratio
n)
subtitle_x_position = 'center'
subtitle_y_position = 'center'
text_position = (subtitle_x_position, subtitle_y_position)
subtitle_clips.append(text_clip.set_position(text_position)
return subtitle_clips

This function create_subtitle_clips takes a list of subtitles, video size, and

optional parameters such as fontsize, font, and color. It iterates through each subtitle in

the list and creates a TextClip object for each subtitle. The start time and end time of the

subtitle are converted to seconds using the time_to_seconds function. The duration

is calculated as the difference between the end time and start time. The TextClip is created

with the specified text, font size, font, color, background color, size, and method. The

TextClip is positioned at the center of the video. The created subtitle clips are stored in a

list and returned.

video = VideoFileClip("video.mp4")

85
Here, the VideoFileClip class is used to load the video file "video.mp4" into the video

object.

video = video.resize((1280, 720))

This line resizes the video to a resolution of 1280x720 pixels using the resize method of

the VideoFileClip class.

audio = AudioFileClip("kahaf.mp3")

The AudioFileClip class is used to load the audio file "kahaf.mp3" into the audioobject.

audio_duration = audio.duration

The duration attribute of the AudioFileClip object is assigned to the variable

audio_duration to retrieve the duration of the audio clip in seconds.

video = video.subclip(0, audio_duration)

This line trims or extends the video clip using the subclip method of the

VideoFileClip class. It sets the start and end times of the video to match the duration of

the audio clip.

video = video.set_audio(audio)

The set_audio method is used to combine the video and audio clips. It sets the audio of

the video clip to the loaded audio clip.

86
subtitles = pysrt.open("Full Ayah.srt")

The pysrt.open function is used to open the subtitle file "Full Ayah.srt" and load the

subtitles into the subtitles object.

subtitle_clips = create_subtitle_clips(subtitles, video.size)

The create_subtitle_clips function is called to generate a list of subtitle clips based

on the loaded subtitles and the size of the video.

final_video = CompositeVideoClip([video] + subtitle_clips)

The CompositeVideoClip class is used to combine the video clip, audio, and subtitle

clips into the final video. The video clip is passed as the first element of a list, followed by

the subtitle clips.

final_video.set_duration(audio_duration).write_videofile(output_video_

file, codec='libx264', audio_codec='aac', fps=video.fps)

The set_duration method is used to set the duration of the final video to match the

audio duration. Then, the write_videofile method is called to write the final video to

the file specified by output_video_file. The video codec is set to "libx264", the audio

codec to "aac", and the frame rate is set to the same as the original video.

Overall, this code loads a video file, resizes it, loads an audio file, trims or extends the

video to match the audio duration, combines the video and audio, adds subtitles from the

generated srt, and creates a final video with embedded subtitles

87
4. Chapter 4: Results

4.1. Generation of SRT

4.1.1. SRT - Words by Words


10

0:00:10.200000 --> 0:00:12.200000


ْ
‫ال َح ْمد‬

11

0:00:12.200000 --> 0:00:12.400000


‫ه‬
‫ّلِل‬
ِ ِ

12

0:00:12.200000 --> 0:00:12.400000


‫ه‬
‫ال ِذي‬

13

0:00:12.200000 --> 0:00:12.400000


َْ
‫أن َز َل‬

14

0:00:14.600000 --> 0:00:15.200000

َٰ َ ‫َع‬
‫ل‬

88
‫‪4.1.2. SRT - Full Ayah‬‬

‫‪1‬‬

‫‪00:00:03,100 --> 00:00:04,400‬‬


‫ه َ َّ‬
‫الش ْي َٰطان َّ‬ ‫َ‬
‫الر ِج ْي ِم‬ ‫ِ‬ ‫اّلِل ِمن‬
‫أعوذ ِب ِ‬

‫‪2‬‬

‫‪00:00:04,400 --> 00:00:10,199‬‬

‫ٱلر ۡح َم َٰ ن َّ‬ ‫ه‬


‫ٱلر ِح ِيم‬ ‫ِ‬ ‫ِب ۡس ِم ِ‬
‫ٱّلِل َّ‬

‫‪3‬‬

‫‪00:00:10,199 --> 00:00:18,100‬‬


‫َ‬ ‫َ ْ َ َ َ َ َٰ َ ْ ْ َ َ َ‬ ‫ْ َ ْ ه ه‬
‫اب َول ْم َي ْج َع ْل له ِع َو ًجا ‪1‬‬‫ّلِل ال ِذي أنزل عل عب ِد ِه ال ِكت‬
‫الحمد ِ ِ‬

‫‪4‬‬

‫‪00:00:18,100 --> 00:00:29,699‬‬

‫ات‬ ‫َق ِّي ًما لي ْنذ َر َب ْأ ًسا َشد ًيدا م ْن َلد ْنه َوي َب ِِّ َ ْ ْ َ ه َ َ ْ َ ُ َ َّ َ‬
‫ّش المؤ ِم ِني ال ِذين يعملون الص ِالح ِ‬ ‫ِ‬ ‫ِ‬ ‫ِ ِ‬
‫َ َّ َ ْ َ ْ ً َ َ ً‬
‫أن لهم أجرا حسنا‪2‬‬

‫‪5‬‬

‫‪00:00:29,699 --> 00:00:32,500‬‬


‫َ ً‬
‫يه أ َبدا‪3‬‬ ‫َ‬ ‫َ‬
‫م ِاك ِثي ِف ِ‬

‫‪89‬‬
4.1.3. SRT – Full Ayah with Translation in Multiple Languages

00:00:06,700 --> 00:00:07,200

In the name of Allah, Most Gracious, Most Merciful.

‫ِشوع هللا کا نام ےل کر جو بڑا مہربان نہایت رحم واال ےہ‬

00:00:07,200 --> 00:00:18,100

Praise be to Allah, Who hath sent to His Servant the Book,


and hath allowed therein no Crookedness:

‫سب تعریف خدا یہ کو ےہ جس ن اپن بندے (محمدﷺ) پر (یہ) کتاب‬


‫ی‬
‫کج (اور پیچیدیک) نہ رکیھ‬
‫نازل یک اور اس مي کیس طرح یک ی‬

00:00:18,100 --> 00:00:29,699

(He hath made it) Straight (and Clear) in order that He may
warn (the godless) of a terrible Punishment from Him, and
that He may give Glad Tidings to the Believers who work
righteous deeds, that they shall have a goodly Reward,

‫سیدیھ (اور سلیس اتاری) تاکہ لوگوں کو عذاب سخت ےس جو اس یک‬


‫ے‬
‫کرئ ہي‬ ‫طرف ےس (آن واال) ےہ ڈر ےائ اور مومنوں کو جو نیک عمل‬
ُ ‫خوشخبی سنا ے‬
‫ن کہ ان ےک ےلئ (ان ےک کاموں کا) نیک بدلہ (یعن) بہشت‬ ‫ی‬
‫ےہ‬

00:00:29,699 90
--> 00:00:32,500

Wherein they shall remain forever:


4.2. Results of matching

Table 14: : Results of Matching

Original Arabic Speech


Index Start Time End Time
arabic text Recognition

ْ
9 ‫ال َح ْمد‬ ‫الحمد‬ ‫الحمد‬ 00:00:10,199 00:00:12,199

‫ه‬
10 ‫ّلِل‬
ِ ِ ‫هلل‬ ‫هلل‬ 00:00:12,199 00:00:12,400

‫ه‬
11 ‫ال ِذي‬ ‫الذى‬ ‫الذى‬ 00:00:12,400 00:00:13,199

َْ
12 ‫أن َز َل‬ ‫انزل‬ ‫انزل‬ 00:00:13,199 00:00:14,599

13 َٰ َ ‫َع‬
‫ل‬ ‫عل‬ ‫عل‬ 00:00:14,599 00:00:15,199

14 ‫َع ْب ِد ِه‬ ‫عبده‬ ‫عبده‬ 00:00:15,199 00:00:15,300

15 َ ‫ْالك َت‬
‫اب‬ ‫الكتاب‬ ‫الكتاب‬ 00:00:15,300 00:00:15,800
ِ

Matched index start_time end_time matched orignal Arabic matched

ْ
9 00:00:06,700 00:00:07,200 ‫ال َح ْمد‬ Matched

‫ه‬
10 00:00:12,199 00:00:12,400 ‫ّلِل‬
ِ ِ Matched

‫ه‬
11 00:00:12,400 00:00:13,199 ‫ال ِذي‬ Matched

َْ
12 00:00:13,199 00:00:14,599 ‫أن َز َل‬ Matched

13 00:00:14,599 00:00:15,199 َٰ َ ‫َع‬


‫ل‬ Matched

91
14 00:00:15,199 00:00:15,300 ‫َع ْب ِد ِه‬ Matched

َ ‫ْالك َت‬
‫اب‬
15 00:00:15,300 00:00:15,800 ِ Matched

4.3. Generated Video

Proposed method is a milestone achievement over traditional Qur’anic video creation

method, reducing the time by 98.4% i.e 5 hours of work can now be done in 5 mins.

Figure 20: Time Comparison between Traditional Method and Proposed Method

92
4.3.1. Word by Word

Figure 21:Video Word by Word

93
4.3.2. Video with Full Ayah

Figure 22:Video with Full Ayah

94
4.3.3. Full Ayah with Translation in Multiple Languages

Figure 23: Code Allows the selection of 2 languages for Translation

95
Figure 24: Full Ayahs with Translation in English and Urdu

96
Figure 25:Video with translation in Multiple Languages

97
5. Chapter 5: Conclusion
In this project, we have successfully addressed the challenge of automating the creation of

Qur’anic content videos, aiming to lower the difficulty for both new and existing content

creators. Through the implementation of advanced Qur’anic text recognition and

synchronization techniques, we developed a robust model that seamlessly aligns Qur’anic

text with audio, simplifying the video creation process and reducing the time by 98.4% i.e.

5 hours of work can now be done in 5 mins.

Our automated system significantly reduces the time and effort required by content

creators, eliminating the need for manual editing using complex software. The process

becomes streamlined, efficient, and consistent promoting accessibility and inclusivity of

the Qur'an. With our solution, we enable the rapid creation of a vast library of high-quality

videos, accommodating different recitations and translations of the Qur’an. This caters to

individuals seeking access to Qur’anic teachings in various languages.

Extensive testing and analysis have validated the accuracy and performance of our system,

achieving high Qur’anic text transcription accuracy and precise synchronization, ensuring

an immersive and seamless user experience.

In conclusion, this project successfully automates the creation of Qur’anic content,

providing a user-friendly and efficient solution for content creators and users alike. The

implementation of advanced speech recognition and synchronization techniques pave the

way for a more accessible, inclusive, and efficient approach to engaging with the Qur'an.

98
The potential impact of this work extends beyond the scope of this project, offering

opportunities for further research, development, and application in the field of Islamic

education and digital content creation.

5.1. Source Files

Link for codes and databases

https://github.com/Nashitsaleem1/FYP-AI-Model-for-Recognition-and-Synchronization-
of-Quranic-Text-with-Audio

For Any Queries email us at

hafiznashitsaleem@gmail.com

zainulabideen7676@gmail.com

armaghan4201@gmail.com

99
6. Chapter 6: Future Work/Aspects
The primary objective of this research endeavor was to develop a model suitable for

automating video creation for Zikra Dar-ul-Nashr's content. However, our project took a

more generalized approach, focusing on Qur’anic Recitations type content creation to

facilitate video production for content creators worldwide, yielding remarkable results that

fulfill the needs of creators.

Zikra Dar-ul-Nashr videos involve multilingual elements, presenting a challenge in

transcribing audio content with both Arabic and Urdu speech. To address this need, future

work will explore the development of a Keyword Spotting Algorithm capable of efficiently

handling multilingual processing. This algorithm will accurately detect and timestamp

Arabic words in the audio, streamlining video creation for multilingual audios.

The Keyword Spotting Algorithm has implications beyond Zikra Dar-ul-Nashr, as it

streamlines video production by automating transcription efforts and identifying key

moments within audio content. Its application extends to diverse industries dealing with

multilingual content, enabling efficient translation, subtitling, and analysis of multilingual

podcasts.

Moreover, the algorithm's proficiency in Arabic word detection sets a precedent for similar

applications in other languages. As multilingual communication becomes prevalent, speech

recognition tools capable of handling diverse linguistic inputs become indispensable across

multiple sectors, including education, media, and customer service.

In addition to algorithm development, we aim to create a user-friendly website interface

for our algorithms. This platform will empower content creators and individuals from

100
various fields to efficiently automate their video creation processes, enhancing productivity

and leveraging the power of our work seamlessly. By empowering creators and

streamlining workflows, our work opens up new possibilities in speech recognition

technology, making content creation more efficient and creative in diverse linguistic

contexts.

101
7. References
[1] B. Mocanu and R. Tapu, “Automatic Subtitle Synchronization and Positioning
System Dedicated to Deaf and Hearing Impaired People,” IEEE Access, vol. 9, pp.
139544–139555, 2021, doi: 10.1109/ACCESS.2021.3119201.

[2] N. O. Balula, M. Rashwan, and S. Abdou, “Automatic Speech Recognition (ASR)


Systems for Learning Arabic Language and Al-Quran Recitation: A Review,”
International Journal of Computer Science and Mobile Computing, vol. 10, no. 7,
pp. 91–100, Jul. 2021, doi: 10.47760/ijcsmc.2021.v10i07.013.

[3] M. Alrabiah and N. Alhelewh, “An Empirical Study On The Holy Quran Based On
A Large Classical Arabic Corpus,” 2014. [Online]. Available: www.islamqa.com

[4] H. A. Alsayadi, A. A. Abdelhamid, I. Hegazy, B. Alotaibi, and Z. T. Fayed, “Deep


Investigation of the Recent Advances in Dialectal Arabic Speech Recognition,”
IEEE Access, vol. 10, pp. 57063–57079, 2022, doi:
10.1109/ACCESS.2022.3177191.

[5] A. F. Ghori, A. Waheed, M. Waqas, A. Mehmood, and S. A. Ali, “Acoustic modelling


using deep learning for Quran recitation assistance,” Int J Speech Technol, vol. 26,
no. 1, pp. 113–121, Mar. 2023, doi: 10.1007/s10772-022-09979-4.

[6] A. H. A. Absa, M. Deriche, M. Elshafei-Ahmed, Y. M. Elhadj, and B. H. Juang, “A


hybrid unsupervised segmentation algorithm for arabic speech using feature fusion
and a genetic algorithm (July 2018),” IEEE Access, vol. 6, pp. 43157–43169, Jul.
2018, doi: 10.1109/ACCESS.2018.2859631.

[7] A. F. Ghori, A. Waheed, M. Waqas, A. Mehmood, and S. A. Ali, “Acoustic modelling


using deep learning for Quran recitation assistance,” Int J Speech Technol, 2022,
doi: 10.1007/s10772-022-09979-4.

[8] S. Al-Issa, M. Al-Ayyoub, O. Al-Khaleel, and N. Elmitwally, “Building a neural


speech recognizer for quranic recitations,” Int J Speech Technol, 2022, doi:
10.1007/s10772-022-09988-3.

102
[9] A. N. Akkila and S. S. A. Naser, “Rules of Tajweed the Holy Quran Intelligent
Tutoring System,” 2018.

[10] A. M. Alagrami and M. M. Eljazzar, “SMARTAJWEED Automatic Recognition of


Arabic Quranic Recitation Rules,” Academy and Industry Research Collaboration
Center (AIRCC), Dec. 2020, pp. 145–152. doi: 10.5121/csit.2020.101812.

[11] S. Hakak, A. Kamsin, P. Shivakumara, O. Tayan, M. Y. Idna Idris, and G. A. Gilkar,


“An Efficient Text Representation for Searching and Retrieving Classical Diacritical
Arabic Text,” in Procedia Computer Science, Elsevier B.V., 2018, pp. 150–157. doi:
10.1016/j.procs.2018.10.470.

[12] M. Alhawarat, M. Hegazi, and A. Hilal, “Processing the Text of the Holy Quran: a
Text Mining Study.” [Online]. Available: www.ijacsa.thesai.org

[13] M. M. al Anazi and O. R. Shahin, “A Machine Learning Model for the Identification
of the Holy Quran Reciter Utilizing K-Nearest Neighbor and Artificial Neural
Networks,” Information Sciences Letters, vol. 11, no. 4, pp. 1093–1102, Jul. 2022,
doi: 10.18576/isl/110410.

[14] K. M. O. Nahar, W. G. Al-Khatib, M. Elshafei, H. Al-Muhtaseb, and M. M.


Alghamdi, “Arabic Phonemes Transcription Using Learning Vector Quantization:
‘Towards the Development of Fast Quranic Text Transcription,’” in Proceedings -
2013 Taibah University International Conference on Advances in Information
Technology for the Holy Quran and Its Sciences, NOORIC 2013, Institute of
Electrical and Electronics Engineers Inc., Sep. 2015, pp. 407–412. doi:
10.1109/NOORIC.2013.85.

[15] B. Mocanu and R. Tapu, “Automatic Subtitle Synchronization and Positioning


System Dedicated to Deaf and Hearing Impaired People,” IEEE Access, vol. 9, pp.
139544–139555, 2021, doi: 10.1109/ACCESS.2021.3119201.

[16] N. Shafie, M. Z. Adam, S. M. Daud, and H. Abas, “A model of correction mapping


for Al-Quran recitation performance evaluation engine,” International Journal of

103
Advanced Trends in Computer Science and Engineering, vol. 8, no. 1.3 S1, pp. 208–
213, 2019, doi: 10.30534/ijatcse/2019/4181.32019.

[17] M. M. Al Anazi and O. R. Shahin, “A Machine Learning Model for the Identification
of the Holy Quran Reciter Utilizing K-Nearest Neighbor and Artificial Neural
Networks,” Information Sciences Letters, vol. 11, no. 4, pp. 1093–1102, Jul. 2022,
doi: 10.18576/isl/110410.

[18] S. Larabi-Marie-Sainte, B. S. Alnamlah, N. F. Alkassim, and S. Y. Alshathry, “A new


framework for Arabic recitation using speech recognition and the Jaro Winkler
algorithm,” Kuwait Journal of Science, vol. 49, no. 1, pp. 1–19, Jan. 2022, doi:
10.48129/KJS.V49I1.11231.

[19] A. M. Basabrain, A. A. A. Shara, and I. A. Al-Bidewi, “Intelligent Search Engine


For The Holy Quran ,” International Journal on Islamic Applications in Computer
Science And Technology, vol. 2, no. 2, 2014.

[20] A. Al Harere and K. Al Jallad, “Quran Recitation Recognition using End-to-End


Deep Learning.”

[21] S. A. Y. Al-Galal, I. F. T. Alshaikhli, A. W. bin Abdul Rahman, and M. A. Dzulkifli,


“EEG-based Emotion Recognition while Listening to Quran Recitation Compared
with Relaxing Music Using Valence-Arousal Model,” in 2015 4th International
Conference on Advanced Computer Science Applications and Technologies
(ACSAT), IEEE, Dec. 2015, pp. 245–250. doi: 10.1109/ACSAT.2015.10.

[22] A. R. Ali, “Multi-Dialect Arabic Speech Recognition,” in 2020 International Joint


Conference on Neural Networks (IJCNN), IEEE, Jul. 2020, pp. 1–7. doi:
10.1109/IJCNN48605.2020.9206658.

[23] M. Osman Hegazi, A. Hilal, and M. Alhawarat, “Fine-Grained Quran Dataset,”


2015. [Online]. Available: www.ijacsa.thesai.org

[24] S. Al-Issa, M. Al-Ayyoub, O. Al-Khaleel, and N. Elmitwally, “Building a neural


speech recognizer for quranic recitations,” Int J Speech Technol, 2022, doi:
10.1007/s10772-022-09988-3.

104
[25] K. M. O. Nahar, W. G. Al-Khatib, M. Elshafei, H. Al-Muhtaseb, and M. M.
Alghamdi, “Arabic Phonemes Transcription Using Learning Vector Quantization:
‘Towards the Development of Fast Quranic Text Transcription,’” in Proceedings -
2013 Taibah University International Conference on Advances in Information
Technology for the Holy Quran and Its Sciences, NOORIC 2013, Institute of
Electrical and Electronics Engineers Inc., Sep. 2015, pp. 407–412. doi:
10.1109/NOORIC.2013.85.

[26] N. Shafie, M. Z. Adam, S. M. Daud, and H. Abas, “A model of correction mapping


for Al-Quran recitation performance evaluation engine,” International Journal of
Advanced Trends in Computer Science and Engineering, vol. 8, no. 1.3 S1, pp. 208–
213, 2019, doi: 10.30534/ijatcse/2019/4181.32019.

[27] H. A. Alsayadi, A. A. Abdelhamid, I. Hegazy, B. Alotaibi, and Z. T. Fayed, “Deep


Investigation of the Recent Advances in Dialectal Arabic Speech Recognition,”
IEEE Access, vol. 10, pp. 57063–57079, 2022, doi:
10.1109/ACCESS.2022.3177191.

[28] H. A. Alsayadi and M. Hadwan, “Automatic Speech Recognition for Qur’an Verses
using Traditional Technique,” Journal of Artificial Intelligence and Metaheuristics,
vol. 1, no. 2, pp. 17–23, 2022, doi: 10.54216/JAIM.010202.

[29] C.-C. Chiu et al., “State-of-the-Art Speech Recognition with Sequence-to-Sequence


Models,” in 2018 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), IEEE, Apr. 2018, pp. 4774–4778. doi:
10.1109/ICASSP.2018.8462105.

[30] S. R. El-Beltagy and A. Rafea, “QDetect: An Intelligent Tool for Detecting Quranic
Verses in any Text,” in Procedia CIRP, Elsevier B.V., 2021, pp. 374–384. doi:
10.1016/j.procs.2021.05.107.

[31] J. E. Garcia, A. Ortega, E. Lleida, T. Lozano, E. Bernues, and D. Sanchez, “Audio


and text synchronization for TV news subtitling based on automatic speech
recognition,” in 2009 IEEE International Symposium on Broadband Multimedia

105
Systems and Broadcasting, BMSB 2009, 2009. doi:
10.1109/ISBMSB.2009.5133758.

[32] B. Xue, C. Liu, and Y. Mu, “Video2Subtitle: Matching Weakly-Synchronized


Sequences via Dynamic Temporal Alignment,” in ICMR 2022 - Proceedings of the
2022 International Conference on Multimedia Retrieval, Association for Computing
Machinery, Inc, Jun. 2022, pp. 342–350. doi: 10.1145/3512527.3531371.

[33] X. Huang and K. F. Lee, “On speaker-independent, speaker-dependent, and speaker-


adaptive speech recognition,” IEEE Transactions on Speech and Audio Processing,
vol. 1, no. 2, pp. 150–157, Apr. 1993, doi: 10.1109/89.222875.

[34] T. Kimura, T. Nose, S. Hirooka, Y. Chiba, and A. Ito, “Comparison of Speech


Recognition Performance Between Kaldi and Google Cloud Speech API,” 2019, pp.
109–115. doi: 10.1007/978-3-030-03748-2_13.

[35] H. Tabbal, W. El Falou, and B. Monla, “Analysis and implementation of a ‘Quranic’


verses delimitation system in audio files using speech recognition techniques,” in
2006 2nd International Conference on Information & Communication Technologies,
IEEE, pp. 2979–2984. doi: 10.1109/ICTTA.2006.1684889.

[36] A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent
neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and
Signal Processing, IEEE, May 2013, pp. 6645–6649. doi:
10.1109/ICASSP.2013.6638947.

[37] R. Pires Magalhães et al., “Evaluation of Automatic Speech Recognition


Approaches,” Journal of Information and Data Management, vol. 13, no. 3, Sep.
2022, doi: 10.5753/jidm.2022.2514.

[38] N. Anggraini, A. Kurniawan, L. K. Wardhani, and N. Hakiem, “Speech Recognition


Application for the Speech Impaired using the Android-based Google Cloud Speech
API,” TELKOMNIKA (Telecommunication Computing Electronics and Control),
vol. 16, no. 6, p. 2733, Dec. 2018, doi: 10.12928/telkomnika.v16i6.9638.

106
[39] Y.-S. Chang, S.-H. Hung, N. J. C. Wang, and B.-S. Lin, “CSR: A Cloud-Assisted
Speech Recognition Service for Personal Mobile Device,” in 2011 International
Conference on Parallel Processing, IEEE, Sep. 2011, pp. 305–314. doi:
10.1109/ICPP.2011.23.

[40] P. Revesz, Introduction to Databases. London: Springer London, 2010. doi:


10.1007/978-1-84996-095-3.

[41] E. C. Grimm et al., “POLLEN METHODS AND STUDIES | Databases and their
Application,” in Encyclopedia of Quaternary Science, Elsevier, 2007, pp. 2521–
2528. doi: 10.1016/B0-44-452747-8/00189-7.

[42] M. A. Sherif and A.-C. Ngonga Ngomo, “Semantic Quran,” Semant Web, vol. 6, no.
4, pp. 339–345, Aug. 2015, doi: 10.3233/SW-140137.

[43] D. A. Wiranata, Moch. A. Bijaksana, and M. S. Mubarok, “Quranic Concepts


Similarity Based on Lexical Database,” in 2018 6th International Conference on
Information and Communication Technology (ICoICT), IEEE, May 2018, pp. 264–
268. doi: 10.1109/ICoICT.2018.8528794.

[44] C. W. Royer and S. J. Wright, “Complexity Analysis of Second-Order Line-Search


Algorithms for Smooth Nonconvex Optimization,” SIAM Journal on Optimization,
vol. 28, no. 2, pp. 1448–1477, Jan. 2018, doi: 10.1137/17M1134329.

[45] J. I. Munro, “On the Competitiveness of Linear Search,” 2000, pp. 338–345. doi:
10.1007/3-540-45253-2_31.

[46] F. M. auf der Heide, “Nondeterministic versus probabilistic linear search


algorithms,” in 26th Annual Symposium on Foundations of Computer Science (sfcs
1985), IEEE, 1985, pp. 65–73. doi: 10.1109/SFCS.1985.38.

[47] R. A. Waltz, J. L. Morales, J. Nocedal, and D. Orban, “An interior algorithm for
nonlinear optimization that combines line search and trust region steps,” Math
Program, vol. 107, no. 3, pp. 391–408, Jul. 2006, doi: 10.1007/s10107-004-0560-5.

107
[48] J. J. Moré and D. J. Thuente, “Line search algorithms with guaranteed sufficient
decrease,” ACM Transactions on Mathematical Software, vol. 20, no. 3, pp. 286–
307, Sep. 1994, doi: 10.1145/192115.192132.

[49] W. Wang, X. Wang, and A. Zhou, “Hash-Search: An Efficient SLCA-Based


Keyword Search Algorithm on XML Documents,” 2009, pp. 496–510. doi:
10.1007/978-3-642-00887-0_44.

[50] A. Martelli, “On the complexity of admissible search algorithms,” Artif Intell, vol.
8, no. 1, pp. 1–13, Feb. 1977, doi: 10.1016/0004-3702(77)90002-9.

108

You might also like