AIMRS
AIMRS
Author
Nashit 19-EE-187
Supervisor
                         i
      AI Model for Recognition and
Synchronization of Quranic Text with Audio
Author
Nashit 19-EE-187
Supervisor:
July 2023
                                            ii
                                    Undertaking
Certifying that research work titled “AI Model for Recognition and Synchronization of
Quranic Text with Audio” is our own work. The work has not been presented elsewhere for
assessment. Where material has been used from other sources it has been properly
acknowledged / referred.
_______________________
19-EE-96
_______________________
19-EE-114
______________________
Nashit
19-EE-187
                                          iii
                                    DEDICATION
We, with immense gratitude and heartfelt appreciation, dedicate this thesis to the
individuals who have played a pivotal role in making our academic journey a reality. Their
unwavering support, encouragement, and inspiration have collectively shaped us into the
First and foremost, we express our deepest thanks to our supervisor Dr. Qamas Gul Khan
Safi, for his exceptional guidance and mentorship throughout this research endeavor. His
expertise, patience, and invaluable feedback have greatly influenced the outcome of this
thesis.
We are also indebted to our families, whose love, understanding, and constant
encouragement have been the bedrock of support in every step of our education. Their
unwavering belief in our abilities has fueled our determination to pursue excellence.
This thesis stands as a testament to the collective efforts of these remarkable individuals,
and we are humbled to have had the opportunity to learn from and be supported by each
one of them.
To our friends and classmates, we extend our sincere appreciation for sharing this journey
with us. The camaraderie, stimulating discussions, and late-night study sessions have been
Lastly, we wish to acknowledge the countless researchers, scholars, and authors whose
works have laid the foundation for our study. The wealth of knowledge they have
                                             iv
                                          Abstract
Nashit 19-EE-187
content often involves tedious manual editing, posing challenges for individuals and
Project responds to this challenge by introducing an innovative AI model that not only
streamlines the content creation process but also maintains the sanctity and accuracy of the
Qur'anic recitation.
By harnessing the power of Automatic Speech Recognition (ASR) technology, the model
predict the recited surahs and verses, and seamlessly retrieve the corresponding text. The
integration of precise timestamps derived from the ASR stream further enhances the
authenticity and synchronization of the generated content. A unique feature of the model
                                             v
lies in its ability to overlay the retrieved text onto a user-selected background, ensuring a
drastically reducing the time and effort traditionally associated with manual content
Qur'anic content generation. It not only relieves users of the burdensome tasks that have
long characterized this domain but also paves the way for a new era of accelerated content
creation methods.
innovation extend beyond immediate applications. The project sets a precedent for the
technology can play a constructive role in preserving and sharing sacred texts. The model's
success opens the door to further possibilities in content creation, encouraging researchers
and practitioners to explore novel ways of enhancing spiritual experiences through the
                                              vi
                              ACKNOWLEDGEMENTS
All the praises, thanks and acknowledgements are for the creator, Allah Almighty, whose
blessings are unlimited, the most beneficent, the most merciful, who gave us strength and
enabled us to complete this task. Countless salutations upon the Holy Prophet (PBUH), his
family and his companions, source of knowledge for enlightening with the essence of faith
in Allah and guiding mankind to the true path of life. We would like to thank our project
supervisor, Dr. Qamas Gul Khan Safi, for the unwavering support and guidance throughout
the duration of this project and having empowered us to undertake and persist on this
journey. His constant encouragement has been a driving force, keeping us motivated every
step of the way. Likewise, Zikra Dar-ul-Nashr truly deserves immense praise for
facilitating our journey towards this innovative project. We would also like to express
special gratitude to our parents for providing us with all the conveniences and supporting
us morally and financially for this project. This project will be motivation for our future
expedition.
                                            vii
                                              TABLE OF CONTENTS
Abstract ............................................................................................................................... v
DEFINITIONS .................................................................................................................. xv
                                                                  viii
   2.4.1.2.          Deep learning ASR algorithms ............................................................. 25
2.4.2.5. Speechmatics......................................................................................... 27
                                                            ix
      2.6.7.      Levenshtein Distance (Edit Distance) Algorithm ......................................... 43
                                                                  x
        3.5.3.1.          Explanation ........................................................................................... 84
5. Chapter 5: Conclusion.............................................................................................. 98
                                                                 xi
                                                 LIST OF FIGURES
Figure 10: Words appear individuallly, progressing sequentially upon pressing of the ‘next’
Figure 12: Surah Verses with their Translations appearing one after another on pressing the
Figure 20: Time Comparison between Traditional Method and Proposed Method .......... 92
                                                                 xii
Figure 21:Video Word by Word ........................................................................................ 93
Figure 23: Code Allows the selection of 2 languages for Translation .............................. 95
Figure 24: Full Ayahs with Translation in English and Urdu ........................................... 96
                                                          xiii
                                                  LIST OF TABLES
Table 3: Problem Letters ( ةand  أin this case) cause mismatching (highlighted in red) ... 51
Table 4: Problem Characters ( صےلin this case) cause mismatching (highlighted in red) .. 51
Table 8: Exclusion chars combined with Previous word for Correct Mapping ............... 60
                                                                xiv
                                     DEFINITIONS
Within the thesis context, subsequent terms are introduced in the main text with distinctive
marking on their first occurrence. This guides the reader to reference this section for a
of emulating human-like intelligence, enabling them to learn from experience, reason, and
Speech to text (STT): is a technology that enables the automatic conversion of spoken
language into written text, facilitating easy transcription of audio or speech recordings.
Hidden Markov Model (HMM): is a sequential probabilistic speech model utilized for
speech recognition and synthesis, effectively predicting acoustic and linguistic features.
videos of the Holy Qur'an, to disseminate its teachings and spiritual essence to a broader
audience.
Arabic Phonology: analyzes Arabic speech sounds (phonemes), exploring their distinct
features, functions, and organizational rules within the language's sound system.
                                             xv
Arabic Phonetics: studies the physical aspects of speech sounds—production,
articulation, and acoustics—and how they are generated and perceived by the human vocal
Intonation: refers to the rise and fall of pitch in spoken language, conveying emotions,
blend and overlap to facilitate smooth and efficient communication in natural speech.
Tawwuz ()تَعوذ: is the act of seeking refuge and protection in Allah from the influence and
whispers of devil.
Tasmiyah ()تَسْمِ يَة: involves invoking the name of Allah before beginning an action or
consuming something.
Zikra Dar-Un-Nashr: Zikra (ٰ ) ِذ ْك َرىis a name of the Quran, and it is the advice that inspires
people to acquire real knowledge. The main difference between humans and all other
creatures is that humans have a wide field of real and acquired knowledge apart from
physical sciences and this knowledge is what nurtures humanity in it. As if Zikra is the
name of an effort to develop humanity among humans. This true knowledge is available to
humans in the form of the Holy Qur'an on behalf of Allah. Since it is in Arabic, we need
the translation of the Qur'an to understand it. In this regard, Zikra Dar-ul-Nashr has
strived to make people's access to the understanding of the Holy Qur'an easier.
                                               xvi
                              ABBREVIATIONS
STT: Speech-to-Text
                                         xvii
                        1. Chapter 1: Introduction
starting with the recording of audio, importing to an editing software (e.g Adobe
Premier Pro), importing the desired background video(s), and the actual hectic part is
adding Arabic subtitles (Qur’anic Text) verse by verse (big verses are split into
parts). Then these subtitles are synchronized manually with the audio which is
The proposed senior design project aims to revolutionize Qur'anic content creation by
Synchronization model. The demand for Qur'anic content creation has never been greater,
prompting the need for an automated approach to ease this tiresome process.
ASR and transcription in a continuous speech is difficult due to coarticulation, and the
room or mistake in the transcription). Moreover, ASR recognizes with only 80 to 90%
accuracy the Qur’anic phonetics. Additionally, retrieving the required data from the
Qur'anic database posed certain challenges (explained in section 3.3). Furthermore, the
issue of synchronizing the Qur'anic transcript with audio arises, specifically concerning the
accurate appearance of subtitles (burnt-in text) in complete sync with the speaker's
recitation. Therefore, the primary focus of this work is to develop a comprehensive model
                                             1
1.2.   Aims and Objectives
4. Automatic subtitling.
5. Easing and Lowering the Barrier of Entry for Qur’anic Content Creation
                                           2
                    2. Chapter 2: Literature Review
complementary background video. These videos also include Qur'anic subtitles that appear
synchronously with the recitation, displaying the verses and their translations (optional).
Presented below is an illustrative video of a reciter reciting Qur'an, where the reciter's
First Audio is imported in an editing software (in this case it is premier pro),
                                              3
                       Figure 2: Audio imported in an editing software
                                             4
                                Figure 3: Background video file imported
Now each verse is manually aligned to the audio. First ٓ يسis matched to the respective
                                                           5
                                Figure 4: Adding Text layers
                                             6
                              Figure 5:Adding translation layers
Following this procedure costs around 5 to 6 hours for surahs of moderate length, such as
Surah Mulk, Surah Yaseen Surah Al-Ibraheem and other similar ones.
                                              7
Presently, we manage two dedicated (non-public) websites (that serve as platforms for
integrating our model). These websites have been developed as part of our ongoing efforts
Qur’anic text is inserted in the first field, and Juz’ number is selected from the dropdown
                                               8
Figure 7: Adding text to site input box
                  9
The algorithm identifies the spaces between the words and places them in an array so that
when the next arrow key is pressed, next word will appear sequentially.
Now on pressing the next button () (next button with respect to Arabic script), text
                                             10
Pressing the next button again,
Again,
                                  11
Again,
Figure 10: Words appear individuallly, progressing sequentially upon pressing of the ‘next’ arrow key
                                                   12
The objective of this site is to record videos or livestream. A reciter recites Quran and a
person glued to the laptop, diligently presses the next button as the Qari proceeds. Visibly,
This website has additional features like background videos and Translation (in 1 language
yet) with the Quranic Text. Instead of individual appearance of words, a part of a big verse
or a complete small verse is displayed, and the next part or verse appears on pressing next
Qur’anic text is entered in the first field and its translation is entered in the second field.
                                               13
On pressing the next button,
Again,
                               14
Figure 12: Surah Verses with their Translations appearing one after another on pressing the next arrow key
                                                   15
2.3.   Art of Qur’anic Recitation
In the fast-pacing world of modern technology, there is huge advancement in the field
Artificial intelligence and ASR also known as computer speech recognition or speech to
text (STT), incorporates knowledge and research in the computer science, linguistics, NLP
and computer engineering fields [1]–[3]. ASR has been envisioned as the future dominant
method for human-computer interaction. Both acoustic modeling and language modeling
are important parts of modern statistically based speech recognition algorithms [4], [5].
Hidden Markov models (HMMs) are also widely used in many systems. Language
modeling is also used in many other natural language processing applications such as
The Qur’an is written in the language of classical Arabic. The number of words in the entire
source of specific vocabulary with a considerable number of Arabic words [7]–[9]. These
Arabic words comprise the 28 (29 with the hamza) letters of the Arabic alphabet. In the
                                      ْ ْ
Arabic language, Diacritics (I’raab ‘ْ)’إع َراب
                                             ِ are symbols added to letters. In Arabic, they
(sukoon) [10]. Discretization is the process of adding those diacritics to Arabic script so
that they are pronounced correctly because the Arabic language is characterized by the
                                                       ْ َ
The meaning is determined by the diacritics (" ”تش ِكيلTashkeel) that are represented by
the difference in pronunciation [8], [13]. The main diacritic marks include are:
                                             16
    •   Dammaٌْٰٰ
                َ
    •   Fatha   َ
• Kasra ٌِْ
• Tanween al-fath ٌْ
• Tanween al-kasr ٌْ
• Madd ~
Hence, the Arabic text can be written in more than one form, depending on the number of
symbols and diacritic marks that are used. The Qur’an is read according to the rules of
recitation and the provisions of intonation to be within the correct way according to the
moderate speed. It refers to a group of rules that governs the Recitation of The Holy Qur’an.
Therefore, it is regarded as an art due to the performance of the reciters in a similar form.
The Tajweed Art refers to the rules that are flexible and well-defined for reciting the Holy
                                               17
Tajweed is the science that concerns with applying the rules of pronunciation to the letters
during the recitation of the Qur’an within specific provisions that include as follows:
                  َ ْ
   •   Izhaar ()إظهار: it is an Arabic term used in the field of Tajweed, which refers to the
clear and distinct pronunciation of Arabic letters and sounds. In the context of
Quranic recitation, "Izhar" signifies articulating each letter with clarity and
non-vowel sound (consonant) is smoothly and gradually merged into the following
when certain letters interact within specific word combinations. This phonetic
change contributes to the accurate and melodious recitation of the Quranic text.,
"Iqlab" refers to the transformation of the "( "نnoon) sound into a "( "مMeem) sound
when followed by the letter "( "بba). This phonetic change contributes to the
                                             18
    accurate and melodious recitation of the Quranic text, maintaining the proper
• Ikhfa ( ) ِإ ْخ َفاء: In the realm of Quranic recitation, Ikhfa ( )إخفاءis a vital Tajweed
technique where certain letters with specific diacritics are softly and smoothly
the recitation. This practice ensures a balanced and melodious rendition of the
Arabic letters when they carry a sukun (no vowel). These letters, namely "( "قqaf),
"( "طṭa), "( "بba), "( "جjim), and "( "دdal), produce a distinctive sound that adds a
rhythmic quality to the recitation. Qalqalah embodies the art of delivering the
Quranic verses with precise phonetic resonance and melodic resonance, enriching
stopping or pausing while reciting the Quran or during formal speech to observe
the appropriate punctuation and rules of recitation. Waqf helps in maintaining the
                                           19
    essential aspect of proper Tajweed, enhancing the beauty, clarity, and correct
sound from the throat during their pronunciation. The letters in "Huroof al-Halaqi"
are:
o ( حḥā)
o ( خkhā)
o '( عayn)
o ( غghayn)
o ( همزةhamza)
o ( هـhā)
These letters have a distinct and guttural pronunciation involving the constriction of
the throat while producing the sound [3], [9], [10], [18], [19].
                    َ
              ْ )ت ْرت:
•   Tarteel (ْيل ِ      It presents a captivating journey into the realm of Quranic recitation,
akin to a serene river of melodies that courses through the heart, arousing the
deepest of emotions and forging a spiritual bond with the divine. This practice is
recitation to emerge as an art form that invites one to plunge into the ocean of verses
As the voyage into Tarteel unfolds, each syllable transforms into a note, and every
word morphs into a melodious strand. The heart, much like a skilled conductor,
orchestrates the tempo of each verse in harmony with the very rhythm of creation.
These Quranic words become steadfast companions, igniting within the heart a
                                           20
       symphony of emotions—reverence, jubilation, and an overwhelming sense of
connection.
Envision the heart as a vessel brimming with the brilliance of the Quran. During
recitation, even the pores of one's being seemed to resonate in perfect unity, attuned
to the resonance of the divine message. The verses envelop and embrace, cradling
the soul within their profound wisdom. The reciter metamorphoses into a vessel of
light, radiating the beauty of the Quran's teachings to those in their midst [20].
that resonates within the heart. The Quranic verses wrap themselves around one's
very essence, immersing the reciter in a profound dialogue with the divine. With
each utterance, the heart dances in elation, casting off the burdens of the world and
Tarteel acts as a bridge, harmonizing the heart with the Quran. It charts a melodious
Creator. Within each verse, a treasury of wisdom lies waiting to be uncovered. The
heart, aflutter with emotions, discovers solace, inspiration, and unbounded joy
within the embrace of the Quran's timeless and resounding message [10], [21], [22].
Wasting any of the intonation rules is considered a mistake that does not change the
                                            21
type of mistake that probably changes the meaning. This error is considered more important
than the mistake of intonation and must be corrected [23]. The recitation of the Qur’an is
organized and the differences between them are related to changing some of the diacritics
Eminently, recognizing the reciters of the Holy Qur’an examination lacks common
datasets. Many of the speech recognition researchers carried out their examination on the
datasets of the English language [14], [27]. However, only a few examinations are made in
the language of Arabic. There is a limitation on the Arabic speech recognition systems
compared with speech recognition systems of other languages. The recognition of the
Qur’an reciter is a complex task as it is done with the way of Tajweed. Every Qur’anic
reciter has a differentiating signal. This reciter signal encapsulates a dynamic range, which
is temporary and altered over time based on the pronunciation basis and the recitation way
called Tarteel ()ترتيل. Classifying and recognizing the reciters of the holy Qur’an are
regarded as parts of the field of speech and voice recognition systems [8].
The voice of every person is unique. As a result, the audios of the Qur’an recited by many
reciters differ significantly from one another. Even if a sentence is fetched from the same
place of the Holy Qur’an, the delivering or recitation way of the Holy Qur’an is not the
same [13], [16]. Various sounds or slices are created by various reciters. Many challenges
are encountered while reciting the Holy Qur’an specialties due to the variations between
                                             22
the Holy Qur’an recitation and the written language [27]. Many similar letter combinations
                                                                                            َ
are pronounced in different ways because of the utilization of diacritics (َ, َ , َ, َ, َ,
                                                                                         ِ َ).
Mastering correct Arabic pronunciation, especially in the context of the Quran, is crucial
for effective communication and deepening one's understanding of the sacred text. Proper
pronunciation of Arabic in the Quran ensures clarity and reverence enabling a profound
connection with its divine message. It allows for a more meaningful recitation and
engagement with the Quranic verses, fostering a spiritual and linguistic connection with
recognize and transcribe spoken language into text. These algorithms are complex and
involve several components, including an acoustic model, a language model, and a decoder
[29], [30].
The acoustic model analyzes the audio signal and converts it into a series of acoustic feature
vectors [2]. The language model then determines the most likely sequence of words that
corresponds to the acoustic feature vectors, based on statistical models and machine
learning algorithms. Finally, the decoder combines the output of the acoustic and language
models to generate the final transcription of the speech signal [11], [30].
ASR algorithms are used in a wide variety of applications, including speech recognition
for virtual assistants, automated transcription of audio and video content, and real-time
                                             23
2.4.1. ASR Algorithms
• Acoustic model that takes the spectrograms as input and outputs a matrix of
• Punctuation and capitalization model that formats the generated text for easier
Traditional statistical techniques for speech recognition involve using probabilistic models
to represent the relationships between speech sounds, words, and language. These models
are trained on large amounts of speech data and can recognize speech by calculating the
probability that a given sequence of sounds corresponds to a particular word or phrase. One
(HMMs), which model the probability of a sequence of observations (in this case, speech
sounds) given a hidden state sequence (in this case, the corresponding sequence of words).
[3]
                                               24
2.4.1.2. Deep learning ASR algorithms
In recent years, deep learning techniques such as neural networks have been applied to
speech recognition with great success. In these systems, the acoustic properties of speech
signals are first extracted and transformed into a sequence of feature vectors, which are
then processed by one or more layers of neural networks to produce the final transcription.
These deep learning techniques have been shown to significantly improve the accuracy of
are Quartznet, Citrinet, and Conformer [31], [32]. In a typical speech recognition
pipeline, you can choose and switch any acoustic model that you want based on your use
A typical deep learning pipeline for speech recognition includes the following components:
• Data pre-processing
                                            25
                      Figure 13. Deep learning speech recognition pipeline
There are many ASR (Automatic Speech Recognition) service providers that offer cloud-
based APIs and SDKs for developers to integrate speech recognition capabilities into their
Google Cloud Speech-to-Text is a cloud-based ASR service that supports real-time and
batch transcription of audio files. It offers high accuracy and supports a wide range of
languages and dialects. The service also provides features such as speaker diarization and
punctuation.[34]
                                              26
2.4.2.2. Microsoft Azure Speech Services
Microsoft Azure Speech Services provides cloud-based ASR capabilities for developers. It
supports real-time and batch transcription of audio files and offers features such as speaker
(AWS). It can transcribe audio and video files into text and supports a wide range of
languages and accents. It also provides features such as speaker identification and custom
vocabulary.[35]
IBM Watson Speech to Text is a cloud-based ASR service that supports real-time and batch
transcription of audio files. It offers high accuracy and supports a wide range of languages
and dialects. The service also provides features such as speaker diarization and
customization options.[35][36]
2.4.2.5. Speechmatics
Speechmatics is a cloud-based ASR service that supports real-time and batch transcription
of audio and video files. It offers high accuracy and supports a wide range of languages
and dialects. The service also provides features such as speaker identification and
                                             27
2.4.3. Usage of Cloud Speech to Text:
Speech-to-Text has three main methods to perform speech recognition. These are listed
below:
• Synchronous Recognition (REST and gRPC) sends audio data to the Speech-to-
Text API, performs recognition on that data, and returns results after all audio has
• Asynchronous Recognition (REST and gRPC) sends audio data to the Speech-to-
Text API and initiates a Long Running Operation. Using this operation, you can
periodically poll for recognition results. Use asynchronous requests for audio data
within a gRPC bi-directional stream. Streaming requests are designed for real-time
recognition provides interim results while audio is being captured, allowing result
                                              28
 {
         "config": {
             "encoding": "LINEAR16",
             "sampleRateHertz": 16000,
             "languageCode": "en-US",
         },
         "audio": {
             "uri": "gs://bucket-name/path_to_audio_file"
 }
following sub-fields:
• encoding (required): specifies the encoding scheme of the provided audio, such as
performance.
• sampleRateHertz (required): specifies the sample rate (in Hertz) of the audio,
determining the number of samples per second in the audio file. It must match the
actual sample rate of the audio. For Speech-to-Text, the supported range is between
8000 Hz and 48000 Hz. In the case of FLAC or WAV files, you can alternatively
specify the sample rate in the file header. To achieve optimal speech recognition
that has already been recorded at a different sample rate, especially legacy
telephony audio at 8000 Hz, should be avoided. Instead, provide the audio at its
recognition. It follows the BCP-47 identifier format and helps the API understand
                                             29
       the language being spoken. Since we are recognizing Arabic, we used ar-SA for
ASR.
processing the audio, including word or phrase boosts and hints for the speech
recognition task.
Audio can be provided to the Speech-to-Text API through the audio parameter of type
• content: contains the audio content embedded within the request. This method has
• uri: contains a URI pointing to the audio content, such as a Google Cloud Storage
compressed.
These parameters and options allow for customization and control over the speech
Speech-to-Text can also include time offset values (timestamps) for the beginning and end
of each spoken word that is recognized in the supplied audio. A time offset value represents
                                             30
the amount of time that has elapsed from the beginning of the audio, in increments of
100ms.
Time offsets are especially useful for analyzing longer audio files, where you may need to
search for a particular word in the recognized text and locate it (seek) in the original audio.
Time offsets are supported for all our recognition methods: recognize, streaming
To include time offsets in the results of your request, we have to set the enable Word Time
   {
   "config": {
     "languageCode": "en-US",
     "enableWordTimeOffsets": true
     },
   "audio":{
     "uri":"gs://gcs-test-data/gettysburg.flac"
     }
   }
2.4.3.2. Encoding
Please note that an audio format should not be confused with an audio encoding. For
example, the .WAV file format defines the structure of the header in an audio file but does
not specify the encoding itself. The actual audio encoding used in a .WAV file can vary, so
On the other hand, FLAC serves as both a file format and an encoding, which can
sometimes cause confusion. In the case of FLAC files, the sample rate must be included in
the FLAC header for proper submission to the Speech-to-Text API. It's important to note
                                              31
      that FLAC is the specific codec used in this context, while the term "FLAC file format"
Remember to verify the header and encoding of your audio files to ensure compatibility
The Speech-to-Text API supports a number of different encodings. The following table lists
FLAC Free Lossless Audio Codec Yes 16-bit or 24-bit required for
streams
rate.
                                                   32
AMR              Adaptive            Multi-Rate No            Sample rate must be 8000 Hz
Narrowband
OGG_OPUS Opus encoded audio frames in No Sample rate must be one of 8000
Hz, or 48000 Hz
HEADER_BYTE
Hz, or 48000 Hz
Speech accuracy can be assessed using various metrics, depending on the specific
requirements. However, the widely accepted standard for comparison is the Word Error
Rate (WER). WER measures the percentage of incorrect word transcriptions within a given
set, serving as an indicator of accuracy. A lower WER value implies a higher level of
                                             33
In the context of ASR accuracy assessment, the term "ground truth" refers to the 100%
reference to compare and measure the accuracy of the ASR system under evaluation.
The Word Error Rate (WER) calculation accounts for three types of transcription errors
• Insertion Error (I): Words present in the hypothesis transcript that aren't present
• Substitution errors (S): Words that are present in both the hypothesis and ground
• Deletion errors (D): Words that are missing from the hypothesis but present in the
ground truth.
                                                 𝑆+𝑅+𝑄                               (2.1)
                                    𝑊𝐸𝑅 =
                                                   𝑁
To find the WER, add the total number of each one of these errors, and divide by the total
number of words (N) in the ground truth transcript. The WER can be greater than 100% in
situations with very low accuracy, for example, when a large amount of new text is inserted.
It is crucial to differentiate the WER from the confidence score as they are independent
metrics that do not necessarily correlate. A confidence score is based on likelihood, while
the WER is based on the accuracy of word identification. Grammatical errors, even if
                                             34
minor, can lead to a high WER if words are not correctly identified. Conversely, a high
confidence score can be attributed to frequently occurring words that are more likely to be
transcribed correctly by the ASR system. Therefore, the confidence score and WER are not
• Normalisation
Normalization is an essential step in calculating the WER metric. Both the machine
transcription and the human-provided ground truth transcription are normalized before
capitalization when comparing the machine transcription with the human-provided ground
truth.[38]
The model adaptation feature can be leveraged in Speech-to-Text systems to enhance the
recognition of specific words or phrases by biasing the system towards those preferred
options over other alternatives. This feature is especially beneficial in several use cases,
• Enhancing the accuracy of frequently occurring words and phrases in the audio
data: By employing model adaptation, users can train the recognition model to
if the word "weather" frequently appears in the audio data, the model can be adapted
                                             35
   •   Expanding the vocabulary recognized by Speech-to-Text: Although Speech-to-Text
already encompasses an extensive vocabulary, it may not include certain words that
are specific to particular domains or rare in general language usage, such as proper
names. Model adaptation allows the addition of such words to the system's
scenarios where the provided audio is noisy or lacks clarity, model adaptation can
account for the characteristics of the audio, the system can better handle challenging
transcription quality in the presence of audio challenges. These applications can greatly
contribute to the overall performance and usability of the system [33], [38], [39],[29].
2.5. Databases
designed to be easily accessed, managed, and updated. It is used to store, organize, and
retrieve data for a wide range of applications, from simple data management tasks to
Databases are made up of tables, which are organized into rows and columns, with each
row representing a unique record and each column representing a data field. The data stored
                                            36
in a database can be accessed and manipulated using Structured Query Language (SQL), a
1. Relational databases: Relational databases are the most common type of database
and are based on the relational model of data, which organizes data into tables with
2. NoSQL databases: NoSQL databases are non-relational databases that are designed
oriented programming model and store data as objects, which encapsulate both data
ObjectStore.
Databases are an essential component of many modern applications and are used in a wide
range of industries, including finance, healthcare, retail, and manufacturing. The effective
management and use of databases can provide organizations with valuable insights and
efficiencies, making them an important asset for businesses of all sizes [4][40], [41].
                                             37
2.5.1. Qur’anic Databases
A Qur'anic database is a digital repository that stores the text of the Quran and related
databases are used to facilitate the study and analysis of the Quran by researchers, scholars,
and students.
annotation tools, and the ability to compare and analyze different versions of the text. They
can also incorporate multimedia elements, such as audio recordings and videos of
Qur'anic databases play an important role in advancing the study and understanding of the
Quran, by providing researchers and scholars with easy access to the text and related
materials, as well as advanced analytical tools to aid in their research.[5] Following are
It is an initiative that provides the Quranic text in various languages, including Arabic,
English, French, and Urdu. The project also includes a web-based Quranic reader that
allows users to view and search the text and includes features such as highlighting,
                                             38
2.5.1.2. The Holy Quran – Zeeshan Usmani
This dataset has been put together to aid data scientists to run their NLP algorithms and
Kernels to find and explore the sacred text by themselves. The data contains complete Holy
Quran in following 21 languages (so data scientists from different parts of the world can
work with it). First language is the original Arabic text. Other 18 files are the translations
We have used this database in our project since the data was represented in MS Excel
making it easier to visualize and to work with. Also the database capabilities of Excel are
very powerful. In fact, not only can Excel have a simple searchable database, it also can be
have a proper relational database. A relational database consists of a master table that links
with its slave tables, which are also known as child tables. Hence for the aforementioned
                                             39
2.6.     Searching Algorithms
                                           40
                                 Figure 14: Big-O Complexity
A linear search algorithm has a worst-case time complexity of O(n), where 'n' represents
the number of elements in the dataset. It sequentially examines each element until a match
is found [44]–[48].
Binary search is an efficient algorithm used for sorted arrays, with a worst-case time
dataset in half to narrow down the search range until the desired element is found. In our
specific application, it disturbs the index order and they do not remain sorted [49].
                                             41
2.6.3. Hashing
Hashing is a fast-searching algorithm that involves generating a unique hash value for each
word. This hash value serves as an index or identifier, enabling efficient retrieval of the
word. Hashing offers constant-time complexity on average, ensuring rapid word lookup.
In our context, data sorting poses a challenge as it distorts the words [49].
estimating the position of the desired element using a formula that considers the values of
the first and last elements, along with the desired value, interpolation search effectively
narrows down the search range. This approach allows for quick search range reduction in
specific scenarios[50].
find occurrences of a given pattern within a text. It utilizes two key preprocessing steps,
the "bad character" and "good suffix" rules, to skip unnecessary comparisons and quickly
narrow down the search space. Due to its ability to handle repetitive patterns effectively,
the Boyer-Moore algorithm is a valuable tool for various applications in computer science
                                             42
2.6.6. Knuth-Morris-Pratt Algorithm
Algorithm, but it also uses a table to keep track of which characters in the pattern
match[50].
The Levenshtein Distance, also known as the Edit Distance, is a string similarity metric
                                            43
                            3. Chapter 3: Design
3.1. Introduction
The primary objective of this project is to create an innovative model that takes a Quranic
recitation audio as input and generates a video which includes the original audio, a user
selected background video, and synchronized Quranic text with beautiful font, accurately
                                            44
   •   Matching of words by matching algorithm.
• Outputting SRT,
• Generating video by combining the output SRT with the selected background video
This process is explained in detail with corresponding code blocks in the subsequent
headings.
3.2.1. Background
The initial challenge in our project lies in transcribing the audio file. We explored two main
                                             45
   1. Designing a specialized ASR model tailored for Qur'anic Recognition.
After careful consideration, we chose the latter approach for the following compelling
reasons,
• Developing a custom ASR model requires substantial effort, time, and a trial-and-
error process. It entails addressing various challenges, ranging from the model's
Moreover, fixing these issues for some words or letters might introduce new
recognition problems for others. The complexity of this process could divert our
focus from the primary objective of synchronizing Qur'anic text with audio.
• Our research findings indicate that existing custom ASR models primarily focus on
transcribing a limited set of short surahs. Designing a comprehensive model for the
entire Qur'an is a significant undertaking that we have not come across in the
literature. Even developing a small-scale model for specific short surahs requires
considerable effort and a trial-and-error approach, which might steer us away from
• Existing general ASR models for Arabic have been trained on vast and diverse
datasets, giving them an edge in terms of efficiency and accuracy. Moreover, these
indicate the start and end times for each transcribed word, empowering us with
                                            46
Considering these significant factors, we made the decision to employ an existing simple
Arabic ASR model. The next step was to choose the most suitable model from the array of
available options, such as those offered by Google, Microsoft, Amazon, and others. After
thorough evaluation, we ultimately selected Google's ASR model due to its ease of access,
3.2.2. Process
                                              47
      3. Getting ASR Transcript with Start and End times for each word (timestamps),
   import speech_recognition as sr
   import numpy as np
   from google.cloud import speech
   import os
   import pandas as pd
   import csv
gcs_uri = "gs://quran_mulk/audio-files/kahaf.flac"
                                           48
def transcribe_speech():
    # Configure the recognition audio and settings
    audio = speech.RecognitionAudio(uri=gcs_uri)
    config = speech.RecognitionConfig(
          encoding=speech.RecognitionConfig.AudioEncoding.FLAC,
          sample_rate_hertz=44100,
          language_code="ar-SA",
          model="default",
          audio_channel_count=2,
          with open("asrt.txt", "w", encoding="utf-8") as file:
                for result in result.results:
                      alternative = result.alternatives[0]
          enable_word_time_offsets=True,
    )
                                   49
                                 1        وما        13.7s 14.4s
3.3.1. Background
The process of predicting which part of the Qur'an is recited in the uploaded audio
straightforward approach by comparing the ASR transcript with the vowel-less column of
the database. We started by searching for T’awwuz ( )تعوذand Tasmiyah ( )تسميهin the code,
checking the first 8 indices (4 for T’awwuz and 4 for Tasmiyah). Based on their presence
or absence, the code made decisions on whether to add either, both, or neither of them.
However, we encountered unexpected discrepancies during the initial testing phase. When
comparing the first six words (indices decided based on the presence of T’awwuz and
Tasmiyah) and the last six words with the entire Qur'anic database, no match was found.
Our thorough inspection of the database, facilitated by its Excel format, revealed that
mismatches occurred due to certain letter differences in the ASR transcript and the database
text. For instance, the  يof the actual Qur'anic text was represented as  ىin the ASR
transcript, and  ةwas represented as  هand may more characters like that.
                                                50
         Table 3: Problem Letters ( ةand  أin this case) cause mismatching (highlighted in red)
                          Qur’anic Database
                                                       ٰ ASR Transcript
                        ٰ Text (Vowel-less)
والحياة والحياه
ليبلوكم ليبلوكم
أيكم ايكم
أحسن احسن
problematic letters in the database with their corresponding forms from the ASR transcript.
This adjustment significantly improved the matching accuracy, but we were still faced with
After much effort and perseverance, we identified the second problem - mismatching due
to the presence of additional characters, such as  صل, ق, ج, in the Qur'anic text from the
Table 4: Problem Characters ( صےلin this case) cause mismatching (highlighted in red)
                          Qur’anic Database
                                                        ٰ ASR Transcript
                          Text (Vowel-less)
                                                  51
                                    َ
                                                               فسوف
فسوف يعلمون
To overcome this obstacle, we applied further data filtering, designating these characters
1 2 3 4 5 6 7 8 9
This meticulous process allowed us to fine-tune the Ayah Prediction algorithm and
Through these iterative steps and continuous refinement, we overcame the complexities of
Ayah Prediction, paving the way for accurate and reliable identification of the specific
Surah and Ayah being recited in the uploaded audio. The success of this algorithm
constitutes a crucial milestone in achieving our project's objectives, since the upcoming
                                                   52
3.3.2. Process
                                            53
   4. Reading Database,
5. Searching First six words and Last six words (in ASR Transcript) in the entire
database,
Open the file asrt.txt and replace occurrences of the Arabic letter ' 'يwith ''ى.
                                           54
   with open("asrt.txt", "r", encoding="utf-8") as file:
      text = file.read()
      updated_text = text.replace(' 'ى,')'ي
   auzbillah = False
   bismillah = False
Read the lines of the file asrt.txt and store them in a list.
Check if the first line contains the string ""اﻋﻮذ. If it does, set auzbillah to True.
   if lines[0].split('\t')[1] == ""اﻋﻮذ:
       auzbillah = True
Check if the sixth line contains the string ""ﺑﺴﻢ. If it does, set bismillah to True.
   if lines[5].split('\t')[1] == ""ﺑﺴﻢ:
       bismillah = True
list. If both auzbillah and bismillah are True, append words from lines 8 to 14
10 (inclusive). If neither auzbillah nor bismillah is True, append words from lines
                                             55
0 to 5 (inclusive).
   combined_words = ""
   if auzbillah and bismillah:
        for i in range(8, 15): # Adjusted the range to include line 14
            combined_words += lines[i].split('\t')[1] + " "
   elif auzbillah or bismillah:
        for i in range(4, 11): # Adjusted the range to include line 10
            combined_words += lines[i].split('\t')[1] + " "
   else:
        for i in range(0, 6):
              combined_words += lines[i].split('\t')[1] + " "
   first_six_words = combined_words.split()[-6:]
Construct the last_six_words string by concatenating words from the last six lines.
   last_six_words = ""
   for i in range(len(lines) - 6, len(lines)):
        last_six_words += lines[i].split('\t')[1] + " "
Read the Excel file DATABASE.xlsx and store the data in a data frame.
df = pd.read_excel('DATABASE.xlsx')
df = df.dropna(subset=['ArabicText'])
and store the resulting rows in `result`, and search for the string represented
by last_six_words in the 'ArabicText' column and store the resulting rows in result2.
                                         56
   result = df[df['ArabicText'].str.contains('            '.join(first_six_words),
   na=False)]
   result2 = df[df['ArabicText'].str.contains(last_six_words, na=False)]
If result is not empty, iterate over its rows and print the corresponding 'SurahNo' and
   if not result.empty:
        for index, row in result.iterrows():
            print(f"SurahNo: {row['SurahNo']}, AyahNo: {row['AyahNo']}")
   else:
        print("Not found: Ayah start")
   if not result2.empty:
        for index, row in result2.iterrows():
            print(f"SurahNo: {row['SurahNo']}, AyahNo: {row['AyahNo']}")
   else:
        print("Not found: Ayah end")
last_six_words.
This code can be used to process the text file asrt.txt and search for corresponding
                                           57
3.4.   Database and Matching
3.4.1.1. Background
The quest for a suitable database presented a significant challenge in our project. We
meticulously explored various options, including an SQLite database that had plain text
with each letter, character, and space as an array element. This approach provided granular
data and it lacked the desired data representation, ease of visualization, and flexibility for
alterations and data fetching. Subsequently, we turned to the API of quran.com, hoping it
would resolve our data representation challenges. However, this too presented certain
Various other databases were also examined, but unfortunately, none of them met our
requirements, leaving us unsatisfied with the available options. In essence, the databases
we explored failed to offer the ideal data representation, making it challenging to visualize
After extensive research, our quest led us to discover 'The Holy Quran' database by Zeeshan
Usmani on Kaggle. This database proved to be an ideal fit for much of our requirements.
It was presented in MS Excel format, featuring two columns for Qur'anic text—one with
vowels and another without vowels. Splitting of text into individual words enabled by
Furthermore, we effectively filtered the vowel-less text column for specific characters
(referred to as 'exclusion chars' in subsequent headings). These characters were not present
                                             58
in the ASR transcript and excluding them was essential for enhancing the efficiency of our
Matching Algorithm.
After this text matching between Arabic Text column (vowel-less) and the ASR Transcript,
the words from the vowel-less column are to be mapped on to the column containing
vowels (called as ‘Original Arabic’ in database). The Original Arabic column contains both
vowels and the ‘exclusion chars’. When text is split into ‘Word by Word’ form for matching
and mapping, the exclusion chars also get separated and occupy a cell. See the table below,
                                                  َْ
                                              اْل َمل
                                             َ َف َس ْو
                                             ف
This disturbs the mapping of text with vowels on to the text without vowels,
                               َْ
                           اْل َمل                                 ويلههم
َ االمل
                          َ َف َس ْو
                          ف                                        فسوف
                                               59
                                َْ
 ويلههمwill be mapped on to  اْل َملand  االملwill be mapped on to صل. To address this issue,
whenever code encounters an exclusion char during text splitting, it appends it with the
previous word in the Original Arabic column. This way we get correctly mapped output
text containing both vowels and the exclusion chars (which are necessary in Qur’anic text).
Table 8: Exclusion chars combined with Previous word for Correct Mapping
                                   Exclusion Chars
       Exclusion Chars                                         Arabic Text              ASR
                                    Combined with
           Separate                                           (Vowel-less)        Transcript
                                    Previous Word
                  َْ                            ْ
              اْل َمل                    َويل ِه ِهم              ويلههم                ويلههم
                                              َْ
               َ                          اْل َمل                  االمل                االمل
seamless identification process facilitated the integration of the Matching Algorithm within
                                                    60
3.4.1.2. Explanation with Code
This code is designed to process Quranic verses stored in an Excel file. It allows the user
to confirm the predicted Surah No. and Ayah No. and then filters the data. The code then
performs several operations on the filtered verses, including adding optional strings,
splitting the verses into individual words based on spaces, and incorporating
supplementary data from a separate file. The processed data is organized and stored in a
  import pandas as pd
  import csv
The code begins by importing the necessary libraries - pandas for data manipulation and
  wb = pd.read_excel('DATABASE.xlsx')
  ws = wb
The code loads the data from the ‘DATABASE.xlsx’ Excel file into a pandas DataFrame
called wb and assigns it to ws as well. Essentially, both wb and ws will point to the same
DataFrame.
   1               الفاتحة                 2                  َّ الر ْح َم َٰ ن
                                                      الر ِح ِيم             َّ               الرحمن الرحىم
                                                                 ِ
                                               61
                                                                        ِّ ْ َ          َ
   1                     الفاتحة                      3              ين
                                                                     ِ م ِال ِك يو ِم الد                    مالك ىوم الدىن
                                                                         َ َ َ                  َ َ
   1                     الفاتحة                      4              ِإ َّياك ن ْعبد َو ِإ َّياك ن ْست ِعي   اىاك نعبد واىاك نستعي
The code prompts the user to enter the Surah number (surah_no), the starting Ayah number
(ayah_start), and the ending Ayah number (ayah_end) through standard input (keyboard).
  add_auzbillah = True
  add_bismillah = True
Two boolean variables add_auzbillah and add_bismillah are initialized to True as default
values. These variables are used to control whether additional strings ‘ٰاّللٰ أَع ْوذ
                                                                                  ِ ٰ ِان مِ نَٰ ب    َّ ال
                                                                                                ِٰ شيْط
                                                           62
The code gives the user the option to change the default values for add_auzbillah and
add_bismillah. If the user enters ‘F’ for either prompt, the respective variable is set to
False, indicating that the corresponding string should not be added to the start of the filtered
data.
 filtered_data_Orignal_arabic = []
 filtered_data_arabic_text = []
 for row in ws.itertuples():
         if row[5] == surah_no and ayah_start <= row[11] <= ayah_end:
         ayah_no = row[11] # Get the AyahNo from column K
 filtered_data_Orignal_arabic.append(row[13] + "" + str(ayah_no))
         filtered_data_arabic_text.append(row[14])
The code filters the original Arabic text and Arabic text data based on the provided Surah
and Ayah range. It iterates through the rows of the DataFrame ws using itertuples(). If
the row’s Surah number (at index 5) matches surah_no and the Ayah number (at index 11)
(index 5 and 11 is a way of telling the code form which column you have to filer the data)
is within the specified range, the Arabic text and Arabic text with Ayah number are
respectively.
                                              63
 if add_auzbillah and add_bismillah:
                   filtered_data_Orignal_arabic.insert(0, "ٰاّلل أَع ْوذ
                                                                    ِٰٰ ان مِنَٰ ِب    َّ الر ِجي ِْٰم ال#
                                                                                 ِٰ شيْط              َّ   " + "ِٰبسۡ ِم
    َّ ن
 ِٰٱّلل ِٰ ٱلر ۡح َم
                 َّ يم
                     ِٰ ِٱلرح%
                          َّ   ")
         filtered_data_arabic_text.insert(0, "" الرجيم الشيطان من باهلل اعوذ+ "الرحمن هللا بسم
 )" الرحيم
 elif add_auzbillah:
           filtered_data_Orignal_arabic.insert(0, "ٰاّلل أَع ْوذ
                                                            ِٰٰ ان مِنَٰ ِب    َّ الر ِجي ِْٰم ال#")
                                                                         ِٰ شيْط              َّ
           filtered_data_arabic_text.insert(0, ")"الرجيم الشيطان من باهلل اعوذ
 elif add_bismillah:
                                                          َّ ٰٱلر ۡح َم ِن
           filtered_data_Orignal_arabic.insert(0, "ٰٰٱلرحِ ِيم             ِ َّ  ِبسۡ ِم%")
                                                                      َّ ٰٱّلل
           filtered_data_arabic_text.insert(0, ")"الرحيم الرحمن هللا بسم
Based on the values of add_auzbillah and add_bismillah, additional strings are inserted
A list called exclusion_chars is defined, which contains characters that should be excluded
The reason is that to make the arabic text and orignal arabic text indexes same so that we
can map the arbic text to orignal arabic text when we will be doing matching and then
mapping.
                                                            64
 Ayahs_oa = ' '.join(filtered_data_Orignal_arabic)
 words = Ayahs_oa.split(" ")
 split_words_oa = [words[0]]
 for i in range(1, len(words)):
         if words[i] in exclusion_chars:
         split_words_oa[-1] += words[i]
         else:
         split_words_oa.append(words[i])
 split_words_oa = list(filter(None, split_words_oa))
The code joins the filtered original Arabic data into a single string separated by spaces and
then splits it into words based on space characters. The resulting words are stored in the
the previous word. Finally, empty words are filtered out using filter(None,
split_words_oa).
                            َْ                                     ْ
                        اْل َمل                             َويل ِه ِهم
                         َ                                       َْ
                                                             اْل َمل
                                              65
 Ayahs_at = ' '.join(filtered_data_arabic_text)
 split_words_at = Ayahs_at.split()
Similarly, the code joins the filtered Arabic text data into a single string and splits it into
 asrt_data = []
 with open("asrt.txt", "r", encoding="utf-8") as asrt_file:
         reader = csv.reader(asrt_file, delimiter='\t')
         for row in reader:
         if len(row) >= 4:
         asrt_data.append(row)
The code reads the data from the “asrt.txt” file, which is assumed to be a tab-separated
values (TSV) file. Each row is appended to the asrt_data list if it contains at least four
elements.
 output_data = {
 "Index": [],
 "Original Arabic": [],
 "Arabic text": [],
 "Speech Recognition": [],
 "Start Time": [],
 "End Time": []
 }
A dictionary called output_data is created with keys representing the column names and
                                              66
 max_length = max(len(asrt_data), len(split_words_oa),
 len(split_words_at))
The code determines the maximum length among asrt_data, split_words_oa, and
split_words_at lists. This will be used to iterate and populate the output_data dictionary.
 for i in range(max_length):
         output_data["Index"].append(i)
         if i < len(asrt_data):
         output_data["Speech Recognition"].append(asrt_data[i][1])
         output_data["Start Time"].append(asrt_data[i][2])
         output_data["End Time"].append(asrt_data[i][3])
         else:
         output_data["Speech Recognition"].append(None)
         output_data["Start Time"].append(None)
         output_data["End Time"].append(None)
 if i < len(split_words_oa):
         output_data["Original Arabic"].append(split_words_oa[i])
 else:
         output_data["Original Arabic"].append(None)
 if i < len(split_words_at):
         output_data["Arabic text"].append(split_words_at[i])
 else:
         output_data["Arabic text"].append(None)
The code populates the output_data dictionary by iterating from 0 to max_length. For each
index, the corresponding values are appended to the respective lists in the dictionary. If the
                                              67
 output_df = pd.DataFrame(output_data)
 output_df.to_excel("search_match.xlsx", index=False)
 print("search matching file created")
a message indicating the successful creation of the search matching file is printed.
Original Speech
                   ْ
   9        ال َح ْمد         الحمد              الحمد               00:00:10,199   00:00:12,199
                  ه
  10           ّلِل
                ِ ِ            هلل                 هلل               00:00:12,199   00:00:12,400
                 ه
  11         ال ِذي            الذى               الذى               00:00:12,400   00:00:13,199
                    َْ
  12          أن َز َل         انزل               انزل               00:00:13,199   00:00:14,599
  13          َٰ َ َع
              ل                عل                 عل                 00:00:14,599   00:00:15,199
                      َ
  14         ع ْب ِد ِه        عبده               عبده               00:00:15,199   00:00:15,300
  15        َ ْالك َت
            اب                الكتاب             الكتاب              00:00:15,300   00:00:15,800
                  ِ
                                                   68
3.4.2. Matching
3.4.2.1. Background
Section 3.3.1 has comprehensively covered the challenges encountered in text matching
and mapping, where we employed the Linear Search algorithm. Addressing potential
The technique involves calculating the difference in the number of words between the
Speech Recognition Transcript and the Original Arabic (vowel-less) text from the database.
By adding a specific number to this calculated difference, we effectively narrow down the
search range. For instance, if the difference is 15, we add a value of 10 to restrict the search
within a range of 25 indices. This strategic approach ensures that the impact of worst-case
this tailored method within the Linear Search algorithm, we significantly enhance the
                                              69
3.4.2.2. Process
1. The vowel-less Arabic Text (AT) is matched with the ASR transcript using Linear
Searching algorithm.
2. For the matched indices, the corresponding Original Arabic text (OA) containing
vowels is replaced with the ASR transcript. The timestamps from the ASR
3. For the unmatched indices, the indices are assigned a value of ‘-1’.
                                             70
   4. All the index values of ‘-1’ are interpolated and the corresponding Original Arabic
In this part, a pandas DataFrame named output_df is created using the output_data
using the to_excel() function. The index=False argument ensures that the row index is not
included in the exported Excel file. Finally, a message is printed indicating that the search
The rest of the code performs additional operations on the generated data, such as counting
non-null cells, matching Speech Recognition and Arabic text, handling consecutive -1
                                             71
values, mapping indices and time information, and saving the updated DataFrame and SRT
content to files.
import pandas as pd
This line imports the pandas library, which is used for data manipulation and analysis.
df = pd.read_excel('search_match.xlsx')
This line reads the data from an Excel file named 'search_match.xlsx' and stores it in a
These lines count the number of non-null cells in the columns 'Arabic text' and 'Speech
Recognition' of the DataFrame df and store the counts in the variables at_count and
sr_count, respectively.
sr_count
This line calculates the difference between sr_count and at_count. It checks if sr_count is
greater than at_count and subtracts at_count from sr_count if it is, otherwise, it subtracts
                                             72
 matched_data = []
 const = 2
These lines initialize an empty list matched_data to store the modified data and set the value
of const as 2. const is used to define the number of rows to search before and after the
 i = 0
 while i < len(df):
          if df.loc[i, 'Speech Recognition'] == df.loc[i, 'Arabic text']:
          start_time = df.loc[i, 'Start Time']
          end_time = df.loc[i, 'End Time']
          matched_data.append((i, start_time, end_time))
          else:
          match_found = False
          for j in range(max(0, i - (const + diff)), min(i + (const + diff),
          len(df))):
          if df.loc[i, 'Speech Recognition'] == df.loc[j, 'Arabic text']:
                   start_time = df.loc[i, 'Start Time']
                   end_time = df.loc[i, 'End Time']
                   matched_data.append((j, start_time, end_time))
                   match_found = True
                   break
                  if not match_found:
                           matched_data.append((-1, df.loc[i, 'Start Time'],
                           df.loc[i, 'End Time']))
         i += 1
This block of code iterates over the rows of the DataFrame df using a while loop. It
checks if the 'Speech Recognition' column value matches the corresponding value in the
                                             73
'Arabic text' column    for each row. If a match is found, it extracts the start and end
times from the 'Start Time' and 'End Time' columns, respectively, and appends a tuple
(i, start_time, end_time) to the matched_data list. Here, i represents the index of the
matched row.
If a match is not found, it searches for a match in the next and previous (const + diff)
where j represents the index of the matched row. If no match is found, it appends (-1,
start_time, end_time) to indicate that no match was found after searching the
neighboring rows.
                                             74
                                   start_times[i] = df.loc[above, 'Start Time']
                                   end_times[i] = df.loc[above, 'End Time']
                          else:
                                   data[i] = above + 1
                                   start_times[i] = df.loc[above, 'Start Time']
                                   end_times[i] = df.loc[above, 'End Time']
                                   data[i+1] = above + 2
                                   start_times[i+1] = df.loc[above, 'Start Time']
                                   end_times[i+1] = df.loc[above, 'End Time']
                  else:
                          i += 1
         i += 1
In this part, the code processes the matched_data list to replace consecutive -1 values using
the +1 approach. It iterates over the data list, which contains the indices from
matched_data, and checks for consecutive -1 values. If found, it looks for the nearest non-
-1 indices above and below. If the difference between them is less than or equal to 2, it
replaces the -1 values with the corresponding indices and copies the start and end times
 df['matched_index'] = data
 df['start_time'] = start_times
 df['end_time'] = end_times
 df['matched_orignal_Arabic'] = ['' if x == -1 else df.loc[x, 'Original
 Arabic']
 for x in data
         df['matched'] = ['Matched' if x != -1 else 'Not matched' for x in
         data]
                                               75
These lines add new columns to the DataFrame df to store the matched index, start time,
end time, original Arabic text, and a column indicating if a word was matched or not. The
This line saves the updated DataFrame df to a new Excel file named 'results of
                     ْ
    9         ال َح ْمد      الحمد              الحمد           00:00:10,199   00:00:12,199
                    ه
   10            ّلِل
                  ِ ِ          هلل                هلل           00:00:12,199   00:00:12,400
                   ه
   11          ال ِذي         الذى               الذى           00:00:12,400   00:00:13,199
                      َْ
   12           أن َز َل      انزل               انزل           00:00:13,199   00:00:14,599
   13           َٰ َ َع
                ل             عل                  عل            00:00:14,599   00:00:15,199
              َ ْالك َت
              اب             الكتاب             الكتاب
   15               ِ                                           00:00:15,300   00:00:15,800
                                                  76
 Matched index       start_time        end_time       matched orignal Arabic      matched
                                                                       ْ
        9           00:00:06,700     00:00:07,200               ال َح ْمد         Matched
                                                                      ه
       10           00:00:12,199     00:00:12,400                  ّلِل
                                                                    ِ ِ           Matched
                                                                     ه
       11           00:00:12,400     00:00:13,199                ال ِذي           Matched
                                                                       َْ
       12           00:00:13,199     00:00:14,599                أن َز َل         Matched
                                                                َ ْالك َت
                                                                اب
       15           00:00:15,300     00:00:15,800                     ِ           Matched
 num = 1
 srt_content = ""
 for start_time, end_time, matched_text in zip(start_times, end_times,
 df['matched_orignal_Arabic']):
 srt_content += f"{num}\n{start_time} --> {end_time}\n{matched_text}\n\n"
 num += 1
This loop iterates over the start_times, end_times, and matched_orignal_Arabic columns
of the DataFrame df using the zip function. It constructs the content for the SRT file by
concatenating the line number, start time, end time, and matched text for each row.
                                           77
This code block opens a file named 'results.srt' in write mode with UTF-8 encoding
and writes the srt_content to the file. It saves the content in the SubRip Text (SRT) format,
10 2 2
        ْ                           00:00:18,100                                          00:00:18,100
 ال َح ْمد
                                                                 َ
                                                َٰ َ ّلِل هال ِذي أ ْن َز َل َع
                                    ل َع ْب ِد ِه
                                                                              ه     َ ْ
                                                                            ِ ِ الح ْمد   Praise be to Allah, Who
                                                 َ                 َ َ َ ْ
                                    ١ اب َول ْم َي ْج َع ْل له ِع َو ًجاال ِكت            hath sent to His Servant
allowed therein no
Crookedness:
                                                                                          کج (اور
                                                                                                مي کیس طرح یک ی
                                                                                                     ی
                                                                                          پیچیدیک) نہ رکیھ
11 3 3
                                                     78
00:00:12,199 --> 00:00:12,400   >00:00:18,100 --                                         >00:00:18,100 --
Reward,
                                               79
 00:00:12,400 --> 00:00:13,199      00:00:29,699 -->            00:00:29,699 -->
remain forever:
                                                                 ی
                                                                جس مي وه ابدا الآباد رہي ےک
3.5.1. Background
Video creation is a crucial step in our project, and it involves processing the output SRTs
(section 4.1) generated by the matching algorithm. Users have the flexibility to choose any
of the three SRTs for video generation. These SRTs offer options for word-by-word
The video creation process demands some time, especially during tasks like video
rendering. On machines equipped with a good GPU, this time-consuming step can be
expedited by more than 80%. For instance, for medium-range surahs (section 2.1), the
processing time can be reduced from around 5 minutes to just 1 minute with a good-
performance GPU.
                                                 80
3.5.2. Process
                                              81
2. Background video is either trimmed or looped based on the duration of input Audio
3. Text clips for both Qur’anic subtitles and Translation are generated each catering to
acceptable font style, size, and color. Then their positions are set in the video
frames.
                                        82
4. Text clips with predetermined positions and timestamps are added to the Video
(containing both Recitation Audio and Background Video), and thus we have
                                     83
3.5.3. Explanation with Code
3.5.3.1. Explanation
 import re
 import pysrt
 from moviepy.editor import VideoFileClip, AudioFileClip, TextClip,
 CompositeVideoClip
In this section, the necessary modules are imported: refor regular expression operations,
pysrt for working with subtitle files, and various classes from the moviepy.editor
 def time_to_seconds(time_obj):
     return time_obj.hours * 3600 + time_obj.minutes * 60 +
 time_obj.seconds + time_obj.milliseconds / 1000
This function time_to_seconds converts a time object from the subtitle file to
seconds. It calculates the total time in seconds by adding up the hours, minutes, seconds,
                                            84
   def create_subtitle_clips(subtitles, videosize, fontsize=24,
   font='Arial', color='yellow'):
        subtitle_clips = []
        for subtitle in subtitles:
              start_time = time_to_seconds(subtitle.start)
              end_time = time_to_seconds(subtitle.end)
              duration = end_time - start_time
              video_width, video_height = videosize
              text_clip = TextClip(subtitle.text, fontsize=60, font=font,
              color='white', bg_color='transparent', size=(video_width * 3
              / 4, None),
              method='caption').set_start(start_time).set_duration(duratio
              n)
              subtitle_x_position = 'center'
              subtitle_y_position = 'center'
              text_position = (subtitle_x_position, subtitle_y_position)
              subtitle_clips.append(text_clip.set_position(text_position)
         return subtitle_clips
optional parameters such as fontsize, font, and color. It iterates through each subtitle in
the list and creates a TextClip object for each subtitle. The start time and end time of the
subtitle are converted to seconds using the time_to_seconds function. The duration
is calculated as the difference between the end time and start time. The TextClip is created
with the specified text, font size, font, color, background color, size, and method. The
TextClip is positioned at the center of the video. The created subtitle clips are stored in a
video = VideoFileClip("video.mp4")
                                              85
Here, the VideoFileClip class is used to load the video file "video.mp4" into the video
object.
This line resizes the video to a resolution of 1280x720 pixels using the resize method of
audio = AudioFileClip("kahaf.mp3")
The AudioFileClip class is used to load the audio file "kahaf.mp3" into the audioobject.
audio_duration = audio.duration
This line trims or extends the video clip using the subclip method of the
VideoFileClip class. It sets the start and end times of the video to match the duration of
video = video.set_audio(audio)
The set_audio method is used to combine the video and audio clips. It sets the audio of
                                           86
 subtitles = pysrt.open("Full Ayah.srt")
The pysrt.open function is used to open the subtitle file "Full Ayah.srt" and load the
The CompositeVideoClip class is used to combine the video clip, audio, and subtitle
clips into the final video. The video clip is passed as the first element of a list, followed by
final_video.set_duration(audio_duration).write_videofile(output_video_
The set_duration method is used to set the duration of the final video to match the
audio duration. Then, the write_videofile method is called to write the final video to
the file specified by output_video_file. The video codec is set to "libx264", the audio
codec to "aac", and the frame rate is set to the same as the original video.
Overall, this code loads a video file, resizes it, loads an audio file, trims or extends the
video to match the audio duration, combines the video and audio, adds subtitles from the
                                              87
                       4. Chapter 4: Results
11
12
13
14
                              َٰ َ َع
                              ل
                                        88
4.1.2. SRT - Full Ayah
1
2
3
4
                         ات   َق ِّي ًما لي ْنذ َر َب ْأ ًسا َشد ًيدا م ْن َلد ْنه َوي َب ِِّ َ ْ ْ َ ه َ َ ْ َ ُ َ َّ َ
                          ّش المؤ ِم ِني ال ِذين يعملون الص ِالح ِ              ِ     ِ                ِ ِ
                            َ َّ َ ْ َ ْ ً َ َ ً
                         أن لهم أجرا حسنا2
5
                                                89
4.1.3. SRT – Full Ayah with Translation in Multiple Languages
                          (He hath made it) Straight (and Clear) in order that He may
                          warn (the godless) of a terrible Punishment from Him, and
                          that He may give Glad Tidings to the Believers who work
                          righteous deeds, that they shall have a goodly Reward,
                          00:00:29,699 90
                                       --> 00:00:32,500
                        ْ
       9         ال َح ْمد            الحمد               الحمد            00:00:10,199       00:00:12,199
                       ه
   10               ّلِل
                     ِ ِ               هلل                  هلل            00:00:12,199       00:00:12,400
                      ه
   11             ال ِذي               الذى                الذى            00:00:12,400       00:00:13,199
                         َْ
   12              أن َز َل            انزل                انزل            00:00:13,199       00:00:14,599
   13              َٰ َ َع
                   ل                   عل                   عل             00:00:14,599       00:00:15,199
   15            َ ْالك َت
                 اب                   الكتاب              الكتاب           00:00:15,300       00:00:15,800
                       ِ
                                                                                         ْ
           9                    00:00:06,700        00:00:07,200                  ال َح ْمد        Matched
                                                                                       ه
           10                   00:00:12,199        00:00:12,400                    ّلِل
                                                                                     ِ ِ           Matched
                                                                                      ه
           11                   00:00:12,400        00:00:13,199                  ال ِذي           Matched
                                                                                         َْ
           12                   00:00:13,199        00:00:14,599                   أن َز َل        Matched
                                                            91
       14            00:00:15,199       00:00:15,300                  َع ْب ِد ِه       Matched
                                                                     َ ْالك َت
                                                                     اب
       15            00:00:15,300       00:00:15,800                       ِ            Matched
method, reducing the time by 98.4% i.e 5 hours of work can now be done in 5 mins.
Figure 20: Time Comparison between Traditional Method and Proposed Method
                                               92
4.3.1. Word by Word
                                  93
4.3.2. Video with Full Ayah
                                           94
4.3.3. Full Ayah with Translation in Multiple Languages
                                               95
Figure 24: Full Ayahs with Translation in English and Urdu
                           96
Figure 25:Video with translation in Multiple Languages
                         97
                         5. Chapter 5: Conclusion
In this project, we have successfully addressed the challenge of automating the creation of
Qur’anic content videos, aiming to lower the difficulty for both new and existing content
text with audio, simplifying the video creation process and reducing the time by 98.4% i.e.
Our automated system significantly reduces the time and effort required by content
creators, eliminating the need for manual editing using complex software. The process
the Qur'an. With our solution, we enable the rapid creation of a vast library of high-quality
videos, accommodating different recitations and translations of the Qur’an. This caters to
Extensive testing and analysis have validated the accuracy and performance of our system,
achieving high Qur’anic text transcription accuracy and precise synchronization, ensuring
providing a user-friendly and efficient solution for content creators and users alike. The
way for a more accessible, inclusive, and efficient approach to engaging with the Qur'an.
                                             98
The potential impact of this work extends beyond the scope of this project, offering
opportunities for further research, development, and application in the field of Islamic
https://github.com/Nashitsaleem1/FYP-AI-Model-for-Recognition-and-Synchronization-
of-Quranic-Text-with-Audio
hafiznashitsaleem@gmail.com
zainulabideen7676@gmail.com
armaghan4201@gmail.com
                                          99
                  6. Chapter 6: Future Work/Aspects
The primary objective of this research endeavor was to develop a model suitable for
automating video creation for Zikra Dar-ul-Nashr's content. However, our project took a
facilitate video production for content creators worldwide, yielding remarkable results that
transcribing audio content with both Arabic and Urdu speech. To address this need, future
work will explore the development of a Keyword Spotting Algorithm capable of efficiently
handling multilingual processing. This algorithm will accurately detect and timestamp
Arabic words in the audio, streamlining video creation for multilingual audios.
moments within audio content. Its application extends to diverse industries dealing with
podcasts.
Moreover, the algorithm's proficiency in Arabic word detection sets a precedent for similar
recognition tools capable of handling diverse linguistic inputs become indispensable across
for our algorithms. This platform will empower content creators and individuals from
                                             100
various fields to efficiently automate their video creation processes, enhancing productivity
and leveraging the power of our work seamlessly. By empowering creators and
technology, making content creation more efficient and creative in diverse linguistic
contexts.
                                            101
                                    7. References
[1]   B. Mocanu and R. Tapu, “Automatic Subtitle Synchronization and Positioning
      System Dedicated to Deaf and Hearing Impaired People,” IEEE Access, vol. 9, pp.
      139544–139555, 2021, doi: 10.1109/ACCESS.2021.3119201.
[3]   M. Alrabiah and N. Alhelewh, “An Empirical Study On The Holy Quran Based On
      A Large Classical Arabic Corpus,” 2014. [Online]. Available: www.islamqa.com
                                            102
[9]    A. N. Akkila and S. S. A. Naser, “Rules of Tajweed the Holy Quran Intelligent
       Tutoring System,” 2018.
[12]   M. Alhawarat, M. Hegazi, and A. Hilal, “Processing the Text of the Holy Quran: a
       Text Mining Study.” [Online]. Available: www.ijacsa.thesai.org
[13]   M. M. al Anazi and O. R. Shahin, “A Machine Learning Model for the Identification
       of the Holy Quran Reciter Utilizing K-Nearest Neighbor and Artificial Neural
       Networks,” Information Sciences Letters, vol. 11, no. 4, pp. 1093–1102, Jul. 2022,
       doi: 10.18576/isl/110410.
                                           103
       Advanced Trends in Computer Science and Engineering, vol. 8, no. 1.3 S1, pp. 208–
       213, 2019, doi: 10.30534/ijatcse/2019/4181.32019.
[17]   M. M. Al Anazi and O. R. Shahin, “A Machine Learning Model for the Identification
       of the Holy Quran Reciter Utilizing K-Nearest Neighbor and Artificial Neural
       Networks,” Information Sciences Letters, vol. 11, no. 4, pp. 1093–1102, Jul. 2022,
       doi: 10.18576/isl/110410.
                                           104
[25]   K. M. O. Nahar, W. G. Al-Khatib, M. Elshafei, H. Al-Muhtaseb, and M. M.
       Alghamdi, “Arabic Phonemes Transcription Using Learning Vector Quantization:
       ‘Towards the Development of Fast Quranic Text Transcription,’” in Proceedings -
       2013 Taibah University International Conference on Advances in Information
       Technology for the Holy Quran and Its Sciences, NOORIC 2013, Institute of
       Electrical and Electronics Engineers Inc., Sep. 2015, pp. 407–412. doi:
       10.1109/NOORIC.2013.85.
[28]   H. A. Alsayadi and M. Hadwan, “Automatic Speech Recognition for Qur’an Verses
       using Traditional Technique,” Journal of Artificial Intelligence and Metaheuristics,
       vol. 1, no. 2, pp. 17–23, 2022, doi: 10.54216/JAIM.010202.
[30]   S. R. El-Beltagy and A. Rafea, “QDetect: An Intelligent Tool for Detecting Quranic
       Verses in any Text,” in Procedia CIRP, Elsevier B.V., 2021, pp. 374–384. doi:
       10.1016/j.procs.2021.05.107.
                                             105
       Systems      and        Broadcasting,      BMSB        2009,        2009.     doi:
       10.1109/ISBMSB.2009.5133758.
[36]   A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent
       neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and
       Signal    Processing,     IEEE,      May     2013,     pp.       6645–6649.   doi:
       10.1109/ICASSP.2013.6638947.
                                            106
[39]   Y.-S. Chang, S.-H. Hung, N. J. C. Wang, and B.-S. Lin, “CSR: A Cloud-Assisted
       Speech Recognition Service for Personal Mobile Device,” in 2011 International
       Conference on Parallel Processing, IEEE, Sep. 2011, pp. 305–314. doi:
       10.1109/ICPP.2011.23.
[41]   E. C. Grimm et al., “POLLEN METHODS AND STUDIES | Databases and their
       Application,” in Encyclopedia of Quaternary Science, Elsevier, 2007, pp. 2521–
       2528. doi: 10.1016/B0-44-452747-8/00189-7.
[42]   M. A. Sherif and A.-C. Ngonga Ngomo, “Semantic Quran,” Semant Web, vol. 6, no.
       4, pp. 339–345, Aug. 2015, doi: 10.3233/SW-140137.
[45]   J. I. Munro, “On the Competitiveness of Linear Search,” 2000, pp. 338–345. doi:
       10.1007/3-540-45253-2_31.
[47]   R. A. Waltz, J. L. Morales, J. Nocedal, and D. Orban, “An interior algorithm for
       nonlinear optimization that combines line search and trust region steps,” Math
       Program, vol. 107, no. 3, pp. 391–408, Jul. 2006, doi: 10.1007/s10107-004-0560-5.
                                          107
[48]   J. J. Moré and D. J. Thuente, “Line search algorithms with guaranteed sufficient
       decrease,” ACM Transactions on Mathematical Software, vol. 20, no. 3, pp. 286–
       307, Sep. 1994, doi: 10.1145/192115.192132.
[50]   A. Martelli, “On the complexity of admissible search algorithms,” Artif Intell, vol.
       8, no. 1, pp. 1–13, Feb. 1977, doi: 10.1016/0004-3702(77)90002-9.
108