0% found this document useful (0 votes)
33 views56 pages

Full Document 146

The document discusses building a transliteration system using LSTM and GRU to transliterate from English to Hindi. It motivates the need for transliteration when dealing with different writing systems. The proposed system will develop a deep learning model using sequence-to-sequence architecture with LSTM and GRU units to perform transliteration accurately for names and unseen words.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views56 pages

Full Document 146

The document discusses building a transliteration system using LSTM and GRU to transliterate from English to Hindi. It motivates the need for transliteration when dealing with different writing systems. The proposed system will develop a deep learning model using sequence-to-sequence architecture with LSTM and GRU units to perform transliteration accurately for names and unseen words.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

BUILDING A TRANSLITERATION SYSTEM

USINGLSTM & GRU

A project dissertation submitted to Bharathidasan


Universityin partial fulfillment of the requirements
for the award of the Degree of

MASTER OF SCIENCE IN DATA SCIENCE

Submitted by
ADELINE M
215229146

Guided by

Ms. G.RAJALAXMI, M.Sc.,


Assistant Professor

DEPARTMENT OF DATA SCIENCE


BISHOP HEBER COLLEGE (AUTONOMOUS)
(Nationally Reaccredited at the ‘A’ Grade by NAAC with the CGPA of 3.58
out of 4)(Recognized by UGC as “College with Potential for Excellence”)
(Affiliated to Bharathidasan University)

TIRUCHIRAPPALLI 620017

APRIL 2023
DECLARATION

I hereby declare that the project work presented is originally done by me under the guidance of

Ms.G. Rajalaxmi, M.Sc., Assistant Professor, Department of Data Science, Bishop Heber

College (Autonomous), Tiruchirappalli 620017, and has not been included in any other

thesis/project submittedfor any other degree.

Name of the Candidate : Adeline. M

Register Number : 215229146

Batch : 2021-2023

Signature of the Candidate


DEPARTMENT OF DATA SCIENCE
BISHOP HEBER COLLEGE (AUTONOMOUS)
(Nationally Reaccredited at the ‘A’ Grade by NAAC with the CGPA of 3.58 out of 4)
(Recognized by UGC as “College with Potential for Excellence”)
(Affiliated to Bharathidasan University)

TIRUCHIRAPPALLI 620017

Date:

Course Title: Project Course Code: P21DS4PJ

BONAFIDE CERTIFICATE

This is to certify that the project work titled “BUILDING A TRANSLITERATION USING

LSTM &GRU” is a bonafide record of the project work done by Adeline.M, 215229146, in

partial fulfillment of the requirements for the award of the degree of MASTER OF SCIENCE

IN DATA SCIENCE during the period 2021 - 2023.

The Viva-Voce examination for the candidate Adeline.M, 215229146, was held on

Signature of the HOD Signature of the Guide

Examiners:

1.

2.
ACKNOWLEDGEMENTS

I take this opportunity to express my thanks to Dr. D. PAUL DHAYABARAN., M.Sc.,


PGDCA., M.Phil., Ph.D., Principal Bishop Heber College, Trichy.

I would like to express my sincere gratitude and deep thanks to Dr. K. RAJKUMAR
M.Sc., M.Phil., Ph.D., Associate Professor and Head of Department of Data Science, who
has been a source of encouragement and moral strength throughout our study period.

This project would not have been possible without the motivation and guidance of my
Internal guide and Class In-Charge Ms.G. RAJALAXMI, M.Sc., Assistant Professor,
Department of Data Science, they are the backbone of my project and their encouragementand
moral support to finish this project.

I am extremely thankful to my family members and my dear friends who helped a lot in
the completion of the project.

ADELINE M
ABSTRACT

Transliteration is a process of mapping between one writing system and another based
on sound similarity. In a machine translation system, transliteration is mostly employed to
handle named entities and words that are not in the lexicon. For instance, in Hindi transliteration,
can type"thanyavaath" to get "थन्यवाथ", that sounds like "thanyavaath”. Transliteration is simply
intendedto change the letters or characters of a source language into corresponding letters of the
target language. It is useful if a user knows a language but cannot write their script. Hindi is
India's 'lingua franca'. Transliteration allows people to speak words and names in different
languages.
Transliteration is designed to change only the letters or characters of a source language
into the corresponding letters of the target language. It makes no sense as opposed to translation,
which converts the written or spoken meaning of words or text from a source language into a
target language. An English word stands for the same word in Devanagari (Hindi) writing using
two RNNs. i.e., an Encoder and a Decoder model. Transliteration can be used with efficacy in
the case of names. The accuracy of the model can be further increased with Sequence to
Sequence used in combination with Encoder-Decoder Model. Transliteration from English
language to Hindi language plays a very important role as Hindi is official language of India.
There is much data is present in Hindi that has to convert to English for worldwide use.

vi
TABLE OF CONTENTS
Chapter Title Page No

Abstract vi
List of Figures vii
List of Tables viii
1 Introduction 01
1.1 Motivation
1.2 Existing Systems and Solutions
1.3 Product Needs and Proposed System
1.4 Product Development Timeline
2 Literature Review 04
2.1 Machine Learning Approach on Transliteration System
2.2 Deep Learning Approach on Transliteration system
2.3 Machine Translation Approach on Transliteration
2.4 NLP Approach on Transliteration System
3 Data Collection 08
3.1 Description of the Data
3.2 Source and Methods of Collecting Data
4 Preprocessing and Feature Selection 11
4.1 Overview of Preprocessing Methods
4.2 Overview of Feature Selection Methods
4.3 Preprocessing and Feature Selection Steps
5 Model Development 16
5.1 Model Architecture
5.2 Algorithms Applied
5.3 Training Overview
6 Experimental Design and Evaluation 22
6.1 Experimental Design
6.2 Experimental Evaluation
6.3 Customer Evaluation and Feedback
7 Model Optimization 25
7.1 Overview of Model Tuning and Best Parameter Selection
7.2 Model Tuning Process and Experiments
8 User Interface Design and Evaluation 30
8.1 Designing Graphical User Interface
8.2 Testing Graphical User Interface
9 Product delivery and deployment 32
10 Conclusion 33
10.1 Summary
10.2 Limitation and Future Work
References 35
Appendix-A: Data Set
Appendix-B: Source Code
Appendix-C: Output Screenshots

vii
LIST OF FIGURES
Figure No. Description Page

1.1 Product Development Timeline 03


3.1 Dataset Description 08
3.2 Training set 09
3.3 Testing set 10
5.1 Model Architecture 16
5.2 LSTM Architecture 18
5.3 GRU Architecture 19
6.1 Model Workflow 22
6.2 Predicted Vanilla Model 23
7.3 Model Accuracy 29
8.1 Designing GUI 30
8.2 Testing GUI 31

viii
LIST OF TABLES
Table No. Description Page

6.1 Evaluation Result 23


7.1 Experiment Parameters 28

9.1 Delivery Schedule 32

ix
Chapter 1

INTRODUCTION

1.1 Motivation

Transliteration emphasizes pronunciation rather than meaning, which is especially useful in


discussions with foreign people, places and cultures. Transliteration can be necessary when dealing
with languages that have different writing systems, such as converting from a language that uses a
non-Latin script, such as Arabic, Chinese, or Hindi, to a language that uses a Latin script, such as
English.

Transliteration may also be useful in situations where there are more than one spelling
system for the same language. Conversion between simple and traditional systems or between
Cyrillic and Latin alphabets. As well as preserving meaning and pronunciation. Transliteration can
also make it easier to communicate and understand among people who speak different languages
and use differentwriting systems. By using the system, it converts from known language to unknown
language it helpsto read another language and more interested in pronouncing it, than understanding
it. Development of an automatic system for performing English to Hindi. Transliteration is a
method of converting a written word into a language using the alphabet of the second language.

1.2 Existing Systems and Solutions

Many transliteration systems exist which have been trained to see the target words.
Transliteration is important to communicate and understand among various languages and writing
systems. Many automatic transliteration algorithms have been created in recent years using specific
statistical and linguistic techniques, as well as phonetic source and target languages. It exists as
International Phonetic Alphabet (IPA). It is used for transcribing speech sounds into any language,
and can also be used for transliterating between writing systems. ASCII transliteration is used as a
combination of Latin letters and punctuation marks to represent the sounds of the original
handwritingsystem.

1
Google Transliteration offers a transliteration service which allows users to type in one language
and automatically convert the text into a writing system of another language. This system issuitable
for a wide variety of languages and writing systems. These tools enable users to enter text into a
writing system and have it converted to another real-time writing system. The most common online
transliteration tools are (Transliteration.com). Machine learning based solutions are recent
advancements in machine learning and natural language processing, there are many machines
learningbased solutions for transliteration. Common solutions based on machine learning include
the transliteration transformer and Seq2Seq.

1.3 Product Needs and Proposed System

Building a transliteration system requires several components, including Language data is


thecomprehensive set of language data for both the source and target languages, Transliteration
rules should take into account various factors such as phonetics, context and common spelling
patterns, User interface that allows users to input text in one script and view the output in the other
scrips. Machine learning models can be used to improve the accuracy and speed of the transliteration
system.These models can be trained on large datasets of transliterated text to improve the system's
ability to recognize patterns and make accurate predictions.

Evaluation metrics is used to measure the accuracyand effectiveness of system. It identifies


areas for improvement and to fine-tune the system over time. Overall, building a transliteration
system requires a combination of linguistic expertise, software development skills, and machine
learning expertise. Deep learning algorithms approach for international languages to Build a RNN
Architecture (LSTM & GRU) based seq2seq model which contains the following layers.

Input layer for character embeddings, one encoder RNN with the input character sequence
(English), One decoder RNN with last state of the encoder as input and produces output character.
(Devanagari), With Backpropagation algorithm calculates the gradient of the error function, A
group of method known as backpropagation algorithms are used to effectively train artificial neural
networks using a gradient descent method that takes advantage of the chain rule. The Gradient
Descent algorithm would be used effectively.

2
1.4 Product Development Timeline

Project duration December to April 2023.

Figure:1.1 Product Development Timeline

3
Chapter 2

Literature Review

2.1 Machine Learning Approach on Transliteration System

Kang et al proposed an automatic English-to-Korean transliteration and back transliteration system


58based on decision tree learning. The suggested approach is entirely bidirectional. They have
created avery effective character alignment technique that pairings of English words and Korean
transliterations are phonetically aligned. The alignment reduces the number of decision trees that
must be learnt for Korean to English back transliteration to 46 and for English to Korean
transliteration to 26 respectively. After learning, using a decision tree for transliteration and back
transliteration is simple.[1]

P.J. et al. suggested a Support Vector Machine-based English to Kannada transliteration system.
The suggested system employs a two-step method for transliteration called sequence labelling. The
source string is segmented into transliteration units in the first stage, and source and target
transliteration units are compared in the second. Moreover, it resolves many alignment and unit
mapping combinations. The entire procedure is broken down into three stages: Preprocessing, SVM
training, and transliteration. The training file is transformed into the SVM-compatible format
during the preparation stage. For SVM training, the authors are employing a database of 40,000
location namesin India.[8]

Abbas Malik, Laurent Besacier Christian Boitet and Push-pak Bhattacharyya proposed an
Urduto Hindi Transliteration using hybrid approach in 2009. This hybrid approach combines finite
state machine-based techniques with statistical word language model and this achieved better
performance.The main aim of this system was the removal of diacritical words of the Urdu input
text. This systemimproved the accuracy by 28.3 % compared to their previous finite transliteration
model.[11]

2.2 Deep Learning Approach on Transliteration system

Arbabi et al. used neural networks and knowledge-based systems to create an Arabic- English
transliteration system In this approach, the initial step was to enter the names that were taken from

4
the telephone dictionary and entered into the database. A knowledge- based technique is utilised to
vowelize these names in order to add the short vowels that are lacking, similar to how short vowels
are typically not printed in Arabic script. The words that KBS is unable to appropriately vowelize
are subsequently removed using an artificial neural network. Cascade correlation method, a
supervised, feed forward neural processing methodology, is used to train the network. Hence, neural
networks areused to assess the names' accuracy in terms of Arabic syllabification. The network
produces binary data as its output.[3]

Sanjana Shree and Anand Kumar. A deep learning-based system for multilingual machine
transliteration of Tamil and English was presented by Sanjana Shree and Anand Kumar. Deep belief
Network (DBN), a generative graphical model, is used by the system. Sparse binary matrices are
created from the data in both languages. Every word has character padding added at the end to keep
the word length consistent when it is encoded as a sparse binary matrix. A generative graphical
model called Deep Belief Network is composed of multiple layers of Restricted Boltzmann
Machine, a type of Boltzmann Machine and Random Markov Field.[7]

Andy Way, Sudip Kumar Naskar, Sandipan Dandapat, Ankit Kumar Srivastava, and Rejwanul
Haque In 2009, CNGL put forth a Context-Informed Phrase-based statistical machine translated for
such an English to Hindi transcription. Depending on specific syntactic, lexicon, and semantic
principles, the RBMT system transforms the source text into an interpreted language. The
destinationlanguage are subsequently translated from the alternative model. Instead of interpreting
phrases as in character-level transforming algorithms, the suggested transliteration system was
modelled by translating letters. They employed a memory-based classification framework that
makes it possible toestimate these attributes well while eliminating issues with noisy data.[13]

2.3 Machine Translation Approach on Transliteration System

Wan and Verspoor approached system for "Automatic English Chinese Name Transliteration"
Usingpronunciation, the system transliterated the words. In other words, the spoken form of the
word was used to map the written English word to the written Chinese character. Each phoneme in
an English word was mapped to a corresponding Chinese character in order for the system to
function. Five stepsmade up the transliteration process: semantic abstraction, syllabification, sub-
syllable divisions, mapping to pinyin, and mapping to Han characters. To determine which parts of
the word should be translated or transliterated, the Preprocessing step known as semantic
abstraction looked up the wordin dictionaries.[2]
5
Deep and Goyal created Rule-based Punjabi to English transliteration method for common names.
Character sequence mapping rules are used to translate across the languages in the proposed system.
The rules are created with certain limits to increase accuracy. This system was evaluated using
various person names, city names, river names, etc. after being taught using the names of 1013
subjects. The system recorded a 93.22% total accuracy rate.[5]

Lehal and Saini created a Perso-Arabic to Indic Script Machine Transliteration Model. The hybrid
transliteration system that blends rules with word- and character-level language models. The system
has undergone successful testing in Sindhi, Urdu, and Punjabi and is easily expandable to include
newlanguages like Kashmiri and Konkani. The three scripts' transliteration accuracy ranges from
91.68%to 97.75%, making it the best accuracy for script pairs in Perso-Arabic and Indic script ever
documented in the literature.[10]

Gurpreet Singh Josan and Jagroop Kaur created a Punjabi to Hindi transliteration algorithm
using a statistical approach. This approach attempted to determine enhancements using statistical
techniquesusing a paragraph transfer as just a baseline. Instead of decoding words as in character-
level translation systems, PB-SMT algorithms are used for transliteration. The matching algorithm
was already grown diag-final with in transliteration model training step, while other parameters
have default values. They developed the system using free SMT technologies and a Punjabi-Hindi
corpus for training.[12]

Taraka Rama and Karthik Gali in 2009. The transliteration issue was handled as a translation
issued For this endeavour, researchers utilized phrase-based SMT algorithms. This method
developed a transliteration system using a beam search-based decoder with GIZA++, both of which
are available freely. One well English-Hindi matched corpus is used to teach the algorithm, and so
this prototype reports a testing efficiency of 46%.[15]

2.4 Natural Language Processing Approach on Transliteration System

Dhore et al. suggested the transliterate names from Hindi to English utilising conditional random
fields. The programme receives Devanagari-written Hindi place names as input and translates them
into English. The data is given in the form of expression for the n-gram techniques. The objective
isto create an English transliteration of a Hindi name using the statistical probability method of
CRF and the feature set of n-grams. The suggested strategy was tested using a bilingual corpus

6
named items gathered from books and online sources. The system's 85.79% bi- gram accuracy for
Hindi as the source language is excellent.[4]

Harshit Surana and Anil Kumar Singh developed a transliteration scheme for Telugu and Hindi,
two Indian languages. They used character-based n-grams to determine if a word was Indian or
foreign before classifying it as such. The likelihood of the word's origin was calculated using
symmetric cross entropy. Several methods of transliteration were carried out for various classes
based on this probability value (Indian or foreign). For the transliteration of an Indian word, the
method first divided the word into segments based on potential vowel and consonant combinations,
and then used various principles to map these segments to their closest letter combinations. The
above said steps produce transliteration candidates, which are then ranked and filtered using fuzzy
string matching. The target word is then produced by matching the transliteration candidates to
words in the corpus of the target language.[6]

Mathur and Saxena have created hybrid method, Mathur and Saxena have created a system for
English-Hindi named entity transliteration.The algorithm first analyses English words and applies
rules to phoneme extraction. The English phoneme is then translated into its comparable Hindi
phoneme using a statistical approach.The authors recovered 42,371 name entities using Stanford's
NER for name entity extraction. These things were subjected to rules, and phonemes were
extracted.Adatabase of English-Hindi phonemes was created after these English phonemes were
transliterated into Hindi.[9]

Amitava Das, Asif Ekbal, Tapabrata Mandal, and Sivaji Bandyopadhyay tackled the
transliteration issue in 2009. In the suggested approach, the letter-to-phoneme subtask of text-to-
speech analysis was used to understand the transcription issue. Despite making any additional
adjustments, researchers used a risks and returns involved of cutting-edge, discriminative letter-to-
phoneme to solve the issue. In this experiment, they showed that such an automated letter-to-
phonemetranslator works effectively without any adjustments for translation or transliteration.[14]

7
Chapter 3

DATA COLLECTION

3.1 Description of the Data


A standard dataset for assessing the effectiveness of transliteration systems is the Dakshina
dataset. The method of transliteration involves changing characters of one script to another in order
to translate text from one writing system to another. For instance, "thanyavaath" would be the
consequence of Romanizing the Hindi word "थन्यवाथ".

The Dakshina dataset includes word pairs in both English and Indian languages including
Bengali, Tamil, and Hindi, as well as its transliterated equivalents. It consists of 2lakh words. The
model is separated as 80-20 as 80% training and 20% of testing. To build and analyze machine
learningmodels for transliteration operations, researchers developed the dataset. It is among the
larger public access datasets for transliteration than 1.5-million- word pairings. Training,
development, and testingportions make up the dataset.

Figure:3.1 Dataset Description

8
Training Set
Devanagari script is used to write several South Asian languages, including Hindi, Nepali,
Marathi, and Sanskrit. It contains more than 1lakh words. The training set for Devanagari script is
a collection of text samples used to train a machine learning model to recognize and classify
Devanagaricharacters. In the case of text-based recognition systems, the training set may consist of
thousands of lines of text written in Devanagari script. Each line of text is labelled with the
corresponding transcription or translation in another language, which serves as the ground truth for
the machine learning algorithm.

The size and composition of the training set can greatly impact the accuracy and
performance of the machine learning model. A larger and more diverse training set can lead to
betterresults, but also requires more computational resources and time to train. Therefore, creating
an effective training set requires careful consideration of the intended application and available
resources.

Figure:3.2 Training Set

9
Testing Set
Dakshina dataset is a large-scale multi-script multi-domain dataset for Indian language
understanding research. The testing set in Dakshina dataset is a subset of the dataset that is used to
evaluate the performance and accuracy of machine learning models trained on the training set. It
carefully selected to represent a diverse range of domains and genres, such as news, social media,
andliterature, and covers multiple scripts and languages, including Devanagari, Tamil, and Telugu.
The dataset is annotated with tasks such as named entity recognition, part-of-speech tagging, and
sentimentanalysis.

For example, for the named entity recognition task, the testing set consists of text samples
that have been manually annotated with the named entities present in the text, such as person names,
organization names, and location name.

Figure:3.3 Testing Set

3.2 Source and Methods of Collecting Data


The dataset is downloaded from the GitHub online open source from the Dakshina account
which is 1.8 GB size collection of word. From Dakshina account, extracting the Devanagari dataset
for transliteration system. The training and testing parts are classified as 80%-20%.

https://github.com/google-research-datasets/dakshina

10
Chapter 4

PREPROCESSING AND FEATURE SELECTION

4.1 Overview of Preprocessing Methods

Preprocessing is an important step in developing Deep learning systems for Devanagari


transliteration. Some common Preprocessing methods used in Devanagari transliteration systems
include:

Text Normalization: This involves converting the input text into a standardized format. In
Devanagari, this may involve converting different character variations, such as Matras, to their
standard form.

Tokenization: This involves breaking the input text into individual words or tokens. In Devanagari,
this may involve segmenting words based on spaces or other delimiters.

# Finding unique characters.

input_data_characters = set()
target_data_characters = set()
for line in train_data_lines[: lenk]:
target_data, input_data, _ = line.split("\t")
for ch in input_data:
if ch not in input_data_characters:
input_data_characters.add(ch)
for ch in target_data:
if ch not in target_data_characters:
target_data_characters.add(ch)

Stop word Removal: This involves removing common words that are unlikely to contribute to the
transliteration process. In Devanagari, this may involve removing common function words such as
"हैं" (are) or "वाला" (of).

11
# Remove all Hindi non-letters

def cleanHindiVocab(line):
line = line.replace('-', ' ').replace(',', ' ')
cleaned_line = ''
for char in line:
if char in hindi_alpha2index or char == ' ':
cleaned_line += char
return cleaned_line.split()

Stemming: This involves reducing words to their root form, to improve the efficiency and accuracy
of the machine learning model. In Devanagari, this may involve removing suffixes such as "ता"
(ness)or "वाली" (of).

Feature Engineering: This involves selecting and engineering features that are most relevant to the
transliteration task. In Devanagari, this may involve using features such as phonetic similarity,
syllablestructure or character n-grams.

Overall, Preprocessing methods can greatly improve the performance and accuracy of
machine learning systems for Devanagari transliteration. The choice of Preprocessing methods
depends on thespecific task and dataset at hand.

12
4.2 Overview of Feature Selection Methods

In the context of developing deep learning models based on LSTM and GRU for Devanagari
transliteration, feature selection plays an important role in improving the accuracy and efficiency
of the models. The goal of feature selection is to identify the most relevant features that can
accurately predict the transliteration output. Some common feature selection methods used in
LSTM and GRU models for Devanagari transliteration include.

Embedding Layer

This involves using an embedding layer to learn a dense representation of the inputtext. The
embedding layer maps each character or character sequence to a low-dimensional vector, which is
learned during training. This allows the model to capture the semantic meaning of the inputtext,
which can improve the accuracy of the transliteration output.

encoder_input_data,decoder_input_data,decoder_target_data,num_encoder_tokens,num
_decoder_tokens,input_token_idx,target_token_idx,encoder_max_length,decoder_max_le
ngth = embed_train_data(train_data_lines)

val_encoder_input_data,val_decoder_input_data,val_decoder_target_data,target_token_i
dx,val_target_data =
embed_val_data(val_data_lines,num_decoder_tokens,input_token_idx,target_token_idx)

reverse_input_char_index = dict((i, char) for char, i in input_token_idx.items())

reverse_target_char_index = dict((i, char) for char, i in target_token_idx.items())

Attention Mechanism

This involves using an attention mechanism to selectively attend to certain parts of the input
text. The attention mechanism weights the importance of each input feature based on its relevance
to the transliteration task. This can improve the efficiency of the model by reducing the number of
irrelevant features.

Dropout Regularization

This involves randomly dropping out some input features during training to prevent
overfitting. This can improve the generalization of the model and prevent it from memorizingthe
training data.

13
4.3 Preprocessing and Feature Selection Steps

In the case of building a transliteration system using a Devanagari dataset, there are two
criticalsteps: Preprocessing and Feature Selection. Preprocessing: Preprocessing is the initial step in
buildinga transliteration system. The goal of Preprocessing is to transform the dataset into a format
that can be easily processed by the Deep learning algorithms.

Preprocessing includes the following steps


Text Cleaning: Remove any irrelevant characters, punctuations, and symbols from the text data.

non_eng_letters_regex = re.compile('[^a-zA-Z ]')


# Remove all English non-letters
def cleanEnglishVocab(line):
line = line.replace('-', ' ').replace(',', ' ').upper()
line = non_eng_letters_regex.sub('', line)
return line.split()

Tokenization: Split the text into individual words or characters to create a sequence. Normalization:
Convert the text to a standard format, such as converting uppercase to lowercase and removing
diacritics.

Feature Selection
The second step in building a transliteration system is feature selection. Feature selection is
the process of selecting the most relevant features from the pre-processed data. The goal of feature
selection is to reduce the number of features to improve the accuracy of the model and reduce the
timeand computational resources required to train the model.

14
Feature selection includes the following steps

Feature Extraction: Extract relevant features from the preprocessed data. In the case of transliteration,
features may include the presence of specific characters, the frequency of character combinations, and
the position of characters within the word.

Character Mapping: Character mapping is the process of mapping input characters to output
characters. This is important in transliteration systems because the input and output characters may
not be the same.

Character Mapping for Hindi

hindi_alphabets = [chr(alpha) for alpha in range(2304, 2432)]


hindi_alphabet_size = len(hindi_alphabets)
hindi_alpha2index = {pad_char: 0}
for index, alpha in enumerate(hindi_alphabets):
hindi_alpha2index[alpha] = index+1
print(hindi_alpha2index)

{'ऀ': 0, 'ऀ': 1, 'ऀ': 2, 'ऀ ': 3, 'ऄ': 4, 'अ': 5, 'आ': 6, 'इ': 7, 'ई': 8, 'उ': 9, 'ऊ': 10, 'ऋ': 11
, 'ऌ': 12, 'ऍ': 13, 'ऎ': 14, 'ए': 15, 'ऐ': 16, 'ऑ': 17, 'ऒ': 18, 'ओ': 19, 'औ': 20, 'क': 21, 'ख'
: 22, 'ग': 23, 'घ': 24, 'ङ': 25, 'च': 26, 'छ': 27, 'ज': 28, 'झ': 29, 'ञ': 30, 'ट': 31, 'ठ': 32, '
ड': 33, 'ढ': 34, 'ण': 35, 'त': 36, 'थ': 37, 'द': 38, 'ध': 39, 'न': 40, 'ऩ': 41, 'प': 42, 'फ': 43,
'ब': 44, 'भ': 45, 'म': 46, 'य': 47, 'र': 48, 'ऱ': 49, 'ल': 50, 'ळ': 51, 'ऴ': 52, 'व': 53, 'श': 54,
'ष': 55, 'स': 56, 'ह': 57, 'ऀ': 58, 'ऀ ': 59, 'ऀ': 60, 'ऽ': 61, 'ऀ ': 62, 'िऀ': 63, 'ऀ ': 64, 'ऀ': 65,
'ऀ': 66, 'ऀ': 67, 'ऀ': 68, 'ऀ': 69, 'ऀ': 70, 'ऀ': 71, 'ऀ': 72, 'ऀ ': 73, 'ऀ ': 74, 'ऀ ': 75, 'ऀ ': 76,
'ऀ': 77, 'ॎऀ': 78, 'ऀ ': 79, 'ॐ': 80, 'ऀ': 81, 'ऀ': 82, 'ऀ': 83, 'ऀ': 84, 'ऀ': 85, 'ऀ': 86, 'ऀ': 87,
'क़': 88, 'ख़': 89, 'ग़': 90, 'ज़': 91, 'ड़': 92, 'ढ़': 93, 'फ़': 94, 'य़': 95, 'ॠ': 96, 'ॡ': 97, 'ऀ': 98
, 'ऀ': 99, '।': 100, '॥': 101, '०': 102, '१': 103, '२': 104, '३': 105, '४': 106, '५': 107, '६': 108,
'७': 109, '८': 110, '९': 111, '॰': 112, 'ॱ': 113, 'ॲ': 114, 'ॳ': 115, 'ॴ': 116, 'ॵ': 117, 'ॶ': 118
, 'ॷ': 119, 'ॸ': 120, 'ॹ': 121, 'ॺ': 122, 'ॻ': 123, 'ॼ': 124, 'ॽ': 125, 'ॾ': 126, 'ॿ': 127}

Character Mapping for English

eng_alphabets = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
pad_char = '-PAD-'
eng_alpha2index_r = {}
for index, alpha in enumerate(eng_alphabets):
eng_alpha2index_r[alpha] = index
print(eng_alpha2index_r)

{'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6, 'H': 7, 'I': 8, 'J': 9, 'K': 10, 'L': 11, 'M': 12, 'N': 13, 'O':
14, 'P': 15, 'Q': 16, 'R': 17, 'S': 18, 'T': 19, 'U': 20, 'V': 21, 'W': 22, 'X': 23, 'Y': 24, 'Z': 25}
15
Chapter 5

MODEL DEVELOPMENT

5.1 Model Architecture

Figure:5.1 Model
Architecture

LSTM&GRU ARCHITECTURE WITH SEQ2SEQ MODEL


An encoder that reads the input sequence and a decoder that creates the output sequence
makeup an encoder-decoder architecture. While transliterating, the source language text is the input
sequence, and the target language text is the output sequence. Recurrent neural network (RNN)
unitsof the LSTM and GRU varieties are frequently employed in encoder-decoder architectures for
transliteration. They are made to deal with long-term dependencies in sequential data and prevent
thevanishing gradient issue that might happen with conventional RNNs.The encoder transforms the
input text sequence into a fixed-length vector representation, known as the encoder hidden state, in
a transliteration system employing LSTM or GRU. The hiddenstate acts as the decoder's starting

16
point and provides details about the input sequence.

The decoder then creates the output sequence one token at a time using the encoder's
concealedstate. The decoder generates the following token at each time step by using the previous
token in theoutput sequence and the concealed state of the decoder at that time. Based on the prior
hidden state and the current input token, the decoder's hidden state is changed at each time step.

The model is tuned during training to reduce the discrepancy between the intended output
sequence and the predicted output sequence. The specific transliteration task determines the loss
function to be utilized. The encoder-decoder architecture with LSTM or GRU units is a potent
methodfor transliteration that can manage complex sequences and yield precise results.

The model implemented transliteration process in two ways, using an Encoder-Decoder


recurrent neural network and with attention model. The application of teacher forcing at the start of
the training phase made the algorithm make an understanding about the data much faster. Of two
methods, the attention model showed much greater accuracy due to the fact that it pays attention to
only the important part of the input data rather than the entire input sequence.

Encoder and Decoder model to simplify our work

Encoder = Embedding layer (input language) + RNN layer

Decoder = Embedding layer (target language) + RNN layer + Dense layer (to predict
next character)

Vanilla model

In contrast to traditional feed-forward neural networks, recurrent neural networks are a sort
ofnetwork architecture that accepts variable inputs and variable outputs.

Sequence to Sequence

This is a blend of many-to-one and one-to-many architecture for the sequence-to-sequence


models where you might want to perform things like machine translation. A variable-sized input,
suchas an English sentence, is received by the encoder, which performs encoding into a hidden state
vector.The hidden state vector is then received by the decoder, which generates a variable-sized
output. Utilizing this architecture is driven by its modularity. Encoders and decoders may easily be
switchedout to accommodate various language translations.

17
5.2 Algorithms Applied

LSTM Model Architecture

The LSTM model consists of multiple LSTM cells, which are capable of remembering and
forgetting information over long periods. Each LSTM cell has three gates: input gate, output gate,
and forget gate, which control the flow of information. The process of transliterating involves
changing atext's writing system. Using an encoder-decoder architecture with LSTM or GRU units
is one typicalmethod of transliteration in machine learning.

Figure:5.2 LSTM
Architecture

GRU Model Architecture


The GRU model is similar to the LSTM model, but it has only two gates: reset gate and update
gate, which control the flow of information. Recurrent neural network (RNN) units of the LSTM
andGRU varieties are frequently employed in encoder-decoder architectures for transliteration.

18
Figure:5.3 GRU Architecture

5.3 Training Overview


Transliteration models are typically trained on pairs of text in different scripts, where one
scriptserves as the source and the other as the target. The model compilation involves selecting the
optimizer algorithm, loss function, and evaluation metrics that are appropriate for the specific
problem being solved.

The optimizer algorithm could be Adam or SGD, the optimizer algorithm determines how
themodel's weights are updated during training to minimize the loss function. In RMSprop (Root
Mean Square Propagation) is an adaptive learning rate optimization algorithm used for training
artificial neural networks. It is designed to improve upon the problems with the Adagrad optimizer,
which adapts the learning rate to each individual parameter but accumulates gradients squared over
the entire training process. RMSprop addresses the problems with Adagrad by introducing an
exponentially weighted moving average of the squared gradients.

The algorithm calculates the running average of the squared gradients over time and divides
the current gradient by the root mean square (RMS) of these squared gradients. The weighted
movingaverage of the squared gradients, epsilon is a small constant to avoid division by zero, and
learning rate is the learning rate hyperparameter. The RMSprop optimizer is effective at dealing
with the problem of vanishing or exploding gradients, which can occur when training deep neural
networks. It
19
has become a popular choice for optimizing deep neural networks in a variety of applications,
including computer vision, natural language processing, and speech recognition.

The loss function could be categorical cross-entropy and measures the difference between
themodel's predictions and the true values and the goal is to minimize the loss function during
training. Categorical cross-entropy is a common loss function used for training neural networks in
this task. The categorical cross-entropy loss function measures the difference between the predicted
probabilitydistribution and the true probability distribution of the output. In a transliteration system,
the output can be represented as a sequence of characters or phonemes in the target writing system.

Let's assume that we have a dataset of pairs of words in the source writing system and their
corresponding transliterated versions in the target writing system. The goal is to train a neural
networkto predict the target transliteration given a source word. For each word in the dataset, the
network produces a probability distribution over the possible target transliterations. The categorical
cross- entropy loss function is then calculated by comparing this predicted probability distribution
with the true probability distribution. The true probability distribution can be represented as a one-
hot vector,where the element corresponding to the correct transliteration is set to 1, and all other
elements are setto 0.

In other words, the loss function penalizes the network for assigning low probabilities to
the correct transliteration and high probabilities to incorrect transliterations. During training, the
networkadjusts its parameters to minimize the average categorical cross-entropy loss over the entire
dataset. This encourages the network to make more accurate predictions for new words in the future.

model.compile(optimizer="rmsprop",
loss="categorical_crossentropy",metrics=['accuracy'])
model.fit(
[encoder_input_data, decoder_input_data],
decoder_target_data,
batch_size=batch_size,
epochs=epochs)

20
The evaluation metrics are used to measure the model's performance during and after
training,and they help to assess how well the model can generalize to new data. Once the model
has been compiled, it is ready to be trained on the dataset, and the training process involves iteratively
adjustingthe weights to minimize the loss function. Once training is complete, the model can be
used to transliterate text from one script to another.

21
Chapter 6

EXPERIMENTAL DESIGN AND EVALUATION

6.1 Experimental Design

Figure:6.1 Model workflow

Step1: Input Layer: The input layer consists of a sequence of characters in the source
language(Devanagari script).
Step2: Embedding Layer: The embedding layer converts the input sequence into a dense
vectorrepresentation.

Step3: LSTM Layer: The LSTM layer processes the input sequence and generates a sequence
ofhidden states, which capture the context and meaning of the input.
GRU Layer: The GRU layer processes the input sequence and generates a sequence of
hiddenstates, which capture the context and meaning of the input.

Step4: Dense Layer: The dense layer maps the hidden states to the target language (Latin
script)characters.
Step5: Output Layer: The output layer generates the predicted sequence of characters in the
targetlanguage.
Both LSTM and GRU models can be trained using the backpropagation algorithm with the
cross- entropy loss function. During training, the model learns to minimize the difference between
thepredicted sequence and the ground truth sequence in the target language. Once the model is
trained, itcan be used to transliterate new sequences of characters from the source language to the
target language.

22
6.2 Experimental Results

Experiment: Predicted Vanilla Model

Figure:6.2 Predicted Vanilla Model

Table 6.1 Evaluation Result

The performance of LSTM and GRU cell types was superior to Simple RNN. LSTM
outperformed GRU in terms of performance. Two-layer encoding and three-layer decoding produce
good results. A model should learn the dataset more effectively with a higher number of layers.
Mostof the time, a learning rate of 0.001 results in good performance. Instead of giving a model 0
dropout,adding dropout enhances the model and speeds up computation. The performance of the
model is improved by increasing the layer-by-layer embedding size. The best Test accuracy was
obtained for these hyperparameters in the seq2seq model is 34.273% (for exact string match).

23
6.3 CUSTOMER EVALATION AND FEEDBACK

The proposed system is attempted to correctly identify and result the probability of the
garmentclassified. Since it reduces and labour work and functions reliably.

EVALUATION

The transliteration system should be able to accurately convert the text from one script to
another without any errors. Customers would expect the system to produce accurate and error-free
results. Speed should be able to convert the text quickly. Customers would expect the system to be
fast and efficient, especially if they have to process a large amount of text. Ease of use the
transliteration system and understand. Customers would expect the system to have a user-friendly
interface and be easy to navigate.
The Customization of the transliteration system should be customizable according to the
needs of the customer. Customers would expect the system to be flexible and adaptable. Customers
would expect the system to have a responsive and helpful support team that can assist with any
issues or questions that arise. Overall, the customer evaluation for a transliteration system would
depend on how well the system meets their specific needs and requirements.

FEEDBACK

Feedback is an important aspect of evaluating a transliteration system as it provides valuable


insights into how well the system is performing and what improvements can be made. Feedback for
atransliteration system can come from various sources, such as end-users, system administrators,
and technical support staff. Feedback on the accuracy of the transliteration system is crucial as it is
the most important aspect of the system's performance.
Feedback on the customization options availablein the system can be obtained by assessing
whether the system allows users to customize the transliteration rules or add their own custom rules.
Feedback on the performance of the system can beobtained by assessing how fast the system is at
transliterating text. Feedback on the support provided by the system can be obtained by assessing
how responsive and helpful the technical support. The feedback can be used to identify areas of
improvement and prioritize future development efforts to enhance the system's performance.

24
Chapter 7

MODEL OPTIMIZATION

7.1 Overview of Model Tuning and Best Parameters Selection

In a transliteration system using GRU, batch size refers to the number of samples that will
beused to train the model at once. When training a neural network, the training data is usually
divided into batches, which are processed one at a time. This is done for computational efficiency
and to allowfor the use of parallel processing. The batch size determines how many samples are
used to computethe gradient of the loss function, which is used to update the model's weights during
training. A largerbatch size will generally result in more stable updates and faster convergence, but
it may also requiremore memory and computation power. In the model.fit function, you can specify
the batch_size parameter to control the batch size used during training.

For example: model.fit (X_train, y_train, batch_size=128, epochs=10) This code will train
the modelusing batches of 128 samples at a time for 10 epochs. You can experiment with different
batch sizes to see how they affect the performance of your model.

Retraining model with best Parameters

train,enc,dec = build_model(units=256,dense_size=512,enc_layers=2,dec_layers=3,cell =
"GRU", embedding_dim = 64)
train.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
checkpoint =
tf.keras.callbacks.ModelCheckpoint('best_model.h5',monitor='val_accuracy',mode='m
ax ',save_ best_only=True,verbose=1)
train.fit([trainx,trainxx],trainy,
batch_size=128,
validation_data=([valx,valxx],valy),
epochs=5,
callbacks = [checkpoint])

25
In Validation Data the model training process, it is essential to monitor the performance of
the model on a separate dataset to ensure that it is not overfitting. This dataset is called the validation
dataset, and it is used to evaluate the model's performance after each epoch. The validation dataset
should be different from the training dataset and should not be used for training the model. The
purpose of the validation dataset is to evaluate the model's performance on unseen data and to
preventoverfitting.

In the model.fit function, you can specify the validation data using the validation_data
parameter.This code will train the model using the training data X_train and y_train with a batch
size of 32 for 10 epochs. After each epoch, the model's performance will be evaluated on the
validation dataX_val and y_val. An epoch is a single pass through the entire training dataset. During
each epoch, themodel is trained on all the samples in the training dataset.

The number of epochs determines how many times the model will iterate over the entire
training dataset. In the model.fit function, you can specify the number of epochs using the epochs
parameter. During each epoch, the model will iterate over all the samples in X_train and y_train.
After each epoch, the model's performance will be evaluated on the validation data X_val and y_val.
You can experiment with different values for the number of epochs to find the optimal number of
iterations required for the model to achieve good performance.

26
7.2 Model Tuning Process and Experiments
Early Stopping

Early stopping is a technique used during the training of deep learning models, including
transliteration systems, to prevent overfitting and improve generalization performance. In the early
stopping works by monitoring the performance of the model on a validation dataset during the training
process. The validation dataset is a set of examples that the model has not seen before and is used to
assess the generalization performance of the model. During training, the performance of the model on
the validation dataset is evaluated after each epoch (i.e., one pass through the training dataset). If the
performance on the validation dataset does not improve for a certain number of consecutive epochs,
the training is stopped early, and the model with the best performance on the validation dataset is
saved.
By stopping the training early, it prevents the model from continuing to learn patterns in the
training data that may not generalize well to new examples. Instead, we choose the model that
performs the best on the validation dataset, which is a better indicator of its generalization
performance. Overall, early stopping is an effective technique to improve the generalization
performance of transliteration systems and prevent overfitting to the training data.

# Early Stopping

earlyStopping = EarlyStopping(monitor='val_loss', patience=5, verbose=0, mode='min')

# Config is a variable

train,enc,dec = build_model(units=units,
dense_size=dense_size, enc_layers=enc_layers, dec_layers=dec_layers, cell = cell,
dropout = dropout,
embedding_dim = embedding_dim) train.compile(optimizer =
Adam(learning_rate=learning_rate),loss='categorical_crossentropy',metrics=['accuracy'])

27
Model gives better performance by configuring with these parameters

Table 7.1 Experiment Parameters

Configuration Values
Batch-size 128

Cell GRU

Decoder-layers 2

Dense-size 512

Dropout 0.2

Embedding 64

Encoder-layers 1

Learning-rate 0.001

Units 256

The Hindi language corpus from the Dakshina dataset. These are the hyperparameters and
their values that searched over in order to find the best setting:

Cell Type: RNN, LSTM, GRU


Number of encoder layers: 1,2,3
Number of decoder Layers: 1,2,3
Neurons in dense layer: 64,128,512
Units in Cell type: 256
Embedding layers: 64,128,256
Learning rate: 0.01, 0.001
Dropout: 0.0,0.2,0.4

The obtained Proposed Model Resultant


Training set accuracy (character-wise) = 93%,
Validation set accuracy (character-wise) = 80%

28
Figure:7.3 Model Accuracy

29
Chapter 8

USER INTERFACE DESIGN AND EVALUATION

8.1 Designing Graphical User Interfaces


The manual explains why the transliteration system is needed and what it is used for. It will
help users understand the context and goals of the system.

Installation and Setup: The manual should provide instructions for installing and setting up
thesystem on a user's computer or device. Installing Flask and other libraries, packages are installed
withvirtual environment created and activated, the library Flask that need for transliteration system
are developed. We will install Flask by following a common convention to create the
requirements.txt file. and creating a app.py which is usually contain full python program of a
code(file) called app.py is an entry point for Flask applications. The manual explains how to input
text into the system and howthe output will be displayed. This may include information on supported
languages and character sets. This information on how specific characters or combinations of
characters are handled.

Figure:8.1 Designing GUI

8.2 Testing Graphical User Interface


Testing a Graphical User Interface (GUI) for a transliteration system involves evaluating
the system's user interface components and functionality that allow users to input text in one script
and convert it to another script.
30
Verify Input and Output: The first step is to check whether the system is taking the input in
thecorrect format, and whether the output generated by the system is accurate and matches the
expectedoutput. This can be done by comparing the transliterated text with a known transliteration
tool or by using native speakers to assess the accuracy of the output.

Functional Testing: Functional testing is important to ensure that the system is working
correctly and without any errors. This involves testing the system's functionality, such as checking
whether the system correctly handles special characters and punctuation marks, whether it can
handledifferent input formats, and whether it is capable of handling long texts.

Compatibility Testing is essential to ensure that the GUI is compatible with different
platformsand browsers, as well as with different screen sizes and resolutions. Compatibility testing
helps to ensure that users can access the transliteration system from different devices and platforms
without any issues. In Performance Testing the performance under different conditions, such as high
traffic orwhen handling large input files. Performance testing helps to identify any bottlenecks in
the system that may affect its performance, and optimize the system to improve its speed and
reliability.

Overall, testing a GUI for a transliteration system involves evaluating the system's input
and output accuracy, usability, functionality, compatibility, and performance. Proper testing
ensures that the system is efficient, reliable, and provides users with a seamless experience while
using the transliteration system.

Figure:8.2 Testing GUI

31
Chapter 9

PRODUCT DELIVERY AND DEPLOYMENT

The Delivery Schedule for the transliteration system using Flask would typically involve
the following stages:

Development: The development stage involves creating the transliteration system using the Flask
framework. This may include setting up the Flask environment, creating the necessary modules and
components, and testing the system for functionality and performance.

Testing: Once the system is developed, it must be tested thoroughly to ensure that it meets the project
requirements and functions as intended. This may involve testing for errors, bugs, and other issues,
aswell as validating the accuracy of the transliteration system.

Deployment: Once the system has been tested and validated, it can be deployed to the target
environment. This may involve setting up the necessary infrastructure, configuring the system, and
deploying the application to the production environment.

Maintenance: After the system is deployed, it must be maintained to ensure that it continues to
functionproperly and meet the changing needs of the users. This may involve performing regular
updates andmaintenance tasks, as well as providing technical support to users who encounter issues
with the system. The delivery schedule for a transliteration system using Flask would typically
depend on thescope and complexity of the model. The delivery schedule is shown below table

Table9.1 Delivery Schedule

32
Chapter 10

CONCLUSION

It concludes an encoder decoder model the encoder is responsible for reading a word in
source language and encoding it to an internal representation. The word transliteration in the
targeted word is produced by the decoder a model, utilizing the encoded representation of the source
language. Theentire encoded input is used as context for generating each step in the output.RNN
allows it to focus only on certain parts of the input sequence when predicting a certain part of the
output sequence, which increase the quality of learning and also enables easier learning.

The models have learned to map the English characters to their corresponding Hindi
characters and generate accurate transliterations. The use of LSTM and GRU models for the
transliteration system from English to Hindi is a promising approach that can help bridge the
language gap betweenthese two languages and enable effective communication.

10.1 Summary
The purpose of a transliteration system is to enable people to read and write text in a language
they may not be familiar with by providing a phonetic representation of the original text.
Transliteration systems can be used for a variety of applications, including language translation,
communication between people who speak different languages. A transliteration system using
LSTMand GRU are both types of recurrent neural networks that are designed to handle sequential
data suchas text.

The training process for a transliteration system using LSTM and GRU involves feeding the
network with large amounts of data in both the source and target writing systems. The network is
then able to learn the patterns and relationships between the two writing systems, and use this
knowledge to generate accurate transliterations for new input data. These networks are able to retain
informationover a long period of time, making them well-suited for tasks such as transliteration,
where the output is dependent on the context and structure of the input text and the output is
dependent on the contextand structure of the input text.

33
10.2 Limitations and Future Work
Transliteration systems are used for a variety of purposes, including facilitating
communication between speakers of different languages, helping to transcribe names and terms
inanother languages.

Ambiguity: Many languages have multiple ways of representing the same sound or word, which
can make it difficult to determine the correct transliteration.

Lack of standardization: There are often multiple competing transliteration systems for a given
language, which can lead to confusion and inconsistencies in the way that names and terms are
written.

Artificial intelligence and machine learning: Advanced algorithms could be trained to recognize
patterns in language and improve the accuracy of transliteration.

Standardization: Efforts to establish a standard transliteration system for each language could
helpto reduce confusion and ensure consistency across different applications.

Integration with other technologies: Transliteration systems could be integrated with other
technologies, such as speech recognition and natural language processing, to enable more
seamlesscommunication across languages.

34
REFERENCES

[1] B.-J. Kang and K.-S. Choi,"Automatic Transliteration and Back-transliteration by Decision
Tree Learning," in LREC, 2000.

[2] S. Wan and C. M. Verspoor, "Automatic English-Chinese name transliteration for


developmentof multilingual resources," in Proceedings of the 17th international conference on
Computationallinguistics-Volume 2, 1998, pp. 1352-1356

[3] M. Arbabi, S. M. Fischthal, V. C. Cheng, and E. Bart, "Algorithms for Arabic name
transliteration," IBM Journal of research and Development, vol. 38, pp. 183- 194, 1994.

[4] M. L. Dhore, S. K. Dixit, and T. D. Sonwalkar, "Hindi to english machine transliteration of


named entities using conditional random fields," International Journal of Computer
Applications,vol. 48, pp. 31- 37, 2012.

[5] K. Deep and V. Goyal, "Development of a Punjabi to English transliteration


system,"International Journal of Computer Science and Communication, vol. 2, pp. 521- 526,
2011.

[6] H. Surana and A. K. Singh, "A More Discerning and Adaptable Multilingual Transliteration
Mechanism for Indian Languages," in IJCNLP, 2008, pp. 64-71.

[7] P. Sanjanaashree, "Joint layer based deep learning framework for bilingual machine
transliteration," in Advances in Computing, Communications and Informatics (ICACCI, 2014
International Conference on, 2014, pp. 1737-1743.

[8] P. Antony, V. Ajith, and K. Soman, "Kernel method for english to kannada transliteration,"in
Recent Trends in Information, Telecommunication and Computing (ITC), 2010 International
Conference on, 2010, pp. 336-338.

[9] S. Mathur and V. P. Saxena, "Hybrid appraoch to EnglishHindi name entity transliteration,"
inElectrical, Electronics and Computer Science (SCEECS), 2014 IEEE Students' Conference
on, 2014, pp. 1-5.

[10] G. S. Lehal and T. S. Saini, "Development of a Complete Urdu-Hindi Transliteration


System,"in COLING (Posters), 2012, pp. 643-652.

35
[11] Abbas Malik Laurent Besacier Christian Boitet, “A Hybrid Model for Urdu Hindi
Transliteration” Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP 2009, pages
177–185, Suntec, Singapore, 7 August 2009.

[12] Gurpreet Singh Josan and Jagroop Kaur,” Punjabi to Hindi Statistical Machine
Transliteration”April 2011,Publisher: International Journal of Information Technology &
Knowledge Management (IJITKM)

[13] Andy Way, Sudip Kumar Naskar, Sandipan Dandapat, Ankit Kumar Srivastava, and
RejwanulHaque In 2009, CNGL put forth a Context-Informed Phrase-based statistical machine
translated for such an English to Hindi transcription.

[14] Amitava Das, Asif Ekbal, Tapabrata Mandal, and Sivaji Bandyopadhyay , Applying Letter-
to-Phenome technique tackled the transliteration issue in 2009.

[15] Taraka Rama, Karthik Gali,” Modeling Machine Transliteration as a Phrase Based
StatisticalMachine Translation Problem” Proceedings of the 2009 Named Entities Workshop,
ACL-IJCNLP2009, pages 124–127,Suntec, Singapore, 7 August 2009

36
APPENDIX-A

DATASET

अं an 3
अंकगणित ankganit 3
अंकल uncle 4
अंकु र ankur 4
अंकु रर ankuran3
अंकु ररत ankurit 3
अंकु श aankush 1
अंकु श ankush 3
अंग ang 2
अंग anga 1
अंगद agandh 1
अंगद angad 2
अंगने angane 3
अंगभंग angbhang 3
अंगरक्षक angarakshak 1
अंगरक्षक angrakshak 2
अंगारा angara 3
अंगारे angaare 1
अंगारे angare 2
अंगी angi 3
अंगीकार angikar 3
अंगुठे anguthe 3
अंगुल angul 3
अंगुिलय ंं anguliyon 3
अंगुली anguli 2
अंगुली ungli 1
अंगूठा angutha 3
अंगूिठय aanguthiyon 1
अंगूिठय anguthiyon 2
अंगूठी anguthi 3
अंगूठे anguthe 2
अंगूठे anguthon 1
अंगूर angoor 1
दु आओं duaon 2

37
दु कानदार dukaandaar 1
दु कानदार dukaardaar 1
दु कानदार dukandar 1
दु कानदारी dukaandaari 2
दु कानदारी dukandari 1
दु कानदार ंं dukaandaaron 2
दु कानदार ंं dukandaron 1
दु खद dukhad 5
दु खदाई dukhdaai 2
दु खदाई dukhdae 1
दु खने दु खांत dukhne 3 dukhaant
2
दु खां त दु गना दु गना दु गनी दु गनी दु गनी दु गुनी दु ग्ध दु ग्ध
दु धवा dukhant dugana 1
dugna 2
dugani 1
dugni 1
duguni 1
duguni 3
dugdh 2
dugdha 1 dudhavaa 2
दु धवा dudhwa 2
दु धारू dudhaaroo 1
दु धारू dudhaaru 1
दु धारू दु िनया
दु िनया dudharu duniya 2
duniyaa 2
दु िनयां duniyaa 1
दु िनयां duniyan 2
णिद्धि णिद्धि
णिद्धियााााँ siddhi 6
sidhi 1 siddhiyan
णिद्धिय ंं siddhiyon 2
णिद्धिय ंं siddiyon 1
णिद्धििवनायक siddhivinayak 3
णिद् धू siddhu 1 णिद् धू sidhu 2 णिणं ंं siddhon 3
णिधदां त ंं siddhanton 3
णिन scene 1
णिन sin 2
णिने cine 2
38
णिने sine 1
णिनेमा cinema 5
णिनेमा sinema 1
णिनेमाघर cinemaghar 3
णिनेमाघर ंं cinemagharon 3
णिन्हा sinha 4
णिप्पी sippy 3
णिप्ला cipla 2
णिप्ला sipla 1
णिफर sifar 2
णिफर siphar 2
हल hal 3
हलचल halchal 2
हलचल hulchul 1
हलद्वानी haldwani 3
हलफनामा halafnama
हलफनामा halfnama
हलाल halal 3
हल्क hulk 3
हल्का halka 3
हल्का hulka 1
हल्की halki 3
हल्की hulki 1
हल्के halke 3
हल्के hulke 1
हल्दीराम haldiram 2
हल्द्द्वानी haldrani 1
हल्द्द्वानी haldwani 2
हवन havan 2
हवन hawan 1
हवलदार havaldar
हवलदार hawaldar
हवा hava 1
हवा hawa 5
हवाई havai 1
हवाई hawai 2

39
APPENDIX-B

SOURCE CODE
# Required packages
import math import random
import numpy as np import pandas as pd import tensorflow as tf from keras import backend
from random import randrange from tensorflow import keras from google.colab import files
import matplotlib.pyplot as plt import matplotlib.ticker as ticker
from tensorflow.python.keras.models import load_model from tensorflow.python.keras.callbacks
import EarlyStopping

# Downloading dataset
!wget https://storage.googleapis.com/gresearch/dakshina/dakshina_dataset_v1.0.tar
!tar -xf 'dakshina_dataset_v1.0.tar'

# Paths of train and valid datasets.


train_data_path = "dakshina_dataset_v1.0/ta/lexicons/ta.translit.sampled.train.tsv" val_data_path
= "dakshina_dataset_v1.0/ta/lexicons/ta.translit.sampled.dev.tsv"

# Saving the files in list


with open(train_data_path, "r", encoding="utf-8") as file: train_data_lines = file.read().split("\n")
with open(val_data_path, "r", encoding="utf-8") as file: val_data_lines = file.read().split("\n")

# Fixed parameter
batch_size = 64

# embedding train data


def embed_train_data(train_data_lines): lenk = len(train_data_lines) - 1 train_input_data = []
train_target_data = [] input_data_characters = set() target_data_characters = set()
for line in train_data_lines[: lenk]: target_data, input_data, _ = line.split("\t")

# We are using "tab" as the "start sequence" and "\n" as "end sequence".
target_data = "\t" + target_data + "\n" train_input_data.append(input_data)
train_target_data.append(target_data)

# Finding unique characters.


for ch in input_data:
if ch not in input_data_characters: input_data_characters.add(ch)
for ch in target_data:
if ch not in target_data_characters: target_data_characters.add(ch)
print ("Number of samples:", len(train_input_data))

# adding space
input_data_characters.add(" ") target_data_characters.add(" ")

# sorting
input_data_characters = sorted(list(input_data_characters)) target_data_characters =
sorted(list(target_data_characters))

# maximum length of the words


40
encoder_max_length = max([len(txt) for txt in train_input_data]) decoder_max_length =
max([len(txt) for txt in train_target_data]) print("Max sequence length for inputs:",
encoder_max_length) print("Max sequence length for outputs:", decoder_max_length)

# number of input and target characters num_encoder_tokens = len(input_data_characters)


num_decoder_tokens = len(target_data_characters) print("Number of unique input tokens:",
num_encoder_tokens) print("Number of unique output tokens:", num_decoder_tokens)

# embedding validation data


def embed_val_data(val_data_lines,num_decoder_tokens,input_token_idx,target_token_idx):
val_input_data = []
val_target_data = []
lenk = len(val_data_lines) - 1 for line in val_data_lines[: lenk]:
target_data, input_data, _ = line.split("\t")

# Build RNN model


def seq2seq(embedding_size, n_encoder_tokens, n_decoder_tokens,
n_encoder_layers,n_decoder_layers, latent_dimension, cell_type,
target_token_idx, decoder_max_length, reverse_target_char_index,dropout,encoder_input_data,

decoder_input_data,
decoder_target_data,batch_size,epochs):
encoder_inputs = keras.Input(shape=(None,), name='encoder_input') encoder = None
encoder_outputs = None state_h = None
state_c = None
e_layer= n_encoder_layers

# GRU
elif cell_type=="GRU":
embed = tf.keras.layers.Embedding(input_dim=n_encoder_tokens,
output_dim=embedding_size,name='encoder_embedding')(encoder_inputs)
encoder = keras.layers.GRU(latent_dimension, return_state=True,
return_sequences=True,name='encoder_hidden_1', dropout=dropout)
encoder_outputs, state_h = encoder(embed)

# number of decoder layers d_layer = n_decoder_layers decoder = None


decoder = keras.layers.GRU(latent_dimension, return_sequences=True,
return_state=True,name='decoder_hidden_1', dropout=dropout)

decoder_dense = keras.layers.Dense(n_decoder_tokens, activation="softmax",


name='decoder_output') decoder_outputs = decoder_dense(decoder_outputs)
model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer="rmsprop", loss="categorical_crossentropy",metrics=['accuracy'])#,
metrics=[my_metric] model.fit(
[encoder_input_data, decoder_input_data], decoder_target_data, batch_size=batch_size,
epochs=epochs,
)

# Inference Model
encoder_inputs = model.input[0]
41
encoder_outputs, state_h_enc = model.get_layer('encoder_hidden_' +
str(n_encoder_layers)).output encoder_states = [state_h_enc]
encoder_model = keras.Model(encoder_inputs, encoder_states) decoder_inputs = model.input[1]
decoder_outputs = model.get_layer('decoder_embedding')(decoder_inputs)
decoder_states_inputs = []
decoder_states = []
for j in range(1, n_decoder_layers + 1):
decoder_state_input_h = keras.Input(shape=(latent_dimension,))

current_states_inputs = [decoder_state_input_h] decoder = model.get_layer('decoder_hidden_' +


str(j))
decoder_outputs, state_h_dec = decoder(decoder_outputs, initial_state=current_states_inputs)
decoder_states += [state_h_dec]
decoder_states_inputs += current_states_inputs

decoder_dense = model.get_layer('decoder_output') decoder_outputs =


decoder_dense(decoder_outputs)
decoder_model = keras.Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] +
decoder_states)
return encoder_model, decoder_model

# LSTM
elif cell_type=="LSTM":
embed = tf.keras.layers.Embedding(input_dim=n_encoder_tokens,
output_dim=embedding_size,name='encoder_embedding')(encoder_inputs)
encoder = keras.layers.LSTM(latent_dimension, return_state=True,
return_sequences=True,name='encoder_hidden_1', dropout=dropout)
encoder_outputs, state_h, state_c = encoder(embed)

for i in range(2,e_layer+1):
layer_name = ('encoder_hidden_%d') % i
encoder = keras.layers.LSTM(latent_dimension, return_state=True,
return_sequences=True,name=layer_name, dropout=dropout)
encoder_outputs, state_h, state_c = encoder(encoder_outputs, initial_state=[state_h,state_c])

encoder_states = None encoder_states = [state_h, state_c]

decoder_inputs = keras.Input(shape=(None,), name='decoder_input') embed_dec =


tf.keras.layers.Embedding(n_decoder_tokens, embedding_size,
name='decoder_embedding')(decoder_inputs)

# number of decoder layers d_layer = n_decoder_layers decoder = None


decoder = keras.layers.LSTM(latent_dimension, return_sequences=True,
return_state=True,name='decoder_hidden_1', dropout=dropout)

def accuracy(val_encoder_input_data,
val_target_data,n_decoder_layers,encoder_model,decoder_model, verbose=False):
correct_count = 0
total_count = 0 n_val_data=len(val_encoder_input_data) for seq_idx in range(n_val_data):

# Updating the target sequence.


target_charseq = np.zeros((1, 1)) target_charseq[0, 0] = sampled_token_idx
42
if decoded_sentence.strip() == val_target_data[seq_idx].strip(): correct_count += 1
total_count += 1 if verbose:
print('Prediction ', decoded_sentence.strip(), ',Ground Truth ', val_target_data[seq_idx].strip()
accuracy =correct_count * 100.0 / total_count
return accuracy

# parameters
embedding_size=256
# n_encoder_tokens=num_encoder_tokens # n_decoder_tokens=num_decoder_tokens
n_encoder_layers=3
n_decoder_layers=3 latent_dimension=512 cell_type='LSTM'
# target_token_idx=target_token_idx
# decoder_max_length=decoder_max_length
# reverse_target_char_index=reverse_target_char_index dropout=0.3
epochs=2

#Best model
encoder_model, decoder_model=seq2seq(embedding_size,
num_encoder_tokens,num_decoder_tokens,n_encoder_layers,
n_decoder_layers,latent_dimension,
cell_type, target_token_idx, decoder_max_length,reverse_target_char_index, dropout
,encoder_input_data, decoder_input_data,decoder_target_data,batch_size,epochs)

# encoder_model.save('encoder_model.h5') # decoder_model.save('decoder_model.h5')

# val_accuracy= accuracy(val_encoder_input_data,
val_target_data,n_decoder_layers,encoder_model,decoder_model) # print('Validation accuracy:
', val_accuracy)

# val_accuracy = accuracy(val_encoder_input_data[0:subset],
val_target_data[0:subset],n_decoder_layers,encoder_model,decoder_model) if subset>0 \ #
else accuracy(val_encoder_input_data,
val_target_data,n_decoder_layers,encoder_model,decoder_model)
print('Validation accuracy: ', val_accuracy)

# compute test accuracy


print('Reading test data')

test_data_path = "dakshina_dataset_v1.0/ta/lexicons/ta.translit.sampled.test.tsv" with


open(test_data_path, "r", encoding="utf-8") as f:
test_lines = f.read().split("\n")

# embedding test
test_input_data = [] test_target_data = []
for line in test_lines[: (len(test_lines) - 1)]: target_text, input_text, _ = line.split("\t") target_text
= "\t" + target_text + "\n" test_input_data.append(input_text)
test_target_data.append(target_text)

test_max_encoder_seq_length = max([len(txt) for txt in test_input_data])


test_max_decoder_seq_length = max([len(txt) for txt in test_target_data]) encoder_model =
load_model("encoder_model.h5")
decoder_model = load_model("decoder_model.h5")
43
# Test accuracy
subset = 50
test_accuracy = accuracy(test_encoder_input_data[0:subset],
test_target_data[0:subset],n_decoder_layers,encoder_model,decoder_model) if subset>0 \
else accuracy(test_encoder_input_data,
test_target_data,n_decoder_layers,encoder_model,decoder_model) print('Validation accuracy: ',
test_accuracy)
test_accuracy = testaccuracy(test_encoder_input_data,
test_target_data,n_decoder_layers,encoder_model,decoder_model) print('Test accuracy: ',
test_accuracy)

Seq2Seq model
#LSTM(Seq2Seq)
def build_model(cell = "LSTM",units = 32, enc_layers = 1, dec_layers = 1,embedding_dim =
32,dense_size=32,dropout=None):
keras.backend.clear_session() encoder_inputs = Input(shape=(None,)) encoder_embedding =
Embedding(input_dim=len(english_tokens)+1,output_dimembedding_dim,mask_zero=True)
encoder_context = encoder_embedding(encoder_inputs)
decoder_inputs = Input(shape=(None,))
decoder_embedding = Embedding(input_dim = len(hindi_tokens)+1,output_dim =
embedding_dim,mask_zero=True)
decoder_context = decoder_embedding(decoder_inputs) if cell == "LSTM":
encoder_prev = [LSTM(units,return_sequences=True) for i in range(enc_layers-1)] encoder_fin
= LSTM(units,return_state=True
decoder = [LSTM(units,return_sequences=True,return_state=True) for i in range(dec_layers)

temp,sh,sc = decoder[0](decoder_context,initial_state=encoder_states) for i in


range(1,dec_layers):
temp,sh,sc = decoder[i](temp,initial_state=encoder_states)

#gru
elif cell == "GRU":
encoder_prev = [GRU(units,return_sequences=True) for i in range(enc_layers-1)] encoder_fin =
GRU(units,return_state=True)
temp = encoder_context for lay in encoder_prev:
temp,s = decoder[0](decoder_context,initial_state=state) for i in range(1,dec_layers):
temp,s = decoder[i](temp,initial_state=state)
elif cell == "RNN":
encoder_prev = [SimpleRNN(units,return_sequences=True) for i in range(enc_layers-1)]
encoder_fin = SimpleRNN(units,return_state=True)
temp = encoder_context for lay in encoder_prev:
temp = lay(temp)
for i in range(1,dec_layers)
dense_lay1 = Dense(dense_size,activation='relu') pre_out = dense_lay1(temp)
final_output = dense_lay2(pre_out)
decoder_model = Model(decoder_input_pass, [final_output]+state_outputs) return
train,encoder_model,decoder_model

# Config is a variable
train,enc,dec = build_model(units=units,
44
dense_size=dense_size, enc_layers=enc_layers, dec_layers=dec_layers, cell = cell,
dropout = dropout,
embedding_dim = embedding_dim) train.compile(optimizer =
Adam(learning_rate=learning_rate),loss='categorical_crossentropy',metrics=['accuracy'])

# Early Stopping
earlyStopping = EarlyStopping(monitor='val_loss', patience=5, verbose=0, mode='min')

# To save the model with best validation accuracy


checkpoint = ModelCheckpoint('bestmodel.h5', monitor='val_accuracy', mode='max', verbose=0,

save_best_only=True) train.fit([trainx,trainxx],trainy,
batch_size=batch_size, validation_data = ([valx,valxx],valy), epochs=10,
callbacks=[WandbCallback(), earlyStopping,checkpoint])

print("model training done") wandb.run.name = run_name wandb.run.save()


return train

sweep_config = {
'method': 'random', #grid, random 'metric': {
'name': 'val_accuracy', 'goal': 'maximize'},
'parameters': { 'learning_rate': {
'values': [0.01, 0.001]},
'dense_size': {
'values': [64,128,512]},
'dropout': {
'values': [0.0,0.2,0.4]},
'units': {
'values': [64,128,256]},
'batch_size': {
'values': [64,128,256]},
'cell': {
'values': ["LSTM","GRU","RNN"]},
'embedding_size': { 'values': [64,128,256]},
'enc_layers': { 'values': [1,2,3]},
'dec_layers': { 'values': [1,2,3]},
sweep_id = wandb.sweep(sweep_config, entity="addy_15", project="v")
id = '0whn8jb2'
wandb.agent(id, train,entity="addy_15", project="v")
train,enc,dec = build_model(units=256,dense_size=512,enc_layers=2,dec_layers=3,cell =
"GRU", embedding_dim = 64)
train.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
checkpoint = train.fit([trainx,trainxx],trainy,
batch_size=128, validation_data=([valx,valxx],valy), epochs=5,
callbacks = [checkpoint])

#Defining inference model and getting prediction


def inference(inp,dec_layers,cell="LSTM"): statess = enc.predict(inp)
target_seq = np.zeros((inp.shape[0],1)) target_seq[:,0] = hin_token_map["\t"] states = []
if cell == "LSTM":
for c in range(dec_layers):
45
states += [statess[0],statess[1]]
else:
for c in range(dec_layers): states += [states]
ans = np.zeros((inp.shape[0],max_hin_len))
for i in range(max_hin_len):
output = dec.predict([target_seq]+states,batch_size=64) ans[:,i] = np.argmax(output[0][:,-
1,:],axis=1) target_seq[:,0] = ans[:,i]
states = output[1:] return ans

#printing 10 sample outputs


for i in range(10):
idx = np.random.choice(testx.shape[0]) orig = ""
for ch in testx[idx]:
orig += reverse_eng_map[ch]
if reverse_eng_map[ch] == "\n":

break deco = ""


for ch in prediction[idx]:
deco += reverse_hin_map[ch] if reverse_hin_map[ch] == "\n":
break
print("Input :",orig) print("Output:", deco) print("********")

#Getting output words


output_words = []
for i in range(testx.shape[0]): idx = i
decode = ""
for ch in prediction[idx]:
decode += reverse_hin_map[ch] if reverse_hin_map[ch] == "\n":
break output_words.append(decode) print(len(output_words))

#Saving ouput as prediction_vanilla,csv from dataframe


pred = pd.DataFrame(output_words) pred1 = pred.replace('\\n','', regex=True)
final_pred = pred1.replace('\\t','', regex=True) test['pred_vanilla'] = final_pred.values
test.rename(columns = {'hi':'target', 'en':'input'}, inplace = True)
test.to_csv('prediction_vanilla.csv', sep='\t', encoding='utf-8')
#Test accuracy by matching exact string
def test_accuracy(pred):
acc = 0
for i,pr in enumerate(pred): fl = 1
for j,ch in enumerate(pr):
if ch != np.argmax(testy[i,j,:]): fl = 0
break
if ch == hin_token_map["\n"]: breaK
if fl==1: acc+=1
return (acc/len(pred))*100
print("Test accuracy with best parameters of s2s_without attention ", test_accuracy(prediction))

46
APPENDIX-C

OUTPUT SCREEN SHOTS

47

You might also like