International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 11 Issue: 05 | May 2024 www.irjet.net p-ISSN: 2395-0072
Developing a system for converting sign language to text
Mr. Umesh kumar1, Ishu singh2, Raman baliyan3, Ritik chuhan4, Harsh tyagi5
1Assistant professor, Dept. of computer science and engineering (artificial intelligence and machine learning),
Meerut Institute Of Engineering And Technology, Meerut, Uttar Pradesh ,India
2345B.tech Student, Dept. of computer science and engineering (artificial intelligence and machine learning),
Meerut Institute Of Engineering And Technology, Meerut , Uttar Pradesh, India
---------------------------------------------------------------------***--------------------------------------------------------------------
Abstract - This project focuses on the development of Amidst its multifaceted capabilities, one of the primary
a Hand Sign Language to Text and Speech Conversion objectives of this system is the accurate detection and
system using Convolutional Neural Networks (CNN). interpretation of an extensive range of hand signs,
With an achieved accuracy of over 99%, the model encompassing not only the 26 letters of the English
accurately translates hand signs, including the 26 alphabet but also the recognition of the backslash symbol,
alphabets and the backslash character, to their a crucial component for seamless textual communication.
corresponding text characters. The system utilizes the By harnessing the power of CNN, the system demonstrates
OpenCV library for image processing and gesture an unprecedented accuracy rate exceeding 99%, enabling
recognition, and the Keras library for the the precise translation of intricate hand gestures into their
implementation of the CNN model. The process corresponding textual representations.
involves capturing real-time video input of hand
gestures, preprocessing the images, and making The core architecture of the system integrates the robust
predictions using the trained CNN model. The system is OpenCV library for intricate image processing and gesture
equipped with a Graphical User Interface (GUI) to recognition, coupled with the flexible Keras library, serving
display the captured video and the recognized hand as the backbone for the streamlined implementation of the
sign, along with options for users to choose suggested CNN model. The comprehensive workflow of the system
words or clear the recognized sentence. Additionally, encompasses real-time video input capturing,
the system enables users to listen to the recognized sophisticated image preprocessing, and informed
sentence through text-to-speech functionality. The predictions based on the robust CNN model and using
effectiveness and accuracy of the proposed system Mediapipe for recognisition of various points, reflecting a
were evaluated through extensive testing, harmonious blend of cutting-edge technology and user-
demonstrating its potential for real-world centric design.
applications.
Furthermore, the system is equipped with a highly intuitive
Keywords: CNN, Text to Speech, GUI, OpenCV, Suggested Graphical User Interface (GUI) that showcases the
Words, Real Time,Mediapipe captured video feed and the recognized hand sign,
providing users with a seamless experience to interact
with the system effortlessly. Users are presented with an
1.INTRODUCTION
array of options, including the ability to select suggested
In the contemporary era of rapid technological words or effortlessly clear the recognized sentence,
advancements, the quest for innovative solutions that fostering an environment of interactive and dynamic
foster seamless communication for individuals with communication. Additionally, the integration of text-to-
diverse linguistic abilities remains a pivotal focal point. speech functionality empowers users to not only visualize
Within this context, the development of a Hand Sign but also audibly comprehend the recognized sentence,
Language to Text and Speech Conversion system using enhancing the overall accessibility and user experience.
Mediapipe advanced Convolutional Neural Networks
Through rigorous and extensive testing, the efficacy and
(CNN) represents a significant stride towards inclusivity
precision of the proposed system have been extensively
and accessibility. This groundbreaking system stands as a
validated, underscoring its immense potential for real-
testament to the fusion of state-of-the-art image
world applications across a diverse spectrum of contexts.
processing, machine learning methodologies, and intuitive
By facilitating the seamless conversion of intricate hand
user interfaces, all converging to bridge the gap between
gestures into coherent textual and auditory output, this
conventional spoken language and the intricate nuances of
system paves the way for enhanced communication and
sign language.
inclusivity, catering to the diverse needs of individuals with
© 2024, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 1100
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 11 Issue: 05 | May 2024 www.irjet.net p-ISSN: 2395-0072
varying linguistic abilities and promoting a more Rekha et al. [4] analyzed a dataset containing static and
connected and accessible society. dynamic signs in ISL, employing skin color segmentation
techniques for hand detection. They trained a multiclass
Support Vector Machine (SVM) using features such as edge
orientation and texture, achieving a success rate of 86.3%.
However, due to its slow processing speed, this method
was deemed unsuitable for real-time gesture detection.
Bhuyan et al. [5] utilized a dataset of ISL gestures and
employed a skin color-based segmentation approach for
hand detection. They achieved a recognition rate of over
90% using the nearest neighbor classification method,
showcasing the effectiveness of simple yet robust
techniques.
Pugeault et al. [6] developed a real-time ASL recognition
2. LITERATURE REVIEW system utilizing a large dataset of 3D depth photos
collected through a Kinect sensor. Their system achieved
In the domain of sign language recognition and translation, highly accurate classification rates by incorporating Gabor
Convolutional Neural Networks (CNNs) have emerged as a filters and multi-class random forests, demonstrating the
prominent technique, particularly for American Sign effectiveness of integrating advanced feature extraction
Language (ASL) recognition. Researchers like Hsien-I Lin techniques.
et al. have utilized image segmentation to extract hand
gestures and achieved high accuracy levels, around 95%, Keskin et al. [7] focused on recognizing ASL numerals
using CNN models trained on specific hand motions. using an object identification technique based on
Similarly, Garcia et al. developed a real-time ASL translator components. With a dataset comprising 30,000
using pre-trained models like GoogLeNet, achieving observations categorized into ten classes, their approach
accurate letter classification. demonstrated strong performance in numeral recognition.
In Das et al.'s study [1], they developed an SLR system Sundar B et al. [8] presented a vision-based approach for
utilizing deep learning techniques, specifically training an recognizing ASL alphabets using the MediaPipe
Inception V3 CNN on a dataset comprising static images of framework. Their system achieved an impressive 99%
ASL motions. Their dataset consisted of 24 classes accuracy in recognizing 26 ASL alphabets through hand
representing alphabets from A to Z, except for J. Achieving gesture recognition using LSTM. The proposed approach
an average accuracy rate of over 90%, with the best offers valuable applications in human-computer
validation accuracy reaching 98%, their model interaction (HCI) by converting hand gestures into text,
demonstrated the effectiveness of the Inception V3 highlighting its potential for enhancing accessibility and
architecture for static sign language detection. communication.
Sahoo et al. [2] focused on identifying Indian Sign Jyotishman Bora et al. [9] developed a machine learning
Language (ISL) gestures related to numbers 0 to 9. They approach for recognizing Assamese Sign Language
employed machine learning methods such as Naive Bayes gestures. They utilized a combination of 2D and 3D images
and k-Nearest Neighbor on a dataset captured using a along with the MediaPipe hand tracking solution, training
digital RGB sensor. Their models achieved impressive a feed-forward neural network. Their model achieved 99%
average accuracy rates of 98.36% and 97.79%, accuracy in recognizing Assamese gestures, demonstrating
respectively, with k-Nearest Neighbor slightly the effectiveness of their method and suggesting its
outperforming Naive Bayes. applicability to other local Indian languages. The
lightweight nature of the MediaPipe solution allows for
Ansari et al. [3] investigated ISL static gestures using both implementation on various devices without compromising
3D depth data and 2D images captured with Microsoft speed and accuracy.In terms of continuous sign language
Kinect. They utilized K-means clustering for classification recognition, systems have been developed to automate
and achieved an average accuracy rate of 90.68% for training sets and identify compound sign gestures using
recognizing 16 alphabets, demonstrating the efficacy of noisy text supervision. Statistical models have also been
incorporating depth information into the classification explored to convert speech data into sign language, with
process. evaluations based on metrics like Word Error Rate (WER),
BLEU, and NIST scores.
© 2024, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 1101
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 11 Issue: 05 | May 2024 www.irjet.net p-ISSN: 2395-0072
Overall, research in sign language recognition and
translation spans various techniques and languages, Webcam
aiming to improve communication accessibility for
individuals with hearing impairments.
3. Proposed Architechure :
The proposed system aims to develop a robust and
efficient Hand Sign Language to Text and Speech Preprocessing
Conversion system using advanced Convolutional Neural
Networks (CNN). With a primary focus on recognizing
hand signs, including the 26 alphabets and the backslash
character, the system integrates cutting-edge technologies
to ensure accurate translation and interpretation.
Leveraging the OpenCV library for streamlined image Feature Extraction
processing and gesture recognition, and the Keras library
for the implementation of the CNN model, the system
guarantees high precision in sign language interpretation.
The system involves the real-time capture of video input
showcasing hand gestures, which are then pre-processed
to enhance the quality of the images. These pre-processed
images are then fed into the trained CNN model, enabling Model Prediction
precise predictions and accurate translation of the
gestures into corresponding text characters. The
integration of a user-friendly Graphical User Interface
(GUI) provides an intuitive display of the captured video
and the recognized hand sign, empowering users with the
option to choose suggested words or clear the recognized Post Processing
sentence effortlessly.
Furthermore, the system is equipped with text-to-speech
functionality, allowing users to listen to the recognized
sentence, thereby enhancing the overall accessibility and
usability of the system. The proposed system is designed
with a focus on real-world applications, ensuring its
effectiveness and accuracy through extensive testing and Output interface
validation. The system's robust architecture and accurate
translation capabilities position it as a promising solution
for bridging communication gaps and facilitating seamless
interaction for individuals using sign language. Fig 3.1: Workflow for ASL model
A. Image Frame Acquisition
The data collection and preparation module
involve sourcing and assembling a
comprehensive dataset from reliable
repositories, including platforms like Kaggle.
This module focuses on curating a diverse and
extensive dataset of hand sign language images,
ensuring the inclusion of various gestures, hand
positions, and lighting conditions. The collected
dataset is then pre-processed to standardize
image formats, remove noise, and enhance
image quality, optimizing it for subsequent
processing and analysis.
© 2024, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 1102
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 11 Issue: 05 | May 2024 www.irjet.net p-ISSN: 2395-0072
C. Feature Extraction :
The feature extraction and representation module
focus on extracting relevant visual features from pre-
processed hand sign language images to facilitate
effective pattern recognition and classification. This
module employs techniques such as edge detection,
contour analysis, and texture extraction to identify
and extract distinctive visual elements that
Fig 3.2: ASL Dataset used for model training represent different hand sign gestures. By extracting
essential features, the module enables the system to
capture and interpret key visual cues, enabling
B. Hand Tracking: accurate and robust recognition of diverse hand sign
language gestures.
In our proposed architecture for Sign Language
Recognition (SLR), we integrate the Mediapipe
module, an open-source project developed by
Google, to facilitate precise hand tracking. The
utilization of the Mediapipe module empowers our
system with robust and efficient hand pose
estimation capabilities, enabling real-time tracking
of hand movements and positions. b
a
The Mediapipe module operates by analyzing video
frames and identifying key points or landmarks Fig 3.4 Hand Gesture Joining mediapipe points
corresponding to various parts of the hand. Through
sophisticated algorithms, it accurately tracks the D. Model Training and Optimization: The model
spatial configuration of the hand, including the training and optimization module involve training
positions of fingers, joints, and palm. From the hand the Convolutional Neural Network (CNN) using the
tracking module, we extract a comprehensive set of pre-processed dataset and optimizing the network's
21 landmarks for each hand being tracked. These architecture and parameters to achieve superior
landmarks serve as pivotal features that encapsulate performance. This module includes procedures such
the intricate details of hand gestures, capturing as model configuration, hyperparameter tuning, and
nuances such as finger positions, hand orientation, cross-validation to enhance the CNN's learning
and motion trajectories. By leveraging these 21 capabilities and generalization to various hand sign
landmarks, our SLR system gains valuable insights gestures. By conducting comprehensive model
into the dynamic movements and spatial training and optimization, the module ensures the
relationships of the hands during sign language CNN's ability to accurately recognize and classify a
gestures. These landmarks serve as fundamental wide range of hand sign language gestures with
building blocks for subsequent stages of our
high precision and reliability.
recognition model, providing rich contextual
information essential for accurate classification of
sign language gestures.
Fig 3.3: Landmarks from Mediapipe Hand Tracking Module
Fig 3.4: Loss and Accuracy Graph during Model
Training
© 2024, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 1103
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 11 Issue: 05 | May 2024 www.irjet.net p-ISSN: 2395-0072
E. Real-time Gesture Recognition and class being the correct gesture. To determine the
Interpretation: The real-time gesture recognition predicted gesture, we identify the class with the
and interpretation module focus on the rapid and highest probability among the distribution. This
accurate recognition of hand sign language gestures class represents the most probable sign language
from live video input in real-time. This module gesture based on the input features and the model's
integrates optimized CNN inference mechanisms and learned parameters.Finally, to derive the output
real-time image processing techniques to enable the gesture, we map the selected class to its
system to instantaneously recognize and interpret corresponding letter between 'A' and 'Z'. This
hand signs displayed by users. By leveraging efficient mapping allows us to interpret the prediction in
gesture recognition algorithms, the module terms of recognizable sign language symbols,
enhances the system's responsiveness and usability, facilitating communication and interaction for
providing users with seamless and instantaneous individuals using sign language.
translation of hand sign language to corresponding
text characters.
F. Error Handling and Correction Mechanisms:
The error handling and correction mechanisms
module addresses potential errors and uncertainties
that may arise during the recognition and
interpretation process. This module implements
robust error detection algorithms and corrective
measures to minimize misclassifications and
inaccuracies in the recognized hand signs, ensuring
the system's reliability and accuracy. By
incorporating effective error handling and correction
mechanisms, the module enhances the system's
overall performance and fosters precise and
dependable translations of hand sign language
gestures.
Fig 3.5 Output of working ASL-model
G. User Interface and Experience Design: The
user interface and experience design module focus 4.FINAL OUTPUT AND DISCUSSION:
on creating an intuitive and user-friendly graphical
interface that enables seamless interaction between We cleaned the ASL dataset before using 4500 photos per
users and the hand sign language conversion system. class to train our model. There were 166K photos in the
This module includes designing a visually appealing original collection. An 80% training set and a 20% test
and accessible interface that allows users to input set were created from the dataset. In order to train the
hand sign language gestures, view recognized text model, we used a range of hyperparameters, including
characters, and access additional functionalities such learning rate, batch size, and the number of epochs.
as word suggestions and sentence clearing. By
Our test set evaluation metrics demonstrate the trained
prioritizing user-centric design principles, the
model's remarkable performance. It properly identified
module enhances the overall user experience and
every sample in the test set, earning a high accuracy
promotes inclusivity for individuals with diverse
score of 100%. The classification report's precision,
communication needs and preferences.
recall, and F1-score values are all 100%, showing that the
model properly identified each class's samples without
H. Output Gesture : In the classification phase of our making any errors.
proposed architecture, our objective is to predict and
convert the asl sign languages to the text and speech
.This prediction is based on the input features
extracted from the hand gestures and processed by
the trained model. Once the feature vector
undergoes processing through the neural network,
the final layer of the network generates a probability
distribution across various classes or labels, each
corresponding to a distinct sign language letter.
These probabilities indicate the likelihood of each
© 2024, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 1104
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 11 Issue: 05 | May 2024 www.irjet.net p-ISSN: 2395-0072
The confusion matrix provides a summary of the
performance of a classification model. Each row in the
matrix represents the instances in the actual class, while
each column represents the instances in the predicted
class. Fig 3.7 represents the confusion matrix plotted
between the 26 classes representing the alphabets (A-Z).
Fig 3.6 Classification report for ASL-model
TABLE I: CLASSIFICATION REPORT FOR ASL-MODEL
Classes Precision Recall F1-score Support
A 1.00 1.00 1.00 912
B 1.00 1.00 1.00 940
C 1.00 1.00 1.00 921
D 1.00 0.99 1.00 927
E 1.00 1.00 1.00 900 Fig 3.7 Confusion matrix
F 1.00 0.99 1.00 923
5. CONCLUSIONS And Future Scope:
G 1.00 1.00 1.00 910
H 1.00 1.00 1.00 895 In summary, our ASL recognition model stands out with an
I 1.00 1.00 1.00 884
extraordinary accuracy rate of 99.50% in real-time Sign
Language Recognition (SLR). This achievement is
J 1.00 1.00 1.00 874 primarily attributed to the sophisticated combination of
K 1.00 0.99 1.00 868 Mediapipe for feature extraction and Convolutional Neural
L 1.00 1.00 1.00 893 Networks (CNN) for classification. By leveraging these
advanced techniques, our model offers a robust and
M 0.99 1.00 0.99 884
precise solution for interpreting ASL hand gestures.
N 1.00 0.99 1.00 935
O 1.00 1.00 1.00 887
Central to the success of our model is the meticulous
curation and preprocessing of the dataset. From an initial
P 1.00 1.00 1.00 898 collection of 13,000 photos, we carefully selected 500
Q 0.99 1.00 1.00 837 representative images per class, ensuring a balanced and
R 1.00 1.00 1.00 912
diverse training corpus. This meticulous approach enabled
our model to generalize effectively, recognizing a broad
S 1.00 1.00 1.00 861
spectrum of ASL gestures with remarkable accuracy.
T 1.00 1.00 1.00 895
U 1.00 1.00 1.00 873
Furthermore, we employed data augmentation techniques
to enrich the training data, thereby enhancing the model's
V 1.00 1.00 1.00 901 ability to handle variations in hand gestures, lighting
W 1.00 1.00 1.00 917 conditions, and backgrounds. This augmentation strategy
X 1.00 1.00 1.00 952
played a pivotal role in bolstering the model's robustness
and performance in real-world scenarios. Looking ahead,
Y 1.00 1.00 1.00 897 our future objectives are ambitious yet promising. We aim
Z 1.00 1.00 1.00 904 to explore the integration of additional deep learning
Accuracy 1.00 23400 architectures and methodologies to further elevate the
precision and speed of our model. By harnessing the latest
Macro avg 1.00 1.00 1.00 23400 advancements in AI research, we aspire to push the
Weighted avg 1.00 1.00 1.00 23400 boundaries of SLR technology, empowering our model to
excel in diverse environments and contexts.
© 2024, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 1105
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 11 Issue: 05 | May 2024 www.irjet.net p-ISSN: 2395-0072
Moreover, we envision expanding the scope of our model [9] Bora, J., Dehingia, S., Boruah, A., Chetia, A. A., & Gogoi,
to encompass a wider range of sign languages and D. (2023). Real-time Assamese Sign Language Recognition
gestures, fostering inclusivity and accessibility on a global using MediaPipe and Deep Learning. Procedia Computer
scale. This expansion will involve extensive research and Science, 218, 1384–1393.
development efforts, including the integration of machine https://doi.org/10.1016/j.procs.2023.01.117
learning algorithms capable of comprehending entire sign
language sentences and phrases.
Ultimately, our goal is to realize a comprehensive suite of
SLR systems that transcend linguistic barriers, enabling
seamless communication between sign language users and
the broader community. With continued innovation and
collaboration, we believe that this technology has the
potential to revolutionize communication and accessibility
for individuals with hearing impairments, facilitating
greater integration and participation in society.
REFERENCES
[1] Das, S. Gawde, K. Suratwala and D. Kalbande. (2018)
"Sign language recognition using deep learning on custom
processed static gesture images," in International Conference
on Smart City and Emerging Technology(ICSCET).
[2] A. K. Sahoo. (2021) "Indian sign language recognition
using machine learning techniques," in Macromolecular
Symposia.
[3] Z. A. Ansari and G. Harit. (2016) "Nearest neighbour
classification of Indian sign language gestures using kinect
camera," Sadhana, vol. 41, p. 161–182.
[4] Rekha, J. Bhattacharya, and S. Majumder. (2011) "Shape,
texture and local movement hand gesture features for Indian
sign language recognition," in 3rd international conference
on trendz in information sciences & computing (TISC2011).
[5] M. K. Bhuyan, M. K. Kar, and D. R. Neog. (2011) "Hand
pose identification from monocular image for sign language
recognition," in 2011 IEEE International Conference onSignal
and Image Processing Applications (ICSIPA).
[6] N. Pugeault and R. Bowden. (2011) "Spelling it out: Real-
time ASL fingerspelling recognition," in 2011 IEEE
International conference on computer vision workshops
(ICCV workshops).
[7] C. Keskin, F. Kıraç, Y. E. Kara and L. Akarun. (2013)
"Real time hand pose estimation using depth sensors," in
Consumer depth cameras for computer vision, Springer,
p.119–137
[8] Sundar, B., & Bagyammal, T. (2022). American Sign
Language Recognition for Alphabets Using MediaPipe and
LSTM. Procedia Computer Science, 215, 642–651.
https://doi.org/10.1016/j.procs.2022.12.066
© 2024, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 1106