0% found this document useful (0 votes)
10 views6 pages

ASL to Text/Speech App Development

Research paper for sign language detection
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views6 pages

ASL to Text/Speech App Development

Research paper for sign language detection
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)

IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2

Real-time Conversion of Sign Language


to Text and Speech
Kohsheen Tiku Jayshree Maloo
Department of Information Science Department of Information Science
BMS College of Engineering BMS College of Engineering
Bangalore, India Bangalore, India
kohsheen.t@gmail.com jayshreemaloo03@gmail.co m

Aishwarya Ramesh Indra R


Department of Information Science Department of Information Science
BMS College of Engineering BMS College of Engineering
Bangalore, India Bangalore, India
ash.cancer98@gmail.com indra.ise@bmsce.ac.in

Abstract—This paper presents an analysis of the performance colleagues and peers regardless of whether the second person
of different techniques that have been used for the conversion of knows sign language.
sign language to text/speech format. Using the best possible
method after analysis, an android application is developed that Being able to achieve a uniform sign language translation
can convert real-time AS L (American S ign Language) signs to machine is not a simple task, however, there are two common
text/speech. methods used to address this problem namely sensor based
sign language recognition and Vision-based sign language
Keywords—sign language, ASL, image processing, machine
learning recognition. Sensor based sign language recognition [12] uses
designs such as the robotic arm with a sensor, smart glove,
golden glove for the conversion of ASL Sign language to
I. INT RODUCT ION
speech. But the issue is that many people do not use it. Also,
466 million people worldwide have impaired hearing loss one must spend money to purchase such a glove, which is not
(more than 5 percent of the world 's population), 34 million of easily available. Vision based Sign Language Translation
whom are teenagers, according to the World Health
[13][14] uses Digital Image Processing. It is a framework
Organization (WHO). Studies expect these figures would
which is utilized to perceive and interpret nonstop gesture-
surpass 900 million by 2050. Moreover, most cases of
debilitating hearing loss affecting millions of people are based communication to English content. In vision-based
concentrated in low- and middle-income countries. gesture recognition, a camera is used as input. Videos are
broken down into frames before processing. Hence vision-
based methods are preferred over gesture-based approaches as
Sign Languages allow the dumb and deaf people to anyone with a smartphone can convert sign language to
communication with each other and the rest of the world. text/speech and it is relatively cost-effective.
There are over 135 different sign languages around the world
which include American Sign Language (ASL), British Sign In this paper, the method of developing an android
Language (BSL) and Australian Sign Language (Auslan) etc. application is demonstrated for the vision-based approach, of
sign language to text/speech conversion without any sensors,
American Sign Language has been created to reach the by only capturing video of the hand gestures, completely
wider public and acts as the primary sign language of the free of an y cost.
Deaf populations in the United States and much of
Anglophone Canada, als o including most of West Africa and II. METHODOLOGY
areas of Southeast Asia.
A. Overview
People with hearing impairments are left behind in online In this paper, 26 ASL alphabets are used along with 1
conferences, office sessions, schools. They usually use basic customized symbol for ‘Space’ which is to be recognized in
text chat to converse — a method less than optimal. With the real-time using a smartphone. For this purpose, here the One
growing adoption of telehealth, deaf people need to be able to Plus 6 smartphone with OxygenOS (based on Android Oreo)
communicate naturally with their healthcare network, operating system has been used. The algorithm is developed on

978-1-7281-5374-2/20/$31.00 ©2020 IEEE 346


Authorized licensed use limited to: Dr. D. Y. Patil Educational Complex Akurdi. Downloaded on September 21,2024 at 12:50:12 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)
IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2

top of a Java-based OpenCV wrapper. The entire system was since the SVM algorithm works more precisely with smaller
developed using images that are of 200 x 200 pixels in RGB datasets. The dataset created for space consists of 100 images
format. of the gesture as well. In total, there are 27 classes (26
‘Alphabet’ classes + 1 ‘Space’ class) where considered the
To design an appropriate model, the first thing is to ‘Space’ as a separate class.
understand what features will be the most appropriate to extract
from static images. Examples of such features include Radial
signature, Histogram of gradients (HOG) [1], centroid distance
signature, Fourier descriptors.
The technique which is the most appropriate for this
scenario is Histogram of gradients (HOG) descriptors . HOG is
preferred because the appearance and shape of a local object
can be easily detected by means of intensity gradients or edge
directions. The image is divided into small connected regions
called cells, and a histogram of gradient directions is compiled
for the pixels within each cell. The descriptor is the
concatenation of the histograms. For higher accuracy, local
histograms are contrast-normalized by measuring the intensity
variance over a wider area of the image, called a block, and
then using this value to normalize all cells within a block. This
normalization results in greater invariance with shifts in
lighting and shadowing.
Support Vector Machine (SVM) [8][9][10], a machine
learning algorithm uses HOG descriptors as the features of the
image. Hence, SVM is used to train our model and this
experimentation deals with using three different array
parameters for SVM and comparing the results of each. The Figure 2: Architecture of the android application
three array parameters are Detection Method, Kernel and
Dimensionality reduction type. The following are the different
types of array parameters that are used for training.
III. IMPLEMENTATION
 Detection Method - Contour Mask, Canny Edges,
Skeleton The application is designed and implemented using Android
 Kernel - Linear, Radial Basis Function (RBF) Studio and OpenCV [15] functions in Java.
 Dimensionality Reduction Type - None, Principal
Component Analysis (PCA) A. Calibration
Here, color-based segmentation has been implemented
provided using libraries present by OpenCV. This can be done
B. DataSet Used by understanding all the different skin tones and their HSVA
The dataset used for this paper is the ASL Kaggle dataset (Hue, Saturation, Value, Alpha) Configurations. The following
[2], which contains 3000 images for every alphabet of the lower and upper bounds define all the skin tones possible. Only
English vocabulary. Here another character has been if the image possesses pixel values in this range, the frame will
introduced which is unique from all other hand gestures for the be considered for classification else it will be discarded.
purpose of acting as an indication of completion for the // H lowerBound.val[0] = 0; upperBound.val[0] = 25;
previous word. This special sign called the ‘Space’ allows the
user to form sentences in a very simple fashion. For example, // S lowerBound.val[1] = 40; upperBound.val[1] = 255;
this space gesture will be used to separate ‘hello’ and ‘world’ // V lowerBound.val[2] = 60; upperBound.val[2] = 255;
in the sentence ‘hello world’.
// A lowerBound.val[3] = 0; upperBound.val[3] = 255;
The image is then blurred using gaussian blur for easy
processing. The next step is to find contours of the largest area
of the frame wherein skin color is present. The main contour is
applied to the largest area and child contour is also applied
within the largest skin color area so that even if there are two
patches of skin, say one full hand and the other some one’s
Figure 1: Hand gesture for ‘Space’ finger, between those can be easily differentiated. A matrix is
used to represent the contours of the skin area.
Since our dataset has 3000 images of all other characters,
reduced it to 100 distinct images of each character for training

978-1-7281-5374-2/20/$31.00 ©2020 IEEE 347


Authorized licensed use limited to: Dr. D. Y. Patil Educational Complex Akurdi. Downloaded on September 21,2024 at 12:50:12 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)
IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2

B. Processing of frame
The following diagram summarizes the steps involved in the
processing of the frame.

Figure 4: Down sampling of the frame

Figure 5: Converting image to Grayscale

C. Detection Method
Figure 3: Frame Processing Diagram
1. Contour masking
The steps involved in the processing of the frame are: Contours may be defined precisely as a curve that connects all
the continuous points (along the boundary), with the same
 An input image is read as BGR (Blue, Green, Red) color or intensity. The contours are a valuable resource for the
Format since OpenCV uses BGR instead of RGB
study of the structure and for identification and recognition of
Format, then the image is converted to RGB for image
an object like a hand as shown in figure 6.
processing.

 Downsampling [3] of the frame captured involves


throwing away unnecessary image information
discarding rows and columns of data at the edges of the
image, this reduces storage requirement as shown in
figure 4. The image is then converted to grayscale as
shown in figure 5. Threshold Contouring is performed
to segment the hand image from the background.
Threshold masking is used to exclude unnecessary from Figure 6: Contour Masking of hand gesture
image processing. After this, the image is cropped and 2. Skeletonization
normalization is performed to change the range of pixel Skeletonization [4] is a method for reducing foreground
intensity to increase contrast and make feature regions to a skeletal remnant in a binary picture that
extraction easier. essentially retains the magnitude and continuity of the original
area while removing much of the original foreground pixels.
 Different feature preprocessing algorithms can now be
The skeleton is valuable because it offers a clear and compact
applied, which consist of contour mask, canny edges,
skeleton explained further in the paper. image of a form that retains much of the initial form's
topological and scale characteristics. Refer figure 7 for the
All these techniques have been experimented with. All above.
of them cannot be used together, results have been
generated when each of these processes is used alone,
as shown in table 1.

978-1-7281-5374-2/20/$31.00 ©2020 IEEE 348


Authorized licensed use limited to: Dr. D. Y. Patil Educational Complex Akurdi. Downloaded on September 21,2024 at 12:50:12 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)
IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2

2. Radial basis function kernel (RBF) Kernel [11]


This is a non-linear kernel which maps samples to a higher-
dimensional space, unlike the linear kernel function. It can
handle the case when the relation between class labels and
attributes is nonlinear.
Figure 7: Skeleton form of the image
The RBF kernel is defined as:
3.Canny Edges , where γ is a parameter
Step by step process to implement Canny edges:[5] that sets the “spread” of the kernel.
 A Gaussian filter is applied to make the image
smooth and remove the noise. A kernel is any function of the form:
 Intensity gradients of the image are calculated.
 Non-maximum suppression is applied to remove the , where ψ is a function that projects
possibility of a false response. vector x into a new vector space. The kernel function
 Double thresholding is done to detect or determine computes the inner product between two projected vectors.
the possible edges.
 Edges are finalized by identifying and removing all E. Dimensionality Reduction
other edges that are weak and not linked to strong
edges. Principal Component Analysis[7]
PCA uses a list of the principal axes to identify the underlying
dataset before classifying it according to the amount of
variance identified by each axis. PCA makes the maximum
variability of the data set more visible by rotating the axes.
The number of feature combinations is equal to the number of
dimensions of the dataset and, in general, the maximum
number of PCAs that can be constructed. The association
between each main component should be zero as the residual
Figure 8: Canny Edges form of the image variation is captured by the subsequent components. The
similarity of any pair of own value/eigenvector is zero so that
the axes are orthogonal, i.e. perpendicular to each other in the
D. Kernel data space.
Kernels in SVM classification [6] refer to the function that is
responsible for defining the decision boundaries between the
F. Classification
classes. The SVM software has been used with the Linear
Kernel and the RBF (radial basis function) kernel. Execution SVM (Support Vector Machine) is a supervised learning
time for model selection is an important issue for practical technique. The objective is to find a hyperplane that distinctly
applications of SVM. classifies data points into classes. 27 classes have been in our
model each corresponding to letters in the English language.
1. Linear SVM [11] The SVM model will classify the images into the 27 classes to
Support vector learning is an algorithm which deals with yield a result. SVM is used as it a supervised learning
finding a separate hyperplane with the has the greatest margin technique which is apt to solve the problem statement.
that separates the positive instances (labelled as +1) from the
negative instances (labelled as -1). The hyperplane margin is G. Post Processing
defined as the shortest distance between the positive and
a. UI String Writing –User interface string writing is
negative occurrences closest to the hyperplane. The intuition
used to print a message for an error. This is to make
behind searching for the large-margin hyperplane is that a
the UI user friendly to the users.
hyperplane with the largest margin should be more noise
b. Debugging – This function is mainly used for
resistant than a smaller-margin hyperplane.
inspection of strings. This will help in removing
errors and increasing the work efficiency of the app.
Formally, assume all data meet the constraints .
f (x) = {+1, w * + b >=1 and -1, w * + b <= -1}

Where the w is the normal to the hyperplane, |b|/||w|| is the


perpendicular distance from the hyperplane to the origin, and
||w|| is the Euclidean norm of w. [6]

978-1-7281-5374-2/20/$31.00 ©2020 IEEE 349


Authorized licensed use limited to: Dr. D. Y. Patil Educational Complex Akurdi. Downloaded on September 21,2024 at 12:50:12 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)
IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2

H. Use Case Diagram Table 1: Performance comparison of all array parameters in SVM

Different combination of methods in SVM


Average per
Sl. Image Accuracy
Detection Dimensionality processing
No Kernel
Method Reduction time
(MilliSecond
)
Contour 18.2 97.45
1 Linear None
Masking
Contour 18.0 97.98
2 Linear PCA
Masking
Contour 18.4 98.12
3 RBF None
Masking
Contour 17.8 98.34
4 RBF PCA
Masking
5 Skeleton Linear None 17.9 98.22

6 Skeleton Linear PCA 17.7 98.25

7 Skeleton RBF None 18.4 98.56

8 Skeleton RBF PCA 18.0 98.89

Canny 18.1 98.52


Figure 9: Use cases of application 9 Linear None
Edges
Canny 17.5 98.67
10 Linear PCA
In the above use case diagram, the communication is being Edges
Canny 18.3 98.74
explained between a deaf/dumb person with a person with all 11 RBF None
Edges
senses. The application user installs our application on his/her Canny 15.0 98.82
12 RBF PCA
phone. He then points the camera to the deaf/dumb person Edges
a.
who will make hand gestures to convey his/her message. The
hand gestures are picked up by the app, and the due processes
happen which has been explained above thoroughly. The app B. Testing Results using the selected parameters(Canny
Edges, RBF and PCA)
then recognizes this sign and prints it on the screen. The
printed letters are then converted to speech by the application. Testing is performed on 20% of the dataset that means 20
The app has the following functionalities. The android images of each alphabet and ‘space’ gesture. After testing a
application consists of 4 main features: category matrix is obtained for each alphabet which gives
sample number of images which were classified as True
 Add alphabet - This functionality adds a new alphabet Positive(TP), True Negative(TN), False Positive(FP) and
False Negative(FN). Following is the category matrix for the
 Back - This functionality erases previously detected
alphabet ‘A’.
alphabets.
 Clear- This functionality clears the entire sentence ; CLASS: A
 Speech- This function converts the entire text to speech TP: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
format. TN: 15 16 17 18 19 20 21 22 23 24 25
FP: 74
IV. RESULTS FN: 270 277

A. Comparative Analysis of all array parameters in SVM

The table below presents a comparative analysis of the


different detection methods, kernels and dimensionality
reduction functions. Observing from the table, maximum
accuracy and minimum average processing time for each
image is achieved in the case of Canny Edges, RBF and PCA.
Hence, these three array parameter values have been deployed
in the SVM algorithm.

Figure 10: Confusion Matrix

978-1-7281-5374-2/20/$31.00 ©2020 IEEE 350


Authorized licensed use limited to: Dr. D. Y. Patil Educational Complex Akurdi. Downloaded on September 21,2024 at 12:50:12 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)
IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2

Then a confusion matrix is obtained of each class against each CONCLUSION


other class when testing. This matrix can be thought of as the This paper compares different techniques and chooses the
following: most optimal approach for creating a vision-based application
For example, in the confusion matrix the entry (1,1) indicates for sign language to text/speech conversion for deaf/dumb
that for 19 images out of 20, the predicted class and actual people. The proposed system could efficiently recognize the
classes were same thus giving a precision of 0.95. Similarly, alphabets from images using a customized SVM model. This
for Class ‘C’, the entry (3,3) indicates 20 out of 20 images are project is aimed at societal contribution.
classified correctly giving precision value as 1.00.
Then, for each alphabet the following is being calculated :
REFERENCES
 Precision = TP/TP+FP
 Recall = TP / FN+TP [1] Patwary, Muhammed J. A. & Parvin, Shahnaj & Akter, Subrina. (2015).
Significant HOG-Histogram of Oriented Gradient Feature Selection for
 F1 score = 2 * (precision * recall)/ (precision + recall) Human Detection. International Journal of Computer Applications. 132.
20-24. 10.5120/ijca2015907704.
The following table summarizes these measures for all [2] ASL Reverse Dictionary - ASL T ranslation Using Deep Learning Ann
alphabets. Nelson Southern Methodist University, alnelson@mail.smu.edu KJ
Price Southern Methodist University, kjprice@mail.smu.edu Rosalie
Table 2: Measure of Precision, Recall and F-Measure the technique deployed Multari Sandia National Laboratory, ramulta@sandia.gov .
[3] Dumitrescu, & Boiangiu, Costin-Anton. (2019). A Study of Image
Measure Maximum Value Minimum Value Median Value Upsampling and Downsampling Filters. Computers. 8. 30.
10.3390/computers8020030.
[4] Saeed, Khalid & T abedzki, Marek & Rybnik, Mariusz & Adamski,
Precision 1.00 0.76 0.91 Marcin. (2010). K3M: A universal algorithm for image skeletonization
and a review of thinning techniques. Applied Mathematics and
Recall 1.00 0.81 0.94 Computer Science. 20. 317-335. 10.2478/v10006-010-0024-4.
[5] Mohan, Vijayarani. (2013). Performance Analysis of Canny and Sobel
Edge Detection Algorithms in Image Mining. International Journal of
F-Measure 1.00 0.79 0.93 Innovative Research in Computer and Communication Engineering.
1760-1767. M. Young, The T echnical Writer’s Handbook. Mill Valley,
CA: University Science, 1989.
[6] T zotsos, Angelos & Argialas, Demetre. (2008). Support Vector Machine
C. End Results Classification for Object-Based Image Analysis. 10.1007/978-3-540-
77058-9_36.
The below figure 11 shows the User Interface of the [7] Mishra, Sidharth & Sarkar, Uttam & T araphder, Subhash & Datta,
application. It shows the construction of the word ‘ILL’ and Sanjoy & Swain, Devi & Saikhom, Reshma & Panda, Sasmita &
various options like ‘Add’, ‘Back’, ‘Clear’ and ‘Speech’. Laishram, Menalsh. (2017). Principal Component Analysis.
International Journal of Livestock Research. 1.
10.5455/ijlr.20170415115235.
[8] Evgeniou, T heodoros & Pontil, Massimiliano. (2001). Support Vector
Machines: T heory and Applications. 2049. 249-257. 10.1007/3-540-
44673-7_12.
[9] Banjoko, Alabi & Yahya, Waheed Babatunde & Garba, Mohammed
Kabir & Olaniran, Oyebayo & Dauda, Kazeem & Olorede, Kabir.
(2016). SVM Paper in T ibiscus Journal 2016.
[10] Pradhan, Ashis. (2012). Support vector machine-A survey. IJET AE. 2
[11] Apostolidis-Afentoulis, Vasileios. (2015). SVM Classification with
Linear and RBF kernels. 10.13140/RG.2.1.3351.4083.
[12] Kumar, Pradeep & Gauba, Himaanshu & Roy, Partha & Dogra, Debi.
(2017). A Multimodal Framework for Sensor based Sign Language
Recognition. Neurocomputing. 259. 10.1016/j.neucom.2016.08.132.
Figure 11 Construction of the word ‘ILL’ [13] T rigueiros, Paulo & Ribeiro, Fernando & Reis, Luís. (2014). Vision
Based Portuguese Sign Language Recognition System. Advances in
Intelligent Systems and Computing. 275. 10.1007/978-3-319-05951-
A CKNOWLEDGMENT 8_57.
[14] Singh, Sanjay & Pai, Suraj & Mehta, Nayan & Varambally, Deepthi &
We thank the institute BMS College of Engineering for the Kohli, Pritika & Padmashri, T . (2019). Computer Vision Based Sign
wonderful learning opportunity. We also express our gratitude Language Recognition System..
to the department of information science and engineering for [15] M. Khan, S. Chakraborty, R. Astya and S. Khepra, "Face Detection and
facilitating the process. Recognition Using OpenCV," 2019 International Conference on
Computing, Communication, and Intelligent Systems (ICCCIS), Greater
Noida, India, 2019, pp. 116-119

978-1-7281-5374-2/20/$31.00 ©2020 IEEE 351


Authorized licensed use limited to: Dr. D. Y. Patil Educational Complex Akurdi. Downloaded on September 21,2024 at 12:50:12 UTC from IEEE Xplore. Restrictions apply.

You might also like