2018 4th International Conference on Computing Communication and Automation (ICCCA)
Hand Gesture Recognition System using
                      Convolutional Neural Networks
                   1st Raj Patel                              1st Jash Dhakad                               2nd Kashish Desai
           Information Technology               Information Technology               Information Technology
     Dwarkadas J. Sanghvi College of Engg Dwarkadas J. Sanghvi College of Engg Dwarkadas J. Sanghvi College of Engg
                Mumbai, India                        Mumbai, India                        Mumbai, India
           raj1471997@gmail.com                jashdhakad10@gmail.com             kashishdesai1997@gmail.com
                                2nd Tanay Gupta                                            3rd Prof. Stevina Correia
                          Information Technology                                          Information Technology
                    Dwarkadas J. Sanghvi College of Engg                            Dwarkadas J. Sanghvi College of Engg
                               Mumbai, India                                                   Mumbai, India
                        tanaygupta3010@gmail.com                                          stevina.dias@djsce.ac.in
      Abstract—Gesture recognition plays an important role in com-            lot of errors can arise [3]. Thus it is tough to recognize the
   munication through sign language. It is a fast growing domain              hand gestures. Our paper focuses on detecting and recognizing
   within computer vision and has attracted significant research due           the hand gestures using different methods and finding out the
   to its widespread social impact. To tackle the difficulties faced
   by hearing impairment, it is the need of the hour to develop               accuracy by those methods. Also we see the performance,
   a system which translates the sign language into text which can            convenience and issues related with each method. Currently
   easily be recognized by the impaired people. In this paper, a static       a lot of methods and technologies are being used for sign
   hand gesture recognition system is developed for American Sign             and gesture recognition. Among them the most common ones
   Language using deep Convolutional Neural Network. The system               used are Hand Glove Based Analysis, Microsoft Kinect Based
   architecture is light weight to make the system easily deployable
   and mobile. In order to achieve high accuracy on live scenarios            Analysis, Support Vector Machines and Convolutional Neural
   we employ, a number of image processing techniques which assist            Networks. One of the objective of these methods is to bridge
   in appropriate background subtraction and frame segmentation.              the communication gap between speech and hearing impaired
   Our approach focuses on mobility,cost-free and easy deployment             people with the normal people and also successful and smooth
   in low computational environment. Our system achieved a testing            integration of these differently abled people in our society. In
   accuracy of 96%.
      Index Terms—Sign Recognition, Gesture Recognition, Com-
                                                                              our research paper we build a real time communication system
   puter Vision, Convolutional Neural Networks.                               using the advancements in Machine Learning. Currently the
                                                                              systems in existence either work on a small dataset and
                                                                              achieve stable accuracy or work on a large dataset with
                          I. I NTRODUCTION
                                                                              unstable accuracy. We try to resolve this problem by applying
      Communication is imparting, sharing and conveying of                    Convolutional Neural Network (CNN) on a fairly large dataset
   information, news, ideas and feelings. Of them, sign language              to achieve a good and stable accuracy.
   is one of the way of non-verbal communication which is
   gaining impetus and strong foothold due to its applications                                  II. L ITERATURE S URVEY
   in a large number of fields. The most prominent application                    In order to bridge communication gap between hearing and
   of this method is its usage by differently disabled persons                speech impaired members, different approaches have been
   like deaf and mute people. They can communicate with non-                  used by researchers for recognition of various hand gestures.
   signing people without the help of a translator or interpreter             These approaches can be broadly divided into three cate-
   by this method. Some other applications are in the automotive              gories - Hand Segmentation Approach, Gesture Recognition
   sector, transit sector, gaming sector and also while unlocking             Approach and Feature Extraction Approach.
   a smart phone [1]. The sign gesture recognition can be done                Two categories of visual-based hand gesture recognition can
   in two ways: static gesture and dynamic gesture [2]. While                 be used.The first one is a 3-D hand gesture model that works
   communicating, the static gesture makes use of hand shapes                 by comparing input frames which makes use of sensors like
   while the dynamic gesture makes use of the movements of                    gloves, helmet, etc[4].The other one is Microsoft Kinect based
   the hand [2]. Our paper focus on static gestures. Hand gesture             analysis which makes use of Kinect camera. Kinect hardware
   recognition is a way of understanding and then classifying                 gives accurate tracking of several user joints. So a huge dataset
   the movements by the hands. But the human hands have very                  is required for the 3-D hand gesture model since it requires
   complex articulations with the human body and therefore a                  a huge data set and also has a higher hardware cost due to
978-1-5386-6947-1/18/$31.00 ©2018 IEEE                                    1
sensors on the gloves. This glove based model for American              a space-coordinated system in which any color which is
Sign Language was proposed by Starner and Pentland [5].                 speified is represented by single point. Here, using different
It is not practically possible for the user to wear gloves              color spaces for robust hand detection and segmentation, three
continuously.                                                           techniques were introduced. Hand tracking and segmentation
The 2-D hand gesture model make use of image dataset                    (HTS) technique using HSV color space is identified for the
for feature extraction and detection. There are many other              pre-processing of HGR system.
approaches used for image based gesture recognition like ANN            Some issues with hand segmentation are,firstly some
(Artificial Neural Network), HMM (Hidden Markov Model),                  objects, which are irrelevant,might overlap with the
Eigenvalue based and Perceptual colour based. The feature               hand.Also,performance of the hand segmentation algorithm is
vector extracted from the image are inputted into HMM [6].              degraded when the distance between the user and the camera
For classification, particle filtering and segmentation methods           is more than 1.5 meters [11]. Lastly,hand segmentation
like Support Vector Machine (SVM) is used where image                   restricts the user to make some gestures in a particular
frame is converted into HSV colour space as it is less sensitive        manner,like gestures must be made with the right hand only,
to light effects[7]. Feature extraction can be employed using           the arm should be vertical, the palm should face the camera
various methods. One of the most used method for feature                and the background should be clear and uniform.
extraction is by Contour Shape Technique which extracts the
                                                                        C. Glove based hand gesture recognition
boundary information of the sign.
                                                                           Glove based approaches make use of gesture or capacitive
         III. C URRENTLY USED M ETHODOLOGIES                            touch sensors embedded into gloves to recognize hand ges-
A. Feature Extraction                                                   ture.The widely used methods make use of hand motion to
   A feature is a function of one or more measurements                  convey hand signs and the motion is tracked and translated to
computed so that it quantifies some significant characteristic            text. Hand motions are categorized using clustering techniques
of the object [9]. Feature extraction is a special form of dimen-       such as k-means. Other approaches use charge-transfer touch
sionality reduction. In pattern recognition and also in image           sensors for translation by using On / Off binary signals. These
processing,if the input given is quite large for processing,then        approaches achieve high accuracy but incur high cost due to
it is suspected to be redundant and eventually the input data           the necessary hardware.
which is given will transform into a reduced representation
set of features [8]. Feature extraction can be defined as a
process of transforming input data in set of features. The
general expectation is that the features set will extract the
information which is relevant from the input data if we extract
the features carefully in order to perform the desired task using
this reduced representation instead of the full size input.
   Some issues with feature extraction are, firstly ,the features
should carry enough information about the image and should
not require any domain-specific knowledge for their extraction
[9]. Secondly, the features should be easy to compute in order
to make feature extraction more feasible for a large image
collection and rapid retrieval. Also, they should relate well
to the human perceptual characteristics since users finally
determine the suitability of the images retrieved.
B. Hand Segmentation Approach
   Hand tracking and Segmentation should be always done in
an efficient manner as they are the keys of success towards any
gesture recognition, because of the challenges vision based                            Fig. 1: System Block Diagram
methods pose such as intensity of the continuous variation
in lightning, many objects in the background(complex) and
detection of the skin color. Color is very powerful descriptor                        IV. P ROPOSED M ETHODOLOGY
for object detection. Thus, color information was used for                 In the view of the limitations posed by the approaches
the segmentation purpose , which is invariant to rotation               mentioned above, our system would focus on mitigating those
and geometric variation of the hand [10]. Human sees                    incompetencies. The system to be built has to be capable to
color component’s features such as saturation, hue and the              be able to be deployed on a mobile or web application for
brightness component more than the percentage of primary                far reach and easy accessibility so, it has to be lightweight
colors which are red, green and blue [10]. These color models           and computationally competent enough to recognize the signs
represent the standardized way of a particular color. It is             appropriately.
                                                                    2
                                           Fig. 2: SSIM between intermediate frames
                                                 Fig. 3: Image Processing Steps
We propose a computer vision based approach to recognize               extract each individual frame from the video feed, we perform
static hand gestures. The system would analyze a video feed            frame by frame comparison by computing the Structural Sim-
and recognize the hand gesture and then output the correct             ilarity Index(SSIM) between two adjacent frames and based
class label. For each sign performed by the user, the system           upon a threshold we factor in distinct frame selection. The
will output 1 among 36 class labels comprising of ASL Ges-             video feed taken into consideration is captured from a webcam
tures for alphabets and numbers. The system would take in a            at a resolution of 900 * 900 and 23fps. The video feed is
video feed which could be pre-recorded or coming live from an          reduced down to 12 fps and the user is given an ROI to perform
input device The system would segregate each sign and output           the sign gesture. Thus the final images after the cropping are
the sign label accordingly. The major challenges identified             of size 300 * 300. For two images I (i, k) and K(i, j) SSIM
were performing precise background subtraction and achieving           is calculated according to equation 1.
high accuracy in order for the system to be used in formulating                                 (2μx μy + C1 ) + (2σxy + C2 )
sentences. Background Subtraction required tackling changing                   SSIM (x, y) = 2                                     (1)
                                                                                               (μx + μ2y + C1 )(σx2 + σy2 + C2 )
illumination and foreground noise in input images. Operations
such as Gaussian Mixture based segmentation and Image                     where, ux is the average of x
Morphology are performed in order to reduce background                 uy is the average of y
noise. Henceforth, our system architecture employs three               σx2 is the variance of x
phases viz. Frame segregation, Image Processing and Image              σy 2 is the variance of y
recognition.Fig 1. illustrates the system block diagram                σxy is the co variance of x and y
                                                                       (c1 - k1 L)2 and (c2 - k2 L)2 are two variables to stabilize the
                    V. I MPLEMENTATION                                 division with weak denominator
                                                                       L represents the dynamic range of the pixel values and
A. Frame Segregation                                                   k1 =0.01 and k2 =0.03 by default [12].
  Frame segregation is the first stage which involves identify-
ing the frames which contain the sign gesture and segregating          An SSIM of 1 indicates perfect similarity. By testing
those frames for further processing and recognition. In order to       for two images to be considered distinct the threshold value
                                                                   3
                                                  Fig. 4: CNN Architecture
for SSIM was identified to be 0.45. Fig 2 illustrates the SSIM        high accuracy in image recognition tasks because of impor-
between intermediate frames. As evident from the calculated          tant positional features of hand and fingers being lost when
values, the SSIM between two similar images is greater than          background subtraction is performed. Thus, in order to retain
the threshold value. For dissimilar frames such as frame 1           those positional features, AND operation is performed with the
and frame 2, the SSIM calculated is less than the threshold          original image and the noise reduced subtracted image. This
value but since it transitions from background to foreground,        results in the white pixels of binary image acting as a filter for
frame 2 is not considered as distinct. To the contrary frame 5       the RGB image. After which, the resulting RGB image after
is considered as a distinct frame moving from foreground to          addition is converted to gray scale. This is done to eliminate
the background.                                                      any bias due to the user skin tone or foreground lighting during
                                                                     recognition.
B. Image Processing
   After the image segregation phase, the next phase is              C. Image Recognition
processing the output frames. The first step is to perform               The last phase in the system is the image recognition phase.
background subtraction. Since our system is mobile, the              In order to achieve higher accuracy as compared to existing
input images will vary a lot in terms of the background and          systems and to keep the system computationally lightweight,
the lighting conditions. Background subtraction is performed         we make use of Convolutional neural network (CNN) for
using Gaussian Mixture based background segmentation                 image recognition. Convolutional neural network are a class
algorithm. It uses a mixture of K Gaussian distributions to          of feed-forward artificial neural networks commonly used for
model each background picture [13][14]. The output frames            visual analysis tasks. They comprise of neurons which act as
are compared against the background image and a resulting            learnable parameters having their own weight and biases. The
image is obtained after the background subtraction. Normal           entire neural network learns with the help of a loss function, a
background subtraction was ruled out due to its low tolerance        learning rate is used to fine tune the learning. The input layer
to dynamic conditions. Noise reduction was performed since           takes in the data which gets propagated through the various
some observable noise was present on the resulting image.            layers and a output is generated, the generated output is com-
The two main types of noise observed were spatial noise due          pared with the actual output and the system updates its weights
to motion and salt and pepper noise due to change in lighting        and biases to correct itself, this step is crucial and is known as
conditions. In order to remove spatial noise, low pass spatial       backpropagation and this process done iteratively is called as
filtering was used with a kernel of size 3 and for reducing           training. The training duration of CNN is decided according to
other kinds of noise, morphological opening was performed            the size, the number of layers and also the learning rate. The
on the subtracted image, with a structuring element of size 5.       CNN was trained using the ASL sign language image dataset
Morphological opening performs erosion followed by dilation          consisting of around 35K images with each class having a
which is useful in removing noise.                                   minimum of 800 images. The dataset consisted of gray scale
                                                                     static sign images concerning alphabets and numbers. Fig 5.
   The resulting binary image after performing the operations        shows some images from the dataset. The character labels
is illustrated in figure 3. This resultant image wont yield           associated with each image were converted into binary vectors
                                                                 4
using one hot encoding, thus converting categorical values into
numbers.The proposed architecture of our CNN is illustrated
in figure 4. It consists of three convolutional layers with 32,
64, 128 number of filters, having intermediate max-pooling
layers and Relu activations. A kernel of size 3 and pool size
of 2 was used accordingly. The last three layers consisted of a
                 Fig. 5: Images from dataset
flattening layer and fully connected layers with dropout layers                                   (a) Model Accuracy
in between in order to avoid overfitting. The final dense layer
is of size 36 corresponding to the number of class labels with
softmax activation. The input to the CNN will be the gray
scale processed image resized to 28 * 28 as per the dataset
and the output of the CNN would be a probability distribution
to classify the image into probabilistic values between 0 and
1. The loss function used for training was categorical cross
entropy and the optimizer used was rmsprop. The training
was conducted for 250 epochs with a batch size of 512.
For any image feed into the CNN, it outputs a probability
distribution. The node containing the highest probability value
is considered as the output node and the correct label against
that node is outputted. In this way the system determines what
sign the user performed.
                        VI. R ESULTS
   The recognition accuracy of the CNN obtained on the test                                        (b) Model Loss
set was 96.36% . Figure 6 illustrates the model accuracy                                Fig. 6: CNN performance graphs
and loss during the training and validation. From the figure
it is evident that the model loss converges to almost zero,
henceforth eliminating the case of under-fitting i.e. the model         a computationally low cost and can be deployed in an mobile
is capable enough to generalize to the dataset and also the test       setting while makes it suitable for real time applications. The
accuracy obtained rules out the suspicion of model being over-         research related to vision based gesture recognition is still in
fitted i.e. the system is able to guess unseen data correctly.          progress and our future research would be based on further
Testing over a live setting, our system yielded 38 correct             improving the accuracy, expanding the classification dictionary
predictions out of a set of 40 trials, gaining a accuracy of           and employing dynamic gestures for recognition.
95%. Image segregation was able to pick out the distinct
frames correctly and the system performed well even when                                            R EFERENCES
the lighting conditions were changed. Depending on the con-
straints posed in the scenario of static gesture recognition,           [1] Gesture Recognition (2018, October, 4) Wikipedia [Online] Available:
our system produced the highest results as compared to the              [2] Priyanka C Pankajakshan, Thilagavathi B, Sign Language Recognition
                                                                            System, IEEE Sponsored 2nd International Conference on Innovations
existing systems out there which rely on feature extraction                 in Information Embedded and Communication Systems ICIIECS15
or employment of costly gloves. It can be deployed as an                [3] Jobin Francis and Anoop B K. Article: Significance of Hand Gesture
application on a minimal system and comes at a zero-cost.                   Recognition Systems in Vehicular Automation-A Survey. International
                                                                            Journal of Computer Applications 99(7):50-55, August 2014.
          VII. C ONCLUSION AND F UTURE W ORK                            [4] T. Starner and A. Pentland, ”Real-time American sign language recogni-
                                                                            tion from video using hidden Markov models”, Technical Report, M.I.T
   In this study a system to classify static gestures was iden-             Media Laboratory Perceptual Computing Section, Technical Report No.
tified and implemented using Convolutional Neural Network.                   375, 1995.
                                                                        [5] Camastra, Francesco, and Domenico De Felice. ”L VQ-based hand
Our system is adaptive and performs robustly under varied                   gesture recognition using a data glove.” Neural Nets and Surroundings.
lighting and background conditions. The proposed system is                  Springer Berlin Heidelberg, 2013. 159-168.
                                                                   5
 [6] Lang, S., B. Marco and R. Raul. Sign Language Recognition Using
     Kinect. In: L.K. Rutkowski, Marcin and R. T. Scherer, Ryszard Zadeh,
     Lotji Zurada, Jacek (Eds.), Springer Berlin / Heidelberg,pp:394-402,20
     11.
 [7] V. K. Verma, S. Srivastava, and N. Kumar, ”A comprehensive review
     on automation of Indian sign language,” IEEE Int. Conf, Adv. Comput.
     Eng. Appl. Mar 2015 pp. 138-142
 [8] Sanaa Khudayer Jadwaa, Feature Extraction for Hand Gesture Recog-
     nition: A Review, International Journal of Scientific Engineering
     Research, Volume 6, Issue 7, July-2015
 [9] George Karidakis et al , Feature Extraction-Shodhganga
[10] Archana Ghotkar, Gajanan K.Kharate, Hand Segmentation Techniques
     to Hand Gesture Recognition for Natural Human Computer Interaction,
     International Journal of Human Computer Interaction
[11] Rafiqul Zaman Khan, Noor Adnan Ibraheem, Comparative Study of
     Hand Gesture Recognition System, SIPM, FCST, ITCA, WSE, ACSIT,
     CS IT 06, pp. 203213, 2012.
[12] Structural Similarity (2018, August, 27) Wikipedia [Online] Available:
[13] P.KaewTraKulPong, R.Bowden, An Improved Adaptive Background
     Mixture Model for Real-time Tracking with Shadow Detection , In
     Proc. 2nd European Workshop on Advanced Video Based Surveillance
     Systems, AVBS01. Sept 2001. VIDEO BASED SURVEILLANCE SYS-
     TEMS: Computer Vision and Distributed Processing
[14] Background Subtraction, Open Source Computer Vision