Hand Gesture Recognition Based On Computer Vision: A Review of Techniques
Hand Gesture Recognition Based On Computer Vision: A Review of Techniques
Imaging
Review
Hand Gesture Recognition Based on Computer
Vision: A Review of Techniques
Munir Oudah 1 , Ali Al-Naji 1,2, * and Javaan Chahl 2
 1    Electrical Engineering Technical College, Middle Technical University, Baghdad 10022, Iraq;
      Munir_aliraqi@yahoo.com
 2    School of Engineering, University of South Australia, Mawson Lakes SA 5095, Australia;
      Javaan.Chahl@unisa.edu.au
 *    Correspondence: ali_al_naji@mtu.edu.iq; Tel.: +96-477-1030-4768
                                                                                                       
 Received: 23 May 2020; Accepted: 21 July 2020; Published: 23 July 2020                                
 Abstract: Hand gestures are a form of nonverbal communication that can be used in several fields
 such as communication between deaf-mute people, robot control, human–computer interaction (HCI),
 home automation and medical applications. Research papers based on hand gestures have adopted
 many different techniques, including those based on instrumented sensor technology and computer
 vision. In other words, the hand sign can be classified under many headings, such as posture and
 gesture, as well as dynamic and static, or a hybrid of the two. This paper focuses on a review of the
 literature on hand gesture techniques and introduces their merits and limitations under different
 circumstances. In addition, it tabulates the performance of these methods, focusing on computer
 vision techniques that deal with the similarity and difference points, technique of hand segmentation
 used, classification algorithms and drawbacks, number and types of gestures, dataset used, detection
 range (distance) and type of camera used. This paper is a thorough general overview of hand gesture
 methods with a brief discussion of some possible applications.
Keywords: hand gesture; hand posture; computer vision; human–computer interaction (HCI)
1. Introduction
     Hand gestures are an aspect of body language that can be conveyed through the center of the
palm, the finger position and the shape constructed by the hand. Hand gestures can be classified
into static and dynamic. As its name implies, the static gesture refers to the stable shape of the hand,
whereas the dynamic gesture comprises a series of hand movements such as waving. There are a
variety of hand movements within a gesture; for example, a handshake varies from one person to
another and changes according to time and place. The main difference between posture and gesture is
that posture focuses more on the shape of the hand whereas gesture focuses on the hand movement.
The main approaches to hand gesture research can be classified into the wearable glove-based sensor
approach and the camera vision-based sensor approach [1,2].
     Hand gestures offer an inspiring field of research because they can facilitate communication and
provide a natural means of interaction that can be used across a variety of applications. Previously,
hand gesture recognition was achieved with wearable sensors attached directly to the hand with gloves.
These sensors detected a physical response according to hand movements or finger bending. The data
collected were then processed using a computer connected to the glove with wire. This system of
glove-based sensor could be made portable by using a sensor attached to a microcontroller.
     As illustrated in Figure 1, hand gestures for human–computer interaction (HCI) started with the
invention of the data glove sensor. It offered simple commands for a computer interface. The gloves
used different sensor types to capture hand motion and position by detecting the correct coordinates of
      Figure 1. Different
                Different techniques for hand gestures. (a)
                                                         (a) Glove-based attached sensor either connected to
      the computer
          computer ororportable;
                        portable;(b)
                                   (b)computer vision–based
                                        computer            camera
                                                 vision–based      using
                                                               camera    a marked
                                                                      using       gloveglove
                                                                             a marked   or just
                                                                                             ora just
                                                                                                  naked  hand.
                                                                                                      a naked
      hand.
       Although the techniques mentioned above have provided good outcomes, they have various
limitations
      Severalthat   make
               studies     themon
                         based   unsuitable
                                    computerfor      the elderly,
                                                  vision            whowere
                                                           techniques     maypublished
                                                                                 experienceindiscomfort       and confusion
                                                                                                 the past decade.    A study
dueMurthy
by    to wire et
              connection    problems.
                 al. [17] covered       In addition,
                                    the role              elderly people
                                              and fundamental               suffering
                                                                       technique        frominchronic
                                                                                    of HCI        terms disease   conditions
                                                                                                         of the recognition
that result in
approach,      loss of muscleand
             classification     function  may be unable
                                     applications,             to wear
                                                        describing       and take off
                                                                       computer         gloves,
                                                                                     vision       causing them
                                                                                               limitations   underdiscomfort
                                                                                                                      various
and constraining
conditions.   Another them  if used
                         study       for long
                                by Khan    et al.periods.    These sensors
                                                   [18] presented              may also
                                                                      a recognition         cause
                                                                                       system       skin damage,
                                                                                                 concerned    with infection
                                                                                                                    the issue
or feature
of adverseextraction,
             reactions ingesture
                            peopleclassification,
                                    with sensitiveand   skinconsidered
                                                              or those suffering    burns. Moreover,
                                                                           the application      area of thesome  sensors
                                                                                                             studies.      are
                                                                                                                       Suriya
quite   expensive.    Some   of these  problems      were    addressed    in  a  study  by   Lamberti
et al. [19] provided a specific survey on hand gesture recognition for mouse control applications,       and   Camastra    [9],
who developed
including           a computer
            methodologies         vision
                                and       system based
                                      algorithms      used on for colored   marked gloves.
                                                                   human–machine                  Although
                                                                                          interaction.        this studythey
                                                                                                         In addition,     did
not  require  the  attachment    of sensors,   it still required    colored   gloves   to
provided a brief review of the hidden Markov model (HMM). A study by Sonkusare et al. [20] be  worn.
       These
reported      drawbacks
           various           led toand
                      techniques    the made
                                         development
                                                 comparisons of promising
                                                                   between and themcost-effective
                                                                                      according to techniques        that did
                                                                                                       hand segmentation
not require cumbersome
methodology,                   glovesextraction,
                  tracking, feature    to be worn.       These techniques
                                                     recognition     techniques,are and
                                                                                    called   camera vision-based
                                                                                         concluded                     sensor
                                                                                                       that the recognition
technologies.     With   the  evolution   of  open-source       software    libraries,   it is  easier
rate was a tradeoff with temporal rate limited by computing power. Finally, Kaur et al. [16] reviewed  than   ever  to detect
hand gestures that can be used under a wide range of applications like clinical operations [10], sign
language [11], robot control [12], virtual environments [13], home automation [14], personal computer
and tablet [15], gaming [16]. These techniques essentially involve replacement of the instrumented
glove with a camera. Different types of camera are used for this purpose, such as RGB camera, time of
flight (TOF) camera, thermal cameras or night vision cameras.
       Algorithms have been developed based on computer vision methods to detect hands using these
different types of cameras. The algorithms attempt to segment and detect hand features such as skin
color, appearance, motion, skeleton, depth, 3D model, deep learn detection and more. These methods
involve several challenges, which are discussed in this paper in the following sections.
       Several studies based on computer vision techniques were published in the past decade. A study
by Murthy et al. [17] covered the role and fundamental technique of HCI in terms of the recognition
approach, classification and applications, describing computer vision limitations under various
conditions. Another study by Khan et al. [18] presented a recognition system concerned with the
issue of feature extraction, gesture classification, and considered the application area of the studies.
Suriya et al. [19] provided a specific survey on hand gesture recognition for mouse control applications,
including methodologies and algorithms used for human–machine interaction. In addition, they
provided a brief review of the hidden Markov model (HMM). A study by Sonkusare et al. [20]
reported various techniques and made comparisons between them according to hand segmentation
methodology, tracking, feature extraction, recognition techniques, and concluded that the recognition
rate was a tradeoff with temporal rate limited by computing power. Finally, Kaur et al. [16] reviewed
    J. Imaging
J. Imaging        6, x6,FOR
               2020,
            2020,        73 PEER REVIEW                                                                                         3 of
                                                                                                                             3 of 2929
several methods, both sensor-based and vision-based, for hand gesture recognition to improve the
  several of
precision    methods,
                 algorithmsboththrough
                                   sensor-based        and vision-based,
                                              integrating                      for hand gesture recognition to improve the
                                                             current techniques.
  precision     of  algorithms      through      integrating   current    techniques.
      The studies above give insight into some gesture recognition systems under various scenarios,
         The studies
and address               aboveasgive
                 issues such          sceneinsight    into some
                                              background           gesture recognition
                                                              limitations,    illuminationsystems      under
                                                                                              conditions,        variousaccuracy
                                                                                                             algorithm      scenarios,
  and   address     issues   such    as scene     background     limitations,    illumination
for feature extraction, dataset type, classification algorithm used and application. However,    conditions,    algorithm    accuracy
                                                                                                                                  no
  for feature    extraction,    dataset     type,   classification   algorithm     used  and   application.
review paper mentions camera type, distance limitations or recognition rate. Therefore, the objective          However,     no  review
ofpaper   mentions
    this study      is tocamera
                           provide   type,  distance limitations
                                         a comparative        review or  ofrecognition    rate. concerning
                                                                             recent studies      Therefore, the     objective
                                                                                                                computer        of this
                                                                                                                              vision
  study   is to  provide    a comparative         review   of recent  studies    concerning
techniques with regard to hand gesture detection and classification supported by different     computer    vision    techniques   with
  regard to hand
technologies.      Thegesture
                         currentdetection         and classification
                                     paper discusses       the seven mostsupported      by different
                                                                                 reported   approachestechnologies.
                                                                                                           to the problem The current
                                                                                                                                such
  paper    discusses     the  seven     most     reported   approaches       to the  problem
as skin color, appearance, motion, skeleton, depth, 3D-model, deep-learning. This paper also    such  as  skin   color,  appearance,
  motion, skeleton,
discusses                   depth, 3D-model,
              these approaches            in detail deep-learning.
                                                        and summarizes      Thissome
                                                                                   papermodern
                                                                                           also discusses
                                                                                                    research  these
                                                                                                                 underapproaches
                                                                                                                           differentin
  detail and summarizes
considerations       (type of camerasomeused,
                                            modern      research
                                                    resolution   ofunder     differentimage
                                                                    the processed        considerations     (type
                                                                                               or video, type    of of  camera used,
                                                                                                                     segmentation
  resolution     of  the  processed      image     or  video,  type   of  segmentation      technique,
technique, classification algorithm used, recognition rate, type of region of interest processing,       classification    algorithm
  used,   recognition       rate,   type   of   region   of interest    processing,     number    of gestures,
number of gestures, application area, limitation or invariant factor, and detection range achieved and             application    area,
inlimitation
   some cases    or data
                     invariant     factor,
                            set use,         and detection
                                        runtime                 range achieved
                                                      speed, hardware        run, typeand of
                                                                                          in error).
                                                                                              some cases     data set the
                                                                                                      In addition,      use,review
                                                                                                                              runtime
  speed,   hardware       run,   type    of  error).   In addition,
presents the most popular applications associated with this topic.     the  review    presents   the most    popular     applications
  associated     with this
      The remainder        oftopic.
                               this paper is summarized as follows. Section 2 explains hand gesture methods
and take considerationofand
         The   remainder          thisfocus
                                        paper  oniscomputer
                                                     summarized      as techniques,
                                                                vision   follows. Section
                                                                                        where 2 explains   hand gesture
                                                                                                describe seven               methods
                                                                                                                    most common
  and   take  consideration        and   focus    on  computer     vision  techniques,
techniques such as skin color, appearance, motion, skeleton, depth, 3D-module, deep learn where    describe   seven    most  common
                                                                                                                                 and
  techniques      such  as  skin   color,  appearance,      motion,   skeleton,    depth,  3D-module,
support that with tables. Section 3 illustrates in detail seven application areas that deal with hand     deep    learn  and  support
  that with
gesture         tables. Section
          recognition      systems.3 Section
                                         illustrates    in detail
                                                    4 briefly       seven research
                                                               discusses     application
                                                                                       gapsareas   that deal with
                                                                                              and challenges.          handSection
                                                                                                                   Finally,    gesture
5 recognition
  presents our     systems.   SectionFigure
                      conclusions.        4 briefly2 discusses    research
                                                       below clarify      thegaps    and challenges.
                                                                                classification         Finally,
                                                                                                 methods          Section 5by
                                                                                                             conducted        presents
                                                                                                                                 this
  our
review.conclusions.       Figure     2 below     clarify  the classification     methods    conducted    by   this  review.
                               Figure 2. 2.
                                 Figure  Classifications method
                                            Classifications      conducted
                                                            method         byby
                                                                   conducted  this review.
                                                                                this review.
2. 2. Hand
   Hand    Gesture
         Gesture   Methods
                 Methods
        The
      The   primary
          primary    goal
                   goal   inin studying
                             studying    gesture
                                       gesture    recognition
                                                recognition is is
                                                               toto  introduce
                                                                  introduce     a system
                                                                              a system   that
                                                                                       that cancan detect
                                                                                                 detect   specific
                                                                                                        specific
  human   gestures  and   use them to  convey  information  or  for command     and control
human gestures and use them to convey information or for command and control purposes.      purposes.   Therefore,
  it includes
Therefore,  it not only tracking
               includes    not onlyoftracking
                                       human movement,      but also thebut
                                               of human movement,          interpretation  of that movement
                                                                               also the interpretation   of thatas
  significant commands.      Two  approaches    are generally used   to interpret gestures for  HCI
movement as significant commands. Two approaches are generally used to interpret gestures for HCI   applications.
  The first approach
applications.  The firstisapproach
                           based onisdata   gloves
                                        based       (wearable
                                               on data  gloves or  direct contact)
                                                                (wearable           and
                                                                            or direct   the second
                                                                                      contact)       approach
                                                                                                and the  second is
  based  on computer     vision without   the need  to wear any   sensors.
approach is based on computer vision without the need to wear any sensors.
     Figure
     Figure 4.  Using computer
             4. Using   computer vision
                                   vision techniques
                                           techniques toto identify
                                                           identify gestures.  Where the
                                                                     gestures. Where  the user
                                                                                          user perform
                                                                                               perform specific
                                                                                                         specific
     gesture
     gesture by single or both hand in front of camera which connect with system framework that
             by  single or  both hand  in front of  camera  which  connect with system  framework   that involve
                                                                                                         involve
      Figure 4. Using computer
     different                      vision  techniques toandidentify gestures. Where the beuser perform   specific
     different possible
                possible techniques
                          techniques to
                                      to extract
                                          extract feature
                                                   feature and classify
                                                                classify hand
                                                                         hand gesture
                                                                               gesture to
                                                                                       to be able
                                                                                             able control
                                                                                                  control some
                                                                                                           some
      gesture application.
     possible  by single or both hand in front of camera which connect with system framework that involve
     possible application.
      different possible techniques to extract feature and classify hand gesture to be able control some
      possible application.
 J. Imaging 2020, 6, 73                                                                                                            5 of 29
J. Imaging 2020, 6, x FOR PEER REVIEW                                                                                             5 of 29
 J. Imaging 2020, 6, x FOR PEER REVIEW                                                                                             5 of 29
       Figure6.
      Figure       Exampleof
               6. Example   ofskin
                               skincolor
                                    colordetection.
                                           detection.(a)
                                                      (a)Apply
                                                         Applythreshold
                                                                 thresholdtotothe
                                                                               thechannels
                                                                                    channelsof
                                                                                             ofYUV
                                                                                                YUVcolor
                                                                                                      colorspace
                                                                                                            space in
                                                                                                                   in
      Figure 6. Example of skin color detection. (a) Apply threshold to the channels of YUV color space in
       order to
      order to  extract only  skin color then assign 1 value for the skin and  0 to non-skin color; (b) detected
                              skin color then assign 1 value for the skin and 0 to non-skin color; (b) detected  and
      order to extract only skin color then assign 1 value for the skin and 0 to non-skin color; (b) detected
       tracked
      and       hand
           tracked     using
                     hand    resulted
                           using       binary
                                 resulted     image.
                                           binary  image.
      and tracked hand using resulted binary image.
      AAseveral
         severalformats
                  formatsofofcolor
                              color space
                                     space are
                                           are obtained
                                               obtained for
                                                        for skin
                                                            skin segmentation,
                                                                 segmentation, as
                                                                               as itemized
                                                                                  itemized below:
                                                                                           below:
       A several formats of color space are obtained for skin segmentation, as itemized below:
••    red,  green, blue (R–G–B and     RGB-normalized);
 •     red,
       red, green,
            green, blue
                   blue (R–G–B
                         (R–G–B and
                                  and RGB-normalized);
                                       RGB-normalized);
•     hue and saturation (H–S–V, H–S–I and H–S–L);
 ••    hue and saturation (H–S–V,
                            (H–S–V, H–S–I
                                       H–S–I and
                                              and H–S–L);
                                                  H–S–L);
•     luminance (YIQ, Y–Cb–Cr and YUV).
 •     luminance (YIQ, Y–Cb–Cr and YUV).
J. Imaging 2020, 6, 73                                                                               6 of 29
     More detailed discussion of skin color detection based on RGB channels can be found in [28,29].
However, it is not preferred for skin segmentation purposes because the mixture of the color channel
and intensity information of an image has irregular characteristics [26]. Skin color can detect the
threshold value of three channels (red, green and blue). In the case of normalized-RGB, the color
information is simply separated from the luminance. However, under lighting variation, it cannot be
relied on for segmentation or detection purposes, as shown in the studies [30,31].
     The characteristics of color space such as hue/saturation family and luminance family are good
under lighting variations. The transformation of format RGB to HSI or HSV takes time in case of
substantial variation in color value (hue and saturation). Therefore, a pixel within a range of intensity
is chosen. The RGB to HSV transformation may consume time because of the transformation from
Cartesian to polar coordinates. Thus, HSV space is useful for detection in simple images.
     Transforming and splitting channels of Y–Cb–Cr color space is simple if compared with the HSV
color family in regard to skin color detection and segmentation, as illustrated in [32,33]. Skin tone
detection based Y–Cb–Cr is demonstrated in detail in [34,35].
     The image is processed to convert RGB color space to another color space in order to detect the
region of interest, normally a hand. This method can be used to detect the region through the range of
possible colors, such as red, orange, pink and brown. The training sample of skin regions is studied to
obtain the likely range of skin pixels with the band values for R, G and B pixels. To detect skin regions,
the pixel color should compare the colors in the region with the predetermined sample color. If similar,
then the region can be labeled as skin [36]. Table 1 presents a set of research papers that use different
techniques to detect skin color.
     The skin color method involves various challenges, such as illumination variation, background
issues and other types of noise. A study by Perimal et al. [37] provided 14 gestures under
controlled-conditions room lighting using an HD camera at short distance (0.15 to 0.20 m) and,
the gestures were tested with three parameters, noise, light intensity and size of hand, which directly
affect recognition rate. Another study by Sulyman et al. [38] observed that using Y–Cb–Cr color space is
beneficial for eliminating illumination effects, although bright light during capture reduces the accuracy.
A study by Pansare et al. [11] used RGB to normalize and detect skin and applied a median filter to the
red channel to reduce noise on the captured image. The Euclidian distance algorithm was used for
feature matching based on a comprehensive dataset. A study by Rajesh et al. [15] used HSI to segment
the skin color region under controlled environmental conditions, to enable proper illumination and
reduce the error.
     Another challenge with the skin color method is that the background must not contain any
elements that match skin color. Choudhury et al. [39] suggested a novel hand segmentation based on
combining the frame differencing technique and skin color segmentation, which recorded good results,
but this method is still sensitive to scenes that contain moving objects in the background, such as moving
curtains and waving trees. Stergiopoulou et al. [40] combined motion-based segmentation (a hybrid of
image differencing and background subtraction) with skin color and morphology features to obtain
a robust result that overcomes illumination and complex background problems. Another study by
Khandade et al. [41] used a cross-correlation method to match hand segmentation with a dataset to
achieve better recognition. Karabasi et al. [42] proposed hand gestures for deaf-mute communication
based on mobile phones, which can translate sign language using HSV color space. Zeng et al. [43]
presented a hand gesture method to assist wheelchair users indoors and outdoors using red channel
thresholding with a fixed background to overcome the illumination change. A study by Hsieh et al. [44]
used face skin detection to define skin color. This system can correctly detect skin pixels under low
lighting conditions, and even when the face color is not in the normal range of skin chromaticity.
Another study, by Bergh et al. [45], proposed a hybrid method based on a combination of the histogram
and a pre-trained Gaussian mixture model to overcome lighting conditions. Pansare et al. [46] aligned
J. Imaging 2020, 6, 73                                                                                                                     7 of 29
two
J.     cameras
   Imaging 2020, 6,(RGB
                    x FOR PEERand TOF)
                                    REVIEW together to improve skin color detection with the help of the depth                               7 of 29
property of the TOF camera to enhance detection and face background limitations.
2.2.2. Appearance-Based Recognition
2.2.2. Appearance-Based Recognition
       This method depends on extracting the image features in order to model visual appearance such
as handThisandmethod
                  comparingdepends theseonparameters
                                             extractingwith the image
                                                                   featurefeatures
                                                                              extractedinfromorder  thetoinput
                                                                                                            model      visual
                                                                                                                   image           appearance
                                                                                                                              frames.      Where
such    as hand    and     comparing      these   parameters         with   feature     extracted
the features are directly calculated by the pixel intensities without a previous segmentation process. from    the   input     image      frames.
Where
The methodthe features
                  is executedare directly
                                    in real calculated
                                             time due tobythe     theeasy
                                                                       pixel 2Dintensities
                                                                                 image featureswithout        a previous
                                                                                                         extracted       and segmentation
                                                                                                                                is considered
process.     The    method       is executed     in  real   time     due   to  the
easier to implement than the 3D model method. In addition, this method can detect    easy   2D    image      features      extracted
                                                                                                                                  various  and    is
                                                                                                                                               skin
considered
tones.          easierthe
          Utilizing      to implement       than the algorithm,
                              AdaBoost learning         3D model method.            In addition,
                                                                        which maintains               thisfeature
                                                                                                  fixed      methodsuch  can detect
                                                                                                                                as keyvarious
                                                                                                                                            points
skin   tones.    Utilizing      the  AdaBoost      learning      algorithm,      which      maintains
for a portion of a hand, which can solve the occlusion issue [47,48], it can separate into two               fixed    feature      such    as keya
                                                                                                                                       models:
points
motionfor      a portion
           model     and aof2D    a hand,
                                     static which
                                             model.can  Tablesolve    the occlusion
                                                                 2 presents       a set issue      [47,48],papers
                                                                                          of research          it can separate
                                                                                                                         that use into         two
                                                                                                                                        different
models: a motion
segmentation                model based
                    techniques       and a on 2Dappearance
                                                   static model.        Table 2 presents
                                                                   recognition       to detectaregion
                                                                                                    set of of research
                                                                                                                  interestpapers
                                                                                                                              (ROI). that use
different    segmentation         techniques    based     on   appearance       recognition
       A study by Chen et al. [49] proposed two approaches for hand recognition. The first        to  detect     region    of  interest     (ROI).
                                                                                                                                       approach
       A  study    by   Chen     et al. [49]  proposed       two    approaches       for  hand     recognition.
focused on posture recognition using Haar-like features, which can describe the hand posture pattern                   The    first   approach
focused on used
effectively     posture therecognition
                                AdaBoost using
                                             learningHaar-like
                                                           algorithmfeatures,    whichup
                                                                           to speed       canthedescribe      the handand
                                                                                                     performance             posture
                                                                                                                                  thus pattern
                                                                                                                                           rate of
effectively     used    the    AdaBoost      learning      algorithm       to  speed     up
classification. The second approach focused on gesture recognition using context-free grammarthe    performance           and     thus    rate of to
classification.      The      second    approach       focused      on  gesture      recognition
analyze the syntactic structure based on the detected postures. Another study by Kulkarni and           using     context-free         grammar
to analyze [50]
Lokhande        the syntactic
                     used threestructure         based on the
                                      feature extraction        methoddetected
                                                                             suchpostures.       Another
                                                                                    as a histogram              study by
                                                                                                            technique       toKulkarni
                                                                                                                                 segment and   and
Lokhande       [50]  used      three  feature   extraction      method      such    as  a  histogram
observe images that contained a large number of gestures, then suggested using edge detection such          technique       to   segment       and
observe
as  Canny, images
              Sobel that      contained
                       and Prewitt          a large to
                                        operators     number
                                                          detectof thegestures,
                                                                        edges with  then   suggestedthreshold.
                                                                                         a different        using edge    The detection       such
                                                                                                                                 classification
as  Canny,    Sobel   and     Prewitt   operators     to  detect   the  edges    with    a different
gesture performed using feed forward back propagation artificial neural network with supervision         threshold.       The    classification
gestureSome
learns.    performed
                   of the using      feedreported
                             limitation    forward by   backthepropagation
                                                                 author where      artificial
                                                                                      conclude  neural
                                                                                                    when    network       with supervision
                                                                                                               use histogram          technique
learns.
the       Somegets
      system       of the    limitation reported
                        misclassified                   by the histogram
                                           result because        author where    canconclude
                                                                                        only be when used for  use the
                                                                                                                    histogram
                                                                                                                          small numbertechnique   of
the system
gesture    whichgets misclassified      result because
                     completely different           from histogram
                                                             each other.can      onlyetbeal.
                                                                               Fang         used[51]for   the small
                                                                                                        used            number AdaBoost
                                                                                                                 an extended          of gesture
which
methodcompletely           different from
            for hand detection           and each     other.optical
                                               combined         Fang etflow al. [51]
                                                                                  with used
                                                                                          thean    extended
                                                                                                color    cue for   AdaBoost
                                                                                                                      tracking.method They also for
hand     detection    and     combined      optical   flow    with    the  color   cue   for tracking.
collected hand color from the neighborhood of features’ mean position using a single Gaussian model           They    also   collected       hand
color
to      from the
    describe        neighborhood
                 hand     color in HSV   of features’
                                             color space.meanWhere
                                                                 position     using
                                                                           multi       a single
                                                                                   feature        Gaussianand
                                                                                               extracted          model    to describe
                                                                                                                      gesture                hand
                                                                                                                                    recognition
color in
using      HSVand
         palm     color    space.
                       finger        Where multi feature
                                  decomposition,                 extractedscale-space
                                                        then utilizing         and gesturefeature
                                                                                                recognition
                                                                                                          detectionusingwhere
                                                                                                                            palm integrated
                                                                                                                                     and finger
decomposition,        then     utilizing  scale-space      feature     detection     where    integrated
into gesture recognition in order to encounter the limitation of aspect ratio which facing most of the          into   gesture      recognition
in order to
learning     of encounter
                 hand gesture   the limitation    of aspect
                                     methods. Licsa’r        et ratio   which
                                                                 al. [52]   usedfacing
                                                                                    a simplemostbackground
                                                                                                    of the learning         of hand method
                                                                                                                      subtraction         gesture
methods.
for   hand Licsa’r      et al. [52]and
               segmentation           used   a simple itbackground
                                           extended           to handle subtraction
                                                                              background     method
                                                                                                 changes   for hand
                                                                                                                 in ordersegmentation
                                                                                                                                to face some   and
extended      it to  handle      background       changes      in  order    to  face   some
challenges such as skin like color and complex and dynamic background then used boundary-based challenges         such    as  skin     like  color
and complex
method             and dynamic
           to classify                 background
                           hand gesture.       Finally,then
                                                          Zhou  used
                                                                   et al.boundary-based
                                                                           [53] proposed amethod   novel methodto classify      hand gesture.
                                                                                                                         to directly       extract
Finally,   Zhou    et  al.  [53]  proposed     a novel     method      to  directly    extract
the fingers where the edges were extracted from the gesture images, and then the finger central  the   fingers     where     the    edges were area
extracted
was           from the
        obtained      from  gesture    images, and
                                the obtained             thenFingers
                                                    edges.      the finger  werecentral
                                                                                      thenarea    was obtained
                                                                                             obtained         from the  fromparallel
                                                                                                                                 the obtainededge
edges. Fingers were
characteristics.              then obtained
                      The proposed              fromcannot
                                           system        the parallel
                                                                 recognizeedgethecharacteristics.
                                                                                      side view ofThe     hand proposed
                                                                                                                   pose. Figuresystem7 cannot
                                                                                                                                            below
recognize     the  side   view    of hand    pose.
show simple example on appearance recognition.       Figure    7 below     show    simple    example        on   appearance         recognition.
       Figure 7. Example
       Figure 7. Example on  appearance recognition
                          on appearance  recognition using
                                                     using foreground
                                                           foreground extraction
                                                                        extraction in
                                                                                   in order to segment
                                                                                      order to segment only
                                                                                                       only
       ROI, where the object features can be extracted using different techniques such as pattern or image
       subtraction and foreground and background
                                       background segmentation
                                                    segmentation algorithms.
                                                                  algorithms.
J. Imaging 2020, 6, 73                                                                                                                                                                     8 of 29
                                  Table 1. Set of research papers that have used skin color detection for hand gesture and finger counting application.
                     Type of                     Techniques/Methods       Feature           Classify         Recognition         No. of       Application       Invariant          Distance
     Author                         Resolution
                     Camera                       for Segmentation      Extract Type       Algorithm            Rate            Gestures        Area              Factor         from Camera
                                                                                           maximum
                  off-the-shelf                                                            distance of                                                        light intensity,
       [37]                           16 Mp            Y–Cb–Cr          finger count                         70% to 100%       14 gestures       HCI                             150 to 200 mm
                  HD webcam                                                               centroid two                                                          size, noise
                                                                                             fingers
                                                                                                                                                               heavy light
                    computer        320 × 250                                                                                      6          deaf-mute
       [38]                                            Y–Cb–Cr          finger count      expert system          98%                                             during               –
                     camera           pixels                                                                                    gestures       people
                                                                                                                                                                capturing
                    Fron-Tech                      RGB threshold &                      feature matching                                        (ASL)
                                                                        A–Z alphabet                                            26 static
       [11]           E-cam           10 Mp      edge detection Sobel                      (Euclidian           90.19%                       American sign           –             1000 mm
                                                                        hand gesture                                            gestures
                  (web camera)                         method                               distance)                                         language
                                                                                        distance transform       100% >                        control the
                                    640 × 480       HIS & distance                                                                 6                           location of
       [15]          webcam                                             finger count    method & circular      according                     slide during a                           –
                                      pixels          transform                                                                 gestures                          hand
                                                                                             profiling         limitation                     presentation
                                                      HIS & frame                       contour matching                                                       sensitive to
                                                                        dynamic hand
       [39]          webcam             –          difference & Haar                        difference             –          hand segment       HCI             moving               –
                                                                          gestures
                                                        classifier                      with the previous                                                      background
                                                    HSV & motion                              (SPM)
                                    640 × 480
       [40]          webcam                           detection         hand gestures     classification        98.75%        hand segment       HCI                 –                –
                                      pixels
                                                  (hybrid technique)                        technique
                                                                                                                                             man–machine
                                    640 × 480           HSV &                                                                      15
       [41]       video camera                                          hand gestures   Euclidian distance      82.67%                         interface             –                –
                                      pixels       cross-correlation                                                            gestures
                                                                                                                                                (MMI)
                                                                                                                                                               objects have
                    digital or                                                                                                                Malaysian
                                    768 × 576                                                                                                                 the same skin
       [42]         cellphone                            HSV            hand gestures   division by shape          –          hand segment       sign                                 –
                                      pixels                                                                                                                  color some &
                     camera                                                                                                                   language
                                                                                                                                                               hard edges
                                                                                             combine
                                                      red channel
                                                                                        information from                                        HCI
                                    320 × 240          threshold                                                                 5 hand
       [43]        web camera                                           hand postures   multiple cures of        100%                         wheelchair             –                –
                                      pixels         segmentation                                                               postures
                                                                                        the motion, color                                      control
                                                        method
                                                                                            and shape
                                                                                                                                                                                   (< 1) mm
                                                                                            Haar-like
                    Logitech                                                                                   93.13 static      2 static    man–machine                          (1000–1500)
                                    320 × 240      normalized R, G,                         directional
       [44]         portable                                            hand gestures                        95.07 dynamic     4 dynamic       interface             –                mm
                                      pixels         original red                       patterns & motion
                  webcam C905                                                                                    Percent        gestures        (MMI)                             (1500–2000)
                                                                                          history image
                                                                                                                                                                                      mm
J. Imaging 2020, 6, 73                                                                                                                                                                                                    9 of 29
                                                                                                     Table 1. Cont.
                     Type of                             Techniques/Methods         Feature               Classify        Recognition              No. of           Application          Invariant             Distance
     Author                            Resolution
                     Camera                               for Segmentation        Extract Type           Algorithm           Rate                 Gestures             Area                Factor            from Camera
                                                                                                                                                                   manipulating
                                                           HIS & Gaussian
                       high                                                                                               98.24% correct                            3D objects &
                                        640 × 480              mixture                                  Haarlet-based                                10                                 changes in
       [45]         resolution                                                   hand postures                             classification                            navigating                                   –
                                          pixels            model (GMM)                                 hand gesture                              postures                             illumination
                     cameras                                                                                                    rate                               through a 3D
                                                         & second histogram
                                                                                                                                                                       model
                                                          histogram-based                                                                                            real-time
                  ToF camera &         176 × 144 &
                                                              skin color                                                                                           hand gesture
       [46]        AVT Marlin           640 × 480                                hand gestures          2D Haarlets           99.54%           hand segment                                   –               1000 mm
                                                            probability &                                                                                           interaction
                  color camera            pixels
                                                          depth threshold                                                                                             system
                                                                                                 Table footer: –: none.
                                           Table 2. A set of research papers that have used appearance-based detection for hand gesture application.
                                                      Techniques/                                                                                                                                               Distance
                  Type of                                                Feature         Classify       RECOGNITION         No. of
    Author                        Resolution          Methods for                                                                           Application Area       Dataset Type         Invariant Factor         from
                  Camera                                               Extract Type     Algorithm          RATE            Gestures
                                                     Segmentation                                                                                                                                               Camera
                                                                                                                                                 real-time          Positive and
                 Logitech                        Haar -like features                      parallel                            4
                                   320 × 240                                                                                                vision-based hand      negative hand
      [49]      Quick Cam                           & AdaBoost         hand posture       cascade          above 90%        hand                                                                  –                   –
                                     pixels                                                                                                       gesture         sample collected
                web camera                       learning algorithm                      structure                         postures
                                                                                                                                               classification        by author
                                                                                       feed-forward
                                                   OTSU & canny
                                     80 × 64                                               back
                                                   edge detection                                                           26 static         American Sign       Dataset created                               different
      [50]      webcam-1.3        resize image                          hand sign      propagation           92.33%                                                                    low differentiation
                                                 technique for gray                                                          signs             Language             by author                                   distances
                                    for train                                             neural
                                                    scale image
                                                                                         network
                                                    Gaussian model
                                                    describes hand                                                             6              real-time hand
                   camera          320 × 240                                           palm–finger
      [51]                                          color in HSV &     hand gesture                           93%            hand           gesture recognition           –                       –                   –
                    video            pixels                                            configuration
                                                      AdaBoost                                                              gestures              method
                                                      algorithm
                                                                                                                                                                                       point coordinates
                                                     background                                                                9                                    ground truth
               camera–projector    384 × 288                                           Fourier-based                                        user-independent                            geometrically
      [52]                                           subtraction       hand gesture                          87.7%           hand                                 data set collected                                  –
                  system             pixels                                            classification                                          application                             distorted & skin
                                                       method                                                               gestures                                  manually
                                                                                                                                                                                             color
                                                 combine Y–Cb–Cr           hand                                                                                   The test data are       variation in
                                                                                                                               14
                Monocular          320 × 240     & edge extraction     posture based                                                            substantial        collected from      lightness would
      [53]                                                                             finger model            –             static                                                                            ≤ 500 mm
                web camera           pixels       & parallel finger      on finger                                                             applications       videos captured        result in edge
                                                                                                                            gestures
                                                  edge appearance         gesture                                                                                 by web-camera        extraction failure
                                                                                                 Table footer: –: none.
J. Imaging 2020, 6, 73                                                                                                                 10 of 29
 J. Imaging 2020, 6, x FOR PEER REVIEW                                                                                                 10 of 29
     According to
     According     to information
                       information mentioned
                                       mentioned in in Table
                                                       Table2.2. The
                                                                 The first
                                                                      first row
                                                                            row indicates
                                                                                 indicates Haar-like
                                                                                              Haar-like feature
                                                                                                           feature which
                                                                                                                     which
consider aa good
consider      good for
                     for analyze
                          analyze ROIROI pattern
                                           pattern efficiently.
                                                    efficiently. Haar-like
                                                                  Haar-like features
                                                                              features can can efficiently
                                                                                                efficiently analyze
                                                                                                             analyze the the
contrastbetween
contrast   betweendarkdarkandandbright
                                   bright   object
                                        object     within
                                                within      a kernel,
                                                        a kernel,      which
                                                                   which       can operate
                                                                           can operate     fasterfaster  compared
                                                                                                  compared            with
                                                                                                               with pixel
pixel based
based  system. system.   In addition,
                 In addition,           it is immune
                                it is immune            forand
                                                for noise   noise  and lighting
                                                               lighting  variationvariation     because
                                                                                     because they          they calculate
                                                                                                      calculate   the gray
the gray
value      value difference
       difference  between the between     the black
                                   white and    whiterectangles.
                                                       and black rectangles.
                                                                   The result ofThefirstresult of90%,
                                                                                          row is   first but
                                                                                                         rowifiscompared
                                                                                                                  90%, but
if compared
with            with single
     single gaussian     modelgaussian
                                  which model
                                          used towhich   used
                                                   describe     to describe
                                                             hand   color in hand    colorspace
                                                                              HSV color      in HSV in color space
                                                                                                       the third     in the
                                                                                                                   row   the
third row
result       the result of
       of recognition        recognition
                          rate              rate is 93%.
                               is 93%. Although       bothAlthough
                                                            proposedboth     proposed
                                                                        system    used the system    used the
                                                                                               Adaboost          Adaboost
                                                                                                            algorithm     to
algorithm    to speed   up  the  system
speed up the system and classification.   and  classification.
2.2.3.
2.2.3. Motion-Based Recognition
                    Recognition
       Motion-based
       Motion-based recognition
                             recognition can   can bebe utilized
                                                          utilized forfor detection
                                                                            detection purposes;
                                                                                           purposes; it   it can
                                                                                                             can bebe extracts
                                                                                                                       extracts the the object
                                                                                                                                          object
through     a series   of  image     frames.   The    AdaBoost      algorithm      utilized
 through a series of image frames. The AdaBoost algorithm utilized for object detection,       for object   detection,    characterization,
movement         modeling,
 characterization,               and pattern
                          movement                recognition
                                           modeling,               is needed
                                                           and pattern            to recognize
                                                                              recognition           the gesture
                                                                                                is needed           [16]. The the
                                                                                                               to recognize       main    issue
                                                                                                                                        gesture
encounter       motion      recognition      is  this  is an   occasion      if  one   more    gesture
 [16]. The main issue encounter motion recognition is this is an occasion if one more gesture is active    is active  at   the   recognition
process    and also dynamic
 at the recognition        process and  background
                                             also dynamic has abackground
                                                                  negative effect.  has aInnegative
                                                                                              addition,     the loss
                                                                                                         effect.      of gesture
                                                                                                                  In addition,      themay
                                                                                                                                         lossbeof
caused     by   occlusion      among      tracked     hand    gesture     or  error   in  region
 gesture may be caused by occlusion among tracked hand gesture or error in region extraction from   extraction     from   tracked      gesture
and   effect
 tracked       long-distance
            gesture     and effect   onlong-distance
                                         the region appearance
                                                            on the region Tableappearance
                                                                                   3 presents aTable set of3research
                                                                                                              presentspapers
                                                                                                                          a set ofthat    used
                                                                                                                                      research
different    segmentation         techniques      based    on   motion     recognition       to
 papers that used different segmentation techniques based on motion recognition to detect ROI.   detect   ROI.
       Two
       Two stages
               stages forforefficient
                               efficienthand
                                           handdetection
                                                   detectionwere wereproposed
                                                                          proposedin    in[54].
                                                                                             [54]. First,
                                                                                                    First, the
                                                                                                            the hand
                                                                                                                 hand detected
                                                                                                                         detected for  for each
                                                                                                                                           each
frame    and    center   point    is used   for  tracking     the hand.     Then,    the   second
 frame and center point is used for tracking the hand. Then, the second stage matching model applying stage   matching    model      applying
to
 to each
    each type
           type of of gesture
                      gesture using
                                  using aa set
                                             set of
                                                  of features
                                                     features is is extracted
                                                                    extracted from from thethe motion
                                                                                                motion tracking
                                                                                                           tracking inin order
                                                                                                                          order to to provide
                                                                                                                                       provide
better
 better classification where the main drawback of the skin color is affected by lighting variations which
         classification      where     the  main    drawback      of   the  skin   color   is affected    by  lighting   variations      which
lead
 lead to detect non-skin color. A standard face detection algorithm and optical flow computation was
       to detect    non-skin       color.   A standard      face  detection       algorithm      and   optical   flow   computation         was
used
 used byby [55]
            [55] toto give
                       give aa user-centric
                                 user-centric coordinate
                                                  coordinate frameframe in  in which
                                                                                which motion
                                                                                          motion features
                                                                                                     features were
                                                                                                                 were used
                                                                                                                        used to to recognize
                                                                                                                                    recognize
gestures
 gesturesfor  forclassification
                   classificationpurposes
                                        purposes   using
                                                      usingthethe
                                                                multiclass
                                                                      multiclassboosting     algorithm.
                                                                                      boosting     algorithm.A real-time     dynamic
                                                                                                                   A real-time            hand
                                                                                                                                      dynamic
gesture    recognition       system      based   on  TOF    was    offered    in  [56],  in  which
 hand gesture recognition system based on TOF was offered in [56], in which motion patterns were       motion    patterns    were     detected
based
 detectedon hand
               basedgestures
                         on hand    received
                                        gesturesas input
                                                     receiveddepthas images.
                                                                        input depthTheseimages.
                                                                                            motion patterns
                                                                                                        These motionwere compared
                                                                                                                             patterns with were
the  hand motion
 compared       with the classifications
                             hand motion     computed       from the
                                                 classifications          real dataset
                                                                      computed        from  videos
                                                                                              the realwhich    do not
                                                                                                          dataset       require
                                                                                                                    videos     whichthe do
                                                                                                                                         usenot
                                                                                                                                              of
arequire
   segmentation        algorithm.        Where     the  system     provides       good    result   except
           the use of a segmentation algorithm. Where the system provides good result except the depth       the  depth    rang    limitation
of  TOF
 rang      camera. ofInTOF
        limitation            [57],camera.
                                      YUV color       space
                                                In [57],   YUV wascolor
                                                                      used,     withwas
                                                                            space      the used,
                                                                                              help of    thethe
                                                                                                      with    CAMShift
                                                                                                                  help of the algorithm,
                                                                                                                                   CAMShift   to
distinguish
 algorithm, to    between
                     distinguishbackground
                                        between   and   skin color, and
                                                     background          and skin
                                                                               the naïve
                                                                                      color,Bayes
                                                                                                and theclassifier
                                                                                                            naïvewas     implemented
                                                                                                                     Bayes    classifier was  to
assist   with    gesture     recognition.        The   proposed        system     faces    some
 implemented to assist with gesture recognition. The proposed system faces some challenges such as challenges      such    as  illumination
variation
 illumination where     light changes
                   variation                affectchanges
                                  where light       the result    of the
                                                                affect   theskin   segment.
                                                                              result             Other
                                                                                       of the skin        challenges
                                                                                                       segment.     Otherarechallenges
                                                                                                                              the degreeare   of
gesture    freedom        which     affect  directly    on   the  output      result   by   change
 the degree of gesture freedom which affect directly on the output result by change rotation. Next,    rotation.    Next,    hand     position
capture    problem,
 hand position            if hand
                      capture         appearsifinhand
                                   problem,          the corner
                                                           appears   of in
                                                                         thethe
                                                                              frame    andofthe
                                                                                  corner        thedots
                                                                                                     framewhich
                                                                                                              and must
                                                                                                                    the dotscover    the hand
                                                                                                                                 which     must
does
 covernotthelie   on hand
               hand     does that
                                not liemay onled
                                               handto failing
                                                       that may  captured
                                                                      led to user
                                                                                failinggesture.
                                                                                          captured  In addition,
                                                                                                        user gesture.the hand      size quite
                                                                                                                           In addition,      the
differs   between      humans       and   maybe     causes    a problem      with    the  interaction
 hand size quite differs between humans and maybe causes a problem with the interaction system.           system.    However,       the  major
still challenging
 However,,               problem
                 the major             is the skin-like
                                still challenging           color which
                                                       problem                 affects overall
                                                                     is the skin-like               systemaffects
                                                                                           color which         and can    abortsystem
                                                                                                                      overall      the result.
                                                                                                                                            and
Figure    8 gives    simple      example     on   hand    motion     recognition.
 can abort the result. Figure 8 gives simple example on hand motion recognition.
      Figure 8.
      Figure  8. Example
                 Example on
                         on motion
                            motion recognition
                                    recognition using
                                                using frame
                                                      frame difference
                                                             difference subtraction
                                                                        subtraction to
                                                                                    to extract
                                                                                       extract hand
                                                                                               hand feature,
                                                                                                    feature,
      wherethe
      where  themoving
                  movingobject
                         objectsuch
                                suchas
                                     ashand
                                       handextracted
                                             extractedfrom
                                                       fromthe
                                                            thefixed
                                                                fixedbackground.
                                                                      background.
J. Imaging 2020, 6, 73                                                                                                                                                                  11 of 29
                                            Table 3. A set of research papers that have used motion-based detection for hand gesture application.
                                                     Techniques/                                                                                                                 Distance
                     Type of                                             Feature        Classify         Recognition    No. of       Application     Dataset      Invariant
     Author                         Resolution       Methods for                                                                                                                  from
                     Camera                                            Extract Type    Algorithm            Rate       Gestures        Area           Type          Factor
                                                    Segmentation                                                                                                                 Camera
                                                                                                                                                                 other object
                                                      RGB, HSV,                        histogram                                                     Data set
                   off-the-shelf                                                                                                     human–computer              moving and
       [54]                             –            Y–Cb–Cr &         hand gesture   distribution         97.33%      10 gestures                  created by                      –
                     cameras                                                                                                           interface                 background
                                                    motion tracking                      model                                                        author
                                                                                                                                                                    issue
                                                                                                                                       gesture       Data set
                   Canon GL2        720 × 480       face detection &     motion       leave-one-out                        7
       [55]                                                                                                   –                      recognition    created by        –             –
                    camera            pixels          optical flow       gesture      cross-validation                  gestures
                                                                                                                                       system         author
                                                        depth                           motion                                        interaction    cardinal
                   time of flight   176 × 144                            motion                                                                                  depth range
       [56]                                          information,                       patterns            95%        26 gestures   with virtual   directions                   3000 mm
                  (TOF) SR4000        pixels                             gesture                                                                                  limitation
                                                    motion patterns                    compared                                      environments    dataset
                                                                                                                                                                    changed
                                                                                                                                                                 illumination,
                                                       YUV &                                                                         human and       Data set
                                                                                      naïve Bayes                                                                   rotation
       [57]      digital camera         –             CAMShift         hand gesture                         high       unlimited      machine       created by                      –
                                                                                       classifier                                                                   problem,
                                                      algorithm                                                                        system         author
                                                                                                                                                                    position
                                                                                                                                                                    problem
                                                                                        Table footer: –: none.
J. Imaging 2020, 6, 73                                                                                                               12 of 29
      According to information mentioned in Table 3. The first row recognition rate of system is
97%, where the hybrid system based on skin detect and motion detection is more reliable for gesture
recognition, where the motion hand can track using multiple track candidates depend on stand
derivation calculation for both skin and motion approach. Where every single gesture encoded as
chain-code     in order
        J. Imaging 2020, 6, xto model
                              FOR        every single gesture which considers a simple model compared
                                  PEER REVIEW                                                               12 of 29 with
(HMM) and classified gesture using a model of the histogram distribution. The proposed system in the
               According to information mentioned in Table 3. The first row recognition rate of system is 97%,
third row use      depth camera based on (TOF) where the motion pattern of the arm model for human
        where the hybrid system based on skin detect and motion detection is more reliable for gesture
utilized to define motion patterns, were the authors confirm that using the depth information for
        recognition, where the motion hand can track using multiple track candidates depend on stand
hand trajectories
        derivation estimation
                      calculation for is to improve
                                         both skin andgesture
                                                       motion recognition    rate.every
                                                                approach. Where     Moreover,    the proposed
                                                                                         single gesture encodedsystem
                                                                                                                 as   no
need forchain-code
           the segmentation          algorithm,   where   the  system   is examined     using  2D  and  2.5D
                       in order to model every single gesture which considers a simple model compared with   approaches,
        (HMM)
were 2.5D          and classified
             performs                gesture
                             better than   2Dusing
                                               and agives
                                                     modelrecognition
                                                            of the histogram
                                                                         rate distribution.
                                                                              95%.          The proposed system in
         the third row use depth camera based on (TOF) where the motion pattern of the arm model for human
         utilized to defineRecognition
2.2.4. Skeleton-Based       motion patterns, were the authors confirm that using the depth information for
         hand trajectories estimation is to improve gesture recognition rate. Moreover, the proposed system
      Thenoskeleton-based     recognitionalgorithm,
              need for the segmentation      specifies model   parameters
                                                        where the  system is which
                                                                              examinedcan using
                                                                                           improve  the detection
                                                                                                 2D and 2.5D      of
complexapproaches,
           features [16].
                      were Where    the various
                            2.5D performs          representations
                                           better than               of skeleton
                                                       2D and gives recognition   data
                                                                                rate 95%.for the hand model can be
used for classification, it describes geometric attributes and constraint and easy translates features and
         2.2.4. Skeleton-Based Recognition
correlations of data, in order to focus on geometric and statistic features. The most common feature
used is the joint     orientation, recognition
               The skeleton-based        the space between          joints,
                                                       specifies model         the skeletal
                                                                           parameters       whichjoint
                                                                                                   can location
                                                                                                         improve the  and   degreeofof angle
                                                                                                                         detection
betweencomplex      features
           joints and          [16]. Where
                         trajectories      andthe   various representations
                                                 curvature     of the joints. ofTableskeleton    data for athe
                                                                                            4 presents       sethand    model can
                                                                                                                   of research       be
                                                                                                                                   papers  that
         used for classification, it describes geometric attributes and constraint and easy translates features
use different segmentation techniques based on skeletal recognition to detect ROI.
         and correlations of data, in order to focus on geometric and statistic features. The most common
     Hand     segmentation using the depth sensor of the Kinect camera, followed by location of the
         feature used is the joint orientation, the space between joints, the skeletal joint location and degree of
fingertips   using
         angle between3D connections,          Euclidean
                            joints and trajectories     anddistance,
                                                              curvature andof thegeodesic       distance
                                                                                    joints. Table            overa hand
                                                                                                     4 presents       set of skeleton
                                                                                                                              research pixels
to provide    increased
         papers              accuracy
                  that use different         was proposed
                                         segmentation            in [58].
                                                          techniques    basedA onnew     3D hand
                                                                                     skeletal          gesture
                                                                                               recognition         recognition
                                                                                                               to detect  ROI.       approach
based on a deepHandlearning
                       segmentation model using   the depth
                                              using           sensor
                                                       parallel        of the Kinectneural
                                                                  convolutional            camera,  followed (CNN)
                                                                                                  networks        by location   of the hand
                                                                                                                            to process
skeletonfingertips    using 3D connections,
           joints’ positions       was introduced  Euclidean   distance,
                                                          in [59],         and geodesic
                                                                     the proposed            distance
                                                                                           system    hasover  hand skeleton
                                                                                                           a limitation          pixelsit works
                                                                                                                             where
         to provide increased accuracy was proposed in [58]. A new 3D hand gesture recognition approach
only with complete sequence. The optimal viewpoint was estimated and the point cloud of gesture
         based on a deep learning model using parallel convolutional neural networks (CNN) to process hand
transformed     using
         skeleton        a curve
                    joints’ positionsskeleton      to specify
                                         was introduced         topology,
                                                            in [59],            then Laplacian-based
                                                                     the proposed       system has a limitation contraction      was applied
                                                                                                                      where it works
to specify
         onlythe
               withskeleton
                      completepoints
                                   sequence.in [60].   Whereviewpoint
                                                The optimal     the Hungarianwas estimatedalgorithm
                                                                                                 and the was
                                                                                                           pointapplied
                                                                                                                    cloud oftogesture
                                                                                                                                calculate the
match scores      of theusing
         transformed        skeleton
                                   a curvepoint    set, but
                                               skeleton       the joint
                                                         to specify         tracking
                                                                       topology,      then information
                                                                                              Laplacian-basedacquired      by Kinect
                                                                                                                    contraction    was is not
accurateapplied
           enough  to specify
                      which the giveskeleton
                                       a result points
                                                   withinconstant
                                                          [60]. Where    the Hungarian
                                                                      vibration.      A novelalgorithm
                                                                                                  method  wasbased
                                                                                                                applied ontoskeletal
                                                                                                                             calculatefeatures
extractedthefrom
             match    scores
                    RGB       of the skeleton
                           recorded       videopoint    set,language,
                                                   of sign   but the jointwhich
                                                                             tracking     information
                                                                                      presents           acquiredto
                                                                                                   difficulties      byextracting
                                                                                                                         Kinect is notaccurate
         accurate enough which give a result with constant vibration. A novel method based on skeletal
skeletal data because of occlusions, was offered in [61]. A dynamic hand gesture using depth and
         features extracted from RGB recorded video of sign language, which presents difficulties to extracting
skeletal accurate
          dataset skeletal
                     for a skeleton-based           approach was
                              data because of occlusions,       wasoffered
                                                                       presentedin [61].inA[62],  where
                                                                                             dynamic    handsupervised
                                                                                                               gesture using learning
                                                                                                                                 depth (SVM)
used forandclassification
              skeletal datasetwith for a  linear kernel.approach
                                       a skeleton-based      Anotherwas    dynamic
                                                                               presented   hand   gesture
                                                                                             in [62], where recognition         using Kinect
                                                                                                               supervised learning
sensor depth      metadata
         (SVM) used               for acquisition
                        for classification              and segmentation
                                               with a linear                        which used
                                                               kernel. Another dynamic          hand to    extract
                                                                                                       gesture         orientation
                                                                                                                  recognition    usingfeature,
         Kinect   sensor   depth    metadata     for  acquisition  and    segmentation
where the support vector machine (SVM) algorithm and HMM was utilized for classification and which    used   to  extract  orientation
         feature, where the support vector machine (SVM) algorithm and HMM was utilized for classification
recognition    to evaluate system performance where the SVM bring a good result than HMM in some
         and recognition to evaluate system performance where the SVM bring a good result than HMM in
specification    such elapsed time, average recognition rate, was proposed in [63]. A hybrid method
         some specification such elapsed time, average recognition rate, was proposed in [63]. A hybrid
for handmethod
            segmentation         based on depth
                   for hand segmentation         basedand     colorand
                                                        on depth      data
                                                                         coloracquired       by the
                                                                                 data acquired     byKinect
                                                                                                       the Kinectsensor
                                                                                                                     sensor with
                                                                                                                              withthe
                                                                                                                                    the help of
skeletal help
          dataofwere    proposed
                   skeletal  data were  in [64].    In this
                                             proposed       method,
                                                        in [64].  In thisthe    imagethe
                                                                           method,         threshold    is applied
                                                                                             image threshold            to the to
                                                                                                                   is applied    depth
                                                                                                                                    the frame
         depth   frame   and    the  super-pixel     segmentation     method     is used    to extract
and the super-pixel segmentation method is used to extract the hand from the color frame, then the       the  hand    from   the  color
         frame,
two results   arethen   the two for
                    combined        results   are combined
                                         robust                for robust
                                                    segmentation.      Figure segmentation.
                                                                                   9 show anFigure example9 show on an   example
                                                                                                                     skeleton        on
                                                                                                                                  recognition.
          skeleton recognition.
             Figure 9. Example of skeleton recognition using depth and skeleton dataset to representation hand
      Figure 9. Example of skeleton recognition using depth and skeleton dataset to representation hand
             skeleton model [62].
      skeleton model [62].
J. Imaging 2020, 6, 73                                                                                                                                                                           13 of 29
                                        Table 4. Set of research papers that have used skeleton-based recognition for hand gesture application.
                                              Techniques/                                                                                                                                 Distance
                   Type of                                         Feature             Classify          Recognition     No. of       Application                         Invariant
     Author                     Resolution    Methods for                                                                                              Dataset Type                        from
                   Camera                                        Extract Type         Algorithm             Rate        Gestures        Area                                Factor
                                             Segmentation                                                                                                                                 Camera
                                               Euclidean                                                                                real time
                    Kinect
                                512 × 424      distance &                           skeleton pixels                        hand           hand
       [58]        camera                                          fingertip                                  –                                              –                –              –
                                  pixels        geodesic                               extracted                         tracking       tracking
                 depth sensor
                                                distance                                                                                 method
                                                                                                                                                          Dynamic
                  Intel Real                                    hand-skeletal        convolutional                                                                      only works on
                                                                                                           91.28%      14 gestures    classification       Hand
       [59]      Sense depth        –         skeleton data        joints’          neural network                                                                         complete          –
                                                                                                           84.35%      28 gestures       method         Gesture-14/28
                   camera                                         positions             (CNN)                                                                             sequences
                                                                                                                                                       (DHG) dataset
                                                                                                                                                         ChaLearn          HGR less
                                                                                                                                      hand gesture
                   Kinect       240 × 320    Laplacian-based    skeleton points       Hungarian                                                           Gesture       performance in
       [60]                                                                                                 80%        12 gestures     recognition                                           –
                   camera         pixels       contraction          clouds            algorithm                                                           Dataset       the viewpoint
                                                                                                                                         method
                                                                                                                                                        (CGD2011)        0◦condition
                                                                                                                                                                        difficulties in
                  RGB video                   vision-based      hand and body           skeleton                                          sign                           extracting
       [61]        sequence         –          approach &          skeletal          classification           –        hand gesture    language        LSA64 dataset    skeletal data        –
                   recorded                   skeletal data        features            network                                        recognition                        because of
                                                                                                                                                                         occlusions
                                                                                      supervised
                                                                                                                                                       Create SHREC
                  Intel Real                                                      learning classifier
                                640 × 480      depth and                                                   88.24%      14 gestures    hand gesture     2017 track “3D
       [62]      Sense depth                                    hand gesture        support vector                                                                            –              –
                                  pixels     skeletal dataset                                              81.90%      28 gestures     application     Hand Skeletal
                   camera                                                          machine (SVM)
                                                                                                                                                          Dataset
                                                                                  with a linear kernel
                                                                                                                                         Arabic                               low
                  Kinect v2
                                512 × 424                       dynamic hand                                            10 gesture      numbers         author own       recognition
       [63]        camera                    depth metadata                              SVM               95.42%                                                                            –
                                  pixels                           gesture                                              26 gesture    (0–9) letters       dataset       rate, “O”, “T”
                   sensor
                                                                                                                                          (26)                             and “2”
                 Kinect RGB                                                                                                            Malaysian
       [64]       camera &      640 × 480     skeleton data       hand blob                –                  –        hand gesture       sign               –                –              –
                 depth sensor                                                                                                          language
                                                                                         Table footer: –: none.
J. Imaging 2020, 6, 73                                                                               14 of 29
      According to information mentioned in Table 4. The depth camera provides good accuracy for
segmentation, because not affected by lightening variations and cluttered background. However,
the main issue is in the range of detection. The Kinect V1 sensor has an embedded system in which
gives feedback information received by depth sensor as a metadata, which gives information about
human body joint coordinate. The Kinect V1 provides information used to track skeletal joint up to
20 joints, that’s help to module the hand skeleton. While Kinect V2 sensor can tracking joint as 25 joints
and up to six people at the same time with full joints tracking. With a range of detection between
(0.5–4.5) meter.
                    Figure 10.
     Figure 10. Depth-based    Depth-based
                            recognition:     recognition:
                                         (a) hand         (a) hand
                                                  joint distance   joint
                                                                 from    distance
                                                                       camera; (b)from   camera;
                                                                                    different    (b) different
                                                                                              feature          feature
                                                                                                      extraction usingextraction using
                                                                                                                       Kinect depth    Kinect depth sensor.
                                                                                                                                     sensor.
                                 Table 5. Set of research papers that have used depth-based detection for hand gesture and finger counting application.
                                 Table 5. Set of research papers that have used depth-based detection for hand gesture and finger counting application.
                                                         Techniques/
               Type of                          Techniques/                FeatureFeature               Classify         Recognition      No. of                                                Distance from
                                                                                                                                                                                                      Distance
   Author                         Resolution            Methods for                                                                                     Application Area     Invariant Factor
            Type of
              Camera
                                                       Segmentation
                                                                               Extract Type       Classify
                                                                                                      Algorithm         Recognition
                                                                                                                            Rate           No. of
                                                                                                                                         Gestures           Application                            Camera
 Author              Resolution                 Methods for                Extract                                                                                              Invariant Factor        from
            Camera                                                                               Algorithm
                                                                                                     finger–earth          Rate           Gestures            Area
                              RGB - 640 × 480Segmentationthreshold &        Type                                                            10          human–computer                                 Camera
     [65]     Kinect V1                                                       finger gesture            movers              93.9%                                                   –                 –
                               - 640-×320 × 240
                          RGBdepth                  near-convex shape
                                                                                                  distance (FEMD)
                                                                                                                                         gestures       interactions
                                                                                                                                                               human–(HCI)
            Kinect            480            threshold & near-              finger         finger–earth movers                               10              computer
   [65]                                                local neighbor
                                                                                                     convex   hull
                                                                                                                           93.9%                             natural
                                                                                                                                                                                        –                 –
             V1           depth
                             RGB - 320
                                   - 1920 × 1080convex shape
                                                          method &         gesture           distance   (FEMD)                            gestures
                                                                                                                                            6               interactions
     [68]     Kinect V2                                                          fingertip             detection            96%                           human–robot               –           (500–2000) mm
                             ×depth
                               240 - 512 × 424             threshold                                                                     gestures              (HCI)
                                                                                                       algorithm                                           interaction
                          RGB - 1920                    segmentation
                                               local neighbor                                                                                                 natural
            Kinect          × 1080                                                 finger       convex hull                                   6                                                      (500–2000)
   [68]                        Infrared     methodoperation
                                                     & threshold of depth fingertip                                         96%        finger count &     human–robot                   –
     [69]    V2
             Kinect V2    depth  - 512 sensor            and  infrared
                                                                                 counting detectionnumber       of
                                                                                                        algorithm             –             gestures
                                                                                                                                          two  hand
                                                                                                                                                        mouse-movement
                                                                                                                                                                                    –                     mm
                                                                                                                                                                                                   < 500 mm
                              depth - 512 × 424segmentation                       & hand            separate areas                                          interaction
                                                                                                                                                          controlling
                             × 424                          images                                                                         gestures
                                                                                  gesture
                           Infrared                                         finger
                                                                                                   finger counting                      finger count           mouse-
            Kinect          sensor
                              RGB - 640 × 480
                                             operation of depth           counting          number    of separate
                                                                                                  classifier & finger   84% one
   [69]
     [70]    Kinect V1                               depth thresholds & finger                                               – hand         9 hand
                                                                                                                                        & two             chatting with
                                                                                                                                                             movement                   –          (500–800)
                                                                                                                                                                                                       < 500 mm
             V2           depth  - 512
                              depth  - 320 ×and
                                             240 infrared images               hand gesture         areascollect &
                                                                                                    name                90% two hand     gestures            speech
                                                                                                                                                                                    –
                                                                                                                                                                                                      mm
                                                                                                                                          gestures           controlling
                             × 424                                         gesture                 vector matching
                          RGB - 640 ×                        frame                            finger counting
                              RGB - 640 × 480                                                      automatic state                         hand         human–computer
     [71]    Kinect V1
            Kinect            480 - 320 × 240             difference           hand gestureclassifier & finger
                                                                            finger                                           94%hand
                                                                                                                        84% one               9            chatting with            –                 –(500–800)
   [70]                       depth           depth thresholds                                     machine (ASM)                          gesture         interaction                   –
             V1           depth - 320                     algorithm        gesture            name collect &            90% two hand      gestures             speech                                     mm
                             × 240                     skin & motion                         vector   matching
                                                                                                   discrete  hidden
     [72]     Kinect V1   RGBRGB      × × 480
                               - 640- 640             detection & Hu
                                                                               hand gesture         Markov model              –
                                                                                                                                            10          human–computer
                                                                                                                                                                                    –                 –
                              depth - 320 × 240          moments an
                                                                                                       (DHMM)
                                                                                                                                         gestures            human–
                                                                                                                                                          interfacing
            Kinect            480                   frame orientation        hand             automatic    state                            hand
   [71]                                                                                                                     94%                               computer                  –                 –
             V1           depth - 320       difference algorithm           gesture            machine (ASM)                                gesture
                                                                                                                                                             interaction
                             × 240
                                                                                              discrete hidden                                                  human–
            Kinect        RGB - 640 ×          skin & motion                 hand                                                            10
   [72]                                                                                       Markov model                   –                                computer                  –                 –
             V1               480              detection & Hu              gesture                                                        gestures
                                                                                                 (DHMM)                                                      interfacing
J. Imaging 2020, 6, 73                                                                                                                                                                       16 of 29
                                                                                           Table 5. Cont.
                                                  Techniques/
                 Type of                                               Feature            Classify         Recognition     No. of                                                Distance from
    Author                     Resolution         Methods for                                                                           Application Area    Invariant Factor
                 Camera                                              Extract Type        Algorithm            Rate        Gestures                                                  Camera
                                                 Segmentation
                                                 range of depth      hand gestures     kNN classifier &                       5         electronic home                            (250–650)
      [14]      Kinect V1   depth - 640 × 480                                                                    88%                                                –
                                                     image               1–5          Euclidian distance                   gestures        appliances                                 mm
                                                                                                                            hand        human–computer
      [73]      Kinect V1   depth - 640 × 480   distance method      hand gesture             –                    –                                                –                 –
                                                                                                                           gesture      interaction (HCI)
                                                                                                                                              hand
                                                                                                                                                                                   400–1500
      [74]      Kinect V1   depth - 640 × 480    threshold range     hand gesture             –                    –     hand gesture     rehabilitation            –
                                                                                                                                                                                     mm
                                                                                                                                             system
                                                                                                                                                               hand not
                            RGB - 1920 × 1080     Otsu’s global                        kNN classifier &                                 human–computer      identified if it’s     250–650
      [75]      Kinect V2                                            finger gesture                              90%     finger count
                            depth - 512 × 424      threshold                          Euclidian distance                                interaction (HCI)    not connected           mm
                                                                                                                                                            with boundary
                                                depth-based data                      distance from the
                            RGB - 640 × 480                                                                                   6           finger mouse                             500—-800
      [76]      Kinect V1                        and RGB data        finger gesture   device and shape           91%                                                –
                            depth - 640 × 480                                                                              gesture           interface                                mm
                                                    together                           bases matching
                                                                                                                                                                detection
                                                                                       depth threshold                                                      fingertips should
                                                depth threshold         finger                                                5         picture selection
      [77]      Kinect V1   depth - 640 × 480                                                and                 73.7%                                      though the hand           –
                                                and K-curvature        counting                                            gestures        application
                                                                                         K-curvature                                                          was moving or
                                                                                                                                                                 rotating
                                                integrate the RGB
                            RGB - 640 × 480                              hand         forward recursion                     hand             virtual
      [78]      Kinect V1                           and depth                                                    90%                                                –                 –
                            depth - 320 × 240                           gesture           & SURF                           gesture        environment
                                                   information
                                                                                       support vector
                                                   skeletal data                                             93.4% for
                                                                         hand         machine (SVM) &                    24 alphabets    American Sign                             500—-800
      [79]      Kinect V2   depth - 512 × 424   stream & depth &                                            SVM 98.2%                                               –
                                                                        gesture        artificial neural                 hand gesture     Language                                    mm
                                                color data streams                                           for ANN
                                                                                      networks (ANN)
                                                                                        Table footer: –: none.
    J. Imaging 2020, 6, 73                                                                                                              17 of 29
                                   Figure 11.
                                   Figure 11. 3D
                                              3D hand
                                                 hand model interaction with
                                                      model interaction with virtual
                                                                             virtual system
                                                                                     system [83].
                                                                                            [83].
                                 Table 6. Set of research papers that have used 3D model-based recognition for HCI, VR and human behavior application.
                                        Techniques/
                                                           Feature Extract        Classify
      Author      Type of Camera        Methods for                                                  Type of Error        Hardware Run         Application Area            Dataset Type       Runtime Speed
                                                                Type             Algorithm
                                       Segmentation
                                                           3D hand poses,                                                 real-time speed      framework for
                                      network directly                                                Fingertips
                                                           6D object poses     PnP algorithm &                                   of         understanding human          First-person hand
                                        predicts the                                                   48.4 mm
        [80]        RGB camera                              ,object classes   Single-shot neural                            25 fps on an      behavior through             action (FPHA)          25 fps
                                      control points in                                            Object coordinates
                                                              and action           network                                 NVIDIA Tesla      3Dhand and object                 dataset
                                             3D                                                        23.7 mm
                                                              categories                                                        M40              interactions
                                                                                                    mean joint error
                                                           3D hand pose
                                                                                                      (stack = 1)                               design hand pose
                    Prime sense                             estimation &       Pose estimation                                                                           NYU Hand Pose
        [81]                            depth maps                                                     12.6 mm                  –               estimation using                                    –
                   depth cameras                           sphere model        neural network                                                                               Dataset
                                                                                                      (stack = 2)                           self-supervision method
                                                             renderings
                                                                                                       12.3 mm
                                                                                                                                                                          Stereo hand pose
                                                                                                      Mesh error                                design model for
                                     Single RGB image                          train networks                                                                           tracking benchmark
                                                           3D hand shape                               7.95 mm             Nvidia GTX       estimate 3D hand shape
        [82]       RGB-D camera       direct feed to the                          with full                                                                              (STB) & Rendered         50 fps
                                                             and pose                                 Pose error            1080 GPU           from a monocular
                                          network                                supervision                                                                             Hand Pose Dataset
                                                                                                       8.03 mm                                    RGB image
                                                                                                                                                                               (RHD)
                                                                                                    Marker error 5%
                                                                                                                                                                           Finger paint
                                      segmentation                                                    subset of the
                     Kinect V2                                                                                                              interactions with virtual       dataset &
        [83]                         mask Kinect body           hand          machine learning       frames in each         CPU only                                                          high frame-rate
                      camera                                                                                                                 and augmented worlds       NYU dataset used
                                         tracker                                                    sequence & pixel
                                                                                                                                                                         for comparison
                                                                                                   classification error
                                                                                                                                                                          dataset contains
                                                           3D hand pose                                                   Nvidia Geforce
                     raw depth       CNN-based hand                              CNN-based         3D Joint Location                         applications of virtual    8000 original depth
        [84]                                                regression                                                     GTX 1080 Ti                                                              –
                       image          segmentation                                algorithm         Error 12.9 mm                                 reality (VR)           images created by
                                                             pipeline                                                         GPU
                                                                                                                                                                              authors
                                                                                 appearance           percentage                                                         synthetic dataset
                                      bounding box                                                                                              Interaction with
                     Kinect V2                                                     and the            of template                                                       generated with the
        [85]                         around the hand            hand                                                            –             deformable object &                                   –
                      camera                                                  kinematics of the     vertices over all                                                   Blender modeling
                                       & hand mask                                                                                                  tracking
                                                                                    hand                 frames                                                              software
                                                                                                                                              human–computer
                                     regression-based                          3D hand pose                                                                             For evaluation ICVL
                    RGBD data                                                                                                                  interaction (HCI),
                                         method &          3D hand pose        estimation via         Mean error          NVIDIA TITAN                                        Dataset
        [86]           from                                                                                                                   computer graphics                                   58 fps
                                        hierarchical        estimation        semi-supervised          7.7 mm               Xp GPU                                       & MSRA Dataset
                  3 Kinect devices                                                                                                          and virtual/augmented
                                     feature extraction                           learning.                                                                               & NYU Dataset
                                                                                                                                                     reality
J. Imaging 2020, 6, 73                                                                                                                                                                         19 of 29
                                                                                           Table 6. Cont.
                                    Techniques/
                                                     Feature Extract       Classify
      Author      Type of Camera    Methods for                                               Type of Error     Hardware Run       Application Area           Dataset Type       Runtime Speed
                                                          Type            Algorithm
                                   Segmentation
                                                                       3D point cloud of                                           (HCI), computer         For evaluation NYU
                                                                                                                Nvidia TITAN
                    single depth                                        hand as network        mean error                              graphics                  dataset
        [87]                        depth image      3D hand pose                                                    Xp                                                             41.8 fps
                       images.                                         input and outputs       distances                        and virtual/augmented        & ICVL dataset
                                                                                                                    GPU
                                                                           heat-maps                                                    reality             & MSRA datasets
                                                                         dense feature
                                   predicting heat                      maps through                                                                          For evaluation
                                                                                              mean error
                                    maps of hand                         intermediate                                                                        ‘HANDS 20170
                                                       hand pose                                6.68 mm         GeForce GTX     (HCI), virtual and mixed
        [88]       depth images        joints in                          supervision                                                                      challenge dataset &         –
                                                       estimation                           maximal per-joint     1080 Ti                reality
                                   detection-based                            in a                                                                          first-person hand
                                                                                             error 8.73 mm
                                      methods                          regression-based                                                                            action
                                                                          framework
                                                                            weakly                               GeForce GTX
                                                     3D hand pose                              mean error                       (HCI), virtual and mixed   Rendered hand pose
        [89]      RGB-D cameras          –                                supervised                            1080 GPU with                                                          –
                                                      estimation                                0.6 mm                                   reality             (RHD) dataset
                                                                           method                                 CUDA 8.0.
                                                                                       Table footer: –: none.
J. Imaging 2020, 6, x FOR PEER REVIEW                                                                                                     20 of 29
J. Imaging 2020, 6, 73                                                                                                                    20 of 29
       FigureFigure
              12. Simple  example
                     12. Simple   on deep
                                example onlearning convolutional
                                           deep learning         neural neural
                                                         convolutional  network  architecture.
                                                                               network  architecture.
J. Imaging 2020, 6, 73                                                                                                                                                                                   21 of 29
                                      Table 7. Set of research papers that have used deep-learning-based recognition for hand gesture application.
                                                     Techniques/
                   Type of                                                 Feature                               Recognition        No. of                                                 Hardware
      Author                     Resolution          Methods for                          Classify Algorithm                                  Application Area        Dataset Type
                   Camera                                                Extract Type                               Rate           Gestures                                                  Run
                                                    Segmentation
                                                                                            Adapted Deep          training set                     (HCI)
                   Different                                                                                                                                                              Core™ i7-6700
                                                  features extraction                       Convolutional            100%           7 hand    communicate for        Created by video
        [90]        mobile       HD and 4k                               hand gestures                                                                                                       CPU @
                                                       by CNN                                  Neural                              gestures      people was           frame recorded
                   cameras                                                                                                                                                                  3.40 GHz
                                                                                          Network (ADCNN)         test set 99%                 injured Stroke
                                                  skin color detection
                                                                                                 deep              training set               Home appliance        4800 image collect
                                                  and morphology &                                                                  6 hand
        [91]       webcam             –                                  hand gestures    convolutional neural        99.9%                       control            for train and 300          –
                                                      background                                                                   gestures
                                                                                            network (CNN)        test set 95.61%               (smart homes)               for test
                                                      subtraction
                                                                                                                    simple
                                                                                                                                                  Command                                  GPU with
                                                                                                                 backgrounds
                                                   No segment stage                                                                               consumer           Mantecón et al.*         1664
                                  640 × 480                                               deep convolutional         97.1%          7 hand
        [92]      RGB image                       Image direct fed to    hand gestures                                                        electronics device     dataset for direct    cores, base
                                    pixels                                                  neural network         complex         gestures
                                                  CNN after resizing                                                                           such as mobiles            testing           clock of
                                                                                                                 background
                                                                                                                                               phones and TVs                              1050 MHz
                                                                                                                    85.3%
                                                  skin color modeling
                                                    combined with                         convolution neural                                                                                  CPUE
                                                                                                                                    8 hand                          image information
        [93]        Kinect            –           convolution neural     hand gestures    network & support          98.52%                           –                                     5-1620v4,
                                                                                                                                   gestures                         collected by Kinect
                                                    network image                           vector machine                                                                                  3.50 GHz
                                                         feature
                                                  skin color -Y–Cb–Cr                                                                           human hand
                                 Image size          color space &                        convolution neural         Average        7 hand         gesture          image information
        [94]        Kinect                                               hand gestures                                                                                                          –
                                  200 × 200        Gaussian Mixture                           network                95.96%        gestures      recognition        collected by Kinect
                                                         model                                                                                     system
                                                       Semantic
                     video                                                                                                                                              Cambridge           Nvidia
                                                  segmentation based     hand gesture     convolution network                       9 hand    intelligent vehicle
        [95]      sequences           –                                                                               95%                                           gesture recognition   Geforce GTX
                                                     deconvolution         motion            (LRCN) deep                           gestures      applications
                   recorded                                                                                                                                               dataset         980 graphics
                                                    neural network
                                                                                             double channel                                                           Jochen Triesch
                               Original images
                                                                                          convolutional neural                                                       Database (JTD) &
                               in the database      Canny operator                                                                 10 hand     man–machine                                   Core i5
        [96]        image                                                hand gesture     network (DC-CNN)           98.02%                                         NAO Camera hand
                                 248 × 256 or       edge detection                                                                 gestures     interaction                                 processor
                                                                                                   &                                                                 posture Database
                               128 × 128 pixels
                                                                                            softmax classifier                                                            (NCD)
                                                                                                                                                                     Dynamic Hand
                                                                                                                                                                      Gesture (DHG)
                                                                         Skeleton-based                                                                                  dataset &
                                                                                            neural network                         14 hand                                                non-optimized
        [97]        Kinect            –                    –              hand gesture                               85.39%                           –             First-Person Hand
                                                                                             based on SPD                          gestures                                               CPU 3.4 GHz
                                                                          recognition.                                                                                Action (FPHA)
                                                                                                                                                                          dataset
3.3.Application
     ApplicationAreas
                 AreasofofHand
                           HandGesture
                                GestureRecognition
                                        RecognitionSystems
                                                    Systems
      Researchinto
     Research     intohand
                       handgestures
                              gestureshashasbecome
                                             becomean  anexciting
                                                          excitingand
                                                                    andrelevant
                                                                         relevantfield;
                                                                                   field;ititoffers
                                                                                              offersaameans
                                                                                                        meansofof
 natural   interaction  and  reduces   the  cost of using  sensors  in terms  of data   gloves.
natural interaction and reduces the cost of using sensors in terms of data gloves. Conventional    Conventional
 interactivemethods
interactive   methodsdepend
                        depend onon different
                                     different devices
                                               devices such
                                                        suchas
                                                             asaamouse,
                                                                   mouse,keyboard,
                                                                           keyboard,touch
                                                                                       touchscreen,   joystick
                                                                                                screen,        for
                                                                                                        joystick
 gaming   and   consoles for machine   controls. The following  sections  describe some
for gaming and consoles for machine controls. The following sections describe some popular popular   applications
 of hand gestures.
applications   of hand Figure   13 Figure
                        gestures.   shows 13theshows
                                                 most the
                                                       common     application
                                                          most common          area deal
                                                                           application  area with  hand
                                                                                                deal withgesture
                                                                                                           hand
 recognition
gesture        techniques.
         recognition  techniques.
        Figure13.
      Figure       Mostcommon
               13.Most  commonapplication
                                  applicationarea
                                               areaof
                                                   ofhand
                                                      hand gesture
                                                           gesture interaction system
                                                                               system (the
                                                                                      (the image
                                                                                            imageof
                                                                                                 ofFigure
                                                                                                    Figure13
                                                                                                           13is
        adapted  from [12,14,42,76,83,98,99]).
      is adapted from [12,14,42,76,83,98,99]).
Although this approach can identify a large number of gestures, it has some drawbacks in some cases,
such as missing some gestures because of the classification algorithms accuracy contrast. In addition,
it takes time more than first approach because of the matching dataset in case of using a large number
of the dataset. In addition, the dataset of gestures cannot be used by other frameworks.
5. Conclusions
     Hand gesture recognition addresses a fault in interaction systems. Controlling things by hand
is more natural, easier, more flexible and cheaper, and there is no need to fix problems caused by
hardware devices, since none is required. From previous sections, it was clear to need to put much
effort into developing reliable and robust algorithms with the help of using a camera sensor has
a certain characteristic to encounter common issues and achieve a reliable result. Each technique
mentioned above, however, has its advantages and disadvantages and may perform well in some
challenges while being inferior in others.
Author Contributions: Conceptualization, A.A.-N. & M.O.; funding acquisition, A.A.-N. & J.C.; investigation,
M.O.; methodology, M.O. & A.A.-N.; project administration, A.A.-N. and J.C.; supervision, A.A.-N. & J.C.; writing–
original draft, M.O.; writing– review & editing, M.O., A.A.-N. & J.C. All authors have read and agreed to the
published version of the manuscript.
Funding: This research received no external funding.
Acknowledgments: The authors would like to thank the staff in Electrical Engineering Technical College, Middle
Technical University, Baghdad, Iraq and the participants for their support to conduct the experiments.
Conflicts of Interest: The authors of this manuscript have no conflicts of interest relevant to this work.
References
1.    Zhigang, F. Computer gesture input and its application in human computer interaction. Mini Micro Syst.
      1999, 6, 418–421.
2.    Mitra, S.; Acharya, T. Gesture recognition: A survey. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2007,
      37, 311–324. [CrossRef]
3.    Ahuja, M.K.; Singh, A. Static vision based Hand Gesture recognition using principal component analysis.
      In Proceedings of the 2015 IEEE 3rd International Conference on MOOCs, Innovation and Technology in
      Education (MITE), Amritsar, India, 1–2 October 2015; pp. 402–406.
4.    Kramer, R.K.; Majidi, C.; Sahai, R.; Wood, R.J. Soft curvature sensors for joint angle proprioception.
      In Proceedings of the 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, San
      Francisco, CA, USA, 25–30 September 2011; pp. 1919–1926.
5.    Jesperson, E.; Neuman, M.R. A thin film strain gauge angular displacement sensor for measuring finger joint
      angles. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and
      Biology Society, New Orleans, LA, USA, 4–7 November 1988; pp. 807–vol.
6.    Fujiwara, E.; dos Santos, M.F.M.; Suzuki, C.K. Flexible optical fiber bending transducer for application in
      glove-based sensors. IEEE Sens. J. 2014, 14, 3631–3636. [CrossRef]
7.    Shrote, S.B.; Deshpande, M.; Deshmukh, P.; Mathapati, S. Assistive Translator for Deaf & Dumb People. Int. J.
      Electron. Commun. Comput. Eng. 2014, 5, 86–89.
8.    Gupta, H.P.; Chudgar, H.S.; Mukherjee, S.; Dutta, T.; Sharma, K. A continuous hand gestures recognition
      technique for human-machine interaction using accelerometer and gyroscope sensors. IEEE Sens. J. 2016, 16,
      6425–6432. [CrossRef]
9.    Lamberti, L.; Camastra, F. Real-time hand gesture recognition using a color glove. In Proceedings of
      the International Conference on Image Analysis and Processing, Ravenna, Italy, 14–16 September 2011;
      pp. 365–373.
10.   Wachs, J.P.; Kölsch, M.; Stern, H.; Edan, Y. Vision-based hand-gesture applications. Commun. ACM 2011, 54,
      60–71. [CrossRef]
11.   Pansare, J.R.; Gawande, S.H.; Ingle, M. Real-time static hand gesture recognition for American Sign Language
      (ASL) in complex background. JSIP 2012, 3, 22132. [CrossRef]
J. Imaging 2020, 6, 73                                                                                           25 of 29
12.   Van den Bergh, M.; Carton, D.; De Nijs, R.; Mitsou, N.; Landsiedel, C.; Kuehnlenz, K.; Wollherr, D.;
      Van Gool, L.; Buss, M. Real-time 3D hand gesture interaction with a robot for understanding directions from
      humans. In Proceedings of the 2011 Ro-Man, Atlanta, GA, USA, 31 July–3 August 2011; pp. 357–362.
13.   Wang, R.Y.; Popović, J. Real-time hand-tracking with a color glove. ACM Trans. Graph. 2009, 28, 1–8.
14.   Desai, S.; Desai, A. Human Computer Interaction through hand gestures for home automation using
      Microsoft Kinect. In Proceedings of the International Conference on Communication and Networks, Xi’an,
      China, 10–12 October 2017; pp. 19–29.
15.   Rajesh, R.J.; Nagarjunan, D.; Arunachalam, R.M.; Aarthi, R. Distance Transform Based Hand Gestures
      Recognition for PowerPoint Presentation Navigation. Adv. Comput. 2012, 3, 41.
16.   Kaur, H.; Rani, J. A review: Study of various techniques of Hand gesture recognition. In Proceedings of
      the 2016 IEEE 1st International Conference on Power Electronics, Intelligent Control and Energy Systems
      (ICPEICES), Delhi, India, 4–6 July 2016; pp. 1–5.
17.   Murthy, G.R.S.; Jadon, R.S. A review of vision based hand gestures recognition. Int. J. Inf. Technol. Knowl.
      Manag. 2009, 2, 405–410.
18.   Khan, R.Z.; Ibraheem, N.A. Hand gesture recognition: A literature review. Int. J. Artif. Intell. Appl. 2012, 3,
      161. [CrossRef]
19.   Suriya, R.; Vijayachamundeeswari, V. A survey on hand gesture recognition for simple mouse control.
      In Proceedings of the International Conference on Information Communication and Embedded Systems
      (ICICES2014), Chennai, India, 27–28 February 2014; pp. 1–5.
20.   Sonkusare, J.S.; Chopade, N.B.; Sor, R.; Tade, S.L. A review on hand gesture recognition system. In Proceedings
      of the 2015 International Conference on Computing Communication Control and Automation, Pune, India,
      26–27 February 2015; pp. 790–794.
21.   Garg, P.; Aggarwal, N.; Sofat, S. Vision based hand gesture recognition. World Acad. Sci. Eng. Technol. 2009,
      49, 972–977.
22.   Dipietro, L.; Sabatini, A.M.; Member, S.; Dario, P. A survey of glove-based systems and their applications.
      IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2008, 38, 461–482. [CrossRef]
23.   LaViola, J. A survey of hand posture and gesture recognition techniques and technology. Brown Univ. Provid.
      RI 1999, 29. Technical Report no. CS-99-11.
24.   Ibraheem, N.A.; Khan, R.Z. Survey on various gesture recognition technologies and techniques. Int. J.
      Comput. Appl. 2012, 50, 38–44.
25.   Hasan, M.M.; Mishra, P.K. Hand gesture modeling and recognition using geometric features: A review.
      Can. J. Image Process. Comput. Vis. 2012, 3, 12–26.
26.   Shaik, K.B.; Ganesan, P.; Kalist, V.; Sathish, B.S.; Jenitha, J.M.M. Comparative study of skin color detection
      and segmentation in HSV and YCbCr color space. Procedia Comput. Sci. 2015, 57, 41–48. [CrossRef]
27.   Ganesan, P.; Rajini, V. YIQ color space based satellite image segmentation using modified FCM clustering
      and histogram equalization. In Proceedings of the 2014 International Conference on Advances in Electrical
      Engineering (ICAEE), Vellore, India, 9–11 January 2014; pp. 1–5.
28.   Brand, J.; Mason, J.S. A comparative assessment of three approaches to pixel-level human skin-detection.
      In Proceedings of the 15th International Conference on Pattern Recognition. ICPR-2000, Barcelona, Spain,
      3–7 September 2000; Volume 1, pp. 1056–1059.
29.   Jones, M.J.; Rehg, J.M. Statistical color models with application to skin detection. Int. J. Comput. Vis. 2002, 46,
      81–96. [CrossRef]
30.   Brown, D.A.; Craw, I.; Lewthwaite, J. A som based approach to skin detection with application in real time
      systems. BMVC 2001, 1, 491–500.
31.   Zarit, B.D.; Super, B.J.; Quek, F.K.H. Comparison of five color models in skin pixel classification. In Proceedings
      of the International Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time
      Systems, In Conjunction with ICCV’99 (Cat. No. PR00378). Corfu, Greece, 26–27 September 1999; pp. 58–63.
32.   Albiol, A.; Torres, L.; Delp, E.J. Optimum color spaces for skin detection. In Proceedings of the 2001
      International Conference on Image Processing (Cat. No. 01CH37205), Thessaloniki, Greece, 7–10 October
      2001; Volume 1, pp. 122–124.
J. Imaging 2020, 6, 73                                                                                     26 of 29
33.   Sigal, L.; Sclaroff, S.; Athitsos, V. Estimation and prediction of evolving color distributions for skin
      segmentation under varying illumination. In Proceedings of the IEEE Conference on Computer Vision and
      Pattern Recognition, CVPR 2000 (Cat. No. PR00662). Hilton Head Island, SC, USA, 15 June 2000; Volume 2,
      pp. 152–159.
34.   Chai, D.; Bouzerdoum, A. A Bayesian approach to skin color classification in YCbCr color space.
      In Proceedings of the 2000 TENCON Proceedings. Intelligent Systems and Technologies for the New
      Millennium (Cat. No. 00CH37119), Kuala Lumpur, Malaysia, 24–27 September 2000; Volume 2, pp. 421–424.
35.   Menser, B.; Wien, M. Segmentation and tracking of facial regions in color image sequences. In Proceedings
      of the Visual Communications and Image Processing 2000, Perth, Australia, 20–23 June 2000; Volume 4067,
      pp. 731–740.
36.   Kakumanu, P.; Makrogiannis, S.; Bourbakis, N. A survey of skin-color modeling and detection methods.
      Pattern Recognit. 2007, 40, 1106–1122. [CrossRef]
37.   Perimal, M.; Basah, S.N.; Safar, M.J.A.; Yazid, H. Hand-Gesture Recognition-Algorithm based on Finger
      Counting. J. Telecommun. Electron. Comput. Eng. 2018, 10, 19–24.
38.   Sulyman, A.B.D.A.; Sharef, Z.T.; Faraj, K.H.A.; Aljawaryy, Z.A.; Malallah, F.L. Real-time numerical 0-5
      counting based on hand-finger gestures recognition. J. Theor. Appl. Inf. Technol. 2017, 95, 3105–3115.
39.   Choudhury, A.; Talukdar, A.K.; Sarma, K.K. A novel hand segmentation method for multiple-hand gesture
      recognition system under complex background. In Proceedings of the 2014 International Conference on
      Signal Processing and Integrated Networks (SPIN), Noida, India, 20–21 February 2014; pp. 136–140.
40.   Stergiopoulou, E.; Sgouropoulos, K.; Nikolaou, N.; Papamarkos, N.; Mitianoudis, N. Real time hand detection
      in a complex background. Eng. Appl. Artif. Intell. 2014, 35, 54–70. [CrossRef]
41.   Khandade, S.L.; Khot, S.T. MATLAB based gesture recognition. In Proceedings of the 2016 International
      Conference on Inventive Computation Technologies (ICICT), Coimbatore, India, 26–27 August 2016; Volume 1,
      pp. 1–4.
42.   Karabasi, M.; Bhatti, Z.; Shah, A. A model for real-time recognition and textual representation of malaysian
      sign language through image processing. In Proceedings of the 2013 International Conference on Advanced
      Computer Science Applications and Technologies, Kuching, Malaysia, 23–24 December 2013; pp. 195–200.
43.   Zeng, J.; Sun, Y.; Wang, F. A natural hand gesture system for intelligent human-computer interaction and
      medical assistance. In Proceedings of the 2012 Third Global Congress on Intelligent Systems, Wuhan, China,
      6–8 November 2012; pp. 382–385.
44.   Hsieh, C.-C.; Liou, D.-H.; Lee, D. A real time hand gesture recognition system using motion history image.
      In Proceedings of the 2010 2nd international conference on signal processing systems, Dalian, China, 5–7 July
      2010; Volume 2, pp. V2–394.
45.   Van den Bergh, M.; Koller-Meier, E.; Bosché, F.; Van Gool, L. Haarlet-based hand gesture recognition for 3D
      interaction. In Proceedings of the 2009 Workshop on Applications of Computer Vision (WACV), Snowbird,
      UT, USA, 7–8 December 2009; pp. 1–8.
46.   Van den Bergh, M.; Van Gool, L. Combining RGB and ToF cameras for real-time 3D hand gesture interaction.
      In Proceedings of the 2011 IEEE workshop on applications of computer vision (WACV), Kona, HI, USA,
      5–7 January 2011; pp. 66–72.
47.   Chen, L.; Wang, F.; Deng, H.; Ji, K. A survey on hand gesture recognition. In Proceedings of the 2013
      International conference on computer sciences and applications, Wuhan, China, 14–15 December 2013;
      pp. 313–316.
48.   Shimada, A.; Yamashita, T.; Taniguchi, R. Hand gesture based TV control system—Towards both user-&
      machine-friendly gesture applications. In Proceedings of the 19th Korea-Japan Joint Workshop on Frontiers
      of Computer Vision, Incheon, Korea, 30 January–1 February 2013; pp. 121–126.
49.   Chen, Q.; Georganas, N.D.; Petriu, E.M. Real-time vision-based hand gesture recognition using haar-like
      features. In Proceedings of the 2007 IEEE instrumentation & measurement technology conference IMTC
      2007, Warsaw, Poland, 1–3 May 2007; pp. 1–6.
50.   Kulkarni, V.S.; Lokhande, S.D. Appearance based recognition of american sign language using gesture
      segmentation. Int. J. Comput. Sci. Eng. 2010, 2, 560–565.
51.   Fang, Y.; Wang, K.; Cheng, J.; Lu, H. A real-time hand gesture recognition method. In Proceedings of the
      2007 IEEE International Conference on Multimedia and Expo, Beijing, China, 2–5 July 2007; pp. 995–998.
J. Imaging 2020, 6, 73                                                                                      27 of 29
52.   Licsár, A.; Szirányi, T. User-adaptive hand gesture recognition system with interactive training. Image Vis.
      Comput. 2005, 23, 1102–1114. [CrossRef]
53.   Zhou, Y.; Jiang, G.; Lin, Y. A novel finger and hand pose estimation technique for real-time hand gesture
      recognition. Pattern Recognit. 2016, 49, 102–114. [CrossRef]
54.   Pun, C.-M.; Zhu, H.-M.; Feng, W. Real-time hand gesture recognition using motion tracking. Int. J. Comput.
      Intell. Syst. 2011, 4, 277–286. [CrossRef]
55.   Bayazit, M.; Couture-Beil, A.; Mori, G. Real-time Motion-based Gesture Recognition Using the GPU.
      In Proceedings of the MVA, Yokohama, Japan, 20–22 May 2009; pp. 9–12.
56.   Molina, J.; Pajuelo, J.A.; Martínez, J.M. Real-time motion-based hand gestures recognition from time-of-flight
      video. J. Signal Process. Syst. 2017, 86, 17–25. [CrossRef]
57.   Prakash, J.; Gautam, U.K. Hand Gesture Recognition. Int. J. Recent Technol. Eng. 2019, 7, 54–59.
58.   Xi, C.; Chen, J.; Zhao, C.; Pei, Q.; Liu, L. Real-time Hand Tracking Using Kinect. In Proceedings of the 2nd
      International Conference on Digital Signal Processing, Tokyo, Japan, 25–27 February 2018; pp. 37–42.
59.   Devineau, G.; Moutarde, F.; Xi, W.; Yang, J. Deep learning for hand gesture recognition on skeletal data.
      In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition
      (FG 2018), Xi’an, China, 15–19 May 2018; pp. 106–113.
60.   Jiang, F.; Wu, S.; Yang, G.; Zhao, D.; Kung, S.-Y. independent hand gesture recognition with Kinect.
      Signal Image Video Process. 2014, 8, 163–172. [CrossRef]
61.   Konstantinidis, D.; Dimitropoulos, K.; Daras, P. Sign language recognition based on hand and body skeletal
      data. In Proceedings of the 2018-3DTV-Conference: The True Vision-Capture, Transmission and Display of
      3D Video (3DTV-CON), Helsinki, Finland, 3–5 June 2018; pp. 1–4.
62.   De Smedt, Q.; Wannous, H.; Vandeborre, J.-P.; Guerry, J.; Saux, B.L.; Filliat, D. 3D hand gesture recognition
      using a depth and skeletal dataset: SHREC’17 track. In Proceedings of the Workshop on 3D Object Retrieval,
      Lyon, France, 23–24 April 2017; pp. 33–38.
63.   Chen, Y.; Luo, B.; Chen, Y.-L.; Liang, G.; Wu, X. A real-time dynamic hand gesture recognition system
      using kinect sensor. In Proceedings of the 2015 IEEE International Conference on Robotics and Biomimetics
      (ROBIO), Zhuhai, China, 6–9 December 2015; pp. 2026–2030.
64.   Karbasi, M.; Muhammad, Z.; Waqas, A.; Bhatti, Z.; Shah, A.; Koondhar, M.Y.; Brohi, I.A. A Hybrid Method
      Using Kinect Depth and Color Data Stream for Hand Blobs Segmentation; Science International: Lahore, Pakistan,
      2017; Volume 29, pp. 515–519.
65.   Ren, Z.; Meng, J.; Yuan, J. Depth camera based hand gesture recognition and its applications in
      human-computer-interaction. In Proceedings of the 2011 8th International Conference on Information,
      Communications & Signal Processing, Singapore, 13–16 December 2011; pp. 1–5.
66.   Dinh, D.-L.; Kim, J.T.; Kim, T.-S. Hand gesture recognition and interface via a depth imaging sensor for smart
      home appliances. Energy Procedia 2014, 62, 576–582. [CrossRef]
67.   Raheja, J.L.; Minhas, M.; Prashanth, D.; Shah, T.; Chaudhary, A. Robust gesture recognition using Kinect:
      A comparison between DTW and HMM. Optik 2015, 126, 1098–1104. [CrossRef]
68.   Ma, X.; Peng, J. Kinect sensor-based long-distance hand gesture recognition and fingertip detection with
      depth information. J. Sens. 2018, 2018, 5809769. [CrossRef]
69.   Kim, M.-S.; Lee, C.H. Hand Gesture Recognition for Kinect v2 Sensor in the Near Distance Where Depth
      Data Are Not Provided. Int. J. Softw. Eng. Its Appl. 2016, 10, 407–418. [CrossRef]
70.   Li, Y. Hand gesture recognition using Kinect. In Proceedings of the 2012 IEEE International Conference on
      Computer Science and Automation Engineering, Beijing, China, 22–24 June 2012; pp. 196–199.
71.   Song, L.; Hu, R.M.; Zhang, H.; Xiao, Y.L.; Gong, L.Y. Real-time 3d hand gesture detection from depth images.
      Adv. Mater. Res. 2013, 756, 4138–4142. [CrossRef]
72.   Pal, D.H.; Kakade, S.M. Dynamic hand gesture recognition using kinect sensor. In Proceedings of the 2016
      International Conference on Global Trends in Signal Processing, Information Computing and Communication
      (ICGTSPICC), Jalgaon, India, 22–24 December 2016; pp. 448–453.
73.   Karbasi, M.; Bhatti, Z.; Nooralishahi, P.; Shah, A.; Mazloomnezhad, S.M.R. Real-time hands detection in
      depth image by using distance with Kinect camera. Int. J. Internet Things 2015, 4, 1–6.
74.   Bakar, M.Z.A.; Samad, R.; Pebrianti, D.; Aan, N.L.Y. Real-time rotation invariant hand tracking using 3D data.
      In Proceedings of the 2014 IEEE International Conference on Control System, Computing and Engineering
      (ICCSCE 2014), Batu Ferringhi, Malaysia, 28–30 November 2014; pp. 490–495.
J. Imaging 2020, 6, 73                                                                                          28 of 29
75.   Desai, S. Segmentation and Recognition of Fingers Using Microsoft Kinect. In Proceedings of the International
      Conference on Communication and Networks, Paris, France, 21–25 May 2017; pp. 45–53.
76.   Lee, U.; Tanaka, J. Finger identification and hand gesture recognition techniques for natural user interface.
      In Proceedings of the 11th Asia Pacific Conference on Computer Human Interaction, Bangalore, India,
      24–27 September 2013; pp. 274–279.
77.   Bakar, M.Z.A.; Samad, R.; Pebrianti, D.; Mustafa, M.; Abdullah, N.R.H. Finger application using K-Curvature
      method and Kinect sensor in real-time. In Proceedings of the 2015 International Symposium on Technology
      Management and Emerging Technologies (ISTMET), Langkawai Island, Malaysia, 25–27 August 2015;
      pp. 218–222.
78.   Tang, M. Recognizing Hand Gestures with Microsoft’s Kinect; Department of Electrical Engineering of Stanford
      University: Palo Alto, CA, USA, 2011.
79.   Bamwenda, J.; Özerdem, M.S. Recognition of Static Hand Gesture with Using ANN and SVM. Dicle Univ.
      J. Eng. 2019, 10, 561–568.
80.   Tekin, B.; Bogo, F.; Pollefeys, M. H+ O: Unified egocentric recognition of 3D hand-object poses and interactions.
      In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA,
      15–20 June 2019; pp. 4511–4520.
81.   Wan, C.; Probst, T.; Van Gool, L.; Yao, A. Self-supervised 3d hand pose estimation through training by fitting.
      In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA,
      15–20 June 2019; pp. 10853–10862.
82.   Ge, L.; Ren, Z.; Li, Y.; Xue, Z.; Wang, Y.; Cai, J.; Yuan, J. 3d hand shape and pose estimation from a single rgb
      image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach,
      CA, USA, 15–20 June 2019; pp. 10833–10842.
83.   Taylor, J.; Bordeaux, L.; Cashman, T.; Corish, B.; Keskin, C.; Sharp, T.; Soto, E.; Sweeney, D.; Valentin, J.;
      Luff, B. Efficient and precise interactive hand tracking through joint, continuous optimization of pose and
      correspondences. ACM Trans. Graph. 2016, 35, 1–12. [CrossRef]
84.   Malik, J.; Elhayek, A.; Stricker, D. Structure-aware 3D hand pose regression from a single depth image.
      In Proceedings of the International Conference on Virtual Reality and Augmented Reality, London, UK,
      22–23 October 2018; pp. 3–17.
85.   Tsoli, A.; Argyros, A.A. Joint 3D tracking of a deformable object in interaction with a hand. In Proceedings of
      the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 484–500.
86.   Chen, Y.; Tu, Z.; Ge, L.; Zhang, D.; Chen, R.; Yuan, J. So-handnet: Self-organizing network for 3d hand pose
      estimation with semi-supervised learning. In Proceedings of the IEEE International Conference on Computer
      Vision, Seoul, Korea, 27 October–2 November 2019; pp. 6961–6970.
87.   Ge, L.; Ren, Z.; Yuan, J. Point-to-point regression pointnet for 3d hand pose estimation. In Proceedings of the
      European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 475–491.
88.   Wu, X.; Finnegan, D.; O’Neill, E.; Yang, Y.-L. Handmap: Robust hand pose estimation via intermediate
      dense guidance map supervision. In Proceedings of the European Conference on Computer Vision (ECCV),
      Munich, Germany, 8–14 September 2018; pp. 237–253.
89.   Cai, Y.; Ge, L.; Cai, J.; Yuan, J. Weakly-supervised 3d hand pose estimation from monocular rgb images.
      In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September
      2018; pp. 666–682.
90.   Alnaim, N.; Abbod, M.; Albar, A. Hand Gesture Recognition Using Convolutional Neural Network for
      People Who Have Experienced A Stroke. In Proceedings of the 2019 3rd International Symposium on
      Multidisciplinary Studies and Innovative Technologies (ISMSIT), Ankara, Turkey, 11–13 October 2019;
      pp. 1–6.
91.   Chung, H.; Chung, Y.; Tsai, W. An efficient hand gesture recognition system based on deep CNN.
      In Proceedings of the 2019 IEEE International Conference on Industrial Technology (ICIT), Melbourne,
      Australia, 13–15 February 2019; pp. 853–858.
92.   Bao, P.; Maqueda, A.I.; del-Blanco, C.R.; García, N. Tiny hand gesture recognition without localization via a
      deep convolutional network. IEEE Trans. Consum. Electron. 2017, 63, 251–257. [CrossRef]
93.   Li, G.; Tang, H.; Sun, Y.; Kong, J.; Jiang, G.; Jiang, D.; Tao, B.; Xu, S.; Liu, H. Hand gesture recognition based
      on convolution neural network. Cluster Comput. 2019, 22, 2719–2729. [CrossRef]
J. Imaging 2020, 6, 73                                                                                     29 of 29
94.  Lin, H.-I.; Hsu, M.-H.; Chen, W.-K. Human hand gesture recognition using a convolution neural network.
     In Proceedings of the 2014 IEEE International Conference on Automation Science and Engineering (CASE),
     Taipei, Taiwan, 18–22 August 2014; pp. 1038–1043.
95. John, V.; Boyali, A.; Mita, S.; Imanishi, M.; Sanma, N. Deep learning-based fast hand gesture recognition using
     representative frames. In Proceedings of the 2016 International Conference on Digital Image Computing:
     Techniques and Applications (DICTA), Gold Coast, Australia, 30 November–2 December 2016; pp. 1–8.
96. Wu, X.Y. A hand gesture recognition algorithm based on DC-CNN. Multimed. Tools Appl. 2019, 1–13.
     [CrossRef]
97. Nguyen, X.S.; Brun, L.; Lézoray, O.; Bougleux, S. A neural network based on SPD manifold learning for
     skeleton-based hand gesture recognition. In Proceedings of the IEEE Conference on Computer Vision and
     Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12036–12045.
98. Lee, D.-H.; Hong, K.-S. Game interface using hand gesture recognition. In Proceedings of the 5th International
     Conference on Computer Sciences and Convergence Information Technology, Seoul, Korea, 30 November–2
     December 2010; pp. 1092–1097.
99. Gallo, L.; Placitelli, A.P.; Ciampi, M. Controller-free exploration of medical image data: Experiencing the
     Kinect. In Proceedings of the 2011 24th international symposium on computer-based medical systems
     (CBMS), Bristol, UK, 27–30 June 2011; pp. 1–6.
100. Zhao, X.; Naguib, A.M.; Lee, S. Kinect based calling gesture recognition for taking order service of elderly
     care robot. In Proceedings of the The 23rd IEEE international symposium on robot and human interactive
     communication, Edinburgh, UK, 25–29 August 2014; pp. 525–530.
                         © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
                         article distributed under the terms and conditions of the Creative Commons Attribution
                         (CC BY) license (http://creativecommons.org/licenses/by/4.0/).