Applying Hand Gesture Recognition For User Guide Application Using Mediapipe
Applying Hand Gesture Recognition For User Guide Application Using Mediapipe
Proceedings of the 2nd International Seminar of Science and Applied Technology (ISSAT 2021)
ABSTRACT
Hand gesture recognition is considered important with development technology in industry 4.0 in Human-Computer-
Interactions (HCI) which gives computers the competence to capture and interpret hand gestures the executing
command without touching devices physically. The MediaPipe is present as a framework built-in machine learning
that has a solution for a hand gesture recognition system. In this research, we develop a simple user guide application
using the MediaPipe framework. The user guide is commonly known as documentation about technical
communication or a manual in a certain system to assist people. The user guide has step-by-step descriptions about
handling a particular system and helps the user deal with user frustration by giving them the means to be identified,
understand, and disentangle technical problems that frequently occurred by themselves. In our experiment, we
captured a real-time image using Kinect, then trained a variety of hand gesture data, identified each hand gesture, and
recognized hand gestures to convey information based on hand gestures in the system user guide application. The user
can archive information user guide based on hand gestures that have been recognized. We proposed using hand
gesture recognition using MediaPipe in our application to improve the convenience of utilization the user guide
application and change user guide application that is still manual become a more interactive application.
1. INTRODUCTION                                                      body, the hand or face are the most commonly adopt
                                                                     [3]. Gesture-based interaction introduced by Krueger as
      We are now in an era of industry 4.0 or the Fourth             a new type of Human-Computer Interaction (HCI) in the
Industrial Revolution which requires automation and                  middle 1970s has become a magnetic area of the
computerized that are realized from the consolidation                research. In the Human-Computer-Interaction (HCI),
between various physical and digital technologies such               building interfaces of applications with managing each
as sensors, embedded systems, Artificial Intelligence                part of the human body to communicate naturally are
(AI), Cloud Computing, Big Data, Adaptive Robotic,                   the great attention to do research, especially the hands
Augmented Reality, Additive Manufacturing (AM), and                  as the most effective-alternative for the interaction tool,
Internet of Things (IoT). [1]. The enhanced digital                  considering their ability [4].
technology connectivity made technology a crucial
                                                                          Through Human-Computer-Interaction (HCI),
requirement in carrying out our daily activities like
                                                                     recognizing hand gestures could help achieve the ease
doing tasks or work, shopping, communication,
entertainment, and even searching for information or                 and naturalness desired [5]. When interacting with other
news [2]. The technology works more using the                        people, hand movements have the meaning to convey
                                                                     something with its information. Ranging from simple
machines and advances in interaction with using a broad
                                                                     hand movements to more complex ones. For example,
range of gestures to recognize, communicate, or interact
                                                                     we can use our hand to point something (object or
with each other.
                                                                     people) or use different simple shapes of hand or hand
      The gesture is known as a form of non-verbal                   movements expressed through manual articulations
communication or non-vocal communication where                       combined with their grammar and lexicon as well-
utilize of the body’s movement that can convey a                     known as sign languages. Hence, using hand gestures as
particular message originating from parts of the human
a device then integration with computers can help                 The gestures are defined from the results with a set
people communicate more intuitively [5].                          of rules and conditions from the vectors and joints of
                                                                  the hands [13].
      Currently, many frameworks or library machine
learning for hand gesture recognition have been built to      ● Low-Level Feature-Based Approaches: Utilized
make it easier for anyone to build AI (Artificial               these features for could be extracted quickly for
Intelligence) based applications. One of them is                robust to noise. Zhou [14] discovered recognition of
MediaPipe. The MediaPipe framework is present by                the hand shape as a cluster-based signature using a
Google for solving the problem using machine learning           novel distance metric called Finger Earth’s Distance.
such as Face Detection, Face Mesh, Iris, Hands, Pose,           Stanner [15] determines the bounding region of the
Holistic, Hair segmentation, Object detection, Box              hand elliptically for implement hand recognition
Tracking, Instant Motion Tracking, Objection, and               based on principal axes. Yang [16] did research
KIFT. MediaPipe framework helps a developer focus on            using the optical flow of the hand region as a low-
the algorithm and model development on the                      level feature. Low-Level Feature-Based is not
application, then support environment application               efficient when cluttered background [4].
through results reproducible across different devices and
                                                              ● 3D Reconstruction-Based Approaches: Use the 3D
platforms which it is a few advantages of using features
                                                                model of features for achieving the construe of hand
on the MediaPipe framework [6].
                                                                completely. Research [17] showed that successfully
     In this paper, we focus on developing a manual             segmenting the hand in skin color needs similarity
user guide application with improving architecture              and high contrast of the background related to the
application by applying hand gesture recognition using          hand through structured light to bring in 3D of depth
the MediaPipe framework and camera of Kinect for                data. Another one [18] uses a stereo camera to track
capture hand pose to recognized. Using hand gesture             numerous interest points of the superficies of the
recognition will improve our user guide application             hand which results in difficulty for handle robust 3D
more interactive.                                               reconstruction, despite data contains 3D has valuable
                                                                information that can help dispose of vagueness. See
2. RELATED WORK                                                 [19] [20] for more 3D reconstruction-based
                                                                approach.
2.1 Hand Gesture Recognition
    Gesture recognition is an essential topic in computer        From kinds of literature, there are three Hand
science and builds technology that aims to interpret          gesture recognition methods, as follow:
human gestures where anyone can use simple gestures
                                                              ● Machine Learning Approaches: The resulting output
to interact with the device without touching them
                                                                came from the stochastic process and approach
directly. The entire procedure of tracking gestures to
                                                                based on statistical modeling for dynamic gestures
their representation and converting them to some
                                                                such as PCA, HMM [21][22][23][24], advanced
purposeful command is known as gesture recognition
                                                                particle filtering [26], and condensation algorithm
[7]. Identify from explicit hand gestures as input then
                                                                [25].
process these gestures representation for devices
through mapping as output is the aim in hand gestures         ● Algorithm Approaches: Collection of encoded
recognition.                                                    conditions and restraints manually for defining as
                                                                gestures in dynamic gestures. Galveia [27] applied a
    Recognition of the hand gesture in kinds of literature
                                                                3rd-degree polynomial equation to determine the
based on extracted features is divided into three groups,
                                                                dynamic component of the hand gestures (create a
as follows:
                                                                3rd-degree polynomial equation, recognition,
● High-Level Features-Based Approaches: Aim to                  reduced complexity of equations, and comparison
  figure out the position of the palm and joint angles          handling in gestures library).
  such as the fingertips, joint location, or anchor
                                                              ● Rule-based Approaches: Suitable for dynamic
  points of the palm [8][9][10][11]. Whereas, effect
                                                                gestures either static gestures which are contained a
  collisions or occlusions on the image are difficult to
                                                                set of pre-encoded rules and features inputs [4]. The
  detect after features are extracted [12], and
                                                                features of input gestures are extracted and
  sensitivity segmentation performance on 2D hand
                                                                compared to the encoding rules that are the flow of
  image [4] are the problem that occurred frequently.
                                                                                                                       102
                                     Advances in Engineering Research, volume 207
   the recognized gestures. Matching between gestures        2.2.1 Palm Detector Model
   with rule and input which is outputted approved as
   known gestures [28].                                            MediaPipe framework has built detect initial palm
                                                             detector called BlazePalm. Detecting the hand is a
2.2 MediaPipe Framework                                      complex task. Step one is to train the palm instead of the
                                                             hand detector, then using the non-maximum suppression
    Today, there are many frameworks or libraries of         algorithm on the palm, where it is modeled using square
machine learning for hand gesture recognition. One of        bounding boxes to avoid other aspect ratios and
them is MediaPipe. The MediaPipe is a framework              reducing the number of anchors by a factor of 3-5. Next,
designed to implement production-ready machine               encoder-decoder of feature extraction that is used for
learning that must build pipelines to perform inference      bigger scene context-awareness even small objects,
over arbitrary sensory data, has published code              lastly, minimize the focal loss during training with
accompanying research work, and build technology             support a large number of anchors resulting from the
prototypes [6]. In MediaPipe, graph modular                  high scale variance [35] [36].
components come from a perception pipeline along with
the function of inference model function, media              2.2.2 Hand Landmark
processing model, and data transformations [29]. Graph
of operations are used in others machine learning such            Achieves precise key point localization of 21 key
as Tensor flow [30], MXNet [31], PyTorch[32],                points with a 3D hand-knuckle coordinate which is
CNTK[33], OpenCV 4.0[34].                                    conducted inside the detected hand regions through
                                                             regression which will produce the coordinate prediction
    Using MediaPipe for hand gesture recognition has
                                                             directly which is a model of the hand landmark in
been researched by Zhang [35] before, using a single
                                                             MediaPipe [35][36]., see in Figure 2.
RGB camera for AR/VR application in a real-time
system that predicts a hand skeleton of the human. We
can develop a combined MediaPipe using other devices.
The MediaPipe implements pipeline in Figure 1.
consists of two models for hand gesture recognition as
follows [29][35][36] :
1. A palm detector model processes the captured image
   and turns the image with an oriented bounding box
   of the hand,
2. A hand landmark model processes on cropped
   bounding box image and returns 3D hand key points         Figure 2 Hand Landmark in MediaPipe [38]
   on hand.
                                                                      Each hand-knuckle of the landmark has
3. A gesture recognizer that classifies 3D hand key          coordinate is composed of x, y, and z where x and y are
   points then configuration them into a discrete set of     normalized to [0.0, 1.0] by image width and height,
   gestures.                                                 while z representation the depth of landmark. The depth
                                                             of landmark that can be found at the wrist being the
                                                             ancestor. The closed the landmark to the camera, the
                                                             value becomes smaller.
                                                                                                                      103
                                     Advances in Engineering Research, volume 207
                                                                                                                      104
                                       Advances in Engineering Research, volume 207
                                                                                                                        105
                                       Advances in Engineering Research, volume 207
                                                               4.    RESULT
                                                                     The measure of the performance on the model in
                                                               machine learning used Confusion Matrix. In Python, we
                                                               can use the library scikit-learn to develop a confusion
                                                               matrix. Experiment datasets were obtained before we
                                                               used them to predict the hand gestures. The Confusion
                                                               Matrix was also used to observe an accuracy achieved
                                                               for the model was made.
                                                                                                                       106
                                     Advances in Engineering Research, volume 207
      Using MediaPipe for implementing machine                    Computer-Interaction. In Processing of the 2011 8th
learning on hand gesture recognition in a user guide              International   Conference     on     Information,
application achieved good performance see Figure 12               Communication and Signal Processing (ICICS).
and Figure 13. Figure 12 above shows the predicted 10             Singapore. 2011.
varieties of hand gestured, and we can also see that a
                                                             [5] S.Rautaray S, Agrawal A. Vision Based Hand
few hand gestures could make a false prediction of hand
                                                                 Gesture Recognition for Human Computer
gestures. Figure 13 above results in validation accuracy
                                                                 Interaction: A Survey. Springer Artificial
of 95% for hand gestures to be recognized. A false
                                                                 Intelligence        Review.      2012.     DOI:
prediction of hand gestures was getting the percentage
                                                                 https://doi.org/10.1007/s10462-012-9356-9.
accuracy performance to become descend. This is
caused by lighting, a distance between Kinect as a           [6] Lugaresi C, Tang J, Nash H, McClanahan C, et al.
picture catcher and user while using the application, and        MediaPipe: A Framework for Building Perception
degrees of angle from Kinect is placed. Figure 14 shows          Pipelines.       Google       Research.  2019.
the result of images of deploying the user guide                 https://arxiv.org/abs/2006.10214.
application using hand gestures as a command for             [7] Z.Xu, et.al, Hand Gesture Recognition and Virtual
display information.                                             Game Control Based on 3D Accelerometer and
                                                                 EMG Sensors, In Processing og IUI’09, 2009, pp
5. CONCLUSION                                                    401-406.
      The Hand gesture recognition system has become         [8] C.Chua, H. Guan, Y.Ho, Model-Based 3D Hand
an important role in building efficient human-machine            Posture Estimation From a Single 2D Image.
interaction. Implementation using hand gesture                   Image and Vision Computing vol.20, 2002, pp.
recognition promises wide-ranging in technology                  191-202.
industry. The MediaPipe as one framework based on
                                                             [9] Y.Li, Hand Gesture Recognition Using Kinect,
machine learning plays an effective role in developing
                                                                 2012.
this application using hand gesture recognition, with the
result has shown an accuracy performance of 95%. We          [10] M.Panwar, Hand Gesture Recognition Based on
would like to extend our system further to develop                Shape Parameters, In International Conferences:
collaboration with other devices and other human body             Computing Communication and Application
parts and experiment with both static and dynamic hand            (ICCCA), 2012.
gesture recognition systems.
                                                             [11] Marco Maisto, An Accurate Algorithm for
                                                                  Identification of Fingertips Using an RGB-D
REFERENCES                                                        Camera, IEEE Journal on Emerging and Selected
[1] Ustunug A, Cevikcan, Industry 4.0: Managing The               Topics in Circuits and System, 2013. pp. 272-283.
    Digital Transformation, Springer Series in               [12] E.Holden, Visual Recognition of Hand Motion,
    Advanced Manufacturing, Switzerland. 2018. DOI:               Ph.D Thesis Departement of Computer Science,
    https://doi.org/10.1007/978-3-319-57870-5.                    University of Western. 1997.
[2] Pantic M, Nijholt A, Pentland A, Huanag TS,              [13] Cardoso T, Delgado J, Barata J, Hand Gesture
    Human-Centered Intelligent Human-Computer                     Recognition toward Echancing Accessibility. In 6 th
    Interaction (HCI2): How Far We From Attaining                 International      Conference      on     Software
    It?,International Jounal of Autonomous and                    Development and Technologies for Enhancing and
    Adaptive Communications Systems (IJAACS),                     Fighting Infoexclusion (DSAI). Procedia Computer
    vol.1     no.2,   2008.  pp  168-187.   DOI:                  Sciences vol.67. 2015. pp.419-429. DOI:
    10.1504/IJAACS.2008.019799 .                                  https://doi.org/10.1016/j.procs.2015.09.287.
[3] Hamed Al-Saedi A.K, Hassin Al-Asadi A, Survey            [14] Z Ren, Robust Hand Gesture Recognition Based on
    of Hand Gesture Recognition System. IOP                       Finger-Earth Mover’s Distance with Commodity
    Conferences Series: Journal of Physics:                       Depth Camera. 2011.
    Conferences Series 1294 042003. 2019. DOI:
    https://doi.org/10.1088/1742-6596/4/042003.              [15] Stanner T, Weaver J, Pentland A, Real Time
                                                                  America Sign Language Recognition Using Desk
[4] Z.Ren, J.Meng, Yuan J. Depth Camera Based Hand
    Gesture Regconition and its Application in Human-
                                                                                                                    107
                                    Advances in Engineering Research, volume 207
    and Wearable Computer based Video. IEEE Tran            [30] Abadi M, Barham P, Chen J et.al, Tensorflow: A
    on PAMI vol.20. 1998. pp. 1375-1375.                         System for Large-Scale Machine Learning, In 12th
                                                                 USENIX Symposium on Operating System Design
[16] Yang M.H, Ahuja N, Tabb M, Extraction of 2D
                                                                 and Implementation (OSDI), USA, 2016,
     Motion Trajectories and its Application to Hand
                                                                 https://www.usenix.org/conference/osdi16/technica
     Gesture Recognition. IEEE Trans on PAMI vol.29.
                                                                 l-sessions/presentation/abadi.
     2002. pp 1062-1074.
                                                            [31] Chen T, Li M, Li Y, MXNet: A Flexible and
[17] Bray M, Koller-Meier E, Gool L.V, Smart Particle
                                                                 Efficient Machine Learning Library for
     Filtering for 3D Hand Tracking. In Processing of
                                                                 Heterogeneous       Distributed   System, 2015,
     Sixth IEEE International Conference on Face and
                                                                 https://arxiv.org/pdf/1512.01274.pdf.
     Gesture Recognition. 2004.
                                                            [32] Pazke A, Gross A, Chintala S, Automatic
[18] Dewaele G, Devernay F, Horaud R. Hand Motion
                                                                Differentiation in PyTorch, In 31st Conference on
     from 3D Point Trajectories and Smooth Surface
                                                                Neural Information Processing System (NIPS),
     Model. In Processing of 8th ECCV. 2004.
                                                                USA, 2017.
[19] Stanger C, Model-Based 3D Tracking of an
                                                            [33] Seide F, Agarwal A, CNTK: Microsoft’s Open-
    Articulated Hand, 2001.
                                                                 Source Deep Learning Toolkit, In KDD ’16:
[20] Keskin C, Real Time Hand Pose Estimation using              Proceedings of the 22nd ACM SIGKDD
     Depth Sensors, In IEEE International Conferences            International Conference on Knowledge Discovery
     on Computer Vision Workshop. 2011.                          and       Data       Mining.     2016,     DOI:
[21] Lee H, Kim J. An HMM-Based Threshold Model                  https://doi.org/10.1145/2939672.2945397.
     Approach for Gesture Recognition. IEEE Trans on        [34]    Matveev D, OpenCV            Graph    API.   Intel
     PAMI vol.21. 1999. pp 961-973.                                Corporation. 2018.
[22] Wilson A, Bobick A, Parametric Hidden Markov           [35] Zhag F, Bazarevsky, Vakunov A et.al, MediaPipe
     Models for Gesture Recognition. IEEE Trans. On              Hands: On – Device Real Time Hand Tracking,
     PAMI vol.21, 1999. pp.884-900.                              Google          Research.       USA.     2020.
[23] Wu Xiayou, An Intelligent Interactive System                https://arxiv.org/pdf/2006.10214.pdf.
    Based on Hand Gesture Recognition Algorithm             [36] MediaPipe: On-Device, Real Time Hand Tracking,
    and Kinect, In 5th International Symposium on                In     https://ai.googleblog.com/2019/08/on-device-
    Computational Intelligence and Design.2012                   real-time-hand-tracking-with.html. 2019. Access
[24] Wang Y, Kinect Based Dynamic Hand Gesture                   2021.
     Recognition Algorithm Research, In 4th                 [37] Grishchenko I, Bazarevsky V, MediaPipe Holositic
     International Conference on Intelligent Human-              – Simultaneoue Face, Hand and Pose Prediction on
     Machine System and Cybernetics. 2012.                       Device,      Google    Research,   USA,     2020,
[25] Doucet A, De Freitas N, Gordon N, Sequential                https://ai.googleblog.com/2020/12/mediapipe-
     Monte Carlo In Practice. New York: Springer-                holistic-simultaneous-face.html, Access 2021.
     Verlag.2001                                            [38]                    MediaPipe                  Github:
[26] Kwok C, Fox D, Meila, Real Time Particle Filters.             https://google.github.io/mediapipe/solutions/hands.
     In Processing of IEEE.2004.                                   Access 2021.
108