Adaptive Hough Transform With Optimized Deep Learning Followed by Dynamic Time Warping For Hand Gesture Recognition
Adaptive Hough Transform With Optimized Deep Learning Followed by Dynamic Time Warping For Hand Gesture Recognition
https://doi.org/10.1007/s11042-021-11469-9
Abstract
Hand gesture is a natural interaction method, and hand gesture recognition is familiar in
human–computer interaction. Yet, the variations, as well as the complexity of hand ges-
tures such as self-structural characteristics, views, and illuminations, made hand ges-
ture recognition as a challenging task. Nowadays, the human–computer interaction area
enhancement leads to putting interest in the dynamic hand gesture segmentation based on
the gesture recognition system. Apart from the lengthy clinical success, dynamic hand ges-
ture segmentation through webcam vision seems challenging due to the light effects, par-
tial occlusion, and complicated environment. Hence, to segment the entire hand gesture
region and enhance the segmentation accuracy, this paper develops an improved segmenta-
tion and deep learning-based strategy for dynamic hand gesture recognition. The data is
gathered from the ISL benchmark dataset that consists of both static as well as dynamic
images. The initial process of the proposed model is the pre-processing, which is being
performed by grey scale conversion and histogram equalization. Further, the segmentation
of gestures is done by the novel Adaptive Hough Transform (AHT), where the theta angle
is tuned. Once the segmentation of gestures is done, the optimized Deep Convolutional
Neural Network (Deep CNN) is used for gesture recognition. The learning rate, epoch
count, and hidden neurons are tuned by the same heuristic concept. As the main contri-
bution, the segmentation and classification are enhanced by the hybridization of Electric
Fish Optimization (EFO), and Whale Optimization Algorithm (WOA) called Electric Fish-
based Whale Optimization Algorithm (E-WOA). The training of optimized Deep CNN is
handled by Dynamic Time Warping (DTW) for avoiding redundant frames, thus enhancing
the performance of dynamic hand gestures. Quantitative measurement is accomplished for
evaluating hand gesture segmentation and recognition, which portrays the superior behav-
iour of the proposed model.
Keywords Dynamic and static hand gestures · Hand gesture recognition · Adaptive hough
transform · Deep convolutional neural network · Electric fish-based whale optimization
algorithm · Dynamic time warping
* Manisha Kowdiki
manisha.kowdiki@mitwpu.edu.in
Extended author information available on the last page of the article
13
Vol.:(0123456789)
2096 Multimedia Tools and Applications (2022) 81:2095–2126
1 Introduction
Hand gestures act as a natural way for interacting with distinct people. Motivated by the
sound and vision of human interaction, the utilization of hand gestures is the most effec-
tive and powerful method in Human–Computer Interaction (HCI). Hand gestures are
composed of a powerful inter-human communication modality. Hence, they are called
as the convenient and intuitive means for providing communication among machines and
humans. This shows the research interest in the advancement and development of hand
gesture technologies. The majority of the traditional solutions for gesture recognition
are modeled for the dynamic [19, 13] or static gestures [4, 11, 34]. There exist only a
few solutions, which perform on both dynamic and static gestures [32] in a simultaneous
manner. The majority of the present solutions handle two hands [24] or a single hand
[29]. There exist no solutions for recognizing and tracking the gestures from multiple
hands. The outcomes are mostly in the format of average recognition rates.
The DTW algorithm matches the data sequence of distinct lengths. It chooses and
stores the minimum cost in a rotational manner [8, 23]. DTW is an effective classification
technique, and most of the research in this field is performed to enhance its accuracy and
speed. Yet, this technique is limited due to the following reasons. (1) The existing DTW
technique performs in an image-to-image manner for calculating the distances among the
samples. This lessens the generalization ability apart from the training samples. (2) In
general, DTW techniques design the model objects as holistic time-series curves. Rather
than perceiving the global methods, the patterns are also examined. Hence, the global
features lessen its flexibility [37].
Nowadays, deep learning approaches such as RNN and CNN have offered reasonable
outcomes in computer vision research. Still, research is going on for enhancing their usage
in gesture recognition [39]. During the process of gesture recognition, the main challenge
faced by deep learning is the efficient gesture movement presentation. The deep learning-
oriented approach needs a huge database for training the inputs in an efficient manner for
attaining sufficient outcomes in the process of testing. The entire processing is done inside
the hidden layers; hence the researcher faces a challenge in investigating the training process
[12]. Hence, the DTW flexibility, its need for a small-sized database during the process of
training makes it the best tool for the matching process, and therefore, it acts as a flexible
technique for the extracted features analysis. At present times, the accelerometer-oriented
gesture recognition technologies are dependent on Dynamic Time Warping (DTW) [27, 40],
Fuzzy Neural Network (FNN) [5], Hidden Markov Model (HMM) [30], etc. Even though it
offers some advantages, it limits from space attitude variation of the mobile phone, acceler-
ometer specifications, distinct in the individual of the user, and various problems.
The major enhancement behind this paper is shown below.
13
Multimedia Tools and Applications (2022) 81:2095–2126 2097
• To recognize the gestures in the final step using the optimized Deep CNN, in which the
enhancement is made by optimizing its learning rate, epoch count, and hidden neuron
count using the same proposed E-WOA, thus improves the recognition accuracy.
• To train the optimized Deep CNN using the DTW technique to avoid the redundant
frames for improving the performance and to compare the proposed E-WOA-Deep
CNN with several optimization algorithms and machine learning algorithms for deter-
mining the superiority in terms of both the static as well as dynamic images.
The organization is as follows: Section 1 provides the introduction regarding the hand
gesture recognition process. The various literature-related works of the hand gesture rec-
ognition methods are described in Section 2. The dataset description and pre-processing
steps involved for static and dynamic hand gesture recognition are explained in Section 3.
Section 4 portrays the improved gesture segmentation for static and dynamic hand gesture
recognition. Section 5 describes the intelligent static and dynamic hand gesture recognition
using deep learning with DTW. Section 6 provides the results and discussions. The paper is
concluded in Section 7.
2 Literature Survey
2.1 Related Works
In 2016, Plouffe and Cretu [25] had labeled the improvement of a natural gesture user inter-
face based on the depth data that is gathered using a Kinect sensor. The hand was assumed
as the nearest object. The novel algorithm enhanced the scanning time for recognizing the
initial pixel. A directional search algorithm permitted the complete hand contour identifi-
cation. The fingertips were located using the k-curvature algorithm. The gesture candidates
were chosen by the DTW. The observed gesture was differentiated with a prerecorded ref-
erence gesture series. This method exceeded the majority of the solutions in terms of static
recognition. It was identical with respect to dynamic and static recognition of the sign lan-
guage alphabet and familiar signs. The solution has handled several hands inside the inter-
est space. The evaluated areas represented the natural control of a software interface and
gesture and sign digit interpretation.
In 2016, Srivastava and Sinha [31] revealed a quaternion-oriented QDTW approach.
It characterized distinct hand/arm gestures and movements. The case study was related
to the outdoor tennis game. A novel technique was developed for the process of training
several tennis shots. It also discussed the consistency and shot of a player. The accuracy
was enhanced for both detection and classification. In 2018, Tang et al. [33] developed a
Structured DTW technique for the purpose of continuous hand trajectory recognition. An
automatic continuous trajectory segmentation technique joined the velocity and templated
information for spotting the starting and finishing points. Distinct weights were assigned
on the basis of the structured information. It was demonstrated on the Continuous Letter
Trajectory (CLT) database. It was robust in terms of diversity.
In 2016, Cheng et al. [6] had defined an Image-to-Class DTW technique for the
3D hand trajectory gestures and 3D static hand gestures recognition. It was twofold.
Initially, it calculated the image-to-class DTW distance. Hence, better generalization
ability was attained. In the second step, fingerlets were developed for the static gesture
13
2098 Multimedia Tools and Applications (2022) 81:2095–2126
representation, and strokelets were proposed for the trajectory gesture representation.
The DTW distance was calculated among a gesture category and data sample. It was
demonstrated on distinct 3D hand gesture datasets. The UESTC-HTG was gathered by
means of a Kinect device. The recognition accuracy was enhanced on trajectory ges-
tures and static gestures.
In 2015, Wang and Li [35] developed a novel accelerometer-oriented gesture recog-
nition system. The data collection was defined using the acceleration waveform in an
automatic manner. It handled the problems produced by the amplitude range. The angle
offset was minimized using the coordinate transformation theory. During the process of
training, the exemplars and clusters were extracted by the Affinity Propagation (AP) and
DTW. The classification enhanced the resolution. It returned better performance on the
Android platform.
In 2018, Choi and Kim [7] introduced a modified DTW algorithm, which differen-
tiated gesture-position sequences on the basis of the gestural movement direction. It
did not take into account the two-dimensional behaviour of the movement of the user.
Hence, the sequence comparison is required to be enhanced. Similar gestures were cho-
sen by means of the filtering process. The difference was computed by means of the
ratio of the proportional and the Euclidean distance to the computed difference. The
recognition-decline issue handled the chosen spline interpolation. The experiment was
conducted on public databases such as G3D and MSRC-12 that described an enhanced
performance of the proposed method.
In 2020, Ameur et al. [1] developed a dynamic hand gesture recognition technique by
means of touchless hand motions over a Leap Motion device. The bidirectional LSTM
and basic unidirectional LSTM were exploited in a separate manner. The temporal and
spatial dependencies were considered among the network layers and the Leap Motion
data during the process of the backward and forward pass. This model was tested on the
RIT dataset and LeapGestureDB dataset. It retrieved better performance with respect to
computational complexity and accuracy.
In 2013, Angel et al. [2] addressed a probability-oriented DTW for gesture recog-
nition. The Gaussian-oriented probabilistic model was constructed using distinct sam-
ples. In the final step, the DTW cost was described on the basis of the novel method. It
was tested on challenging databases. From the analysis, better gesture recognition was
retrieved on the RGB-D data.
In 2021, Lv [15] has proposed a somatosensory sensor for implementing gesture rec-
ognition. This method was very important in human–computer interaction. Initially, the
gesture was converted into a gesture sequence that has consisted of micro gestures for
describing the different directions. Then, the comparison between the gesture sequences
with gesture templates was made. The gesture found out by the pattern matching result.
From the results, more than 92% gestures can be found out by the proposed recognition
method.
In 2021, Blazkiewicz [3] aimed to quantify the degree of asymmetry of kinematic and
kinetic parameters which is occurred by the existence of the different ankle orthosis set-
tings using Dynamic Time Warping (DTW). Barefoot gait and gait with four different
walker settings were tested in eighteen healthy persons. Measurement of Kinematic and
kinetic parameters was done using the Vicon and Kistler plates. From the results, the ortho-
sis position of this study fulfils its protective function, but gait 15DF can lead to the over-
load of the knee and hip joints.
13
Multimedia Tools and Applications (2022) 81:2095–2126 2099
2.2 Review
Hand gesture recognition is an important area in language technology and computer science
with the aim of interpreting hand gestures through mathematical algorithms. It emerges
from any state or bodily motion. It can be accomplished using several methods such as
DTW, deep learning, and Hough transform, etc. But, there exist drawbacks such as it is not
precise, not possible to make long conversations, information gets distorted, and complex-
ity in understanding, etc. Table 1 lists the features and the challenges of the existing models
associated with hand gesture recognition. k-curvature algorithm [25] can be employed in
the natural control of a software application and is utilized for the real-time interpreta-
tion of familiar gestures and sign digits. But, it does not speed up the gesture recognition
using a distinct version of DTW. QDTW [31] offers better accuracy and can be utilized in
outdoor swing-oriented sports and indoor gaming. Still, it does not propose a ranking sys-
tem for a player. SDTW [33] segments the trajectories automatically by joining the SVM
classifier and DTW algorithm and also improves the significance related to the structure
information. Yet, it does not minimize the overlap problem among complex letter trajecto-
ries and simple letter trajectories. I2C-DTW [6] attains better generalization ability and is
also useful for multiple users to appear together. But, it does not learn the fewer parameters
in an automatic manner for handling the multiple-user gesture recognition and continuous
gesture recognition. MVSAMP [35] is more robust to angle offset and waveform distortion
and also offers better performance in user-independent recognition and user-dependent rec-
ognition. Still, the offset that is produced by the yaw angle is not resolved. Modified DTW
algorithm [7] lightens the matching errors and also provides a lower learning pressure and
a simple calculation process. Yet, much time-consuming occurs. HBU-LSTM [1] enhances
the model performance by considering the temporal and spatial dependencies and also
effectively classifies the input data attained from the LMC. But, it cannot be implemented
by GPU. Probability-based DTW [2] benefits from both the temporal warping ability and
generalization ability and can also solve the multiple deformations in data. Kinect sensor
[15] is used for checking whether the threshold is small. The computational cost is very
high. DTW [3] has a simple calculation process and a lower learning pressure. It has a very
high computational cost. Still, it does not attain gesture-discriminative features. Thus, it is
required to develop novel deep learning methods for generating effective performance of
hand gesture recognition while handling both dynamic and static images.
3.1 Dataset Description
The dataset used here is the benchmark IIITA-ROBITA Indian Sign Language (ISL) Ges-
ture Database. The recognition and interpretation of ISL gestures are becoming an inter-
esting area of research in the area of Human–Robot Interaction. The IIITA-ROBITA ISL
Gesture Database is offered by the Indian Institute of Information Technology Allahabad
(IIITA)-ROBITA to the gesture recognition researchers for generating future enhancement
in this area [21, 22]. The data is generated at the AI and Robotics Lab, IIIT-Allahabad
13
Table 1 Features and challenges of state-of-the-art hand gesture recognition methods
2100
13
Plouffe and Cretu [25] k-curvature algorithm It can be used for the real-time interpretation of familiar The gesture recognition is not speeded up by a distinct version
gestures and sign digits of DTW
It can be employed in the natural control of a software
application
Srivastava and Sinha [31] QDTW It can be utilized in outdoor swing-oriented sports and indoor The ranking system is not proposed for a player
gaming
It provides better accuracy
Tang et al. [33] SDTW The significance related to the structure information is The overlap problem is not minimized among complex letter
improved trajectories and simple letter trajectories
The trajectories are segmented in an automatic manner by
joining the SVM classifier and DTW algorithm
Cheng et al. [6] I2C-DTW It is helpful for multiple users to appear together The less parameter are not learned in an automatic manner for
It attains better generalization ability handling the multiple-user gesture recognition and continu-
ous gesture recognition
Wang and Li [35] MVSAMP It provides better performance in user-independent recogni- It cannot resolve the offset that is produced by the yaw angle
tion and user-dependent recognition
It is more robust to angle offset and waveform distortion
Choi and Kim [7] Modified DTW algorithm It offers a lower learning pressure and a simple calculation It is very much time-consuming
process
The matching errors are alleviated
Ameur et al. [1] HBU-LSTM The input data attained from the LMC is effectively classified It cannot be implemented by GPU
The model performance is enhanced with consideration of
the temporal and spatial dependencies
Angel et al. [2] Probability-based DTW It can handle multiple deformations in data The gesture-discriminative features are not attained
It benefits from both the temporal warping ability and gener-
alization ability
Lv [15] Kinect sensor It is used for checking whether the threshold is small The computational cost is very high
Blazkiewicz [3] DTW It has a simple calculation process and a lower learning It has a very high computational cost
pressure
Multimedia Tools and Applications (2022) 81:2095–2126
Multimedia Tools and Applications (2022) 81:2095–2126 2101
from July 2009. The dataset is composed of 23 distinct gestures that are captured at 320
by 240 pixels, 30 fps having Sony Handycam. A constant background is maintained
that contains several light illumination conditions. The ISL gesture dataset consists of
a sequence of RGB frames for 23 Isolated ISL gestures. Some of the sample static, as
well as dynamic images used here for the hand gesture recognition are displayed below in
Figs. 1 and 2.
3.2 Image Pre‑processing
In the proposed hand gesture recognition model, the image pre-processing is performed
using the histogram equalization and the grey scale conversion. In the static type, the
images are considered for the processing, and for the dynamic type, the video frames
{ are
}
in ,
utilized for the processing. Assume the database used for the recognition as QA = MZcy
in which cy = 1, 2, ⋯ CY and CY represents the total number of hand gesture images or
video used for the recognition. The description of these two methods adopted in the pre-
processing phase is given below.
Frame 1 for Frame 101 for Frame 201 for Frame 301 for Frame 401 for
Across Across Across Across Across
Frame 1 for Frame 101 for Frame 201 for Frame 301 for Frame 401 for
Advance Advance Advance Advance Advance
Fig. 2 Sample dynamic images of ISL dataset for different frames for different signs
13
2102 Multimedia Tools and Applications (2022) 81:2095–2126
The grey scale image is formed by transforming the colour intensity EQJQ of the entire
pixels in the color image to the entire trees present in the random forest. At every tree node,
the group of pixels is divided in terms of the stored binary test 𝜑∗, and it is sent again to
the right or left of the child node till a leaf node is met. The path of iqth pixel present in the
entire trees of the forest finishes in a group of leaf nodes LFQiq. The handled values pre-
∑
sent in the LFQiq represent the covariance matrix lq and the leaf node I Q ̂ lq . On holding
these values, the RGB colours of iq th pixel are transformed to the grey scale value GVQiq as
in Eq. (1).
∑ ( lq )
GVQiq = ̂
𝜔lq dec I Q
(1)
lq∈LFQiq
3.2.2 Histogram equalization
This Histogram Equalization approach [9] varies the intensity of an image, thereby improv-
ing the image brightness. Assume MZcy as the known image with the help of kqlq by kqmq
grey
matrix having pixel intensities within 0 to 1, and the possible intensity value count is rep-
resented by PIVQ that is almost equal to 256. The normalized histogram NHQ$ of EQJQ
having a bin for every possible intensity is represented in Eq. (2).
number of pixels with density heq
NHQ = (2)
total numberof pixels
Hence, the final pre-processed image using the histogram equalization is represented by
his , and this is further employed for the process of segmentation.
MZcy
Nowadays, the major challenging task present in computer vision is hand gesture recog-
nition. The present methodologies revealed preliminary outcomes on simple scenarios,
yet they are very apart from human performance. Owing to the huge count of potential
applications that include human gesture recognition in fields such as clinical assistance,
sign language recognition, or surveillance, among others, there exists a vast active
research community in handling this problem. An illustration of sequential learning is
hand gesture recognition. The major problem arises due to the distinct temporal dura-
tion of the data sequences, and it may even consist of a fundamentally distinct group of
13
Multimedia Tools and Applications (2022) 81:2095–2126 2103
component elements. The two major approaches associated with this problem are the
methods like Conditional Random Fields (CRF) or Hidden Markov Models (HMM) that
are generally used for handling the problem from a probabilistic viewpoint in the case of
classification problems. Moreover, the key pose techniques for hand gesture recognition
are also developed. The dynamic programming-oriented algorithms are utilized for both
the clustering as well as the alignment of temporal series. The most familiar dynamic
programming technique utilized for hand gesture recognition is Dynamic Time Warping
(DTW). Yet, applying these methods in complex portions becomes a challenging task
owing to the highly changing environmental conditions. The familiar problems arising
are the partial occlusions, illumination variations, unexpected object appearance, speed,
human action spontaneity, human movement continuity, background influence, the wide
variety of human pose configurations, or distinct viewpoints. These impacts produce
dramatic variations in the specific gesture description, thereby producing great intra-
class variability.. The architecture of the proposed hand gesture recognition model is
displayed in Fig. 3.
The proposed hand gesture recognition model is composed of four phases such as,
“Data collection, pre-processing, segmentation, and recognition”. In the initial phase, the
data is gathered from the standard ISL dataset that consists of static as well as dynamic
images. After collecting the data, the next phase of pre-processing begins. It is done to
enhance the image, which suppresses the unwanted distortions and improves few features
of the image that are significant for future processing. Here, the pre-processing is accom-
plished using the grey-scale conversion and the histogram equalization. Pre-processing is
useful to expose how the data should be structured on the basis of domain knowledge. In
the grey scale conversion, the RGB values are converted into gray scale values and are
used to measure the intensity of light present in images. In the grey scale conversion, lesser
information is provided in each pixel that makes the process simpler. The histogram equali-
zation is used to enhance the contrast present in the images. It is done by efficiently spread-
ing out the most repeated intensity values. Once the pre-processed image is attained, it is
subjected to the third phase of segmentation. Here, the AHT is used to perform the seg-
mentation process. The Hough transform is used to find the imperfect instances of objects
that are present inside a specific class of shapes. An improvement is made in the Hough
transform by optimizing the theta parameter using the proposed E-WOA; hence it is called
as adaptive segmentation. The Hough transform is very useful when detecting the lines
with short breaks in images due to noise. This segmented image is subjected to the final
recognition phase, in which the recognition is performed by the optimized Deep CNN.
Here, an enhancement is made in the Deep CNN by optimizing its learning rate, epoch
count, and hidden neuron count by the same proposed E-WOA. E-WOA has the capability
for avoiding the local optima. It is also useful for solving the unconstrained and constrained
issues. The learning rate describes a tuning parameter for defining the step size at every
iteration. An epoch represents the count of passes of the complete training dataset the algo-
rithm has taken. The hidden neurons are used to accomplish the nonlinear transformations
of the inputs that are subjected to the network. The training of the Deep CNN is done
by the DTW to avoid the redundant frames and improve the performance when dealing
with dynamic data. The implementation of the Deep CNN is very faster. The DTW per-
mits comparisons of two time-series sequences having changing speeds and lengths. DTW
lightens the matching errors and also provides a lower learning pressure. In the final step,
the Deep CNN returns the recognized gesture as output. Due to the advantages of the pro-
cessing steps, the gesture segmentation and recognition has performed efficiently.
13
2104 Multimedia Tools and Applications (2022) 81:2095–2126
Pre-Processing
Grey-scale Histogram
conversion equalization
Segmentation
AHT
Theta
Recognition
Training
Optimized Deep CNN Optimize
Deep CNN
Proposed E-
WOA Learning Epoch Hidden
rate count neuron count
DTW
Recognized
gesture
13
Multimedia Tools and Applications (2022) 81:2095–2126 2105
availability of noise. It acts as a robust tool for extracting the features like ellipses, circles, or
straight edges. The primitives are described by the polygons and are defined in a parametric
format. It is employed as one step in a processing chain. It has the capability to find, quantify,
and extract shapes for recognizing those features and shapes in the case of incomplete, broken
outlines or that of noise corrupted in the thresholded image. The shape of interest is trans-
formed into its parameter space. A line present in a Cartesian coordinate system (xh, yh) is
defined as in Eq. (4).
yh = mhxh + bh (4)
Here, the term bh describes its interception with yh and the constant mh denotes the
slope. Every line is characterized uniquely with the constants bh and mh. Hence, any line is
described using a point in a coordinate system bh and mh. Conversely, any point (xh, yh) is
linked with a group of values for bh and mh, and hence Eq. (4) is rewritten as in Eq. (5).
yh 1
mh = − bh (5)
xh xh
Every point (xh, yh) is defined by a line in a (mh, nh) space. The values
( present)in (mh, nh)
space are not described for the vertical lines. When a group of points yhkh , xhkh lying on a
line that is defined by yh = MHxh + BH is converted into (mh, nh) space known as parameter
space or Hough space, then every point is defined using a line in Hough space as in Eq. (6).
yhkh 1
mh = − bh (6)
xhkh xhkh
Assume that the entire lines meet at one point (MH, NH). The Hough transform is also
used by the polar coordinates as in Eq. (7).
𝜌 = xh cos 𝜃 + yh sin 𝜃 (7)
Here, the angle of the line with the horizontal axis is defined by 𝜃, and the minimum distance
to the origin is defined by 𝜌, and these are linked to bh and mh via Eq. (8) and Eq. (9).
cos 𝜃
mh = − (8)
sin 𝜃
cos 𝜃
mh = − (9)
sin 𝜃
The Hough space seems to be bi-dimensional, having coordinates θ and ρ. The initial step
in the Hough transform method is the formation of a 2D parameter space. A straight line is
fixed to every element present in the matrix. This parameter matrix describes only a finite line
count. Next, a counter is fixed from every point to other points in the parameter space. Hence,
the parameter matrix is generally known as an accumulator. Initialize the counter to zero before
initiating the transformation. Here, the improvement is made in the Hough transform by opti-
mizing its θ value by the proposed E-WOA, hence it is called as AHT. The main objective of
the proposed E-WOA-based gesture segmentation is to maximize the accuracy by optimizing
the theta value of the Hough transform. This characteristic is defined in Eq. (10).
Obn1 = arg max (Acr)
{𝜃} (10)
13
2106 Multimedia Tools and Applications (2022) 81:2095–2126
Here, the term Obn1 denotes the objective function for the gesture segmentation and
θ denotes the theta angle of the Hough transform that is to be optimized by the proposed
E-WOA, and Acr denotes the accuracy. Accuracy is defined as, “the degree to which the
result of a measurement conforms to the correct value or a standard”. The formula for
accuracy is defined in Eq. (11).
pt + nt
Acr =
pt + nt + pf + nf (11)
Here, the terms nt and nf denotes the true negative and false negative, respectively. The
bounding limit of the theta value lies in between 0.01 and 0.10. The solution encoding of
the E-WOA-based gesture segmentation is portrayed in Fig. 4.
The major drawback of the Hough transform is that it produces misleading out-
comes when objects appear to be aligned by chance. The detected lines represent
the infinite lines that are defined by their (mh, nh) values other than the finite lines
Segmented
output
Fitness
Check accuracy
evaluation
No if
accuracy
is high
Update by Yes
E-WOA
Optimized
Termination
segmented output
13
Multimedia Tools and Applications (2022) 81:2095–2126 2107
having the defined end points. Hence, to overcome these drawbacks, the parameter θ
is optimized in the Hough transform. The AHT is tolerant of the gaps present in the
edges, unaffected by the occlusion present in the image, and it is also not affected
by the noise. Thus, the final AHT-based gesture segmented image is represented as
MZcy .
hough
4.2 Proposed E‑WOA
The proposed E-WOA-Deep CNN is used to optimize the theta parameter of Hough trans-
form in the segmentation phase and learning rate, epoch count, and hidden neuron count
of Deep CNN in the recognition phase. The WOA [17] mimics the humpback whales. It
is motivated using the bubble-net hunting method. After describing the best search agent,
the remaining search agents update their positions in the path of the best search agent as
depicted in Eqs. (12) and (13).
⃗ = ||CF
DF ⃗ ⋅ XA
⃗ ∗ (iaja) − X A(iaja)
⃗ |
| (12)
| |
⃗
X A(iaja ⃗ ∗ (iaja) − AF
+ 1) = X A ⃗ ⋅ DF
⃗ (13)
Here, the position vector of the best solution is represented by XA ∗, the element-by-
element multiplication is represented by ‘·’, the current iteration is represented by iaja,
the position vector is represented by X A ⃗ , the coefficient vectors are represented by AF
⃗
and CF ⃗ , and the absolute value is represented by ||. The coefficient vectors are measured
as in Eqs. (14) and (15).
⃗ = 2a⃗f ⋅ r⃗a − a⃗f
AF (14)
⃗ = 2 ⋅ r⃗a
CF (15)
In the above equations, a⃗f is minimized from 2 to 0 and r⃗a represents a random
vector in [0,1]. Two approaches are defined to describe the bubble-net character-
istics called the exploitation phase of the humpback whales. In the first approach,
called the shrinking encircling mechanism, the term a⃗f is decreased. In the second
approach, called spiral updating position, the distance is measured among the whale
positioned at (XA, YF) and prey positioned at (XA ∗, YF ∗) . This behaviour is shown in
Eq. (16).
⃗
X A(iaja + 1) = DF⃗� ⋅ ebflf ⋅ cos (2𝜋lf ) + X A
⃗ ∗ (iaja) (16)
Here, the term lf represents a random number, bf represents a constant, · represents an element-
| ⃗ |
by-element multiplication, and DF⃗� = |X A ⃗
∗ (iaja) − X A(iaja) | represents the distance of the if th
| |
whale to the prey. The spiral updating position is modelled as in Eq. (17).
{
XA⃗ ∗ (iaja) − AF ⃗ ⋅ DF⃗ if pf < 0.5
⃗
X A(iaja + 1) = (17)
⃗� bflf ⃗
DF ⋅ e ⋅ cos (2𝜋lf ) + X A ∗ (iaja) if pf ≥ 0.5
13
2108 Multimedia Tools and Applications (2022) 81:2095–2126
In the above equation, the random number in [0,1] is represented by pf . The prey is
searched in a random manner. The search for prey is the exploration phase on the basis of the
⃗ vector. The search agent present in the exploration phase is updated on the
variation of the AF
basis of the randomly selected search agent rather than the best search agent attained. This
| ⃗|
mechanism together with |AF | > 1 performs the exploration and permits to accomplish a
| |
global search as shown in Eqs. (18) and (19).
⃗ = ||CF
DF ⃗ ⋅ XA ⃗ ||
⃗ rand − X A (18)
| |
⃗
X A(iaja ⃗ rand − AF
+ 1) = X A ⃗ ⋅ DF
⃗ (19)
The WOA offers several advantages such as better exploitation, exploration, con-
vergence behaviour, and local optima avoidance. But, it lacks due to some shortcom-
ings, such as it is not good with the search space exploration. Hence, to overcome
the shortcomings, EFO is integrated into it, and the so formed algorithm is called
as E-WOA. The EFO has several advantages such as high compelling convergence
ability, and better global searchability, etc. EFO algorithm [38] is motivated by the
communication characteristics and prey location of the electric fish. The passive and
active electrolocation ability of these fishes acts as the best candidate for handling
the global as well as the local search. It is composed of only a fraction of the entire
fish species. They accommodate in muddy water, and they seem to be nocturnal. A
species-specific ability is possessed called electrolocation that acts as the distinguish-
ing ability for positioning the obstacles and prey. They are classified as weakly and
strongly electric fish on the basis of the electric field strength produced. The strongly
electric fish uses the electrolocation ability for offensive uses. The weakly electric
fish are employed to detect, communicate, navigate objects, etc. The behaviours that
rely on self-organization are multiple interactions, fluctuation, negative feedback, and
positive feedback. The intelligent characteristics of electric fish are designed as Elec-
tric Organ Discharge (EOD) amplitude, EOD frequency, passive electrolocation, and
active electrolocation.
In general, for the traditional WOA, if (pf < 0.5), it checks whether (|AF| < 1). If it is
satisfied, then the current search agent is updated using Eq. (12), and if (|AF| ≥ 1), the
current search agent position is updated using Eq. (19). But, in the proposed E-WOA, if
(pf < 0.5), it checks the condition whether (|AF| ≥ 1). If this condition is fulfilled, then the
current search agent position is updated using Eq. (19). Otherwise, if (|AF| < 1), then the
update takes place using EFO as in Eq. (20).
( )
XAcand
iaja = XAiaja + 𝜙 XAkaja − XAiaja (20)
Here, the term ka denotes a randomly selected individual from the neighbour group
of the iath individual, the dimension is defined by ja|ja ∈ 1, 2, ⋯ , da , the candidate
location of the iath individual is defined by xacand
iaja
, and a random number produced
from a uniform distribution is defined by 𝜙 ∈ [−1, 1]. In the other case, it checks
whether (pf ≥ 0.5) , and if this condition is satisfied, then the current search agent
position of WOA is updated using Eq. (16). The pseudocode of the proposed E-WOA
is shown in Algorithm 1, and the flowchart of the developed E-WOA is depicted in
Fig. 5.
13
Multimedia Tools and Applications (2022) 81:2095–2126 2109
A hybrid algorithm [16] joins two or more algorithms for handling the critical problem
by selecting one algorithm or switching among them over the sequence of the algorithm.
This is usually performed to join the necessary features of each, such that the new algo-
rithm is improved than the individual components. It is modelled to provide better per-
formance when compared with the individual algorithms. It exploits the good properties
of distinct techniques by subjecting them to the problems that it can handle in an effec-
tive manner. It can easily solve multi-objective optimization engineering problems having
inequality constraints.
While considering the dynamic image (video), the training, as well as the testing, includes
some frame count that offers distinct information. Therefore, the frames having repeated
13
2110 Multimedia Tools and Applications (2022) 81:2095–2126
13
Multimedia Tools and Applications (2022) 81:2095–2126 2111
information are eradicated that contain minimum difference from the earlier frame. The
DTW [26] model measures the disparity between two data series that are attained at dif-
ferent times. A matrix comprising of the Euclidean distances at aligned points above the
two series is utilized for computing the minimal cost between the two series. Further, the
direction of the shortest path selection is associated with specific regulations and rules.
In particular, the movement is lessened to diagonal, horizontal, and vertical directions. A
weight is associated with these directions. The shortest path is limited in the case of thresh-
old to the two series. These are needed to be equivalent ones. Thus, the measurement of the
distance between the two frames helps to remove the repeated frames based on the DTW
concept.
The optimized deep CNN is used for gesture recognition, in which the improvement is
made by optimizing its learning rate, epoch count, and hidden neuron count by the pro-
posed E-WOA. Deep CNN [28] represents the feedforward networks, where the informa-
tion flow occurs in a single direction. These are biologically inspired networks. CNN archi-
tectures are composed of pooling and convolutional layers. The pooling layers are grouped
in the form of modules. The modules are arranged one above the other for generating a
deep model. One or several fully connected layers are fed by these representations. The
final fully connected layer produces the class label as output.
5.2.1 Convolutional layers
It acts as feature extractors. Hence, the feature representations regarding the input images
are learned. The neurons are sorted in the form of feature maps. Every neuron is composed
of a receptive field that is linked to the neuron neighbourhood in the earlier layer through
a group of trainable weights called a filter bank. A novel feature map is measured by con-
volving the inputs with the learned weights. The outcomes are passed via a nonlinear acti-
vation function. The neurons consist of weights that are conditioned to be equal. Different
feature maps consist of distinct weights, and hence it is possible to extract various features
at every location. The kzth output feature map YZkz is measured as in Eq. (21).
( )
YZkz = f WZkz ∗ xz (21)
Here, the convolutional filter is represented by WZkz , the input image is represented by
xz, the multiplication sign represents the 2D convolutional operator, and f (⋅) defines the
nonlinear activation function. The nonlinear features are extracted by the nonlinear activa-
tion functions. The existing methods used the hyperbolic and sigmoid tangent functions.
The Rectified Linear Units (ReLUs) are very familiar these days.
5.2.2 Pooling layers
This layer minimizes the spatial resolutions of the feature maps. It propagates the average
of the input values to the next layer. The maximum value is propagated by the max-pooling
aggregation layers inside a receptive field. It chooses the largest element in every receptive
field as in Eq. (22).
13
2112 Multimedia Tools and Applications (2022) 81:2095–2126
In the above equation, the pooling operation output is represented by YZkzizjz , and the
element at the location (pz, qz) is represented by xzkzpzqz.
Various convolutional and pooling layers are stacked above each other for extracting vari-
ous abstract feature representations. It accomplishes the function of high-level reasoning.
The classification problems utilized the softmax operator on top of DCNN.
5.2.4 Training
The free parameters are adjusted by the CNN using the learning algorithms. This achieves
the necessary network output. The familiar algorithm used here is the backpropagation. It
measures the gradient of an objective. The parameters are adjusted to reduce the errors.
The problem that occurs in training is overfitting. This issue damages the capability of the
model in generalizing unseen data. It exists as a major challenge during the regularization
process. DCNNs need various hyperparameters like the epoch count for running the model
as well as the learning rate. The batch normalization permits higher learning rates. The
learning rate is the quantity of the weights that are updated in the process of training. It is
mostly a configurable hyperparameter that is employed in the training of Deep CNN and is
composed of a small positive value, usually in the range among 0 to 1. An epoch describes
the pass count of the whole training dataset, the optimization algorithm has finished. If the
batch size represents the entire training dataset, then the epoch count is nothing but the
iteration count.
The major objective of the proposed E-WOA-based classification is to maximize the
precision by optimizing the learning rate, epoch count, and hidden neuron count of the
Deep CNN. This behaviour is mathematically modelled as in Eq. (23).
Obn2 = arg max (Pr ec)
{LR,EC,HNC} (23)
In the above equation, the term Obn2 represents the objective function for the classifica-
tion, LR denotes the learning rate, EC denotes the epoch count, and HNC denotes the hid-
den neuron count of the Deep CNN that is to be optimized by the proposed E-WOA, and
Pr ec denotes the precision. Precision is defined as, “the ratio of positive observations that
are predicted exactly to the total number of observations that are positively predicted”. The
formula for precision is defined in Eq. (24).
pt
Pr ec =
pt + pf (24)
Here, the terms pt and pf denotes the true positive and false positive, respectively. The
bounding limit of the learning rate lies in between 0.01 to 0.09, the epoch count lies in
between 5 to 10, and the hidden neuron count lies in between 5 to 256. The architectural
representation of the proposed optimized Deep CNN-based recognition is depicted in Fig. 6.
In the traditional Deep CNN, the orientation and position of the objects are not encoded
into their predictions. It does not have the capability to be spatially invariant to the input
data. It also entirely loses the information regarding the position and composition of the
13
Multimedia Tools and Applications (2022) 81:2095–2126 2113
components. Hence, the optimized Deep CNN is proposed to overcome these drawbacks. This
optimized Deep CNN can minimize the losses and offer the most accurate outcomes.
6.1 Experimental setup
The proposed E-WOA-Deep CNN-based hand gesture segmentation and recognition were
implemented in Python, and the results were executed. Here, the dataset was gathered from
the ISL dataset that consists of static as well as dynamic images. Here, the population size
considered was 10, and the maximum iterations accomplished was 25. Here, the proposed
E-WOA-Deep CNN was compared with various optimization algorithms such as EFO-Deep
CNN [38], WOA-Deep CNN [17], GWO-Deep CNN [18], and PSO-Deep CNN [36], and
distinct machine learning algorithms like DH-GWO-NN [22], Deep CNN [28], RNN [14],
and VGG16 [10] in terms of performance measures such as, “accuracy, sensitivity, specificity,
precision, FPR, FNR, FDR, NPV, F1 Score and MCC”.
6.2 Performance measures
13
2114 Multimedia Tools and Applications (2022) 81:2095–2126
(c) Specificity: “the number of true negatives, which are determined precisely”.
nt
Spe =
pf (26)
(f) FNR: “the proportion of positives which yield negative test outcomes with the test”.
nf
FNR = (28)
nt + pt
(g) NPV: “probability that subjects with a negative screening test truly don’t have the
disease”.
nf
NPV = (29)
nf + nt
(h) FDR: “the number of false positives in all of the rejected hypotheses”.
pf
FDR = (30)
pf + pt
(i) F1 Score: “harmonic mean between precision and recall. It is used as a statistical
measure to rate performance”.
Sens ∙ Pr ec
F1Scre = (31)
Pr ec + Sens
(j) MCC: “correlation coefficient computed by four values”.
pt × nt − pf × nf
MCC = √ (32)
(pt + pf )(pt + nf )(nt + pf )(nt + nf )
6.3 Segmentation analysis
The experimental outcomes of the hand gesture recognition are displayed in Figs. 7 and 8
that include the original images, pre-processed images, canny edge detection images, and
Hough transformed images.
The performance analysis of the developed and traditional heuristic-oriented AHT and
Deep CNN for hand gesture recognition using static and dynamic images by changing the
learning percentages for the different performance measures is depicted in Figs. 9 and 10.
The positive measures reveal an increased result, and negative measures reveal a decreased
13
Multimedia Tools and Applications (2022) 81:2095–2126 2115
Pre-
processed
Images
Canny
Edge
detection
Images
PSO-Deep
CNN
Hough
transforme
d Images
GWO-
Deep
CNN
Hough
transforme
d Images
WOA-
Deep
CNN
Hough
transforme
d Images
EFO-Deep
CNN
Hough
transforme
d Images
E-WOA-
Deep
CNN
Hough
transforme
d images
Fig. 7 Experimental outcomes of pre-processing and segmentation for hand gesture recognition in terms of
static images
13
2116 Multimedia Tools and Applications (2022) 81:2095–2126
Frame 1 (Above) Frame 2 (Across) Frame 3 (Advance) Frame 4 (Afraid) Frame 5 (All)
Original
Images
Pre-processed
Images
Canny Edge
detection
Images
PSO-Deep
CNN Hough
transformed
Images
GWO-Deep
CNN Hough
transformed
Images
WOA-Deep
CNN Hough
transformed
Images
EFO-Deep
CNN Hough
transformed
Images
E-WOA-
Deep CNN
Hough
transformed
images
Fig. 8 Experimental outcomes of pre-processing and segmentation for hand gesture recognition in terms of
dynamic images
result defining the superiority of the proposed E-WOA-Deep CNN. From Fig. 9a, the accu-
racy of the E-WOA-Deep CNN at 65% learning percentage is 1.68%, 0.52%, 1.36%, and
1.25% higher than EFO-Deep CNN, WOA-Deep CNN, GWO-Deep CNN, and PSO-Deep
CNN. On considering Fig. 9b, at 65% learning percentage, the F1 Score of the E-WOA-
Deep CNN is 0.92%, 0.31%, 0.82%, and 0.72% better than EFO-Deep CNN, WOA-Deep
CNN, GWO-Deep CNN, and PSO-Deep CNN. In Fig. 9c, the FDR of the E-WOA-Deep
13
Multimedia Tools and Applications (2022) 81:2095–2126 2117
Fig. 9 Performance analysis of the developed and traditional heuristic-oriented AHT and Deep-CNN for
hand gesture recognition using static images by changing the learning percentages for the measures a accu-
racy, b F1 score, c FDR, d FNR, e FPR, f MCC, g NPV, h precision, i sensitivity, and j specificity
13
2118 Multimedia Tools and Applications (2022) 81:2095–2126
Fig. 10 Performance analysis of the developed and traditional heuristic-oriented AHT and Deep-CNN for
hand gesture recognition using dynamic images by changing the learning percentages for the measures a
accuracy, b F1 score, c FDR, d FNR, e FPR, f MCC, g NPV, h precision, i sensitivity, and j specificity
13
Multimedia Tools and Applications (2022) 81:2095–2126 2119
CNN at 85% learning percentage is 18.34%, 12.10%, 9.80%, and 13.21%, surpassed than
EFO-Deep CNN, WOA-Deep CNN, GWO-Deep CNN, and PSO-Deep CNN. On consider-
ing Fig. 9d, at 65% learning percentage, the FNR of E-WOA-Deep CNN is 8.28%, 21.05%,
44.44%, and 42.31% improved than EFO-Deep CNN, WOA-Deep CNN, GWO-Deep
CNN, and PSO-Deep CNN. Similarly, in the case of Fig. 10a, at 65% learning percentage,
the accuracy of the E-WOA-Deep CNN is 0.63%, 0.21%, 0.84%, and 0.10% progressed
than EFO-Deep CNN, WOA-Deep CNN, GWO-Deep CNN, and PSO-Deep CNN. From
Fig. 10e, the FPR at 85% learning percentage is 2.20%, 1.16%, 1.39%, and 0.54%, superior
to EFO-Deep CNN, WOA-Deep CNN, GWO-Deep CNN, and PSO-Deep CNN. In the case
of Fig. 10f, at 55% learning percentage, the MCC of the E-WOA-Deep CNN is 15.38%,
9.76%, 2.17%, and 4.26% improved than EFO-Deep CNN, WOA-Deep CNN, GWO-Deep
CNN, and PSO-Deep CNN. Moreover, in Fig. 10g, the NPV at 75% learning percentage is
1.79%, 4.26%, 0.41%, and 3.89% higher than EFO-Deep CNN, WOA-Deep CNN, GWO-
Deep CNN, and PSO-Deep CNN. Hence, better performance analysis is achieved in the
case of the proposed E-WOA-Deep CNN for both the static as well as the dynamic images
than all the traditional algorithms.
The performance analysis over several traditional machine learning algorithms in terms of
distinct performance measures is displayed in Figs. 11 and 12, respectively. The measures
reveal better outcomes in the case of the proposed E-WOA-Deep CNN. From Fig. 11a,
at 85% learning percentage, the accuracy of E-WOA-Deep CNN is 1.06%, 2.15%, and
3.26% improved than CNN, RNN, and VGG16. On considering Fig. 11b, at 85% learning
percentage, the F1 Score of E-WOA-Deep CNN is 6.38%, 5.26%, and 4.17% higher than
CNN, RNN, and VGG16. In the case of Fig. 11c, the FDR at 85% learning percentage
is 13.33%, 18.75%, and 1.47% improved than CNN, RNN, and VGG16. On considering
Fig. 11d, the FNR at 75% learning percentage is 17.65%, 28.57%, and 25.93% superior to
CNN, RNN, and VGG16. Further, in Fig. 12a, the accuracy at 55% learning percentage is
2.17%, 3.33%, and 1.08% superior to CNN, RNN, and VGG16. On considering Fig. 12e,
the FPR at 45% learning percentage is 2.0%, 10.71%, and 4.17% higher than CNN, RNN,
and VGG16. From Fig. 12f, the MCC at 35% learning percentage is 16.28%, 19.05%, and
2.04% surpassed than CNN, RNN, and VGG16. Similarly, in the case of Fig. 12g, the NPV
at 85% learning percentage is 2.04%, 4%, and 2.13% better than CNN, RNN, and VGG16.
Hence the machine learning analysis of the proposed E-WOA-Deep CNN is better than the
state-of-the-art methods in both the static as well as dynamic images.
The overall algorithmic analysis in the case of several optimization algorithms with
the proposed E-WOA-Deep CNN for both the static as well as dynamic images is
listed in Tables 2 and 3 for the static as well as the dynamic images. The outcomes
hold better with the proposed E-WOA-Deep CNN. In the case of Table 2 for static
images, the accuracy of the E-WOA-Deep CNN is 1.02%, 1.14%, 1.02%, and 0.76%
improved than EFO-Deep CNN, WOA-Deep CNN, GWO-Deep CNN, and PSO-Deep
CNN. Additionally, the FPR of the E-WOA-Deep CNN is 1.16%, 0.71%, 1.82%, and
0.20%, superior to EFO-Deep CNN, WOA-Deep CNN, GWO-Deep CNN, and PSO-
Deep CNN. On considering Table 3, for the dynamic images, the accuracy of the
13
2120 Multimedia Tools and Applications (2022) 81:2095–2126
Fig. 11 Performance analysis of the developed and traditional machine learning algorithms for hand gesture
recognition using static images by changing the learning percentages for the measures a accuracy, b F1
score, c FDR, d FNR, e FPR, f MCC, g NPV, h precision, i sensitivity, and j specificity
13
Multimedia Tools and Applications (2022) 81:2095–2126 2121
Fig. 12 Performance analysis of the developed and traditional machine learning algorithms for hand gesture
recognition using dynamic images by changing the learning percentages for the measures a accuracy, b F1
score, c FDR, d FNR, e FPR, f MCC, g NPV, h precision, i sensitivity, and j specificity
13
2122 Multimedia Tools and Applications (2022) 81:2095–2126
Table 2 Overall performance analysis of developed and traditional heuristic-oriented AHT and Deep CNN
models for hand gesture recognition in static format
Performance WOA-Deep PSO-Deep CNN EFO-Deep CNN GWO-Deep E-WOA-Deep
measures CNN [17] [36] [38] CNN [18] CNN
E-WOA-Deep CNN is 0.56%, 0.65%, 1.18%, and 0.92% higher than EFO-Deep CNN,
WOA-Deep CNN, GWO-Deep CNN, and PSO-Deep CNN. Similarly, the MCC of
the E-WOA-Deep CNN for the dynamic images is 2.76%, 1.57%, 11.63%, and 4.26%
higher than EFO-Deep CNN, WOA-Deep CNN, GWO-Deep CNN, and PSO-Deep
CNN. Thus, the outcomes demonstrated that the proposed E-WOA-Deep CNN pro-
duced better outcomes in the case of both the static as well as the dynamic images
than all the traditional methods.
The classifier analysis in terms of several machine learning algorithms with the pro-
posed E-WOA-Deep CNN for both the static as well as dynamic images is defined in
Table 3 Overall performance analysis of developed and traditional heuristic-oriented AHT and Deep CNN
models for hand gesture recognition in dynamic format
Performance EFO-Deep CNN GWO- PSO-Deep CNN WOA- E-WOA-Deep CNN
measures [38] Deep CNN [36] Deep CNN
[18] [17]
13
Multimedia Tools and Applications (2022) 81:2095–2126 2123
Table 4 Overall classifier analysis of developed and traditional machine learning algorithms for hand ges-
ture recognition in a static format
Performance RNN [14] VGG16 [10] Deep CNN [28] DH-GWO-NN [22] E-WOA-Deep CNN
measures
Table 4 and Table 5, respectively. The outcomes are superior to the proposed E-WA-
Deep CNN. On considering Table 4, the accuracy of E-WOA-Deep CNN is 2.20%,
3.05%, and 3.16% better than DH-GWO-NN, Deep CNN, VGG16, and RNN. The
precision of static images is 4.19%, 0.30%, 0.23%, and 0.01% higher than DH-GWO-
NN, Deep CNN, VGG16, and RNN. Similarly, the accuracy of dynamic images is
8.26%, 0.74%, 1.04%, and 1.03%, superior to DH-GWO-NN, VGG16, RNN, and Deep
CNN. Further, the specificity of dynamic images is 49.13%, 0.93%, 0.28%, and 1.19%
improved than DH-GWO-NN, VGG16, RNN, and Deep CNN. Therefore, the classifier
analysis outcomes holds good for the proposed E-WOA-Deep CNN for both the static
as well as dynamic images when it is compared with several traditional machine learn-
ing algorithms.
Table 5 Overall classifier analysis of developed and traditional machine learning algorithms for hand ges-
ture recognition in dynamic format
Performance Deep CNN [28] RNN [14] VGG16 [10] DH-GWO-NN [22] E-WOA-Deep CNN
measures
13
2124 Multimedia Tools and Applications (2022) 81:2095–2126
7 Conclusion
This paper has proposed improved segmentation and deep learning-based strategy for
dynamic hand gesture recognition. The data was collected from the ISL benchmark data-
set that consisted of both the static as well as dynamic images. The pre-processing was
accomplished by the grey scale conversion and histogram equalization. The segmentation
of gestures was performed by the AHT by tuning the theta angle using the E-WOA. Next,
the optimized Deep CNN was used for gesture recognition, where the learning rate, epoch
count, and the number of hidden neurons were optimized using the proposed E-WOA.
The training of optimized Deep CNN was done by the DTW that avoided the redundant
frames, thus improving its performance. From the analysis, the accuracy of static images
of the E-WOA-Deep CNN was 1.02%, 1.14%, 1.02%, and 0.76% improved than EFO-Deep
CNN, WOA-Deep CNN, GWO-Deep CNN, and PSO-Deep CNN. Similarly, the accu-
racy of dynamic images was 8.26%, 0.74%, 1.04%, and 1.03%, superior to DH-GWO-NN,
VGG16, RNN, and Deep CNN. Therefore, the outcomes clearly demonstrated that the pro-
posed E-WOA-Deep CNN holds superior results for gesture recognition in the case of both
the static as well as dynamic images than the state-of-the-art methods.
References
1. Ameur S, Ben Khalifa A, Bouhlel MS (2020) A novel hybrid bidirectional unidirectional LSTM network for
dynamic hand gesture recognition with Leap Motion. Entertain Comput 35:100373
2. Bautista MA, Hernandez-Vela A, Vi Ponce X, Perez-Sala X Baro, Pujol O, Angulo C, Escalera S
(2013) Probability-based dynamic time warping for gesture recognition on RGB-D data. Lecture Notes
Comput Sci 7854:126–135
3. Blazkiewicz M, Lann Vel Lace K, Hadamus A (2021) Gait symmetry analysis based on dynamic time
warping. Symmetry 13(5):836
4. Chen Q, Georganas ND, Petriu EM (2008) Hand gesture recognition using Haar-like features and a
stochastic context-free grammar. IEEE Trans Instrum Meas 57(8):1562–1571
5. Cheng H, Yang L, Liu Z (2016) Survey on 3d hand gesture recognition. IEEE Trans Circuits Syst
Video Technol 26(9):1659–1673
6. Cheng H, Dai Z, Liu Z, Zhao Y (2016) An image-to-class dynamic time warping approach for both 3D
static and trajectory hand gesture recognition. Pattern Recogn 55:137–147
7. Choi H-R, Kim TY (2018) Modified dynamic time warping based on direction similarity for fast ges-
ture recognition. Pattern Recogn 2018:9
8. Dardas NH, Georganas ND (2011) Real-time hand gesture detection and recognition using bag-of-
features and support vector machine techniques. IEEE Trans Instrum Meas 60(11):3592–3607
9. Dorothy R, Joany RM, Rathish J, Santhana Prabha S, Rajendran S, Joseph S (2015) Image enhancement
by Histogram equalization. Int J Nano Corros Sci Eng 2:21–30
10. Guan Q, Wang Y, Ping Bo, Li D, Jiajun Du, Qin Yu, Hongtao Lu, Wan X, Xiang J (2019) Deep convo-
lutional neural network VGG-16 model for differential diagnosing of papillary thyroid carcinomas in
cytological images: a pilot study. J Cancer 10(20):4876–4882
11. Hsieh C-C, Liou D-H (2015) Novel Haar features for real-time hand gesture recognition using SVM. J
Real-Time Image Process 10(2):357–370
12. Ibañez R, Soria Á, Teyseyre A, Rodríguez G, Campo M (2017) Approximate string matching: a lightweight
approach to recognize gestures with kinect. Pattern Recogn 62:73–86
13. Kollorz K, Penne J, Hornegger J, Barke A (2008) Gesture recognition with a time-of-flight camera. Int
J Intell Syst Technol Appl 5(3–4):334–343
14. Li F, Liu M (2019) A hybrid convolutional and recurrent neural network for hippocampus analysis in
Alzheimer’s disease. J Neurosci Methods 323:108–118
15. Lv W (2021) Gesture recognition in somatosensory game via kinect sensor. Internet Technol Lett.
https://doi.org/10.1002/itl2.311
16. Marsaline Beno M, Valarmathi IR, Swamy SM, Rajakumar BR (2014) Threshold prediction for segmenting
tumour from brain MRI scans. Int J Imaging Syst Technol 24(2):129–137
13
Multimedia Tools and Applications (2022) 81:2095–2126 2125
17. Mirjalili S, Lewis A (2016) The whale optimization algorithm. Adv Eng Softw 95:51–67
18. Mirjalili S, Mirjalili SM, Lewis A (2014) Grey wolf optimizer. Adv Eng Softw 69:46–61
19. Mitra S, Acharya T (2007) Gesture recognition: a survey. IEEE Trans Syst Man Cybernet C Appl Rev
37(3):311–324
20. Murillo-Bracamontesa EA, Martinez-Rosas ME, Miranda-Velasco MM, Martinez-Reyes HL, Martinez-
Sandoval JR, Cervantes-de-Avila H (2012) Implementation of Hough transform for fruit image segmenta-
tion. In: International meeting of electrical engineering research ENIINVIE 2012, vol 35, pp 230–239
21. Nandy A, Mondal S, Prasad JS, Chakraborty P, Nandi GC (2010) Recognizing & interpreting Indian
sign language gesture for human robot interaction. In: The proceeding of ICCCT-10, IEEE Xplore
Digital Library, pp 712–717
22. Nandy A, Mondal S, Prasad JS, Chakraborty P, Nandi GC (2010) Recognition of isolated indian sign
language gesture in real time. In: Das VV et al (eds) Information processing and management, LNCS-
CCIS, vol 70. Springer, Berlin, pp 102–107
23. Palacios JM, Sagüés C, Montijano E, Llorente S (2013) Humancomputer interaction based on hand
gestures using RGB-D sensors. Sensors 13(9):11842–11860
24. Pedersoli F, Benini S, Adami N, Leonardi R (2014) XKin: An open source framework for hand pose
and gesture recognition using Kinect. Vis Comput 30(10):1107–1122
25. Plouffe G, Cretu A-M (2016) Static and dynamic hand gesture recognition in depth data using dynamic
time warping. IEEE Trans Instrum Meas 65(2):305–316
26. Plouffe G, Cretu A (2016) Static and dynamic hand gesture recognition in depth data using dynamic
time warping. IEEE Trans Instrum Meas 65(2):305–316
27. Poularakis S, Katsavounidis I (2016) Low-complexity hand gesture recognition system for continuous
streams of digits and letters. IEEE Trans Cybernet 46(9):2094–2108
28. Rawat W, Wang Z (2017) Deep convolutional neural networks for image classification: a comprehensive
review. Neural Comput 29:2352–2449
29. Ren Z, Yuan J, Meng J, Zhang Z (2013) Robust part-based hand gesture recognition using Kinect sensor.
IEEE Trans Multimedia 15(5):1110–1120
30. Ren Z, Yuan J, Meng J et al (2013) Robust part-based hand gesture recognition using kinect sensor.
IEEE Trans Multimedia 15(5):1110–1120
31. Srivastava R, Sinha P (2016) Hand movements and gestures characterization using quaternion dynamic
time warping technique. IEEE Sens. J 16(5):1333–1341
32. Tang M (2011) Recognizing hand gestures with Microsoft’s Kinect. Stanfordedu 14(4):303–313
33. Tang J, Cheng H, Zhao Y, Guo H (2018) Structured dynamic time warping for continuous hand trajectory
gesture recognition. Pattern Recogn 80:21–31
34. Várkonyi-Kóczy AR, Tusor B (2011) Human–computer interaction for smart environment applications
using fuzzy hand posture and gesture models. IEEE Trans Instrum Meas 60(5):1505–1514
35. Wang H, Li Z (2015) Accelerometer-based gesture recognition using dynamic time warping and sparse
representation. Multimedia Tools Appl 75:8637–8655
36. Wang D, Tan D, Liu L (2018) Particle swarm optimization algorithm: an overview. Soft Comput 22:387–408
37. Yao Y, Fu Y (2014) Contour model-based hand-gesture recognition using the Kinect sensor. IEEE
Trans Circuits Syst Video Technol 24(11):1935–1944
38. Yilmaz S, Sen S (2020) Electric fish optimization: a new heuristic algorithm inspired by electroloca-
tion. Neural Comput Appl 32:11543–11578
39. Yoon H-S, Soh J, Bae YJ, Seung Yang H (2001) Hand gesture recognition using combined features of
location, angle and velocity. Pattern Recogn 34(7):1491–1501
40. Zhou Y, Jiang G, Lin Y (2016) A novel finger and hand pose estimation technique for real-time hand
gesture recognition. Pattern Recogn 49:102–114
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
13
2126 Multimedia Tools and Applications (2022) 81:2095–2126
13