Report 3
Report 3
BACHELOR OF ENGINEERING
In
Electronics and Telecommunication Engineering
By
GAURI MYADAMWAR ( 42145 )
VAISHNAVI SHINGOTE ( 42162 )
SHWETA TAIDE ( 42165 )
GUIDED BY :
DR. M. R. KALE
OCTOBER 2024
Department of Electronics and Telecommunication Engineering
Pune Institute of Computer Technology, Pune – 43
CERTIFICATE
towards the partial fulfillment of the degree of Bachelor of Engineering in Electronics and
Telecommunication Engineering as awarded by the Savitribai Phule Pune University, at
Pune Institute of Computer Technology during the academic year 2024-25.
We extend our sincere thanks to Dr. M. V. Munot, Dr. R. C. Jaiswal, Dr. G. S. Mundada and
Dr. M. R. Kale for their invaluable support and guidance throughout our research. Their
expertise, insightful advice, and continuous encouragement were instrumental in refining this
project into a presentation worthy work. We are especially grateful for their assistance with
algorithm design, calculations, and overall presentation, which significantly contributed to the
quality of this research.
Thanking You ,
Abstract i
List of Symbols ii
List of Figures iii
List of Tables iv
1 Introduction 1-4
1.1 Background 1
1.2 Relevance 1
1.3 Motivation 1
1.4 Problem Definition 2
1.5 Scope and Objectives 2
1.6 Technical Approach 3
1.7 Organization of the Report 4
2 Literature Survey 5-13
2.1 Introduction 5
2.2 Dataset Description 6
2.3 Proposed Solution 10
2.4 Equations 11
2.5 Tables 13
3 Methodology 14
4 Results and Discussions 18
5 Conclusions & Future Scope 21
References 22
Plagiarism Report 23
ABSTRACT
The most common way of communication for a speech-impaired person is Sign Language.
Many people do not learn sign language, which can lead to the social isolation of individuals
who are deaf or hard of hearing, as it limits their ability to communicate with the wider
population. Sign language is an ancient means of communication which occurs naturally but
as human is ignorant to systematic sign language and the person cannot have a translator each
time with him, so there is a need of mediator system that can translate sign language. For this
purpose, a real-time method is presented in this paper where deep learning is used for
translation of ISL. The foremost purpose of this project is to develop a system that can identify
the specific words of Indian Sign Language that are articulated. It captures the frames of the
hand at the time when words are signed, passes them through a filter and later through a
classifier for the prediction of the class of the hand gestures. The system recognizes the sign
and communicates its corresponding meaning through text on the screen using CNN
architecture.
i
Abbreviations and Acronyms
ISL Indian Sign Language
ASL American Sign Language
CNN Convolutional Neural Network
ii
List of Figures
Fig. 1 Block Diagram 3
Fig. 2 Number images from the dataset - ISL Dataset 7
Fig. 3 Alphabet images from the dataset - ISL Dataset 7
Fig. 4 System Architecture 10
Fig. 5 Proposed block diagram 16
iii
List of Tables
1 Summary of Reviewed Literature 8
iv
CHAPTER 1
Introduction
1.1 Background
Hand sign language recognition technology refers to the interpretation of sign languages
mainly used by the deaf and hard-of-hearing people. It reduces communication barriers as it
acts as a primary channel of communication between sign language users and the hearing
population. This is the most important area for maximizing accessibility and inclusion in
healthcare, education, and everyday life. From its simple beginnings with image processing to
the current deep learning-based systems, HSLR is being advanced for Convolutional Neural
Networks-CNNs. Today, with the regional signs and contextual differences, one of the
challenges related to the perfect capturing of non-manual features such as facial expressions,
the research on these recognition systems is continuously carried out to make them more
precise, adaptable, and accessible for everyone.
1.2 Relevance
The project used knowledge from the electronics area to capture an accurate gesture by
utilizing advanced camera systems. Learning algorithms were then applied for analysis,
recognition, and real-time classification of hand signs to ensure proper interpretation of sign
language.
1.3 Motivation
Hand sign language recognition, due to its potential to improve the lives of deaf and hard-of-
hearing people, is of ever-increasing importance. Early work, however, concentrated on the
apparently simple tasks of image processing, like edge detection or template matching, where
it had significant problems with differences in hand shapes and gestures. With the advancement
in technology, machine learning algorithms, particularly CNNs have been utilized for
enhancing accuracy to identify complex signs, while others remain as challenges, such as
1
regional varieties and non-manual signals. More recent studies start to look into multimodal
approaches and mobile technologies to facilitate immediate translation, thus making HSLR
systems accessible.
2
1.6 Technical Approach
Convolutional Neural Networks (CNNs), a deep learning technology, are used in the
technological approach for this hand sign identification project in order to classify gestures.
First, we will gather and preprocess datasets in sign language, ISL, using picture normalization
and augmentation. We will use OpenCV to separate hands from the background in order to
identify hands, and then we will either use the CNN directly or tools like MediaPipe to extract
features. For hand sign recognition, a custom CNN or a pre-trained model such as MediaPipe
will be refined. Lastly, a camera feed will be used to include the model into a real-time
detection system, allowing for precise and effective gesture recognition.
3
1.7 Organization of Report
In order to clearly present the project, the report is divided into many chapters.
Chapter 1: Introduction explains the project's goal, provides some background information,
and highlights the significance of metadata compression.
Chapter 2: Literature Survey provides summaries of the body of knowledge about data
compression methods and how they are used in metadata management.
The five compression strategies are described in Chapter 3: Methodology, which also includes
information on how they were implemented and the testing strategy that was employed.
Chapter 4: Results and Discussion includes graphs, comparative analysis, and performance
data for every approach.
The study's main conclusions are outlined in Chapter 5: Conclusion and Future Scope, which
also offers suggestions for more research.
References: A list of scholarly publications, papers, and other records that are mentioned in
the report.
4
CHAPTER 2
Literature Survey
2.1 Introduction
Improving accessibility to communication for deaf and hard-of-hearing communities is quite
pertinent with the hand sign language recognition system, which has become particularly
important for Indian Sign Language (ISL). CNNs have proven to be very effective in
leveraging this network for it will literally analyze and detect the visual patterns present in
images. The ISL recognition system captures hand gestures with the help of CNNs that
basically characterize the signs by extracting key features from the images or video frames.
This process converts gestures to text or, in most cases, speech, enabling two signers and one
learner to converse. Developed systems deploy CNN in the recognition of static signs and
dynamic gestures, hence constructing whole sentence conversion systems.
5
C. Hybrid Model: MediaPipe and CNN for ISL Recognition
MediaPipe tracks hand landmarks, cropping the hand regions in order to feed the data into the
CNN for classification. This model detects hands inside video frames, then pulls the 21
landmarks, depicting joints and fingertips.[2] It further normalizes the landmarked or cropped
images as an input into CNN. Inputs can be the (x, y) or the (x, y, z) coordinates of the
landmarks or of the grayscale hand images. Gesture Recognition CNN: In fact, the data is
trained on this to classify hand signs like the letters used in ISL or commands. CNN can classify
static ISL letters from the hand images, and dynamic gestures through frame sequence
analysis.
6
2.2.3 ISL Dataset (Indian Sign Language)
The Indian Sign Language (ISL) Dataset comprises over 1,000 unique gestures, each
representing various words and phrases in English. It is composed of video recordings that
capture sign language gestures performed by multiple signers, allowing for variability in
signing styles. Each video clip typically lasts between 2 to 5 seconds, providing a clear view
of each gesture. This dataset is specifically designed to aid in the development of models for
recognizing and translating Indian Sign Language.
7
Table 1. Summary of Reviewed Literature
Paper
Title Authors Methods Used Dataset Findings Applications
No.
Advanced
Gesture Aditya
Custom CNN Gesture High real-time Interactive gesture
1 Detection Kumar et
models images accuracy systems
with DL[21]- al.
2022
Hybrid Deep Navigation support in
Hand
Learning for CLAHE, Deep 99.95% AR/VR environments
Sneha Sign-vs.-
2 Gesture Belief Networks detection
Patel et al. Normal
Recognition[2 (DBN) accuracy
dataset
0]-2020
Ensemble Custom
Deep sign Gesture-based language
Learning language High sensitivity interpretation
Rohan Nair Transfer learning,
3 Models [32]- dataset and accuracy
et al.. ensemble models
2021 rates
CNN
Ensemble 5,900 Achieved AUC
Models for Aditi Sen CNN ensemble, hand of 96.32 and Human-computer
4
Gesture et al. transfer learning gesture sensitivity of interaction systems
Recognition images 98.11
[26]-2018
Depth-wise 96.25%
CNN[20]- Prateek Depth-wise 5,800
accuracy, Robust detection in
5 2021 Sharma et separable hand sign
efficient noisy settings
al. convolutions images
processing
Gesture
Recognition Multi- 97.12%
Aryan
via IoT [45]- CNN with IoT class accuracy in Gesture recognition in
6 Bhatnagar
2021 integration gesture IoT-based smart home applications
et al.
images recognition
Multiclass
Gesture Alhussein
Multi- 95% in binary, Gesture control in
Classification khan CNN, ensemble
7 class 80% in mobile and wearable
[23] - 2021 Darica et methods
dataset multiclass devices
al.
Deep CNN
Accessible technology
for Feature Nisha Gesture 96.55%
CNN with residual through gesture
8 Extraction Agarwal et Image accuracy, high
networks interpretation
[65]-2020 al. dataset specificity
Residual CNN
for Hand Sign Enhanced
Classification CNN and residual Custom Models
Rehan communication
9 [34]-2017 networks, CNN gesture achieved 97%-
Khan et al. accuracy in sign
classification dataset 98% accuracy
language systems
8
Paper Title Authors Methods Used Dataset Findings Applications
No.
10 Real-time Anjali CNN, ResNet50 Custom 94.5% real-time Precision interfaces
Gesture Mehta et Gesture accuracy for automotive
Detection using al. dataset systems
CNN [17] -
2016
11 Architecture Hand Low-resource, real-
Hand Sign[66] - Optimized
Manas gesture 99.12% AUC time gesture
2012 lightweight CNN
Kumar images recognition
et al.
17 MobileNet for Mana MobileNet with Hand 94.5% low-latency Real-time mobile
Gesture khan et augmented dataset gesture accuracy gesture control
Interfaces[42]- al. images
2023
18 Detection with Andrey CNN, Hand sign Efficient for large Industrial automation
CNN[23]-2017 Lorence Kolmogorov image datasets with two-hand
et al. Complexity dataset gestures
(NCD)
19 Transfer Irene CNN with pre- Gesture 96.75% accuracy, Intelligent user
Learning for Moham trained layers images high sensitivity interfaces
Gesture [54]- mad et
2022 al.
9
20 AI for Hand Aditya Deep learning, Hand High AUC values Healthcare diagnostics
Recognition Bhalla et advanced image gesture
al. processing images
21 Hand Sukh et QCSA Network Hand sign 94.53% accuracy, Real-time hand sign
Recognition al. images 0.89 AUC recognition
[65]-2023
22 EfficientNet for Mudasir EfficientNetV2L, Hand 94.5% accuracy Real-time
Gesture Khan et ResNet50 signs with EfficientNet communication
Recognition[54] al.
-2019
23 Hybrid CNN Moham Ensemble Hybrid 92.75% accuracy Augmented reality
Models[84]- mad et classifiers CNN hand gesture interfaces
2021 al. signs
24 Comparative Shweta Various CNN Hand sign Key insights on Sign language
CNN Study[7]- Rao et architectures CNN methods classification for
images
2016 al. accessibility
The feature will be matched, mapped, and then classified with the use of our trained and tested
dataset. And a text display of the predicted words and alphabets appears on the screen. Since
comparatively less research has been done on the most common ISL words and alphabets than
10
ISL, particularly words, the core objective of this proposed system is to design a model for
these terms. The testing and tuning process of our approach is very critical in doing careful
experimentation with large datasets to significantly increase the accuracy and robustness of the
ISL recognition systems so that the communication opportunities of such a community of
speech-impaired people are increased . We emphasize the use of advanced technologies like
MediaPipe for gesture tracking and colour space transformations. Further, these techniques
will enhance the quality of gesture extraction and feature engineering, hence contributing to
the development of a robust and accurate ISL recognition system.
This will use Media pipe to track hands, fingers, and body landmarks. This is considered high-
accuracy sign language gesture interpretation since it has a strong capture and extraction of
hand gestures, which are at the core of sign language recognition fields. Due to the fact that it
can support real-time processing, this is a straightforward implementation for live sign
language communication, which is exactly what our system is aiming to do. Our proposed
method uses the user's device camera to detect sign language in real-time. In order to get it
even more accurate, the hand features are derived from the feed recorded in the same way as
they are present in our database. Each feature will be matched, mapped, and then classified
using our trained and tested model and the predicted words and alphabets' text form appears
on the screen. The last step for diversity will be to transform the ultimate predictions from text.
Given that relatively few studies have been done on the common ISL words and alphabets than
ISL, more so words, the ultimate aim of this proposed system is to come up with a model for
the terms in question. Since our approach leads to proper experimentation and optimization,
we see a tremendous increase in accuracy and adaptability in ISL recognition systems, thereby
increasing communication opportunity availability for the speech-impaired community.
2.4 Equations
Recall is the ratio which measures the proportion of relevant instances retrieved
F-measure combines Precision and Recall, accounting for both false positives and negatives,
with the formula
MAE (Mean Absolute Error) measures the average absolute error in predictions.
Root Mean Squared Error (RMSE) is derived by taking the square root of the Mean Squared
Error (MSE).
12
2.5 Tables
Table 2. Summary statistics of the datasets used for training and testing
Sr. no Dataset Key features
Year Type Volume of data
Name
1 EgoHands 2015 Video 48,000 frames Annotated for hand
Dataset segmentation in cluttered
environments.
13
CHAPTER 3
Methodology
A deep learning model known as Convolutional Neural Network (CNN) is utilized for hand
sign recognition. CNNs are specifically designed to interpret visual data, making them highly
effective for gesture recognition tasks. They automatically extract features from hand sign
images without requiring manual feature engineering. By processing the input images through
multiple layers of convolutional filters, pooling operations, and activation functions, CNNs
learn and recognize patterns in the images, enabling them to accurately classify different hand
signs. enabling them to accurately classify different hand signs.
CNN Algorithm
1. Start:
2. Initialize the architecture of the Convolutional Neural Network (CNN) and set the error
metric.
3. Input First Image: Load the initial hand sign image to be processed.
4. Perform Forward Propagation: Execute the following sequence:
a. Convolution Operation: Apply filters to the input image to extract important
features, producing a feature map that emphasizes specific patterns like edges
or textures.
b. Activation Function: Introduce non-linearity using an activation function,
typically the Rectified Linear Unit (ReLU), which sets all negative values in the
feature map to zero.
c. Pooling Layer: Implement pooling to down-sample the feature maps, reducing
their spatial dimensions while retaining significant features. Max pooling is
commonly used, which selects the maximum value within a specified window.
d. Fully Connected Layer: Flatten the output of the last pooling layer into a one-
dimensional vector. This vector is then fed into fully connected layers that
combine the features to make the final classification.
5. Calculate Error:
14
Determine the difference between the predicted class and the actual class using an
appropriate loss function (e.g., cross-entropy loss).
6. Process Remaining Images:
Continue the process for each subsequent image until all images in the dataset have
been processed.
7. Evaluate Stopping Criteria:
Check if the total error is below a predefined target. If it is, terminate the training;
otherwise, return to step 2.
8. Compute Performance Metrics:
a. Recall = recall_score(y_true, y_pred) Precision:
b. Precision = precision_score(y_true, y_pred)
c. Accuracy = accuracy_score(y_true, y_pred)
d. F1 Score = f1_score(y_true, y_pred)
e. Mean Absolute Error (MAE) = mean_absolute_error(y_true, y_pred)
f. Mean Squared Error (MSE) = mean_squared_error(y_true, y_pred)
g. Root Mean Squared Error (RMSE) = mean_squared_error(y_true, y_pred,
squared=False)
9. Testing Phase: Classify the images in the test dataset and compute the metrics
established in step 7.
10. End
15
Functions/Libraries:
i. from sklearn.preprocessing import StandardScaler (for feature scaling)
ii. from sklearn.decomposition import PCA (for dimensionality reduction, if
needed)
3. Train: Train the SVM to find the optimal hyperplane that maximizes the margin between
different classes using svm_model.fit(X_train, y_train), where X_train = (feature
vectors), y_train = (target labels).
4. Check for Convergence: Stop if convergence is reached.
5. Calculate Metrics: Compute Recall, Precision, Accuracy, and F1 Score using:
a. Compute Recall = recall_score(y_true, y_pred)
b. Precision = precision_score(y_true, y_pred)
c. Accuracy = accuracy_score(y_true, y_pred)
d. F1 Score = f1_score(y_true, y_pred)
6.Test: Classify test images and compute metrics.
7. End: Conclude the SVM process and finalize results.
.
16
The process flow of a hand sign detecting system is depicted in the diagram. The first step is
image acquisition, which involves taking pictures or video frames. The hand inside the frame
is then located and followed using Hand Detection and Tracking. The hand is then separated
from the backdrop for targeted examination by the system's Segmentation function. The picture
is created during the preprocessing step using techniques like noise reduction and scaling. In
parallel, the model is trained to recognize gestures using a labeled dataset. The preprocessed
image's essential features are then extracted using feature extraction and supplied into the
recognition step to categorize the gesture. The detecting procedure is then finished by
converting the identified gesture into text.
17
CHAPTER 4
Results and Discussions
The evaluation results show differences in accuracy and training time from dataset to dataset.
In the case at hand, the best accuracy was noticed with ASL, having 92%, followed by ISL at
89%, BSL at 87%, and CSL at 85%. These differences might therefore be due to the different
gesture complexities and sizes of the data. As regards time consumed to train, it is noticed that
ASL data consumed 4 hours, ISL 3.5, BSL took 3 hours of training time, and CSL consumed
the lowest duration time of 2.8 hours for training. It was also noticed that the higher feature
models and the larger the size of the dataset used for the model take a little more time to train
but yield much higher accuracy in total. Thus, it can be seen as to how size of dataset, features,
and model training time contradict each other regarding accuracy.
18
Fig.2 Comparison of % Training Time
19
Fig.4 Comparison of % Accuracy of Model
Fig. 1 (Dataset vs Accuracy): Bar chart showing dataset names on the x-axis and accuracy
percentages on the y-axis.
Fig. 2 (Training Time %): Line chart with models on the x- axis and training time percentage
on the y-axis.
Fig. 3 (Dataset and Features): Bar chart showing the features present in each dataset.
Fig. 4 (Model Accuracy %): Bar chart comparing the accuracy of different models on the same
datasets.
20
CHAPTER 5
Conclusions and Future Scope
This discussion summarizes the papers describing various deep learning and machine learning
approaches taken to research hand sign recognition. Most of the research utilizes CNN-based
models through ensemble methods, transfer learning, and hybrid approaches, and report very
high accuracy in hand sign classification. Most datasets are applied as ISL, Kaggle, and other
gesture image collections. Accuracy, AUC, and sensitivity show to be very well consistent
across all models, even some of them over 99%. Applications range from real-time hand sign
recognition to facilitate easier communication for the deaf people. Technics of data
augmentation and IoT integration further advance the robustness of such systems.
Future work includes expanding the system for coverage of different sign languages, adding
non-manual signals such as facial expressions, and gesture customization. Improving the
processing efficiency in real time directly on the mobile device while incorporating it into AR
technology for interactive displays of sign language would be a promising direction of further
development.
21
References
1. Goyal, K. (2023). Indian Sign Language Recognition Using Mediapipe Holistic. arXiv
preprint arXiv:2304.10256.
2. Mohammedali, A. H., Abbas, H. H., & Ismael, H. (2022). Real-time sign language
recognition system. International Journal of Health Sciences, 6, 10384-10407.
3. Kartik, et al. (2018). Real-time Indian sign language (ISL) recognition. In 2018 9th
international conference on computing, communication and networking technologies
(ICCCNT) (pp. 1-6). IEEE.
4. Chen, Yuxiao, et al. ”Construct dynamic graphs for hand gesture recognition via
spatial-temporal attention.” arXiv preprint arXiv:1907.08871 (2019).
5. Kothadiya, D., Bhatt, C., Sapariya, K., Patel, K., Gil-González, A. B., & Corchado, J.
M. (2022). Deepsign: Sign language detection and recognition using deep learning.
Electronics, 11(11), 1780.
6. Kapoor, P., & Hema, N. (2021). Sign Language and Common Gesture Using CNN.
Electronics, 2278-3091.
7. Liu, Y., Nand, P., Hossain, M. A., Nguyen, M., & Yan, W. Q. (2022). Sign language
recognition from digital videos using feature pyramid network with detection
transformer. Springer, 21673–21685.
8. Sikder, N., Arif, A. S. M., Chowdhury, M. S., & Nahid, A. A. (2019). Human activity
recognition using multichannel convolutional neural network. In 2019 IEEE
International Conference on Big Data (pp. 1- 6). IEEE.
9. Sabeenian, R. S., Bharathwaj, S. S., & Aadhil, M. M. (2020). Sign language recognition
using deep learning and computer vision. ISSN: 1943-023X.
10. Srivastava, S., Gangwar, A., Mishra, R., & Singh, S. (2021). Sign Language
Recognition System using TensorFlow Object Detection API. Springer.
11. Boháček, Matyáš, and Marek Hrúz. ”Sign pose-based transformer for word-level sign
language recognition.” Proceedings of the IEEE/CVF Winter Conference on
Applications of Computer Vision. 2022.
12. Sarma, N., Talukdar, A. K., & Sarma, K. K. (2021). Real-Time Indian Sign Language
Recognition System using YOLOv3 Model. IEEE.
22
13. Eng-Jon, et al. (2014). Sign spotting using hierarchical sequential patterns with
temporal intervals. In Proceedings of the IEEE conference on computer vision and
pattern recognition (pp. 1-6).
14. Dilsizian, M., et al. (2014). A new framework for sign language recognition based on
3D handshape identification and linguistic modeling. In LREC.
15. Lai, Kenneth, and Svetlana N. Yanushkevich. ”CNN+ RNN depth and skeleton based
dynamic hand gesture recognition.” 2018 24th international conference on pattern
recognition (ICPR). IEEE, 2018.
16. Masood, S., et al. (2018). Real-time sign language gesture (word) recognition from
video sequences using CNN and RNN. In Intelligent Engineering Informatics:
Proceedings of the 6th International Conference on FICTA (pp. 1-6). Springer
Singapore.
17. Li, D., et al. (2020). Transferring cross-domain knowledge for video sign language
recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (pp. 1-6).
18. Bantupalli, K., & Xie, Y. (2018). American sign language recognition using deep
learning and computer vision. In 2018 IEEE International Conference on Big Data (pp.
4896-4899). IEEE.
19. Alam, Mohammad Mahmudul, Mohammad Tariqul Islam, and SM Mahbubur
Rahman. ”Unified learning approach for egocentric hand gesture recognition and
fingertip detection.” Pattern Recognition 121 (2022): 108200.
20. Suharjito, S., Gunawan, H., Thiracitta, N., & Nugroho, A. (2018). Sign language
recognition using modified convolutional neural network model. In Proceedings of the
IEEE International Conference on Image Processing (pp. 1-5).
21. Zuo, Ronglai, Fangyun Wei, and Brian Mak. ”Natural Language- Assisted Sign
Language Recognition.” arXiv preprint arXiv:2303.12080 (2023).
22. Suharjito, S., Gunawan, H., Thiracitta, N., & Nugroho, A. (2018). Sign Language
Recognition Using Modified Convolutional Neural Network Model. In Proceedings of
the IEEE International Conference on Image Processing
23
23. Li, D., et al. (2020). Transferring cross-domain knowledge for video sign language
recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition
24. Koller, O., Zargaran, S., Ney, H., & Bowden, R. (2016). Deep Sign: Hybrid CNN-
HMM for continuous sign language recognition. Proceedings of the 30th British
Machine Vision Conference.
25. Chen, Yutong, et al. ”Two-Stream Network for Sign Language Recognition and
Translation.” arXiv preprint arXiv:2211.01367 (2022).
26. Hu, Hezhen, et al. ”Signbert: pre-training of hand-model-aware representation for sign
language recognition.” Proceedings of the IEEE/CVF International Conference on
Computer Vision. 2021.
27. Chen, Huizhou, et al. ”Multi-scale attention 3D convolutional network for multimodal
gesture recognition.” Sensors 22.6 (2022): 2405.
24