Sample Project Report
Sample Project Report
BACHELOR OF TECHNOLOGY
in
Submitted by
Asst. Professor
Department of CSE
Durgapur, W.B.
Department of Computer Science and Engineering
Durgapur, W.B.
CERTIFICATE OF APPROVAL
The project entitled “Sign Language Translator – Using Gesture Segmentation and CNN to
classify Sign language.” submitted by Debaleen Das Spandan (12500117078), Devashish Roy
(12500117076) and Acquib Javed (12500117112) under the guidance of “Asst. Professor Mr.
Prasenjit Maji”, is here by approved as creditable study of engineering subject to warrant its
acceptance as a pre-requisite to obtain the degree for which it has been submitted. It is
understood that by this approval the undersigned don’t necessary endorse or approve any
statement made, opinion or conclusion drawn therein but approve the project only for the
purpose for what it is submitted.
______________________________ ______________________________
Mr. Prasenjit Maji Prof. Sk. Abdul Rahim
Asst. Prof. H.O.D.
Dept of CSE Dept. of CSE
Page | ii
Department of Computer Science and Engineering
Durgapur, W.B.
UNDERTAKING
We, Debaleen Das Spandan (12500117078), Devashish Roy (12500117076) and Acquib Javed
(12500117112), B. Tech, 7th Semester (Computer Science and Engineering), hereby declare
that our project entitled “Sign Language Translator – Using Gesture Segmentation and CNN
to classify Sign language.” is our own contribution. The work or ideas of other people which
are utilized in this report has been properly acknowledged and mentioned in the reference. We
undertake total responsibility if traces of plagiarism are found at any later stage.
__________________________
12500117078
__________________________
Devashish Roy
12500117076
__________________________
Acquib Javed
12500117112
Page | iii
ACKNOWLEDGEMENT
We would like to thank our respected HOD Prof. Sk. Abdul Rahim for giving us the opportunity
to work on the topic of our choice which is on “Sign Language Translator – Using Gesture
Segmentation and CNN to classify Sign language.”. Nonetheless, we would like to thank our
project guide Asst. Prof. Mr. Prasenjit Maji, whose valuable guidance has helped us to
complete this project. His suggestions and instructions have served as the major contributor
towards the complete of this project.
We would also like to express gratitude towards our friends and every person who helped in
every person who helped in every little way by giving suggestions. We are also thankful to the
college for providing necessary resources for the project.
Page | iv
Table of Contents
ABSTRACT .............................................................................................................................. ix
1. INTRODUCTION ................................................................................................................. 1
REFERENCES ........................................................................................................................ 12
Page | v
Table of Figures
Page | vi
List of Tables
Page | vii
List of Abbreviations
Abbreviation Full-form
ASL American Sign Language
CNN Convolutional Neural Network
Conv. Convolutional Layer
DFD Data Flow Diagram
HCI Human Computer Interaction
ISL Indian Sign Language
SDK Software Development Kit
SLT Sign Language Translation
STMC Spatial-Temporal Multi-Cue Network
UI User Interface
Page | viii
ABSTRACT
Technology is changing the world rapidly. The researches in Artificial Intelligence and
Computer Vision have addressed and solved problems that were seen as science fiction few
decades ago. This project aims to develop a system that can translate Sign Language to Text.
In order to fully understand this discussion, one must understand a few basic concepts about
sign language. At first, we need to understand that Sign languages are not international. Many
countries have their unique sign languages. Secondly, signing is a two-way process. It involves
both receptive skills and expressive skills. Receptive skills refer to reading the signs and
expressive skills refer to rendering or making the signs. More progress has been made in terms
of computers rendering sign than reading them. This project focuses on the former i.e., the
reading of signs and then translating them to text. Through this project we want to propose an
architecture for a system that will be able to translate sign language to textual representation.
The proposed system is modular enough to adapt itself it to changes in its part so that it can
work with different sign languages and different procedures to separate and classify them in
order to translate them to text.
Page | ix
1. INTRODUCTION
f(x): x → Δ
f(.) is a function assigning a pixel vector x to a single class in the set of classes Δ.
In order to classify a set of data into different classes or categories, the relationship between
the data and the classes into which they are classified must be well understood. The
classification techniques were originally developed out of research in Pattern Recognition field.
Important aspects of accurate classification are:
Learning Techniques
Feature sets
Supervised Learning: A guided learning process designed to form a mapping from one
set of variables (data) to another set of variables (information classes).
Unsupervised Learning: An unguided learning process involving exploration of the data
space to discover scientific laws underlying the data distribution.
Page | 1
Features are attributes of the data elements based on which the elements are assigned various
classes. They can be qualitative or quantitative. Some examples of features are absence or
presence of any object, color profile, information collected from sensors and many more.
Convolutional Neural Network or CNN is a type of deep learning model for processing data
that has a grid pattern, such as images, which is inspired by the organization of animal visual
cortex [1], [2] and designed to automatically and adaptively learn spatial hierarchies of features,
from low to high level patterns. CNN is a mathematical construct that is typically composed of
three types of layers (or building blocks): convolution, pooling, and fully connected layers. The
first two, convolution and pooling layers, perform feature extraction, whereas the third, a fully
connected layer, maps the extracted features into final output, such as classification. A
convolution layer plays a key role in CNN, which is composed of a stack of mathematical
operations, such as convolution, a specialized type of linear operation [3] . Digital images are
stored as an array of numbers ranging from 0-255 as shown in Figure 1. A small array of
parameters called kernel which is an optimizable feature extractor, is applied at each image
position. This makes the CNN highly efficient for image classification. The process of
optimizing parameters is called training, which is performed so as to minimize the difference
between outputs and ground truth labels. Optimization algorithms like backpropagation and
gradient descent are used among many others. Figure 2 shows an overview of a convolutional
neural network (CNN) architecture and the training process.
Page | 2
Figure 2 A typical CNN architecture
Figure 3 shows the working of a convolution operation. A kernel is applied across the input
tensor, and an element-wise product between each element of the kernel and the input tensor
is calculated at each location and summed to obtain the output value in the corresponding
position of the output tensor, called a feature map.
Page | 3
1.4 Project Overview
Gesture Recognition is the process by which gestures formed by a user interact with the
computer or is the element of the special signs language to convey meaning. The hand gesture
has provided significant means of communication in human daily interaction and has been
widely explored in Human-Computer Interaction (HCI) studies. In our daily life, hand gesture
plays an important part of human communication. It provides the most important means for
non-verbal interaction among people. More so in the case of people who use sign language as
a mode of their regular communication. The gestures provide an expressive means of
interactions among people that include hand postures and dynamic hand movements. A static
finger configuration without hand movement is called a hand posture. A dynamic hand
movement consists of a hand gesture with or without finger motion. The gesture segmentation
problem is introduced as the first step towards visual gesture recognition i.e., with the detection,
analysis and recognition of gestures from sequences of real images. Sign Language translation
can be achieved via classification of these gestures. With a predefined set of gestures and their
corresponding labels, a CNN can be trained to classify each gesture to its corresponding label.
This project aims to develop a system that can translate Sign Language to Text. In order to
fully understand this discussion, one must understand a few basic concepts about sign language.
At first, we need to understand that Sign languages are not international. Many countries have
their unique sign languages. Secondly, signing is a two-way process. It involves both receptive
skills and expressive skills. Receptive skills refer to reading the signs and expressive skills refer
to rendering or making the signs. More progress has been made in terms of computers rendering
sign than reading them. This project focuses on the former i.e., the reading of signs and then
translating them to text. Through this project we want to propose an architecture for a system
that will be able to translate sign language to textual representation.
2. LITERATURE REVIEW
The development of a sign language translator system is very closely related to the
advancement of computer technologies and their applications in the field of sign languages and
image recognition. In this chapter, some studies related to proposed sign language translator
architecture will be discussed.
Page | 4
2.1 Sign Language Translation
Sign Language Translation and real time classification of sign language has presented
numerous difficulties. P. Escudeiro et al. in their work created a bidirectional model that allows
deaf and hard of hearing people to improve their integration into mainstream education [4].
Madhuri Y. et al. created a mobile solution to translate sign language [5]. Yin K. et al. created
a novel state-of-th-art transformer model for video-to-text transformation [6]. Pugeault N. et
al. created an interactive UI for sign languge translation [7]. Badhe P. et al. came up with an
algorithm for translating Indian Sign Language to English textual representation [8]. Different
datasets and technologies were used by them in their studies. Yin K. et al. and Badhe P. et al.
have used video datasets. P. Escudeiro et al. have used Portuguese Sign Language Dataset. P.
Escudeiro et al. and Pugeault N. et al. have both used Microsoft Kinect technology in their
studies. An overview of these studies has been provided in table 1.
Table 1 A table on Studies on Sign language Translation
Page | 5
Article Dataset used Technology Solution Nature of solution
Used Provided
We have chosen American Sign Language (ASL) Dataset as our training data. The training
data set contains 87,000 images which are 200x200 pixels. There are 29 classes, of which 26
are for the letters A-Z and 3 classes for SPACE, DELETE and NOTHING. These 3 classes are
very helpful in real-time applications, and classification. The test data set contains a mere 29
images, to encourage the use of real-world test images. The data set is collected from Kaggle
[9]. The images in this dataset are created by taking multiple pictures of the various classes
with variations in terms of the person (signing the alphabets) as well as background and lighting
conditions. It is important to note here that letters J and Z are motion letters in ASL. But since,
this dataset was meant for image classification, static images with variations of different frames
of the motion are used for the aforementioned letters. Figure 4 shows the diagrammatic
representation of ASL and figure 5 shows a sample of images from the dataset.
Page | 6
Figure 5 ASL Diagrammatic Representation
Page | 7
3.2 First Approach: Using simple CNN architecture
At first, we converted the RGB images to grayscale images. It changed the shape of the images
from 200x200x3 to 200x200. We then further reshaped the images to 32x32 to reduce the
training time required. We also applied a Gaussian Kernel filter on the images to remove
Gaussian noises. Then we iterated over the images from the NOTHING class to compute
weighted average, accumulate it and update the background in order to obtain a filter for
separating the background and hand from the training images. Figure 6 shows the preprocessed
image obtained from the above-mentioned method.
We used a simple Convolutional network as our model. The model was composed of 4
Convolutional layer (Conv) each succeeded by a Max-Pool layer. The whole Conv-Max-Pool
cluster was then fed into a Batch Normalization layer and then a dropout of 50% was applied.
The output of the dropout layer was passed to a dense network to obtain the classification.
In this approach, our model reached a training accuracy of 99.14% and a testing accuracy of
97.52%. The model performed well on static images. However, during the live prediction stage,
the model performed poorly which led us to take up another approach. Table 2 shows the
Page | 8
accuracy report of our first approach. Figure 7 and figure 8 shows the accuracy graph and a
sample of static predictions respectively.
Page | 9
4. PROPOSED METHOD
We propose an architecture which can translate Sign language captured in a camera and
translate the same to textual representation in real time. From figure 9, we can see that the
system is made up of the followings:
Camera
Training Dataset
Trained Model and
Gesture Segments
The model is trained using the training data. The brief descriptions of each components are
given below:
Camera: The camera is an external entity which will be used as an input for the system. It will
capture the live feed and pass the feed frame by frame to the system.
Training Dataset: The training dataset is the data used to train the Convolutional Neural
Network.
Trained Model: The trained CNN will take the live feed as input to classify and predict the
class of the recognized text. This is a vital component of the system.
Gesture Segments: The live feed from the camera will be transformed to gesture segments to
separate the background and the hand gestures made. These gesture segments will be then
cropped and scaled accordingly and then passed to the Trained Model as an input image.
Figure 10 and 11 shows the level-0 Data Flow Diagram (DFD) and level-1 DFD diagram
respectively. From these 2 figures, we can see that the training data will be used to train the
model. The live feed from the camera will go through a Gesture segmentation process which
will provide the segmented gestures which will in turn pass through a cropping and rescaling
process to provide the input data for the trained model. The trained model will then classify
and predict the text recognized from the gestures. The recognized text is the output of the
system.
Page | 10
Figure 7 Proposed System Architecture
Page | 11
Figure 9 1-level Data Flow Diagram of the proposed system
REFERENCES
[1] D. Hubel and T. Wiesel, “Receptive fields and functional architecture of monkey striate
cortex.,” The Journal of Physiology, vol. 195, no. 1, pp. 215-243, 1968.
[4] P. Escudeiro, “Virtual Sign – A Real Time Bidirectional Translator of Portuguese Sign
Language,” in Procedia Computer Science, 2015.
Page | 12
[5] Y. Madhuri, G. Anitha and M. Anburajan, “Vision-based sign language translation
device,” in International Conference on Information Communication and Embedded
Systems (ICICES), Chennai, India, 2013.
[6] K. Yin and J. Read, Better Sign Language Translation with STMC-Transformer, eprint
arXiv:2004.00588, 2020.
[7] N. Pugeault and R. Bowden, “Spelling it out: Real-time ASL fingerspelling recognition,”
in 2011 IEEE International Conference on Computer Vision Workshops (ICCV
Workshops), 2011.
[8] P. C. Badhe and V. Kulkarni, “Indian sign language translator using gesture recognition
algorithm,” in 2015 IEEE International Conference on Computer Graphics, Vision and
Information Security (CGVIS), 2015.
Page | 13