0% found this document useful (0 votes)
14 views22 pages

Sample Project Report

Uploaded by

Priyanshu Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views22 pages

Sample Project Report

Uploaded by

Priyanshu Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

A Project Report On

Sign Language Translator – Using Gesture



Segmentation and CNN to classify Sign language.”

SUBMITTED IN THE PARTIAL


FULFILLMENT OF THE
REQUIREMENT FOR THE AWARD OF
THE DEGREE OF

BACHELOR OF TECHNOLOGY
in

Computer Science Engineering

Submitted by

Debaleen Das Spandan (12500117078)

Devashish Roy (12500117076)

Acquib Javed (12500117112)

Under the esteemed guidance of

Mr. Prasenjit Maji

Asst. Professor

Department of CSE

Department of Computer Science and Engineering

Bengal College of Engineering and Technology

Durgapur, W.B.
Department of Computer Science and Engineering

Bengal College of Engineering and Technology

Durgapur, W.B.

CERTIFICATE OF APPROVAL

The project entitled “Sign Language Translator – Using Gesture Segmentation and CNN to
classify Sign language.” submitted by Debaleen Das Spandan (12500117078), Devashish Roy
(12500117076) and Acquib Javed (12500117112) under the guidance of “Asst. Professor Mr.
Prasenjit Maji”, is here by approved as creditable study of engineering subject to warrant its
acceptance as a pre-requisite to obtain the degree for which it has been submitted. It is
understood that by this approval the undersigned don’t necessary endorse or approve any
statement made, opinion or conclusion drawn therein but approve the project only for the
purpose for what it is submitted.

______________________________ ______________________________
Mr. Prasenjit Maji Prof. Sk. Abdul Rahim
Asst. Prof. H.O.D.
Dept of CSE Dept. of CSE

Page | ii
Department of Computer Science and Engineering

Bengal College of Engineering and Technology

Durgapur, W.B.

UNDERTAKING

We, Debaleen Das Spandan (12500117078), Devashish Roy (12500117076) and Acquib Javed
(12500117112), B. Tech, 7th Semester (Computer Science and Engineering), hereby declare
that our project entitled “Sign Language Translator – Using Gesture Segmentation and CNN
to classify Sign language.” is our own contribution. The work or ideas of other people which
are utilized in this report has been properly acknowledged and mentioned in the reference. We
undertake total responsibility if traces of plagiarism are found at any later stage.

__________________________

Debaleen Das Spandan

12500117078

__________________________

Devashish Roy

12500117076

__________________________

Acquib Javed

12500117112

Page | iii
ACKNOWLEDGEMENT

We would like to thank our respected HOD Prof. Sk. Abdul Rahim for giving us the opportunity
to work on the topic of our choice which is on “Sign Language Translator – Using Gesture
Segmentation and CNN to classify Sign language.”. Nonetheless, we would like to thank our
project guide Asst. Prof. Mr. Prasenjit Maji, whose valuable guidance has helped us to
complete this project. His suggestions and instructions have served as the major contributor
towards the complete of this project.

We would also like to express gratitude towards our friends and every person who helped in
every person who helped in every little way by giving suggestions. We are also thankful to the
college for providing necessary resources for the project.

Page | iv
Table of Contents

Table of Figures ........................................................................................................................ vi

List of Tables ...........................................................................................................................vii

List of Abbreviations ............................................................................................................. viii

ABSTRACT .............................................................................................................................. ix

1. INTRODUCTION ................................................................................................................. 1

1.1 Image Classification ......................................................................................................... 1

1.2 Convolutional Neural Network (CNN) ............................................................................ 2

1.4 Project Overview .............................................................................................................. 4

1.5 Project Objective .............................................................................................................. 4

2. LITERATURE REVIEW ...................................................................................................... 4

2.1 Sign Language Translation............................................................................................... 5

3. REPORT ON PRESENT INVESTIGATION ....................................................................... 6

3.1 Data Description............................................................................................................... 6

3.2 First Approach: Using simple CNN architecture ............................................................. 8

3.2.1 Data Preprocessing .................................................................................................... 8

3.2.2 Training using Simple CNN ...................................................................................... 8

3.2.3 Outcome of First Approach ....................................................................................... 8

4. PROPOSED METHOD ....................................................................................................... 10

REFERENCES ........................................................................................................................ 12

Page | v
Table of Figures

Figure 1 A computer sees an image as an array of numbers. .................................................... 2


Figure 2 A typical CNN architecture ......................................................................................... 3
Figure 3 A convolution operation. ............................................................................................. 3
Figure 5 Sample from Dataset ................................................................................................... 7
Figure 4 ASL Diagrammatic Representation............................................................................. 7
Figure 6 Preprocessed Images of First Approach. ..................................................................... 8
Figure 9 Proposed System Architecture .................................................................................. 11
Figure 10 0-Level Data Flow Diagram of the proposed system .............................................. 11
Figure 11 1-level Data Flow Diagram of the proposed system ............................................... 12

Page | vi
List of Tables

Table 1 A table on Studies on Sign language Translation ......................................................... 5


Table 2 Accuracy report of First Approach ............................................................................... 9

Page | vii
List of Abbreviations

Abbreviation Full-form
ASL American Sign Language
CNN Convolutional Neural Network
Conv. Convolutional Layer
DFD Data Flow Diagram
HCI Human Computer Interaction
ISL Indian Sign Language
SDK Software Development Kit
SLT Sign Language Translation
STMC Spatial-Temporal Multi-Cue Network
UI User Interface

Page | viii
ABSTRACT
Technology is changing the world rapidly. The researches in Artificial Intelligence and
Computer Vision have addressed and solved problems that were seen as science fiction few
decades ago. This project aims to develop a system that can translate Sign Language to Text.
In order to fully understand this discussion, one must understand a few basic concepts about
sign language. At first, we need to understand that Sign languages are not international. Many
countries have their unique sign languages. Secondly, signing is a two-way process. It involves
both receptive skills and expressive skills. Receptive skills refer to reading the signs and
expressive skills refer to rendering or making the signs. More progress has been made in terms
of computers rendering sign than reading them. This project focuses on the former i.e., the
reading of signs and then translating them to text. Through this project we want to propose an
architecture for a system that will be able to translate sign language to textual representation.
The proposed system is modular enough to adapt itself it to changes in its part so that it can
work with different sign languages and different procedures to separate and classify them in
order to translate them to text.

Page | ix
1. INTRODUCTION

1.1 Image Classification

Image Classification is a fundamental task that attempts to comprehend an entire image as a


whole. The goal is to classify the image by assigning it to a specific label. Typically, Image
Classification refers to images in which only one object appears and is analyzed. In contrast,
object detection involves both classification and localization tasks, and is used to analyze more
realistic cases in which multiple objects may exist in an image. In other words, image
classification can be defined as a process of assigning pixels in the image to categories or
classes of interest. It is essentially a process of mapping numbers to symbols. The function
given below gives the concept of image classification.

f(x): x → Δ

 x ∈ Rn. (Rn = set of real n-vectors).


 Δ = {c1, c2, c3, . . ., cL}.
 n = number of bands.
 L = number of classes.

f(.) is a function assigning a pixel vector x to a single class in the set of classes Δ.

In order to classify a set of data into different classes or categories, the relationship between
the data and the classes into which they are classified must be well understood. The
classification techniques were originally developed out of research in Pattern Recognition field.
Important aspects of accurate classification are:

 Learning Techniques
 Feature sets

Learning Techniques can be further classified into

 Supervised Learning: A guided learning process designed to form a mapping from one
set of variables (data) to another set of variables (information classes).
 Unsupervised Learning: An unguided learning process involving exploration of the data
space to discover scientific laws underlying the data distribution.

Page | 1
Features are attributes of the data elements based on which the elements are assigned various
classes. They can be qualitative or quantitative. Some examples of features are absence or
presence of any object, color profile, information collected from sensors and many more.

1.2 Convolutional Neural Network (CNN)

Convolutional Neural Network or CNN is a type of deep learning model for processing data
that has a grid pattern, such as images, which is inspired by the organization of animal visual
cortex [1], [2] and designed to automatically and adaptively learn spatial hierarchies of features,
from low to high level patterns. CNN is a mathematical construct that is typically composed of
three types of layers (or building blocks): convolution, pooling, and fully connected layers. The
first two, convolution and pooling layers, perform feature extraction, whereas the third, a fully
connected layer, maps the extracted features into final output, such as classification. A
convolution layer plays a key role in CNN, which is composed of a stack of mathematical
operations, such as convolution, a specialized type of linear operation [3] . Digital images are
stored as an array of numbers ranging from 0-255 as shown in Figure 1. A small array of
parameters called kernel which is an optimizable feature extractor, is applied at each image
position. This makes the CNN highly efficient for image classification. The process of
optimizing parameters is called training, which is performed so as to minimize the difference
between outputs and ground truth labels. Optimization algorithms like backpropagation and
gradient descent are used among many others. Figure 2 shows an overview of a convolutional
neural network (CNN) architecture and the training process.

Figure 1 A computer sees an image as an array of numbers.


The matrix on the right contains numbers between 0 and 255, each of which corresponds to the
pixel brightness in the left image. Both are overlaid in the middle image.

Page | 2
Figure 2 A typical CNN architecture

Figure 3 shows the working of a convolution operation. A kernel is applied across the input
tensor, and an element-wise product between each element of the kernel and the input tensor
is calculated at each location and summed to obtain the output value in the corresponding
position of the output tensor, called a feature map.

Figure 3 A convolution operation.

Page | 3
1.4 Project Overview

Gesture Recognition is the process by which gestures formed by a user interact with the
computer or is the element of the special signs language to convey meaning. The hand gesture
has provided significant means of communication in human daily interaction and has been
widely explored in Human-Computer Interaction (HCI) studies. In our daily life, hand gesture
plays an important part of human communication. It provides the most important means for
non-verbal interaction among people. More so in the case of people who use sign language as
a mode of their regular communication. The gestures provide an expressive means of
interactions among people that include hand postures and dynamic hand movements. A static
finger configuration without hand movement is called a hand posture. A dynamic hand
movement consists of a hand gesture with or without finger motion. The gesture segmentation
problem is introduced as the first step towards visual gesture recognition i.e., with the detection,
analysis and recognition of gestures from sequences of real images. Sign Language translation
can be achieved via classification of these gestures. With a predefined set of gestures and their
corresponding labels, a CNN can be trained to classify each gesture to its corresponding label.

1.5 Project Objective

This project aims to develop a system that can translate Sign Language to Text. In order to
fully understand this discussion, one must understand a few basic concepts about sign language.
At first, we need to understand that Sign languages are not international. Many countries have
their unique sign languages. Secondly, signing is a two-way process. It involves both receptive
skills and expressive skills. Receptive skills refer to reading the signs and expressive skills refer
to rendering or making the signs. More progress has been made in terms of computers rendering
sign than reading them. This project focuses on the former i.e., the reading of signs and then
translating them to text. Through this project we want to propose an architecture for a system
that will be able to translate sign language to textual representation.

2. LITERATURE REVIEW
The development of a sign language translator system is very closely related to the
advancement of computer technologies and their applications in the field of sign languages and
image recognition. In this chapter, some studies related to proposed sign language translator
architecture will be discussed.

Page | 4
2.1 Sign Language Translation

Sign Language Translation and real time classification of sign language has presented
numerous difficulties. P. Escudeiro et al. in their work created a bidirectional model that allows
deaf and hard of hearing people to improve their integration into mainstream education [4].
Madhuri Y. et al. created a mobile solution to translate sign language [5]. Yin K. et al. created
a novel state-of-th-art transformer model for video-to-text transformation [6]. Pugeault N. et
al. created an interactive UI for sign languge translation [7]. Badhe P. et al. came up with an
algorithm for translating Indian Sign Language to English textual representation [8]. Different
datasets and technologies were used by them in their studies. Yin K. et al. and Badhe P. et al.
have used video datasets. P. Escudeiro et al. have used Portuguese Sign Language Dataset. P.
Escudeiro et al. and Pugeault N. et al. have both used Microsoft Kinect technology in their
studies. An overview of these studies has been provided in table 1.
Table 1 A table on Studies on Sign language Translation

Article Dataset used Technology Solution Nature of solution


Used Provided
Virtual Sign Portuguese Microsoft VS Model A model, that allows the
Translator [4] Writing (LEP) Kinect SDK, deaf and hard of hearing
and Machine people to improve their
Portuguese Learning integration into
Sign mainstream education
Language and a virtual reality
(LGP) environment to translate
Portuguese sign
language to Portuguese
text.
Vision - NA LABVIEW Mobile Vision - Application Software
Based Sign software Based Sign
Language Language
Translation Translation
Device [5] Device

Better Sign PHOENIX- Weight tying, A novel STMC- Transformer Model


Language Weather Transfer Transformer
Translation 2014T learning, and model for video-
with STMC- (Camgoz et ensemble to-text
Transformer al., 2018) learning in translation.
[6] SLT The first
ASLG-PC12 successful
(Othman and application of
Jemni, 2012) Transformers to
SLT.

Page | 5
Article Dataset used Technology Solution Nature of solution
Used Provided

Spelling it American Microsoft Interactive hand Interactive UI for Sign


out: Real- Sign Kinect shape recognition Language Translation
time ASL Language Network and user interface for
fingerspelling OpenNI+NITE American Sign
recognition framework Language (ASL)
[7] finger-spelling
Indian sign Self-created Video An algorithm that A translator system.
language Indian Sign Processing and will translate the
translator Language combinational ISL into English.
using gesture Dataset algorithm.
recognition
algorithm [8]

3. REPORT ON PRESENT INVESTIGATION

3.1 Data Description

We have chosen American Sign Language (ASL) Dataset as our training data. The training
data set contains 87,000 images which are 200x200 pixels. There are 29 classes, of which 26
are for the letters A-Z and 3 classes for SPACE, DELETE and NOTHING. These 3 classes are
very helpful in real-time applications, and classification. The test data set contains a mere 29
images, to encourage the use of real-world test images. The data set is collected from Kaggle
[9]. The images in this dataset are created by taking multiple pictures of the various classes
with variations in terms of the person (signing the alphabets) as well as background and lighting
conditions. It is important to note here that letters J and Z are motion letters in ASL. But since,
this dataset was meant for image classification, static images with variations of different frames
of the motion are used for the aforementioned letters. Figure 4 shows the diagrammatic
representation of ASL and figure 5 shows a sample of images from the dataset.

Page | 6
Figure 5 ASL Diagrammatic Representation

Figure 4 Sample from Dataset

Page | 7
3.2 First Approach: Using simple CNN architecture

3.2.1 Data Preprocessing

At first, we converted the RGB images to grayscale images. It changed the shape of the images
from 200x200x3 to 200x200. We then further reshaped the images to 32x32 to reduce the
training time required. We also applied a Gaussian Kernel filter on the images to remove
Gaussian noises. Then we iterated over the images from the NOTHING class to compute
weighted average, accumulate it and update the background in order to obtain a filter for
separating the background and hand from the training images. Figure 6 shows the preprocessed
image obtained from the above-mentioned method.

3.2.2 Training using Simple CNN

We used a simple Convolutional network as our model. The model was composed of 4
Convolutional layer (Conv) each succeeded by a Max-Pool layer. The whole Conv-Max-Pool
cluster was then fed into a Batch Normalization layer and then a dropout of 50% was applied.
The output of the dropout layer was passed to a dense network to obtain the classification.

3.2.3 Outcome of First Approach

In this approach, our model reached a training accuracy of 99.14% and a testing accuracy of
97.52%. The model performed well on static images. However, during the live prediction stage,
the model performed poorly which led us to take up another approach. Table 2 shows the

Figure 6 Preprocessed Images of First Approach.

Page | 8
accuracy report of our first approach. Figure 7 and figure 8 shows the accuracy graph and a
sample of static predictions respectively.

Figure 7 Static prediction using model of First Approach

Figure 8 Accuracy graph for First Approach


Table 2 Accuracy report of First Approach

Training/Testing Accuracy % Model Loss

Training 99.14 0.0260

Testing 97.52 0.07963

Page | 9
4. PROPOSED METHOD
We propose an architecture which can translate Sign language captured in a camera and
translate the same to textual representation in real time. From figure 9, we can see that the
system is made up of the followings:

 Camera
 Training Dataset
 Trained Model and
 Gesture Segments

The model is trained using the training data. The brief descriptions of each components are
given below:

Camera: The camera is an external entity which will be used as an input for the system. It will
capture the live feed and pass the feed frame by frame to the system.

Training Dataset: The training dataset is the data used to train the Convolutional Neural
Network.

Trained Model: The trained CNN will take the live feed as input to classify and predict the
class of the recognized text. This is a vital component of the system.

Gesture Segments: The live feed from the camera will be transformed to gesture segments to
separate the background and the hand gestures made. These gesture segments will be then
cropped and scaled accordingly and then passed to the Trained Model as an input image.

Figure 10 and 11 shows the level-0 Data Flow Diagram (DFD) and level-1 DFD diagram
respectively. From these 2 figures, we can see that the training data will be used to train the
model. The live feed from the camera will go through a Gesture segmentation process which
will provide the segmented gestures which will in turn pass through a cropping and rescaling
process to provide the input data for the trained model. The trained model will then classify
and predict the text recognized from the gestures. The recognized text is the output of the
system.

Page | 10
Figure 7 Proposed System Architecture

Figure 8 0-Level Data Flow Diagram of the proposed system

Page | 11
Figure 9 1-level Data Flow Diagram of the proposed system

REFERENCES

[1] D. Hubel and T. Wiesel, “Receptive fields and functional architecture of monkey striate
cortex.,” The Journal of Physiology, vol. 195, no. 1, pp. 215-243, 1968.

[2] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism


of pattern recognition unaffected by shift in position,” Biological Cybernetics, vol. 36,
no. 4, pp. 193-202, 1980.

[3] R. Yamashita, M. Nishio, R. K. G. Do and K. Togashi, “Convolutional neural networks:


an overview and application in radiology,” Insights into Imaging, vol. 9, no. 4, pp. 611-
629, 2018.

[4] P. Escudeiro, “Virtual Sign – A Real Time Bidirectional Translator of Portuguese Sign
Language,” in Procedia Computer Science, 2015.

Page | 12
[5] Y. Madhuri, G. Anitha and M. Anburajan, “Vision-based sign language translation
device,” in International Conference on Information Communication and Embedded
Systems (ICICES), Chennai, India, 2013.

[6] K. Yin and J. Read, Better Sign Language Translation with STMC-Transformer, eprint
arXiv:2004.00588, 2020.

[7] N. Pugeault and R. Bowden, “Spelling it out: Real-time ASL fingerspelling recognition,”
in 2011 IEEE International Conference on Computer Vision Workshops (ICCV
Workshops), 2011.

[8] P. C. Badhe and V. Kulkarni, “Indian sign language translator using gesture recognition
algorithm,” in 2015 IEEE International Conference on Computer Graphics, Vision and
Information Security (CGVIS), 2015.

[9] Akash, “ASL Alphabet | Kaggle,” Kaggle, [Online]. Available:


https://www.kaggle.com/grassknoted/asl-alphabet. [Accessed 2021].

Page | 13

You might also like