0% found this document useful (0 votes)

23 views61 pages

Tahira Thesis

thesis for electronics and kashmir

Uploaded by

Mr Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views61 pages

Tahira Thesis

thesis for electronics and kashmir

Uploaded by

Mr Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Automatic Speech Recognition Using Deep Learning

Algorithm
(Thesis submitted towards the partial fulfillment of requirement for the award of degree of)
M.TECH
IN
ELECTRONICS AND COMMUNICATION ENGINEERING

SUBMITTED BY
TAHIRA RASHID
Roll No. 20320362014
M Tech ECE 2020 Batch

Under the Supervision of

DR GURINDER KAUR SODHI
HOD, Electronics and Communication Deptt.
DBU Mandi Gobindgarh

DEPARTMENT OF ELECTRONICS AND COMMUNICATION

ENGINEERING
Desh Bhagat University, Mandi Gobindgarh
Punjab India
(2021)

1
DECLARATION
I hereby declare that the thesis entitled “Automatic Speech Recognition Using Deep
Learning Algorithm” submitted by me in partial fulfillment of requirements for the award of
degree of Master of Engineering (Electronics and Communication) of Desh Bhagat University
is record of my own work carried under the supervision and guidance of Dr Gurider kaur
sodhi, HOD ,Electronics and Communication Engineering Department, Desh Bhagat
University, Mandi Gobindgarh, Punjab.

To the best of my knowledge this thesis has not been submitted to Desh Bhagat University or
other University or Institute for the award of any degree.

Date: Tahira Rashid

Place: DBU, Mandi Gobindghar Roll no. 20320362014
M.tech ECE 2020 Batch
DBU,Mandi Gobindghar

2
CERTIFICATE
I hereby certify that the work which is being presented in the thesis, entitled
“Automatic Speech Recognition Using Deep Learning Algorithm” in fulfillment of
the requirement for the award of the degree of Masters in Technology in Faculty of
Engineering and submitted in Desh Bhagat University, Mandi Gobindgarh, is an
authentic record of my own work carried out under the supervision of Dr. Gurinder
kaur sodhi HOD , ECE department, Desh Bhagat University. The matter embodied in
this thesis has not been submitted by me for the award of any other degree of this in any
other University/Institute.

(Signature of Candidate)

Tahira Rashid
This is to certify that the above statement made by the candidate is correct to the best
of our knowledge.

(Head of Department)
(Supervisor)
Deptt.of ECE Deptt. Of ECE

The External examination of __________________Review Scholar has been held on

_________________.

Sign.of Supervisor(s) Sign. of Head of the Department Sign. External Examiner

3
ACKNOWLEDGEMENT

I wish to express my heartfelt gratitude to my guide Dr. Gurinder kaur sodhi, HOD,,
Department of Electronics and Communication Engineering, Desh Bhagat University,
Mandi Gobindgarh for her invaluable encouragement and advice during every stage of
this thesis work.

I also wish to express my sincere thanks and deep sense of gratitude to my esteemed H.O.D
ECE Department, Desh Bhagat University, Mandi Gobindgarh for her intellectual support
throughout this thesis work.

I would like to express my sincere gratitude to all faculty members of ECE Engineering
department for their intellectual support throughout this thesis work.

I am also thankful to other technical and supporting staff of Electronics Department for
providing me the necessary help whenever it was required. I also place my appreciation
for the cooperation of library and academic cell staff members of Desh Bhagat
University, Mandi Gobindgarh.

At last but not least, I would like to extend my thanks to my parents, my elder sister and
younger brother and my grandparents for encouraging me with continuous words of
appreciation right from beginning to the end of this thesis work.

Tahira Rashid
Roll no. 20320362014
M.tech (ECE) 2020 batch
DBU, Mandi Gobindgarh

4
ABSTRACT
In artificial intelligence and machine learning, automatic voice recognition is a prominent issue,
with the objective of building robots that can converse with people through speech. Speech is a
data-rich mode of communication that incorporates both linguistic and paralinguistic data.
Emotion is an excellent example of non-verbal information that is partially expressed through
speech. Human-machine communication is made more natural and easy by developing
machines that can absorb non-linguistic information like emotion. This study looked into the
usefulness of convolutional neural networks in identifying voice emotions. The networks' input
features were wide-band spectrograms of voice samples. Voice signals created by performers
while playing out a certain mood were used to train the networks. English-language voice
datasets were used to train and assess our models. Each database's training data was subjected
to two levels of augmentation. The dropout method was utilized to regularize the networks. The
gender-agnostic, language-agnostic CNN models achieved state-of-the-art accuracy was found
to be 83 percent, outperformed previously reported results in the literature that are about 64%,
and equaled or even exceeded human performance on benchmark databases, according to our
findings. Future study should focus on the ability of deep learning models to discern speech
emotion from real-world speech data.

5
Table NO. CAPTION PAGE

Chapter 1 INTRODUCTION ................................................................................. 8

1.1 Automatic Speech Recognition .......................................................................11

Chapter 2 Literature Review................................................................................. 14

Chapter 3 Methodology ........................................................................................ 24

3.1 Multi-layer Perceptron Networks ...................................................................25

3.2 Gradient Descent ............................................................................................27

3.3 Convolutional Neural Networks ......................................................................29

3.4 Convolutional Layer ........................................................................................29

3.5 Pooling Layer ...................................................................................................31

3.6 Fully Connected Layer .....................................................................................32

3.6.1 Rectified Linear Unit ........................................................................... 33

3.6.2 Mini-batch Learning ............................................................................ 36

3.6.3 Dropout ................................................................................................ 36

Chapter 4 System Setup ........................................................................................ 39

4.1 Preprocessing ..................................................................................................39

4.1.1 Training and Test Sets ......................................................................... 40

4.1.2 Architecture.......................................................................................... 41

Chapter 5 Results and Discussion ........................................................................ 43

Chapter 6 Conclusion and Future Scope .............................................................. 51

REFERENCES …………………………………………………………..……………………………………………………………………51

PLAGIARISM REPORT. ........................................................................................................................51

PUBLICATION................................................................................................................................... 52

6
LIST OF FIGURES
FIGURE CAPTION PAGES

3.1 Linear threshold unit 26

3.2 Multi layer perceptron 27

3.3 Convolution 3x3image 31

3.4 Pooling of a 3x3 using 2x2 32

kernel

3.5 Sigmoid Activaton Function 33

3.6 Rectified linear function 34

4.1 Wide band spectrogram 40

4.2 The baseline architecture of 42

CNN

CHAPTER 1
7
INTRODUCTION
Over the last several decades, several studies have been undertaken on understanding the human
brain and designing systems that replicate human intelligence [1, 2, 3, 4, 5, 6]. The human brain
is a complicated organ that has long inspired artificial intelligence research (AI). The neural
networks of the human brain are exceptionally capable of acquiring high-level abstract
conceptions from low-level sensory input. Language acquisition, voice comprehension, and
facial recognition are just a few examples of the human brain's amazing capacity to acquire
high-level ideas. AI's main goal is to develop intelligent systems that can generate plausible
thoughts and activities that are equivalent to human intellect and performance [1].. AI sub-fields
include a number of distinct research areas. Computer vision, natural language processing,
automated reasoning, robotics, machine listening, and machine learning are some of the core
topics of AI research. The development of machines that can communicate with humans by
understanding speech opens the way for the construction of artificial intelligence systems. One
of the most difficult tasks for the human brain is to understand speech. Humans communicate
most naturally and easily through speech.

The objective of producing robots that can speak with humans through voice has been a popular
issue in AI research [7, 8]. Early ASR systems largely relied on grammatical aspects of speech
to understand spoken utterances [9, 10, 11, 12].

Speech is a data-rich mode of communication that incorporates both linguistic and

paralinguistic data. The most essential extralinguistic information provided by speech is
identity, gender, purpose, mood, and emotion, although they have received less attention in the
standard ASR paradigm [13]. The human brain uses all linguistic and paralinguistic information
to understand the underlying meaning of utterances and to communicate effectively. Indeed,
any impairment in paralinguistic feature perception has a detrimental influence on
communication quality.. Children who are unable to understand their peers' emotional states are
considered to have poor social skills and, in certain cases, psychopathological symptoms [14,
15]. This underlines the need of recognizing the emotional states of speech in order to properly
communicate. As a result, developing robots capable of interpreting non-linguistic data such as
emotion is crucial for producing clear, effective, and human-like communication.

8
Researchers have been exploring emotion recognition for years. The detection of emotion from
facial expressions and biological data such as heart beats or skin resistance was the first basis
of research in emotion identification [16]. Emotion recognition from audio signals has recently
attracted a lot of attention. The conventional approach to this problem assumed that auditory
features and emotion were connected. In other words, to express emotion, acoustic and prosodic
speech signal correlates such as speaking tempo, intonation, energy, formant frequencies,
fundamental frequency (pitch), intensity (loudness), duration (length), and spectral
characteristic (timbre) are employed [13, 17].. Several machine learning techniques have been
investigated for classifying emotions based on their acoustic correlates in spoken utterances..

HMMs, Gaussian mixture models, closest neighbor classifiers, linear discriminant classifiers,
artificial neural networks, and support vector machines are just a few of the extensively used
approaches for identifying emotions based on their auditory properties of interest [13, 17, 18,
19, 20]. Feature extraction techniques and the important features for each emotion impact the
performance of these classifiers. Because the acoustic correlates of emotion in speech signals
fluctuate amongst speakers, genders, languages, and cultures [13], there is no broad agreement
on them [21, 22]. This results in a range of "hand-crafted" features, depending on the speech
corpus. Using deep learning models is one apparent way to solve this problem. Emotion
recognition in speech is a multimodal process.. Despite the fact that speech may convey a lot of
emotional information, it is insufficient for identifying human affective states in everyday
situations. Other modalities, such as visual or verbal, can also assist in providing the information
needed for emotion detection. In addition to speech, people use paralinguistic signs such as
facial expression, body language, semantics, and context to determine emotions in others.

Body language is considered to transmit 55 percent of the message when people connect with
one another [23]. We just looked into speech and not other modes of communication. As a
result, the databases used in this study were created by actors, who used exaggeration to
communicate emotions, maybe compensating for the lack of information provided by other
modalities. This provides us more control over how successful deep learning models are than
using real-life utterances. On the other hand, the extensive availability of executed speech
standards in the speech community allows us to compare our findings to previously studied
models.. Convolutional neural networks were tested in identifying speech emotions inside and
across British and American English using performed voice datasets. The use of wide-band
9
spectrograms rather than narrow-band spectrograms, as well as the analysis of the impact of
data augmentation on the accuracy of our models across widely-used benchmark databases, are
unique contributions of this research. According to our findings, wide-band spectrograms and
data augmentation enabled CNNs to achieve state-of-the-art accuracy and surpass humans.

Deep neural networks (DNNs) and deep learning (LeCun et al., 2015) approaches have recently
yielded state-of-the-art performance across a range of human and artificial intelligence (AI)
tasks, such as speaker identification (Dahl et al., 2012), natural language processing (nlp)
(Sutskever et al., 2014), and object tracking (LeCun et al., 2015). (Krizhevsky and colleagues,
2012). These models include many layers of nonlinear processing units, allowing complex data
to be accurately represented. Deep learning may generate automated representations in a multi-
layer architecture based on raw data representation. As a consequence, this general-purpose
learning approach may use raw data like signal spectrograms and image pixels directly.. On the
other hand, traditional machine learning algorithms rely largely on proper data representation
choices, sometimes known as features. Domain-specific knowledge and engineering (Chiang et
al., 2009; Forman, 2003) are extensively used to generate helpful features for a given activity.
In practice, feature engineering may have advantages; nevertheless, the time spent designing
features limits the breadth of AI applications.

The notion of training multi-layer networks to replace hand-designed features was proposed
around the end of the 1950s (Rosenblatt, 1958; Selfridge, 1958). The introduction of error back-
propagation algorithms (Rumelhart et al., 1988) in the 1980s allowed multi-layer models to be
trained using a simple stochastic gradient descent approach.

Due to computer resource constraints, early DNN models (LeCun et al., 1990; Waibel et al.,
1989) were usually evaluated in simple and small settings. Thanks to recent advances in
computer resources, particularly graphics processing units (GPUs), big DNN models on
massive datasets may now be optimized rapidly and efficiently, which varies from previous
studies. There has been research into very deep neural networks with tens of layers, such as the
VGG (Simonyan and Zisserman, 2014) and residual networks (He et al., 2016). In addition,
several network modifications, such as convolutional neural networks (Krizhevsky et al., 2012),
have been proposed to explain various types of data.; CNNs (LeCun et al., 1998) and recurrent
10
neural networks (Bengio et al., 2003; LeCun et al., 1998) are two types of neural networks
(RNNs).. Although DNN models have shown promising outcomes, they are not without
problems. DNNs are prone to overfitting to training data, which limits their capacity to
generalize to unknown input. To avoid over-fitting, regularisation techniques are widely used
during training. Two of these are the weight decay and dropout (Srivastava et al., 2014)
techniques. Additionally, natively trained DNNs are frequently referred to as "black boxes,"
because their representations are distributed and difficult to understand directly. The scope of
future network regularisation and model post-modification is limited by this issue.

1.1 Automatic Speech Recognition

Automatic speech recognition (ASR), often known as voice to text, was one of the first AI
research goals. The most natural way for people to communicate is through speech, sometimes
known as spoken language. ASR systems can provide a more convenient and user-friendly
platform for human-computer interaction. Furthermore, machine voice processing and
interpretation helps to the ultimate goal of artificial intelligence. In 1952, Bell Laboratories
developed the first ASR system, a digit recogniser. Davis and colleagues (1952). Pattern
matching and the acoustic phonetic method (Hemdal and Hughes, 1967) were two early speech
recognition algorithms that relied on rule-based and knowledge-based heuristic approaches
(Itakura, 1975). In the 1970s, hidden Markov models (HMMs), especially Gaussian mixture
model HMMs (GMM-HMMs), were utilized for the first time in sound recognition (Baker,
1975; Jelinek, 1976). Since then, statistical approaches have dominated this field of study.
Given a collection of x1:T characteristics of length T, derived mathematically from a raw audio
stream,

11
The acoustic model is known as p(x1:T |ω), whereas the language model is known as P. (). A
generative framework is used to define the ASR system in this arrangement. HMMs are used
to estimate the acoustic model, p(x1:T |ω). In the decades afterwards, state tying (Young et al.,
1994), discriminative training (Povey and Woodland, 2002), and speaker/noise adaptation have
all been proposed as valuable enhancements to the HMM architecture (Gales, 1998; Leggetter
and Woodland, 1995). Meanwhile, ASR systems have advanced from identifying single words
to continuous speech with a large vocabulary, as well as from clean environments to complex
scenarios such as telephone calls (Godfrey et al., 1992).

Recent advances in combining deep learning and HMMs have considerably enhanced the
performance of ASR systems (Dahl et al., 2012; Deng et al., 2013; Hinton et al., 2012; Seide et
al., 2011b). Discriminative models, often known as end-to-end models, have been investigated
instead (Cho et al., 2014; Graves et al., 2006; Sutskever et al., 2014). In these models, neural
networks are used to describe the conditional probability P(|x1:T), which is directly related to
the decision rule.. Over the previous half-century, advances in speech recognition technology
have been significant. Such technology has begun to change human lifestyles and contribute to
civilization's growth. Recent advances in combining deep learning and HMMs have
considerably enhanced the performance of ASR systems (Dahl et al., 2012; Deng et al., 2013;

12
Hinton et al., 2012; Seide et al., 2011b). Discriminative models, often known as end-to-end
models, have been investigated instead (Cho et al., 2014; Graves et al., 2006; Sutskever et al.,
2014). In these models, neural networks are used to describe the conditional probability
P(|x1:T), which is directly related to the decision rule. Over the previous half-century, advances
in speech recognition technology have been significant. Human habits have begun to change as
a result of such technologies..

13
CHAPTER 2

Literature Review
Deep learning is an advanced machine learning technique for dealing with massive databases
and complex systems. A flurry of new algorithms arose with the introduction of deep learning,
decreasing the need for "hand-crafted" attributes prior to classification [24]. To put it another
way, deep learning models may learn low-level properties from training data in their lower
layers and then construct high-level representation in the top layers depending on the previous
levels. As a result, deep learning algorithms can extract the attributes automatically. Deep
learning algorithms for identifying speech emotions have lately gained prominence. Several
studies have looked at the use of deep learning models in voice emotion recognition in recent
years.

Stuhlsatz et al. [25] investigated the performance of a Generalized Discriminant Analysis

(GerDA) based on deep neural networks (DNNs) in detecting spoken utterances using
emotional factors like arousal and valence [26]. Using a specific preprocessing strategy, the
static acoustic features were captured and supplied into the classifiers as input data. According
to their findings, the DNN outperformed the SVM in recognizing emotional aspects in voice
utterances.

Li et al. [27] compared the performance of the hybrid deep neural network-hidden Markov
model (DNN-HMM) classifier to the hybrid Gaussian mixture model hidden Markov model
(GMM-HMM) classifier in speech emotion classification. The DNN in the DNN-HMM
retrieved the discriminative properties, which were then used by the HMM to identify speech
emotions. According to their findings, the DNN-HMM outperformed the GMM-HMM in
classifying speech emotions.

Mao et al. [28] used convolutional neural networks, a sort of deep learning model, to generate
discriminative features from narrowband spectrograms of audio data. An SVM was used to
classify the properties learned by CNN. The researchers discovered that learning high-level
discriminative qualities from a low-level spectrographic representation resulted in improved

14
results.

Fayek et al. [29] employed a deep neural network to classify vocal emotions (DNN). One-
second narrow-band spectrograms of voice samples were given into the DNN as input.
Traditional machine learning approaches were outperformed by their findings.

Zheng et al. [30] utilized a convolutional neural network to learn discriminative information
from narrow-band log-spectrograms of speech sounds. Similar to previous studies, their
findings showed that approaches that used hand-crafted features to characterize speech
emotions performed better.

Trigeorgis et al. [31] introduced a convolutional neural network and a recurrent neural network-
based end-to-end deep learning model. Recurrent neural networks (RNNs) are sequence models
that deal with sequential data such as audio, text, or video [32].

Trigeorgis et al. [31] employed a combination of CNN spatial resolution and RNN temporal
resolution to extract discriminative emotional traits from raw data without any preprocessing;
that is, the model's inputs were raw audio signals. Their findings also showed that deep learning
models are good at learning prominent traits that may be used to discern human emotional states
from spoken utterances.

Papakostas et al. [33] assessed the capability of deep convolutional neural networks in
recognizing vocal emotions using publically available materials. In this method, they compared
the performance of deep neural networks with SVM classifiers. According to their findings,
deep neural networks beat SVM classifiers in discriminating emotional states.

Papakostas et al. [33] built convolutional neural networks with four convolutional layers, each
followed by a max-pooling layer, and two fully connected layers. The networks were trained
using the stochastic gradient descent (SGD) approach with 5000 training iterations. The F1-
score was used to compare models within and across languages. The dropout technique and
data augmentation were employed to avoid overfitting the training data. To augment training
data, background noise with three different degrees of signal to noise ratio was applied to voice
signals. The speech sounds were given into the deep neural networks as 250 x 250 narrow-band
spectrogram images. In our research, we also looked at the usefulness of convolutional neural
networks for decoding emotions in speech inputs.

15
Our findings differ from those of Papakostas et al. [33] in a number of respects. We used wide-
band spectrogram images as the input to the convolutional neural networks instead of narrow-
band spectrogram images; we also used a different learning strategy and somewhat different
data augmentation. We also developed convolutional neural networks with much less
hyperparameters and a significantly faster speed than the models developed by Papakostas et
al. [33]. Above all, our convolutional neural networks outperformed prior models such as those
developed by Papakostas et al. [33]. The next chapter covers some essential machine learning
and deep learning concepts before getting into the core of our study.

Previous studies have focused on extracting information from unimodal systems. Machines
only used facial expressions [1] or speech sounds [2] to determine emotion. Multimodal systems
that predict emotion using many factors have shown to be more effective and accurate over
time. As a result, audio-visual expressions, EEG, and body gestures have been used since then.
Many intelligent machines and neural networks are used to create the emotion recognition
system.

Shiqing et al. [3] discovered that multimodal recognition outperforms unimodal

systems..According to studies, deep neural networks may successfully construct discriminative
features that mirror the complex non-linear relationships between traits in the original
collection. Voice and language processing, as well as emotion recognition, have all been studied
using deep generative models [4-6].

A bidirectional Long Short Term Memory (BLSTM) network outperforms a typical SVM
approach, according to Martin et al. [7]. In voice processing, Ngiam et al. [8] built and evaluated
deep networks to learn audio-visual features from spoken characters

. Brueckner et al. [9] observed that deploying a Restricted Boltzmann Machine (RBM) before
a two-layer neural network with fine-tuning might considerably improve classification accuracy
in the Interspeech automatic likability classification task [10]..

Yelin et al. [12] shown that three layered Deep Belief Networks (DBNs) outperform two layered
DBNs using an auditory visual emotion detection approach. Samira et al [13] used a CNN-RNN
architecture combining a Recurrent Neural Network (RNN) and a Convoluted Neural Network
(CNN) to predict emotion in a video. Several amazing tactics and processes were also used to
support this research. They're more constant, exact, and realistic. In terms of performance,
16
accuracy, reasonability, and precision, these tactics are the most successful. Others are more
realistic, while others are more precise. To get a more exact result, some take a long time and
need a lot of processing resources, while others compromise precision for speed. Despite the
fact that everyone's notion of success differs,..

Yelin Kim and Emily Mower Provos study whether a subset of an utterance may be used to
infer emotion, as well as how the subset varies among emotion classes and modalities. They
propose a windowing approach that detects window configurations, window length, and timing
for gathering segment-level information for utterance-level emotion inference.

The studies employing the IEMOCAP and MSPIMPROV datasets revealed that the identified
spatiotemporal window setups have consistent patterns among speakers and are unique to
particular emotion classes and modalities. They compare their proposed windowing technique
to a baseline method that picks window configurations at random and an usual all-mean strategy
that considers all of the data in each utterance. . While only utilising 40–80 percent of the
information in each syllable, this approach does substantially better in emotion recognition. The
windows reported by each speaker are likewise constant, indicating how multimodal signals
reflect mood across time. These tendencies are also supported by psychological studies. When
everything is said and done, however, the result does not match the technique [15].

A. Yao, D. Cai, P. Hu, S. Wang, L. Shan, and Y. Chen used a well-designed Convolutional
Neural Network (CNN) architecture for video-based emotion recognition [14]. They presented
the HOLONET strategy, which includes three key network architecture elements.

For instance, Morgan [9] conducted a review in the area of speech recognition assisted with
discriminatively trained feed-forward networks. The main focus of the review was to shed the
light on papers that employ multiple layers of processing prior to the hidden Markov model
based decoding of word sequences. Throughout the paper, some of the methods that incorporate
multiple layers of computation for the purpose of either providing large gains for noisy speech
in small vocabulary tasks or significant gains for high Signal-to-Noise Ratio (SNR) speech on
large vocabulary tasks were described. Moreover, a detailed description was provided about the
methods with structures that incorporate a large number of layers (the depth) and multiple
streams using Multilayer Perceptrons (MLPs) with a large number of hidden layers. This review
paper eventually concluded that even though the deep processing structures are capable of
17
providing improvements in this genre, choice of features and the structure with which they are
incorporated, including layer width, can also be significant factors.

Hinton et al. [10], presents an overview on the use of deep neural networks that incorporate
many number of hidden layers that are trained using some of the new techniques. The overview
summarizes the findings of four different research groups that collaborated to reveal the
advantage of a feed-forward neural network that has quite a few frames of coefficients as an
input and produces subsequent probabilities over HMM states as an output. This technique was
studied as an alternative to using the traditional HMMs and GMMs for acoustic modeling in
speech recognition. The collected results have shown that deep neural networks that incorporate
many hidden layers and are trained by new techniques outperform GMMs - HMMs on a variety
of speech recognition benchmarks, by sometimes a large margin.

Deng et al. [11] presented an overview summary on the papers that were part of the session at
ICASSP- 2013, entitled ‘‘New Types of Deep Neural Network Learning for Speech
Recognition and Related Applications,’’ which was organized by the authors. In addition to
that, the paper presented the history of the development of the deep neural networks for acoustic
models for speech recognition. The overview summary focused on the different ways that can
be utilized to improve deep learning, which was classified into five different categories:
enhanced types of network architecture and activation functions, enhanced optimization
methods, enhanced ways of determining the deep neural networks parameters and finally
enhanced ways of leveraging a number of languages at the same time. The overview revealed
the rapid continues progress in the acoustic models that use deep neural networks which can be
seen on several fronts when compared to those based on GMMs. The paper also revealed that
these acoustic models can also be applicable and enhance performance in other signal
processing applications, and not only speech recognition.

Y. Fan, X. Lu, D. Li, and Y. Liu suggested a method for video-based emotion recognition in
the wild. They used CNN-LSTM and C3D networks to imitate simultaneous video appearances
and motions [16]. They observed that integrating the two types of networks may yield
outstanding results, proving the method's efficacy. In their proposed approach, they used LSTM
(Long Short Term Memory), a kind of RNN, C3D (Direct Spatio-Temporal Model), and Hybrid
CNN-RNN and C3D Networks. This method produces outstanding precision and results. This

18
technique, on the other hand, is far more difficult, time-consuming, and impractical. As a result,
productivity isn't exceptionally high [16].

Zixing Zhang, Fabien Ringeval, Eduardo Coutinho, Erik Marchi, and Björn Schüller proposed
numerous enhancements to the SSL approach to improve the bad performance of a classifier
that can deliver on demanding recognition. Tasks reduce the reliability of automatically labeled
data and solve the noise accumulation problem; examples that the system incorrectly classifies
are nevertheless used to educate it in later rounds [17]. During the supervised phase, they
leveraged the complementarity of audio-visual properties to improve the classifier's
performance. They then re-evaluated the automatically categorized instances iteratively to
correct any mislabeled data, enhancing the system's overall confidence in its predictions.. . This
strategy uses SSL to give the best possible performance when labeled data is limited and/or
expensive to get, however there are several inherent limitations that limit its efficacy in actual
applications. This strategy has been tested on a database with limited data volume and kind.
The algorithm used isn't capable of assessing physiological data with other types of data [17].

EEG-based effective models applying transfer learning methodologies (TCA-based Subject

Transfer) [18] were presented by Wei-Long Zheng and Bao-Liang Lu, which are more accurate
in terms of positive emotion recognition than other techniques used earlier.

. Their strategy is 85.01 percent accurate. To transfer learning, they used three pillars: TCA-
based Subject Transfer, KPCA-based Subject Transfer, and Transductive Parameter Transfer.
They extracted information from raw EEG signals using a bandpass filter between 1 and 75 Hz
using differential entropy (DE) features for feature extraction. They assessed using a leave-one-
subject-out cross-validation approach. The based classification parameter transfer technique
outperforms the other options in terms of recognition accuracy, with a 19.58 percent increase.
This accomplishment is limited to just recognizing positive feelings. This method is limited in
terms of distinguishing negative and neutral emotions. However, there is still a long way to go
in recognizing negatives.

Deng et al. [12] conducted a summary on the work done by Microsoft since the year 2009 in
the area of speech using deep learning. The paper focused on more recent advances which
helped shed some light on the different capabilities as well as limitations of deep learning in the
area of speech recognition. This was done by providing samples of the recent experiments
19
carried by Microsoft for advancing speech related applications through the use of deep learning
methods. Speech related applications included features extraction, modeling language, acoustic
models, understanding speech as well as dialogue estimation. Experimental results have shown
clearly that the speech spectrogram features are more advanced to MFCC with deep neural
networks compared to the traditional practice using GMMs - HMMs. This paper also shows
that improvements should be done on the architecture of deep neural networks in order to
improve further the features of acoustic measurements

Li et al. [13] presented the basics of the state of the art solutions for automatic spoken language
recognition for both, computational and phonological perspectives. Huge progress was
achieved in recent years in the area of spoken language recognition which was mostly directed
by breakthroughs in relevant signal processing areas such as pattern recognition and cognitive
science. Several main aspects relevant to language recognition was discussed such as language
characterization, modeling methods, as well as system development techniques. Findings
clearly indicate that even though this area has hugely developed in the past years, it is still far
from the perfect, especially when it comes to language characterization. In addition, this paper
provides an overview on the current research trends and future directions which was carried
using the language recognition evaluation (LRE) which is developed by the National Institute
of Standards and Technology (NIST).

Li et al. [14] provided an overview on modern noise robust techniques for automatic speech
recognition developed over the past three decades. More emphasis was given on the techniques
that have proven successful over the years and are likely to maintain and further expand in their
applicability in the future. The examined techniques were categorized and evaluated using five
different criteria, which are: using former knowledge about the acoustic environment distortion,
model domain processing versus feature domain processing, using specific environment
distortion models, uncertainty processing versus predetermined processing and finally using
acoustic models trained by the same model adaptation process used in the testing stage. This
study helps the reader differentiate between the different noise-robust techniques, as well as
provides a comprehensive insight on the performance complex tradeoffs that should be taken
into account when selecting between the available techniques

M.A.Anusuya and S.K.Katti [1, 3] presents a brief survey on Automatic Speech Recognition

20
and discusses the major themes and advances made in the past 60 years of research, so as to
provide a technological perspective and an appreciation of the fundamental progress that has
been accomplished in this important area of speech communication. The design of Speech
Recognition system requires careful attentions to the following issues: Definition of various
types of speech classes, speech representation, feature extraction techniques, speech classifiers,
data base and performance evaluation. The objective of this review paper is to summarize and
compare some of the well known methods used in various stages of speech recognition system
and identify research topic and applications which are at the forefront of this exciting and
challenging field.

Santosh K.Gaikwad , Bharti W.Gawali and Pravin Yannawar [2] The Speech is most prominent
& primary mode of Communication among of human being. The communication among human
computer interaction is called human computer interface. Speech has potential of being
important mode of interaction with computer .This paper gives an overview of major
technological perspective and appreciation of the fundamental progress of speech recognition
and also gives overview technique developed in each stage of speech recognition. This paper
helps in choosing the technique along with their relative merits & demerits. A comparative
study of different technique is done as per stages. This paper is concludes with the decision on
feature direction for developing technique in human computer interface system using Marathi
Language.

Shanthi Therese, Chelpa Lingam [4] Says that speech has evolved as a primary form of
communication between humans. The advent of digital technology, gave us highly versatile
digital processors with high speed, low cost and high power which enable researchers to
transform the analog speech signals in to digital speech signals that can be scientifically studied.
Achieving higher recognition accuracy, low word error rate and addressing the issues of sources
of variability are the major considerations for developing an efficient Automatic Speech
Recognition system. In speech recognition, feature extraction requires much attention because
recognition performance depends heavily on this phase. In this paper, an effort has been made
to highlight the progress made so far in the feature extraction phase of speech recognition
system and an overview of technological perspective of an Automatic Speech Recognition
system are discussed

21
For further information on diverse modes of learning, we suggest Mitchell et al. [34], Bishop
[35], Suton and Barto [36], and James et al. [37]. The predicted response, known as labels, is
included in the training data in supervised learning; that is, each observation has a label (training
sample or instance). The goal of learning is to correctly predict the label of each training/test
situation

Sanjib Das [5] presents a brief survey on speech is the primary and the most convenient means
of communication between people. The communication among human computer interaction is
called human computer interface. Speech has potential of being important mode of interaction
with computer. This paper gives an overview of major technological perspective and
appreciation of the fundamental progress of speech recognition and also gives overview
technique developed in each stage of speech recognition. This paper helps in choosing the
technique along with their relative merits and demerits. A comparative study of different
technique is done as per stages. This paper concludes with the decision on feature direction for
developing technique in human computer interface system in different mother tongue and it
also discusses the various techniques used in each step of a speech recognition process and
attempts to analyze an approach for designing an efficient system for speech recognition. The
objective of this review paper is to summarize and compare different speech recognition
systems and identify research topics and applications which are at the forefront of this exciting
and challenging field.

Nidhi Desai1, Prof. Kinnal Dhameliya, Prof. Vijayendra Desai[6,7] survey presents speech is
the most natural form of human communication and speech processing has been one of the most
inspiring expanses of signal processing. Speech recognition is the process of automatically
recognizing the spoken words of person based on information in speech signal. Automatic
Speech Recognition (ASR) system takes a human speech utterance as an input and requites a
string of words as output. This paper introduce a brief survey on Automatic Speech Recognition
and discuss the major subjects and improvements made in the past 60 years of research, that
provides technological outlook and a respect of the fundamental achievement that has been
accomplished in this important area of speech communication. Definition of various types of
speech classes, feature extraction techniques, speech classifiers and performance evaluation are
issues that require attention in designing of speech recognition system. The objective of this
review paper is to summarize some of the well known methods used in several stage of speech
22
recognition system.

Guillaume Gravier, Ashutosh Garg [8, 11] survey presents Visual speech information from the
speaker’s mouth region has been successfully shown to improve noise robustness of automatic
speech recognizers, thus promising to extend their usability into the human computer interface.
In this paper, we review the main components of audio-visual automatic speech recognition and
present novel contributions in two main areas: First, the visual front end design, based on a
cascade of linear image transforms of an appropriate video region-of-interest, and subsequently,
audio-visual speech integration. On the later topic, we discuss new work on feature and decision
fusion combination, the modeling of audio-visual speech asynchrony, and incorporating
modality reliability estimates to the bimodal recognition process. We also briefly touch upon
the issue of audiovisual speaker adaptation. We apply our algorithms to three multi-subject
bimodal databases, ranging from small- to large vocabulary recognition tasks, recorded at both
visually controlled and challenging environments. Our experiments demonstrate that the visual
modality improves automatic speech recognition over all conditions and data considered,
however less so for visually challenging environments and large vocabulary tasks.

Li Deng and John C. Platt [9] survey presents that deep learning systems have dramatically
improved the accuracy of speech recognition, and various deep architectures and learning
methods have been developed with distinct strengths and weaknesses in recent years. How can
ensemble learning be applied to these varying deep learning systems to achieve greater
recognition accuracy is the focus of this paper. We develop and report linear and log-linear
stacking methods for ensemble learning with applications specifically to speech-class posterior
probabilities as computed by the convoluional, recurrent, and fully-connected deep neural
networks. Convex optimization problems are formulated and solved, with analytical formulas
derived for training the ensemble-learning parameters. Experimental results demonstrate a
significant increase in phone recognition accuracy after stacking the deep learning subsystems
that use different mechanisms for computing high-level, hierarchical features from the raw
acoustic signals in speech.

Li Deng, Jinyu Li, Jui-Ting Huang, Kaisheng Yao, Dong Yu, Frank Seide, Michael L. Seltzer,
Geoff Zweig, Xiaodong He, Jason Williams, Yifan Gong, and Alex Acero Mic [10] survey
describe that deep learning is becoming a mainstream technology for speech recognition at

23
industrial scale. In this paper, we provide an overview of the work by Microsoft speech
researchers since 2009 in this area, focusing on more recent advances which shed light to the
basic capabilities and limitations of the current deep learning technology. We organize this
overview along the feature-domain and model-domain dimensions according to the
conventional approach to analyzing speech systems. Selected experimental results, including
speech recognition and related applications such as spoken dialogue and language modeling,
are presented to demonstrate and analyze the strengths and weaknesses of the techniques
described in the paper. Potential improvement of these techniques and future research directions
are discussed.

CHAPTER 3

Methodology

Machine learning is a subset of artificial intelligence and computer science. It focuses on

developing algorithms that allow a task or ability to be learned spontaneously via experience,
such as seeing training data. Following that, machine learning should be able to apply their
newly acquired understanding to new, unrelated data. This may be accomplished using a variety
of learning approaches, including supervised learning, unsupervised learning, and
reinforcement learning. The next sections give a short summary of supervised learning and how
it was applied in this project..

In supervised learning, the machine learning algorithm learns to categorize the input data into
two or more categories. The algorithm learns discriminant features or qualities across various
categories or classes based on observed training data. Additional test input data is then classified
using these criteria.

Artificial neural networks (ANNs) are a well-known example of classifier [38], and certain
ANNs may be taught using machine learning techniques. ANNs were inspired by biological
neural networks, such as the human central nervous system. That is, they are made up of
neurons, which are processing units that are intimately coupled. The core components of deep

24
learning, a strong modern machine learning technique, are artificial neural networks (ANNs).
In reality, deep learning models are ANNs with a huge number of layers and neurons. Deep
learning models have performed well in a wide range of machine learning tasks, including
classification. ... A fundamental component that makes deep learning models effective is their
ability to acquire complicated attributes from simple data [24]. That is, the simplest and most
fundamental aspects of the training data are represented in the first layers of deep learning
models. The deeper layers employ these low-level attributes to generate a complex
representation. The ability to build high-level features from low-level data has the potential to
reduce the amount of preprocessing required to extract hand-crafted features prior to classifier
development. In this study, we employed a deep learning model called a convolutional neural
network to identify emotional states of speech data.

The rest of this chapter is laid out as follows:

1.2 Multi-layer Perceptron Networks

The network design of artificial neural networks is one of its properties (ANNs). It regulates
the connections between neurons, which are the basic processing units. The multi-layer
perceptron (MLP) is a well-known ANN architecture that consists of linear threshold units
(LTUs), which are neurons. Figure 3.1 depicts a linear threshold unit. An LTU takes weighted
inputs from numerous neurons (in this example, three neurons) and computes z = w1x1 + w2x2
+ w2x2 + b, where b is the bias factor... The linear combination is then applied to a step function
(such as the Heaviside function H(x) or the Sign function sgn(x)) to get the result y = f(z), where
f is a step function. If the weighted total exceeds a threshold value, the LTU will produce an
output (which is determined by the bias term).

25
Figure 3. 1Linear threshold unit

One input layer, one or more LTU stages (also known as hidden layers), and one activation
function make up an MLP. The information is transferred from the input layer (lower level) to
the output layer (higher level) (higher level). Because of this, they're called feedforward
artificial neural networks.

The input layer represents the values of a single training sample in various dimensions. This
might be the loudness of an audio stream at different sample points or the luminance of an
image at different pixels. The input is usually expressed as an x vector, with the length denoting
the number of dimensions (e.g., the number of sampling points in an audio signal or the number
of pixels in an image).

The output can be a real number, y, or a vector, y, that shows the input label. l [0], l[1], l[2],..,
l[n], l[n+1], where l [0] is the input layer, l [1], l[2],.., l[n], l[n+1], where l [0] is the input layer,
l [1], l[2],.., l[n], l[n+1], where l [0] is the input layer, l [ Every neuron in the layer l [j] 1j(n+1)
(all layers save the input layer) receives weighted input directly from every layer one level
down, i.e., l [j1]. In Figure 3.2, which displays an MLP with one hidden layer, W1 and W2 are
matrices associated to the values that weight the inputs of the neurons..

26
Figure 3. 2 Multi layer perceptron

The weight matrices (W1 and W2), as well as the bias vectors, are the parameters of interest
(b1 and b2). The network learns to classify the incoming input by changing the values of these
parameters. To discover the appropriate values for these parameters, a variety of learning
algorithms can be applied. Encourage low is one of these learning techniques that has been
widely used to train MLP networks. Gradient descent is the basis for the backpropagation
method, which is briefly covered in the next section..

1.3 Gradient Descent

It's possible to think of the search for the optimal weight parameter values as an optimization
problem. After getting the error function or loss function, the goal is to discover the parameters
of interest that minimize the error function. There are several methods to define the error
function. In Equation 3.1, W, d, and D stand for weights, one training instance, and all training
data, respectively, and demonstrate a well-known error function known as sum of squared errors
[34]. As shown, the squared error between the actual (td) and predicted (yd) labels is added up
over all training data. The goal is to find weight values that minimize this function in the weight
space.

27
where is the learning rate that determines the update step and wi is the weight associated with
the input's ith dimension. The backpropagation approach employs algorithms to find the local
minima of the error function of multilayer perceptron (MLP) networks. The error function of
MLP networks is not convex like that of linear units. As a result, defining global minima is far
from certain. Because we don't have access to the hidden neurons' output, the backpropagation
approach calculates the contribution of hidden neurons to the output error and updates the
hidden layers' weights using the chain rule of calculus. 38 and 24. It should be noted that the
step function of MLP networks makes finding the derivative in terms of weights challenging.
As a consequence, the differentiable sigmoid function (x) = 1 1+ex replaces the step function.

28
For more details, the reader should see [34, 24, 38]. Gradient descent with momentum [39] and
root mean square propagation (RMSprop) [40] are two gradient descent variants that speed up
the convergence process. Gradient descent with momentum uses an exponentially diminishing
weighted average of gradients instead of gradients to update the weights. That is, the gradient
over time may be seen as a timeseries signal, E w (t), where t is the number of rounds. . The
gradients can be replaced with the exponentially weighted average of gradients as vt = βvt−1+
(1−β) ∂E ∂w (t), where v0 = 0 and β ∈ [0, 1], to update the weights. RMSprop optimization
computes the exponentially decaying weighted average of the squared gradient, m(t) = βmt−1
+ (1 − β)( ∂E ∂w (t))2 , where m0 = 0 and β ∈ [0, 1], and use its square root to scale the learning
rate as η0 = √ η m(t) . These gradient descent versions can be used to smooth out oscillations
around local optima and speed up convergence. The Adam approach, devised by Kingma and
Ba [41], reduces the loss function by combining the benefits of gradient descent with
momentum and RMSprop optimization [42]. It employs first-order momentum, as in gradient
descent with momentum, and second-order momentum, as in RMSprop optimization, to update
the weights. "Adaptive moment estimation" is the name of this approach. This method is quite
efficient, and it is frequently employed in deep learning models.

1.4 Convolutional Neural Networks

CNNs are a form of deep learning model that has had a lot of success in sectors including object
recognition [43], face recognition [44], handwriting recognition [45], speech recognition [46],
and natural language processing [47]. Convolution is derived from the fact that neural networks
employ convolution, a mathematical approach. The three main building blocks of CNNs are the
convolutional layer, the pooling layer, and the fully connected layer. The explanations of
numerous architectural pieces, as well as some basic concepts like softmax, rectified linear unit,
and dropout, follow.

1.5 Convolutional Layer

CNN output is computed using mixing rather than duplication in convolutional layers. As a
result, the convolutional layers' neurons aren't entirely coupled to the layers above them. This
architecture was motivated by the fact that neurons in the visual cortex have a local receptive
field [48, 49]. That is, the neurons are taught to respond to stimuli that are unique to a certain
location and structure. Convolutional neural networks contain sparse connections and parameter
29
sharing as a result, reducing the number of parameters in deep neural networks significantly.
Figure 3.3 depicts the convolution of a kernel, which is a 2 2 matrix, with a one-channel 3 3
image. The output is 2 2 1 in size. The output size is (nh f + 1) (nw f + 1) nf, where nh represents
the input height, nw represents the input width, and nf is the number of kernels. The depth of
the kernel is determined by the depth of the input. In the situation presented in Figure 3.3, the
input depth is nc = 1. The kernel depth as a result is 1. The output depth is also one because
there is just one kernel. As can be seen, each output neuron is the weighted sum of the input
neurons within the proper receptive field, resulting in CNNs with sparse connectivity. . . The
kernel is also shared across the layer, allowing CNNs to share parameters. The amount by which
the kernel glides along the input is called stride. In our scenario (Figure 3.3), the stride is s = 1,
indicating that the kernel moves one step across the image. It's worth noticing that the input
volume decreases with each convolutional layer. To avoid this decrease, we can pad the input's
outer boundary with zero.

The output's overall height and breadth are determined by

30
Figure 3. 3 Convolution 3x3image

Local filtering in convolutional layers enables the identification of a variety of low-level

properties of interest, as well as the creation of various feature maps. The deep layers employ
these feature maps to create a high-level representation of the inputs.

1.6 Pooling Layer

The second critical component of CNNs is the pooling layer. This layer is intended to limit the
outputs' susceptibility to local input fluctuations. In certain applications, this fluctuation to
minor local movements could impair spatial resolution and lead to underfitting. Pooling can aid

31
CNNs in extracting features of interest more rapidly when exact spatial qualities are not
required. By lowering the number of dimensions and parameters, pooling can also assist to
minimize overfitting [24]. Pooling takes subsamples from the outputs in a sense [38]. .. Pooling
layers, like convolutional layers, use a kernel (a linear receptive field) to summarize the values
of the neurons included inside the pooling kernel using an aggregation function such as
maximum, average, L2-norm, or weighted average. We must first determine the size of the
pooling kernels, the shifting step, and the amount of padding before we can generate a pooling
layer in CNNs. Figure 3.4 depicts maximum pooling over a 3 3 matrix with a 2 2 pooling kernel
moving one pixel across the matrix (i.e., stride of 1).

Figure 3. 4 Pooling of a 3x3 using 2x2 kernel

1.7 Fully Connected Layer

A typical CNN consists of many convolutional layers, each followed by a pooling layer. The
last component of CNNs is the fully connected layer, which is effectively a standard MLP. This
component is used to either process the features further to produce a more abstract
representation of the inputs or to classify the inputs based on the characteristics collected by the
precedinglayers [51]. 3.2.4 Softmax Unit A softmax unit is usually the output of the entirely
connected layer. A softmax unit employs the softmax function to describe the posterior
distribution of k classes (normalized exponential). In reality, the softmax function is an
extension of the sigmoid function, which depicts the probability distribution of two potential
classes [24]. The softmax function is shown in Equation 3.5.

32
σ(z) = 1 1+e−z = e z 1+e z

1.7.1 Rectified Linear Unit

The activation functions have been the linchpin of artificial neural networks (ANNs). That is,
they incorporate nonlinearity into ANNs, which makes ANNs an effective tool to learn complex
models. Sigmoid function g(z) = σ(z) = 1 1+e−z and hyperbolic tangent function g(z) = tanh(z)
= 2σ(2z) − 1 are two popular activation functions, especially in traditional ANNs (see Figure
3.5).

Figure 3. 5 Sigmoid Activaton Function

These functions have the disadvantage of saturating for zs with high absolute values, which
makes gradient-based learning difficult. To overcome this problem, the rectified linear unit
(ReLU)—g(z) = max(z, 0)—can be utilized as an activation function [24].

Figure 3.6 depicts the ReLU function. Despite its nonlinearity around z = 0, it behaves linearly
for z > 0 and z 0. ReLU functions strike a compromise between nonlinearity and linearity, which

33
is essential for gradient-based learning. As a result, ReLU functions are widely used as the
perceptron in deep learning models. ReLU functions are used to the feature maps in CNN
convolutional layers before subsampling the pooling layers. . The primary problem in the ReLU
functions is the zero output for zs with negative values. The dying ReLUs [38] is a problem in
which the neurons' output falls to zero during training and stays there. As a result, the neurons'
capacity to operate is compromised. A leaky ReLU function has been devised to overcome this
issue, ReLU(z) = max(z, z), where determines the slope of the ReLU function for z 0 [52, 38].
Using leaky ReLU increases the danger of overfitting the training data for a small number of
training events, according to earlier research [38]. This is because the network's sensitivity to
nuisance fluctuations increases as the number of parameters increases.

Figure 3. 6 Rectified linear function

34
35
1.7.2 Mini-batch Learning

Deep learning models have a large number of training examples, slowing learning and taxing
computational resources. The idea of using small batches of examples from the training set
seeks to solve the data size problem [24]. Instead of training the network with a single batch of
training data for each iteration, the network is trained using many batches. The optimization
problem happens on a number of error subfunctions rather than a single error function. Mini-
batch learning is a hybrid of batch and stochastic learning methods. As previously stated, all
training cases are used to train the model for each iteration in batch learning.. In stochastic
learning, on the other hand, each instance is used to generate an error function individually. The
primary benefit of stochastic and mini-batch learning is the time and resources saved. On the
other hand, stochastic or mini-batch algorithms never reach the global optimum; instead, they
approach or fluctuate about it [42, 38]. Mini-batch learning is more accurate than stochastic
learning because it includes a larger portion of the training data while optimizing the error
functions. However, stochastic learning is more likely to avoid local optima than mini-batch
learning [38]. Mini batches are often set to 64, 128, 256, 512, or 1024 in deep learning models
like convolutional neural networks (CNNs)..

1.7.3 Dropout

Variance and bias are two key features of gradient-based learning [37]. The estimated
underlying model of a system might be changed according on the project set used for learning.
The word "variance" is used to characterize these shifts. The ideal condition is to create a model
that represents system behavior broadly and is therefore less vulnerable to the specific training
set. By following noise-induced changes in the training set, high variance models overfit the
training data. In general, complex models have a high level of volatility. Models built on
assumptions that simplify the system's genuine activity, on the other hand, have a strong bias.
These models underfit the training data and fail to capture the underlying variety of the system..
There is a significant risk of overfitting the training set (high variance) in deep learning models
due to the vast amount of training data and model parameters in deep neural networks. In
traditional machine learning, there is a trade-off between variance and bias; that is, reduced
unpredictability leads to more bias, and vice versa. This long-standing variance-bias problem
has been solved utilizing deep architectures and big training data. As long as the deep networks
36
are suitably regularized, the variance may be reduced without compromising the bias [42].. On
the training set, models with a lot of variance perform well, but they don't generalize well on
the test set. L1-norm regularization and L2-norm regularization are two traditional methods for
dealing with overfitting. The aim function incorporates weight decay into the loss function, as
shown below for L1-norm regularization and L2-norm regularization [42, 24]..

On the training set, models with a lot of variance perform well, but they don't generalize well
on the test set. L1-norm regularization and L2-norm regularization are two traditional methods
for dealing with overfitting. The aim function incorporates weight decay into the loss function,
as shown below for L1-norm regularization and L2-norm regularization [42, 24]. . Dropout is a
strong regularization method for reducing overfitting in deep learning models. Dropout is a
strong and successful method for regulating deep neural networks that involves randomly
eliminating neurons from the hidden layers during training. A propensity, such as p [55], is used
to delete the neurons. The outputs and error function of the networks may be represented as
random variables. Dropout is comparable in some aspects to bagged ensemble training [24],
which utilizes training material to train several networks.

As a result, the output of the network with dropout may be seen as an average ensemble of
multiple networks of different architectures [38]. The anticipated value of the network's error
function with dropout, according to Baldi and Sadowski [56], comprises a regularization
component equivalent to weight decay in L1-norm and L2-norm regularization. They also claim
that the optimum degree of regularization is achieved when the hidden neurons are eliminated
with a probability of p = 0.5. Enhancement of Data Deep learning algorithms seek for data since
huge data sets are utilized to train them, which increases their actual potential [57].

Indeed, it has been demonstrated that increasing the size of the training set reduces overfitting
and improves the generalizability of deep learning models [24]. Obtaining new data, on the
other hand, is a time-consuming and expensive procedure. To tackle the data collection
37
challenge, data augmentation, a regularization strategy, is utilized to artificially synthesize new
training data and enlarge the size of training sets [38]. Data augmentation has had a lot of
success in recent years in a number of machine learning tasks, especially classification [24, 58,
59, 60, 43, 61]. Transformation techniques that are invariant to the classifier of interest are used
in the proposed supervised learning methodology. That is, data augmentation has no influence
on the classification of the data. For example, rotating images for object recognition or
embedding voice signals in background noise for speech emotion detection increases the
amount of training sets without impairing instance labeling.

Folds 3.3 Cross-Validation Cross-validation is a technique for evaluating the performance of

machine learning models. Using this method, the data is divided into k distinct non-overlapping
subsets, or folds, of about identical size.

. The first fold is used as a validation set for evaluating the model's performance, while the
following k 1 folds are used to train the model. This procedure is repeated k times for each kth
trial, with the kth fold acting as the validation set and the remaining folds as the training test.
Finally, the model's performance is calculated by averaging performance accuracy over these k
test sets [37, 38]..

38
CHAPTER 4

System Setup

In the current study, we employed convolutional neural networks (CNNs) to classify speech
interjections based on their content. We employed a proprietary database as well as three widely
recognized benchmarks for spoken utterance recognition to train and assess our models [62, 63,
64]. TensorFlow (an open-source toolkit created in Python and C++ [65]) served as the
programming foundation for our CNN models. This chapter describes the experimental setup
used in the current study..

1.8 Preprocessing

To maintain uniformity across all databases, all utterances were resampled and filtered using
an antialiasing FIR lowpass filter with a frequency rate of 16 kHz prior to any processing. All
auditory utterances' spectrograms were then produced. A spectrogram is a graphic depiction of
energy variation over time at various frequencies. The right half (horizontal axis) denotes time,
whereas the ordinate (vertical axis) represents frequency. The quantity of darkness or color used
to encode energy or intensity.

There are two general types of spectrograms:

• wide-band spectrograms

• narrow-band spectrograms

Wide-band spectrograms have a better temporal resolution than narrow-band feature extraction.
Because of this property, individual glottal pulses may be observed in wide-band spectrograms.
Wide-band spectrograms provide superior frequency resolution than narrow-band
spectrograms. Because of this property, individual harmonics can be resolved using narrow-
band spectrograms [66, 67].

Figure 4.1 depicts the wideband and narrowband spectrograms of a speech utterance. Because
of the importance of vocal fold vibration and the fact that glottal pulse is associated to one
39
period of vocal fold vibration [66], we decided to convert all utterances into wide-band
spectrograms. The length of Hamming windows was set at 5 ms, resulting in a 4.4 ms overlap.
There were 512 DFT points in all.

In many circumstances, frequencies below 4000 Hz are sufficient for speech comprehension
[68], hence we excluded frequencies over 4 kHz from spectrograms. Earlier research found that
eliminating energy over 4000 Hz improved the algorithms' performance.. This resulted in 129
frequencies. All spectrogram pictures were scaled to 129 129 pixels and then z-normalized to
have a zero mean and close to one standard deviation.

Figure 4. 1Wide band spectrogram

1.8.1 Training and Test Sets

Our models were trained and tested using 5-fold cross-validation. The data was divided into
five folds in this way. Our models were trained on the first fold, while the others were put to
the test. The remaining folds were employed for training and other purposes, whereas the second
fold was used to test our models. To avoid overfitting and the unfavorable effect of restricted
database sizes, the data sets were augmented by adding white Gaussian noise with a +15 signal
to noise ratio (SNR) to each audio signal either 10 times or 20 times. The SNR is 10

40
log10(Pspeech Pnoise), where P is the average power of the signal.. As a result of the data
augmentation, we used data sets with 10 times augmentation (10x) and data sets with 20 times
augmentation (20x) to train our models (20x). I used the original data, which was noise-free, to
test our models. The additional data was just used for training purposes. Finally, single-hot
vectors were used to encode the training and test data labels.

The number of training epochs varied between 100 and 4000. The preferred training period was
chosen to 100 because to calculation and time constraints.

1.8.2 Architecture

A convolutional neural network with two convolutional layers and one fully connected layer
with 1024 hidden neurons was used as the baseline configuration in the current investigation.
Depending on the number of classes, a 5-way or 7-way softmax unit was used to estimate the
probability distribution of the classes. After each convolutional layer, a max-pooling or
average-pooling layer was added. To give the model nonlinearity, Rectified Linear Units
(ReLU) were used as activation functions in convolutional and fully connected layers. The
initial kernel size for convolutional layers was set at 55 with a stride of 1... The number of
kernels in the first and second convolutional layers was set at 8 and 16, respectively. The kernel
size of the pooling layers was set at 22 with a stride of 2. The Adam optimizer was used to
minimize the loss function over the training data's mini batches, and the loss function was cross-
entropy. The 512 character limit was specified for the small batches. There were 100 iterations
of training. The networks built in this study took ranging from 30 minutes to two days to train
using Graphics Processing Units (GPUs). Because GPUs have several cores and can handle a
large number of concurrent jobs, they are used instead of CPUs to improve processing
performance.

41
Figure 4. 2 The baseline architecture of CNN

Our research was carried out on the Crane cluster at the University of Nebraska-Holland
Lincoln's Computing Center. The dropout mechanism was also implemented to the fully
connected layer to improve network performance if overfitting symptoms were found.

42
CHAPTER 5

Results and Discussion

Import the required libraries

Check the sample rate of recordings

Change the sample rate to 8000 and check whether the recording is audible.

43
These are the labels of the recordings

Change the sample rate to 8000 and filter the recordings with the.wav extension. Append all
recodings to the all wave list and all labels to the all label list after that.

44
Encode all labels and recordings to machine language form and split the data into train and test.

45
Design the 1D CNN Model. Add MaxPooling and dropout layers where needed. Use Flatten to
sequence the output.

46
47
Model summary

Start Training the model and after enough epochs and least validation loss, save the
model.

48
Testing

Load the saved model and get a random recording from the test data and test it. Our
results were accurate upto 95%

This research used Python programming help to generate MFCC. It also used tensorflow , that
is an open source machine learning library. In addition, both Keras and tensorflow is also used
to combine the deep learning. This research uses deep learning to predict the sound obtained
from Google's data warehouse by dividing the neural network into 3 parts, as shown in Table
1, by the 324 stanzas which are 3x3 in the size model, the Revolution, and the 2x2 Dropout put
that reduce the number of neural networks, which are down 25%.

49
Table 5. 1 The separated group of the deep neural

Group I Group II Group III

o CNN-32 CNN-64 Fully

o MaxPooling MaxPooling Dropout

o Dropout Dropout Fully

50
CHAPTER 6

Conclusion and Future Scope

1.9 Conclusion

The major use and value of speech recognition is the level of security it provides. Speech
recognition is transforming the way people conduct business on the internet. Because voice mail
technology integrates speech recognition and telephony, businesses may design and deploy
voice-enabled online solutions right now. In this work, ANN is utilized to recognize speech.
Artificial neural networks (ANNs) are one of the most promising technologies today. This is a
nice place to start if you want to categorize speech. Conventional logic does not apply to them
as well as it does to others.. We picked MLP Architecture because of its low sensitivity, which
has shown to be the best solution. This research uses deep learning to be used in speech
recognition by using an audio library from Google which has 66.22% accuracy. Based on the
experimental results, this research can be applied to speech recognition. In other ways, this
research will make the computer more intelligent and capable

1.10 Future Scope

Voice recognition has had a huge impact on society, and scientists are increasingly interested
in knowing more about it. This technology has a promising future, and hardware development
is crucial to its success.

51
REFERENCES
[1] Stuart Jonathan Russell, Peter Norvig, John F Canny, Jitendra M Malik, and Douglas D
Edwards. Artificial intelligence: a modern approach, volume 2. Prentice hall Upper Saddle
River, 2003.

[2] Demis Hassabis, Dharshan Kumaran, Christopher Summerfield, and Matthew Botvinick.
Neuroscience-inspired artificial intelligence. Neuron, 95 (2):245–258, 2017.

[3] Marcel Van Gerven. Computational foundations of natural intelligence. Frontiers in

Computational Neuroscience, 11:112, 2017.

[4] Paul R Cohen and Edward A Feigenbaum. The handbook of artificial intelligence, volume
3. Butterworth-Heinemann, 2014.

[5] Ray Kurzweil. The singularity is near. Gerald Duckworth & Co, 2010.

[6] Marvin Minsky. The emotion machine: Commonsense thinking, artificial intelligence, and
the future of the human mind. Simon and Schuster, 2007.

[7] Dong Yu and Li Deng. AUTOMATIC SPEECH RECOGNITION. Springer, 2016.

[8] Lawrence R Rabiner and Biing-Hwang Juang. Fundamentals of speech recognition, volume
14. PTR Prentice Hall Englewood Cliffs, 1993.

[9] Lalit R Bahl, Frederick Jelinek, and Robert L Mercer. A maximum likelihood approach to
continuous speech recognition. In Readings in speech recognition, pages 308–319. Elsevier,
1990.

[10] Stephen E Levinson, Lawrence R Rabiner, and Man Mohan Sondhi. An introduction to the
application of the theory of probabilistic functions of a markov process to automatic speech
recognition. The Bell System Technical Journal, 62(4):1035–1074, 1983.

52
[11] Su-Lin Wu, ED Kingsbury, Nelson Morgan, and Steven Greenberg. Incorporating
information from syllable-length time scales into automatic speech recognition. In Acoustics,
Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference
on, volume 2, pages 721–724. IEEE, 1998.

[12] Vaibhava Goel and William J Byrne. Minimum bayes-risk automatic speech recognition.
Computer Speech & Language, 14(2):115–135, 2000. 71

[13] Louis Ten Bosch. Emotions, speech and the asr framework. Speech Communication, 40(1-
2):213–225, 2003.

[14] Monita Chatterjee, Danielle J Zion, Mickael L Deroche, Brooke A Burianek, Charles J
Limb, Alison P Goren, Aditya M Kulkarni, and Julie A Christensen. Voice emotion recognition
by cochlear-implanted children and their normally-hearing peers. Hearing research, 322:151–
162, 2015.

[15] Nancy Eisenberg, Tracy L Spinrad, and Natalie D Eggum. Emotion-related self-regulation
and its relation to children’s maladjustment. Annual review of clinical psychology, 6:495–525,
2010.

[16] Harold Schlosberg. Three dimensions of emotion. Psychological review, 61 (2):81, 1954.

[17] Thomas S Polzin and Alex Waibel. Detecting emotions in speech. In Proceedings of the
CMC, volume 16. Citeseer, 1998.

[18] Chul Min Lee and Shrikanth S Narayanan. Toward detecting emotions in spoken dialogs.
IEEE transactions on speech and audio processing, 13(2): 293–303, 2005.

[19] Dimitrios Ververidis and Constantine Kotropoulos. Emotional speech recognition:

Resources, features, and methods. Speech communication, 48 (9):1162–1181, 2006.

[20] Tim Polzehl, Shiva Sundaram, Hamed Ketabdar, Michael Wagner, and Florian Metze.
Emotion classification in children’s speech using fusion of acoustic and linguistic features. In
Tenth Annual Conference of the International Speech Communication Association, 2009.

[21] Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. Survey on speech emotion
recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3):572–
587, 2011.

53
[22] Bj¨orn Schuller, Anton Batliner, Stefan Steidl, and Dino Seppi. Recognising realistic
emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech
Communication, 53(9-10):1062–1087, 2011.

[23] Albert Mehrabian et al. Silent messages, volume 8. Wadsworth Belmont, CA, 1971.

[24] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

[25] André Stuhlsatz, Christine Meyer, Florian Eyben, Thomas Zielke, Günter Meier, and
Björn Schuller. Deep neural networks for acoustic emotion recognition: raising the
benchmarks. In Acoustics, speech and signal processing (ICASSP), 2011 IEEE international
conference on, pages 5688–5691. IEEE, 2011. 72

[26] Arianna Mencattini, Eugenio Martinelli, Giovanni Costantini, Massimiliano Todisco,

Barbara Basile, Marco Bozzali, and Corrado Di Natale. Speech emotion recognition using
amplitude modulation parameters and a combined feature selection procedure. Knowledge-
Based Systems, 63:68–81, 2014.

[27] Longfei Li, Yong Zhao, Dongmei Jiang, Yanning Zhang, Fengna Wang, Isabel Gonzalez,
Enescu Valentin, and Hichem Sahli. Hybrid deep neural network–hidden markov model (dnn-
hmm) based speech emotion recognition. In Affective Computing and Intelligent Interaction
(ACII), 2013 Humaine Association Conference on, pages 312–317. IEEE, 2013.

[28] Qirong Mao, Ming Dong, Zhengwei Huang, and Yongzhao Zhan. Learning salient features
for speech emotion recognition using convolutional neural networks. IEEE Transactions on
Multimedia, 16(8):2203–2213, 2014.

[29] Haytham M Fayek, Margaret Lech, and Lawrence Cavedon. Towards real-time speech
emotion recognition using deep neural networks. In Signal Processing and Communication
Systems (ICSPCS), 2015 9th International Conference on, pages 1–5. IEEE, 2015.

[30] WQ Zheng, JS Yu, and YX Zou. An experimental study of speech emotion recognition
based on deep convolutional neural networks. In Affective Computing and Intelligent
Interaction (ACII), 2015 International Conference on, pages 827–831. IEEE, 2015.

[31] George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A
Nicolaou, Bj¨orn Schuller, and Stefanos Zafeiriou. Adieu features? end-to-end speech emotion

54
recognition using a deep convolutional recurrent network. In Acoustics, Speech and Signal
Processing (ICASSP), 2016 IEEE International Conference on, pages 5200–5204. IEEE, 2016.

[32] Andrew Ng. Sequence models. Deeplearning.ai on Coursera, February 2018.

[33] Michalis Papakostas, Evaggelos Spyrou, Theodoros Giannakopoulos, Giorgos Siantikos,

Dimitrios Sgouropoulos, Phivos Mylonas, and Fillia Makedon. Deep visual attributes vs. hand-
crafted audio features on multidomain speech emotion recognition. Computation, 5(2):26,
2017.

[34] Tom M Mitchell et al. Machine learning. wcb, 1997.

[35] C Bishop. Pattern recognition and machine learning (information science and statistics),
1st edn. 2006. corr. 2nd printing edn. Springer, New York, 2007.

[36] RS Suton and Andrew G Barto. Reinforcement learning: an introduction. adaptive

computation and machine learning, 2002.

[37] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to
statistical learning, volume 112. Springer, 2013.

[38] Aur´elien G´eron. Hands on machine learning with scikit-learn and tensorflow, 2017. 73

[39] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural
networks, 12(1):145–151, 1999.

[40] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a
running average of its recent magnitude. COURSERA: Neural networks for machine learning,
4(2):26–31, 2012.

[41] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.

[42] Andrew Ng. Improving deep neural networks: Hyperparameter tuning, regularization and
optimization. Deeplearning.ai on Coursera, October 2017.

[43] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing systems, pages
1097–1105, 2012.

55
[44] Steve Lawrence, C Lee Giles, Ah Chung Tsoi, and Andrew D Back. Face recognition: A
convolutional neural-network approach. IEEE transactions on neural networks, 8(1):98–113,
1997.

[45] Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard,
Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a back-
propagation network. In Advances in neural information processing systems, pages 396–404,
1990.

[46] Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas
Stolcke, Dong Yu, and Geoffrey Zweig. The microsoft 2016 conversational speech recognition
system. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International
Conference on, pages 5255–5259. IEEE, 2017.

[47] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for
text classification. In Advances in neural information processing systems, pages 649–657, 2015.

[48] David H Hubel. Single unit activity in striate cortex of unrestrained cats. The Journal of
physiology, 147(2):226–238, 1959.

[49] David H Hubel and Torsten N Wiesel. Receptive fields of single neurones in the cat’s
striate cortex. The Journal of physiology, 148(3):574–591, 1959.

[50] Nikhil Buduma and Nicholas Locascio. Fundamentals of deep learning: designing next-
generation machine intelligence algorithms. ” O’Reilly Media, Inc.”, 2017.

[51] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time
series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.

[52] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified
activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015. 74

[53] Claude Elwood Shannon. A mathematical theory of communication. ACM SIGMOBILE

Mobile Computing and Communications Review, 5(1):3–55, 2001.

[54] Reuven Y Rubinstein and Dirk P Kroese. Simulation and the Monte Carlo method, volume
10. John Wiley & Sons, 2016.

[55] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R
56
Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors.
arXiv preprint arXiv:1207.0580, 2012.

[56] Pierre Baldi and Peter J Sadowski. Understanding dropout. In Advances in neural
information processing systems, pages 2814–2822, 2013.

[57] Andrew Ng. Neural networks and deep learning. Deeplearning.ai on Coursera, September
2017.

[58] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
image recognition. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016.

[59] Justin Salamon and Juan Pablo Bello. Deep convolutional neural networks and data
augmentation for environmental sound classification. IEEE Signal Processing Letters,
24(3):279–283, 2017.

[60] Navdeep Jaitly and Geoffrey E Hinton. Vocal tract length perturbation (vtlp) improves
speech recognition. In Proc. ICML Workshop on Deep Learning for Audio, Speech and
Language, volume 117, 2013.

[61] Martin A Tanner and Wing Hung Wong. The calculation of posterior distributions by data
augmentation. Journal of the American statistical Association, 82(398):528–540, 1987.

[62] Felix Burkhardt, Astrid Paeschke, Miriam Rolfes, Walter F Sendlmeier, and Benjamin
Weiss. A database of german emotional speech. In Interspeech, volume 5, pages 1517–1520,
2005.

[63] Sanaul Haq, Philip JB Jackson, and J Edge. Speaker-dependent audio-visual emotion
recognition. In AVSP, pages 53–58, 2009.

[64] Giovanni Costantini, Iacopo Iaderola, Andrea Paoloni, and Massimiliano Todisco. Emovo
corpus: an italian emotional speech database. In LREC, pages 3501–3504, 2014.

[65] Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,
Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian
Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz,
Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man´e, Rajat Monga, Sherry Moore,

57
Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever,
Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi´egas, Oriol
Vinyals, Pete Warden, Martin 75 Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.
TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL
https://www.tensorflow.org/. Software available from tensorflow.org.

[66] Brian CJ Moore. An introduction to the psychology of hearing. Brill, 2012.

[67] David Pisoni and Robert Remez. The handbook of speech perception. John Wiley & Sons,
2008.

[68] Adam K Bosen and Monita Chatterjee. Band importance functions of listeners with
cochlear implants using clinical maps. The Journal of the Acoustical Society of America,
140(5):3718–3727, 2016.

[69] Andrew Ng. Convolutional neural networks. Deeplearning.ai on Coursera, October 2017.

[70] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, "Pixel recurrent neural networks,"

arXiv preprint arXiv:1601.06759, 2016.

[71] R. Prenger, R. Valle, and B. Catanzaro, "Waveglow: A flow-based generative network for
speech synthesis," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), 2019: IEEE, pp. 3617-3621.

[72] S. Pascual, A. Bonafonte, and J. Serra, "SEGAN: Speech enhancement generative

adversarial network," arXiv preprint arXiv:1703.09452, 2017.

[73] N. Adiga, Y. Pantazis, V. Tsiaras, and Y. Stylianou, "Speech Enhancement for Noise-
Robust Speech Synthesis Using Wasserstein GAN}}," Proc. Interspeech 2019, pp. 1821-1825,
2019.

[74] X. Ma and E. Hovy, "End-to-end sequence labeling via bi-directional lstm-cnns-crf," arXiv
preprint arXiv:1603.01354, 2016.

[75] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, "On the properties of neural

machine translation: Encoder-decoder approaches," arXiv preprint arXiv:1409.1259, 2014..

58
PLAGIARISM REPORT

59
60
PUBLICATION
Tahira Rashid, Dr.Gurinder kaur sodhi, “Automatic speech Recognition using Deep Learning
Algorithim.” Journal of Fundamental & Comparative Research; IF=7.268 Vol VIII,
No.1(XXXVIII) : 2022.ISSN;2277-7067.

Mini Project Document
No ratings yet
Mini Project Document
45 pages
Final Year Project Report Complete
No ratings yet
Final Year Project Report Complete
56 pages
Project Report B12
No ratings yet
Project Report B12
80 pages
Batch 2 Complete Documentatin
No ratings yet
Batch 2 Complete Documentatin
79 pages
CSDS-2 Batch-12 Project Report
No ratings yet
CSDS-2 Batch-12 Project Report
68 pages
17ec35011 MTP Report
No ratings yet
17ec35011 MTP Report
30 pages
"Speech Recognition and Voice Detection System": Bachlor of Technology IN Computer Science Engineering
No ratings yet
"Speech Recognition and Voice Detection System": Bachlor of Technology IN Computer Science Engineering
29 pages
Report
No ratings yet
Report
5 pages
Project Repoprt Final-Speech Emotion Recognition
No ratings yet
Project Repoprt Final-Speech Emotion Recognition
25 pages
CRATE 2024 - ECE - Programme Schedule
No ratings yet
CRATE 2024 - ECE - Programme Schedule
6 pages
Report File Part 1 VC - Removed
No ratings yet
Report File Part 1 VC - Removed
5 pages
Visvesvaraya Technological University: "Virtual Traffic Police Automatic Helmet Detection and Number Plate Recognition"
No ratings yet
Visvesvaraya Technological University: "Virtual Traffic Police Automatic Helmet Detection and Number Plate Recognition"
57 pages
Voice Based System Assistant Using NLP and Deep Learning-1
No ratings yet
Voice Based System Assistant Using NLP and Deep Learning-1
82 pages
Firoz KHAN
No ratings yet
Firoz KHAN
31 pages
Report File Part 1 VC - Removed
No ratings yet
Report File Part 1 VC - Removed
5 pages
Smart Hand Gloves
No ratings yet
Smart Hand Gloves
36 pages
Shuru Ke Pages
No ratings yet
Shuru Ke Pages
6 pages
Frontsheet Spectrum Analyser
No ratings yet
Frontsheet Spectrum Analyser
8 pages
Thesis Format
0% (1)
Thesis Format
36 pages
BTP Endsem 83
No ratings yet
BTP Endsem 83
38 pages
Minor Project Report
No ratings yet
Minor Project Report
13 pages
Documentation (AA20)
No ratings yet
Documentation (AA20)
62 pages
02 Prelim Page PDF
No ratings yet
02 Prelim Page PDF
13 pages
Project
No ratings yet
Project
76 pages
AI Voice Assistant Project
No ratings yet
AI Voice Assistant Project
44 pages
Project Thesis
No ratings yet
Project Thesis
47 pages
Kumbharana CK Thesis Cs
No ratings yet
Kumbharana CK Thesis Cs
243 pages
Rakesh Nite SH Saks Ham 3
No ratings yet
Rakesh Nite SH Saks Ham 3
42 pages
Laser Security Alarm With Mirror Reflection
No ratings yet
Laser Security Alarm With Mirror Reflection
28 pages
Kiran PDF
No ratings yet
Kiran PDF
4 pages
"Electronics Component Identification From Voice": An Industrial Oriented Mini Project Report
No ratings yet
"Electronics Component Identification From Voice": An Industrial Oriented Mini Project Report
59 pages
Design and Implementation of SPI Slave Controller On FPGA Using VHDL"
No ratings yet
Design and Implementation of SPI Slave Controller On FPGA Using VHDL"
4 pages
Dlau Major Project Report
No ratings yet
Dlau Major Project Report
38 pages
Python Project Voice Converter To Text
No ratings yet
Python Project Voice Converter To Text
24 pages
Report
100% (1)
Report
32 pages
Mini Project
No ratings yet
Mini Project
40 pages
Cha Marathi
No ratings yet
Cha Marathi
33 pages
Thamar-Abel Joseph English A SBA
No ratings yet
Thamar-Abel Joseph English A SBA
16 pages
Project Report
No ratings yet
Project Report
44 pages
Ai Based Personal Voice Assistant
No ratings yet
Ai Based Personal Voice Assistant
57 pages
Iot Flood Monitoring and Alerting System
No ratings yet
Iot Flood Monitoring and Alerting System
64 pages
Mcsethesis Tanmoy Chakraborty
No ratings yet
Mcsethesis Tanmoy Chakraborty
229 pages
Towards Developing Tools For Indian Lang
No ratings yet
Towards Developing Tools For Indian Lang
59 pages
Finaldiissertationreportformanipal
No ratings yet
Finaldiissertationreportformanipal
85 pages
Digital Eye An Eye Which Makes The Life of Visually Impaired More Colorful
No ratings yet
Digital Eye An Eye Which Makes The Life of Visually Impaired More Colorful
8 pages
MK Sop RU
No ratings yet
MK Sop RU
3 pages
Final Rev
No ratings yet
Final Rev
60 pages
Fromt Sheet
No ratings yet
Fromt Sheet
4 pages
Minor File Updated Okr - Merged
No ratings yet
Minor File Updated Okr - Merged
47 pages
Final Report
No ratings yet
Final Report
60 pages
TH-3182 186101001
No ratings yet
TH-3182 186101001
207 pages
Android Based Encrypted Sms System: A Minor Project Report
No ratings yet
Android Based Encrypted Sms System: A Minor Project Report
50 pages
DDGKZK JZDBVKJDFZB KDZB DZNBD
No ratings yet
DDGKZK JZDBVKJDFZB KDZB DZNBD
31 pages
1 - Kamini - Front PagesNotsign
No ratings yet
1 - Kamini - Front PagesNotsign
8 pages
Mini
No ratings yet
Mini
63 pages
Project File Complete
No ratings yet
Project File Complete
54 pages
Bharathgowda
No ratings yet
Bharathgowda
10 pages
Project Report Rtu
No ratings yet
Project Report Rtu
17 pages
Project List - Clgs
No ratings yet
Project List - Clgs
18 pages
Real-Time Facial Emotion Recognition System Among Children With Autism Based On Deep Learning and Iot
No ratings yet
Real-Time Facial Emotion Recognition System Among Children With Autism Based On Deep Learning and Iot
12 pages
U Can Project Titles 2025
No ratings yet
U Can Project Titles 2025
35 pages
Master's Machine Learning Practicum
No ratings yet
Master's Machine Learning Practicum
16 pages
Speech Emotion Recognition Using Machine Learning - A Systematic Review
No ratings yet
Speech Emotion Recognition Using Machine Learning - A Systematic Review
25 pages
Facial Emotion Recognition Thesis
100% (3)
Facial Emotion Recognition Thesis
5 pages
Emotion-Based Music Recommendation Systems
No ratings yet
Emotion-Based Music Recommendation Systems
4 pages
Real-Time Facial Emotion System
No ratings yet
Real-Time Facial Emotion System
6 pages
Emotion Detection On Text Using Machine Learning and Deep Learning Techniques
No ratings yet
Emotion Detection On Text Using Machine Learning and Deep Learning Techniques
12 pages
CPP Final Report Final
No ratings yet
CPP Final Report Final
30 pages
Python IEEE Projects 2023-2024
No ratings yet
Python IEEE Projects 2023-2024
8 pages
Data Science Projects For Final Year
No ratings yet
Data Science Projects For Final Year
1 page
1 s2.0 S2352340925003245 Main
No ratings yet
1 s2.0 S2352340925003245 Main
8 pages
Emotion Recognition of College Students' Online Learning Engagement Based On Deep Learning
No ratings yet
Emotion Recognition of College Students' Online Learning Engagement Based On Deep Learning
13 pages
SEO Tips for Internship Success
No ratings yet
SEO Tips for Internship Success
4 pages
AI Emotion Recognition Project
No ratings yet
AI Emotion Recognition Project
26 pages
Final Report-2
No ratings yet
Final Report-2
4 pages
Final Project Report
No ratings yet
Final Project Report
24 pages
Career Recommendation
No ratings yet
Career Recommendation
28 pages
Hemakshi Lohiya VIIT
No ratings yet
Hemakshi Lohiya VIIT
1 page
Myra Projects List 2024
No ratings yet
Myra Projects List 2024
28 pages
Emotion-Based-Age-and-Gender-Detection-Using-ML-research Paper-Iie
No ratings yet
Emotion-Based-Age-and-Gender-Detection-Using-ML-research Paper-Iie
13 pages
An Affective Service Based On Multi-Modal Emotion Recognition Using EEG Enabled Emotion Tracking and Speech Emotion Recognition
No ratings yet
An Affective Service Based On Multi-Modal Emotion Recognition Using EEG Enabled Emotion Tracking and Speech Emotion Recognition
3 pages
Cross-Modal Dynamic Transfer Learning For Multimodal Emotion Recognition
No ratings yet
Cross-Modal Dynamic Transfer Learning For Multimodal Emotion Recognition
10 pages
Smart Intelligent Fashion Recommendation System
No ratings yet
Smart Intelligent Fashion Recommendation System
7 pages
IEEEJV - 82emotion Recognition On Twitter Comparative Study and Training A Unison Model PDF
No ratings yet
IEEEJV - 82emotion Recognition On Twitter Comparative Study and Training A Unison Model PDF
14 pages
CLARE: Real-Time Cognitive Load Dataset
No ratings yet
CLARE: Real-Time Cognitive Load Dataset
12 pages
Informatics in Medicine Unlocked: Aya Hassouneh, A.M. Mutawa, M. Murugappan
No ratings yet
Informatics in Medicine Unlocked: Aya Hassouneh, A.M. Mutawa, M. Murugappan
9 pages
AI ASD Support System Presentation Updated
No ratings yet
AI ASD Support System Presentation Updated
10 pages
All Sessions Schedule Icripe25
No ratings yet
All Sessions Schedule Icripe25
12 pages