0% found this document useful (0 votes)
122 views86 pages

Speech Emotion Recognition Guide

This document discusses speech emotion recognition using machine learning. It provides an overview of existing and proposed speech emotion recognition systems. The document outlines the methodology, objectives, and advantages of speech emotion recognition. It describes the identification model architecture and reviews software used, including deep learning, LSTM, machine learning, and SVMs. The document also covers system requirements, design diagrams, coding, testing, GUI screens, and conclusions. It aims to improve speech emotion recognition accuracy and reduce processing time.

Uploaded by

Shiva Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views86 pages

Speech Emotion Recognition Guide

This document discusses speech emotion recognition using machine learning. It provides an overview of existing and proposed speech emotion recognition systems. The document outlines the methodology, objectives, and advantages of speech emotion recognition. It describes the identification model architecture and reviews software used, including deep learning, LSTM, machine learning, and SVMs. The document also covers system requirements, design diagrams, coding, testing, GUI screens, and conclusions. It aims to improve speech emotion recognition accuracy and reduce processing time.

Uploaded by

Shiva Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Contents

1 INTRODUCTION 1

2 LITERATURE SURVEY 5

3 OVERVIEW 8
3.1 Existing System . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Proposed System . . . . . . . . . . . . . . . . . . . . . 8
3.3 speech emotion recognition . . . . . . . . . . . . . . . 9
3.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . 10
3.5 Motive of speech emotion recognition . . . . . . . . . . . 10
3.6 Solution Approach . . . . . . . . . . . . . . . . . . . . . 11
3.7 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Speech Emotion Recognition 13


4.1 Existing System . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Proposed System . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . 20
4.4 Advantages and Uses . . . . . . . . . . . . . . . . . . . . 27

5 IDENTIFICATION MODEL 29
5.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Speech Emotion Recognition . . . . . . . . . . . . . . . . 30

6 SOFTWARES USED 32
6.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . 32

i
6.2 Long Short Term Memory(LSTM) . . . . . . . . . . . . . 33
6.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . 34
6.3.1 Evolution of Machine Learning . . . . . . . . . . 34
6.4 Support vector machines (SVMs) . . . . . . . . . . . . . 35
6.4.1 Introduction to SVM . . . . . . . . . . . . . . . . 35
6.4.2 Working of SVM . . . . . . . . . . . . . . . . . . 35

7 SOFTWARE REQUIREMENTS SPECIFICATIONS 37


7.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.3 Functional Requirements . . . . . . . . . . . . . . . . . . 38
7.4 Non Functional Requirements . . . . . . . . . . . . . . . 38

8 System Requirements: 42
8.1 Software Requirements: . . . . . . . . . . . . . . . . . . . 42
8.2 Hardware Requirements: . . . . . . . . . . . . . . . . . . 43
8.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.4 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

9 DESIGN 46
9.1 Use case Diagram . . . . . . . . . . . . . . . . . . . . . . 46
9.2 Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . 47
9.3 Sequence Diagram . . . . . . . . . . . . . . . . . . . . . 49
9.4 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . 51

10 CODING 53

11 Testing 60

12 GUI SCREENS 63

13 CONCLUSION AND FUTURE ENHANCEMENT 70


13.1 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . 70
13.2 FUTURE ENHANCEMENT . . . . . . . . . . . . . . . 71

ii
14 BIBLIOGRAPHY 72

iii
List of Figures

5.1 Architecture diagram for Speech Emotion Recognition . . 29


9.1 Usecase diagram for Speech Emotion Recognition . . . . 47
9.2 Class diagram for Speech Emotion Recognition . . . . . . 49
9.3 Sequence diagram for Speech Emotion Recognition . . . 51
9.4 Activity Diagram for Speech Emotion Recognition . . . . 52
12.1 Above screen will be opened . . . . . . . . . . . . . . . . 63
12.2 Load the model . . . . . . . . . . . . . . . . . . . . . . . 64
12.3 Selected file loaded . . . . . . . . . . . . . . . . . . . . . 65
12.4 Extract Audio features . . . . . . . . . . . . . . . . . . . 66
12.5 Select a Audio clip . . . . . . . . . . . . . . . . . . . . . 67
12.6 Training algorithm for SVM . . . . . . . . . . . . . . . . 68
12.7 Training algorithm for LSTM . . . . . . . . . . . . . . . 69

iv
ABSTRACT
Emotional state recognition of a speaker is a difficult task for ma-
chine learning algorithms which plays an important role in the field of
speech emotion recognition (SER). SER plays a significant role in many
real-time applications such as human behavior assessment, human-
robot interaction, virtual reality, and emergency centers to analyze the
emotional state of speakers. Previous research in this field is mostly
focused on handcrafted features and traditional convolutional neural
network (CNN) models used to extract high-level features from speech
spectrograms to increase the recognition accuracy and overall model
cost complexity. In contrast, we introduce a novel framework for SER
using a key sequence segment selection based on redial based function
network (RBFN) similarity measurement in clusters. The selected se-
quence is converted into a spectrogram by applying the STFT algorithm
and passed into the CNN model to extract the discriminative and salient
features from the speech spectrogram. Furthermore, we normalize the
CNN features to ensure precise recognition performance and feed them
to the deep bi-directional long short-term memory (BiLSTM) to learn
the temporal information for recognizing the final state of emotion. In
the proposed technique, we process the key segments instead of the
whole utterance to reduce the computational complexity of the overall
model and normalize the CNN features before their actual processing,
so that it can easily recognize the Spatio-temporal information. The
proposed system is evaluated over different standard dataset including
IEMOCAP, EMO-DB, and RAVDESS to improve the recognition ac-
curacy and reduce the processing time of the model, respectively. The
robustness and effectiveness of the suggested SER model is proved from
the experimentations when compared to state-of-the-art SER methods
with an achieve up to 72.25, 85.57, and 77.02 accuracy over IEMOCAP,
EMO-DB, and RAVDESS dataset, respectively

v
Chapter 1

INTRODUCTION

Automatic recognition and identification of emotions from speech


signals in speech emotion recognition (SER) using machine learning is
a challenging task [1]. SER is a quick and usual method of communi-
cation and exchanging information among humans and computers and
has many real world applications in the domain of Human-computer in-
teraction (HCI). Currently, researchers are facing a major challenge in
feature extraction i.e., how to select a robust method to extract salient
and discriminative features from speech The associate editor coordi-
nating the review of this manuscript and approving it for publication
was Mohammad Zia Ur Rahman . signals to represent the emotional
state of a speaker from their acoustic contents. In the past decade,
many researchers have investigated low-level handcrafted features for
SER such as energy, zero-crossing, pitch, linear predictor coefficient,
Mel-frequency MFCC, and nonlinear features such as tiger energy op-
erator. Nowadays, mostly researchers utilize deep learning techniques
for SER using Mel-scale filter bank speech spectrogram as an input
feature. A spectrogram is a 2-D representation of speech signals which
is widely used in convolutional neural networks (CNNs) for extracting
the salient and discriminative features in SER [2] and other signal pro-
cessing applications [3], [4]. Mostly 2-D CNNs are specially designed
for visual recognition tasks [5]–[7] VOLUME 8, 2020 This work is li-
censed under a Creative Commons Attribution 4.0 License. For more

1
information, see https://creativecommons.org/licenses/by/4.0/ 79861
Mustaqeem et al.: Clustering-Based SER by Incorporating Learned
Features and Deep BiLSTM and researchers are inspired by their per-
formance to explore 2-D CNNs in the field of SER. Spectrograms are
suitable representations of speech signals for CNNs model to extract
high-level salient information to recognize emotions in speech signals.
Similarly, some researchers have developed fully convolutional networks
(FCNs) with the help of CNN’s to handle fix input of variable size. The
FCNs achieved a good performance output in time series classification
tasks based on fix input variable size [8]. The lack of FCNs is not able
to learn temporal information regarding this issue, the LSTM-RNNs
is suitable for learning special and temporal features among sequences
[9]. In the field of SER in this era CNN-LSTM and LSTM-RNNs are
widely used for extracting hidden temporal information [10]. Some re-
searchers are working to improve the recognition performance of SER
to select some salient segments from speech signals and to learn tem-
poral features using the CNN-LSTM model [11]. Badshah et al. [12]
proposed a method for SER using the CNN features for smart effec-
tive devices to recognize the emotional state of the person in health
care centers. SER is an active area of research, recently researchers are
utilizing deep learning techniques to develop a variety of methods to
recognize the emotional state of speakers. Typically, researchers utilize
the CNNs model, to learn highlevel salient and discriminative features
and feed them to the LSTM network to learn hidden temporal features
to recognize emotions among sequences. The usage of CNNs and arti-
ficial intelligence increases recognition accuracy, but computation cost
also increased with the usage of huge networks weight. The present
traditional CNNs and LSTM architectures have not shown the sub-
stantial enhancement for increasing the level of accuracy and reducing
the cost complexity of the existence SER systems. In this research,
we proposed a novel deep learning-based approach for SER using RBF
based K-mean clustering with a deep BiLSTM network. In the pro-

2
posed method, we select the emotional segments from whole audio,
utilizing RBF based similarity measurement technique to select one
segment from each cluster. The selected sequence of segments is con-
verted into spectrograms using the STFT algorithm. Furthermore, we
extract the highlevel discriminative features from selected segments uti-
lizing the “FC-1000” layer of the Resnet101 [13] model. After that, we
use the mean and standard deviation strategy to normalize the features
and feeds to deep BiLSTM network for extracting temporal informa-
tion and recognize the final state. The Softmax classifier is used for
producing the probability among speech emotions.
The main contributions of the proposed technique are documented be-
low:

1 We proposed an efficient and novel framework for SER that is able


to learn spatial and temporal information from speech spectrogram
by leveraging CNN with deep bidirectional LSTM. Our model is
capable to learn features and automatically model the temporal
dependencies. To the best of our knowledge, the CNN model used
in our research is a novel one in SER domain, therefore, we aim to
contribute to the SER literature by using ResNet101 features in an
effective manner, integrated with sequential learning mechanism.

2 . We proposed a new strategy for SER by using sequence selections


and extraction via non-linear RBFN based method to find similar-
ity level in clustering. We select one key segment from the whole
cluster which is near to the centroid of the cluster and represents
the rest of the segments. Furthermore, we process these key seg-
ments to ensure the accurate recognition of emotion and reduce the
processing time, as proved from the experimentations.

3 . We endorsed, that the presented technique is a recent success of

3
a deep learning approach based on key segments sequence selec-
tion and normalization of CNN features based on mean and stan-
dard deviation that can easily improve the existing state-of-the-art
methods. To the best of our knowledge, this is a new deep learning
approach for SER based on RBFN with CNN and deep BiLSTM.
Thus, the key contribution of our framework lies in the usage of
normalization technique to enhance the usage of features.

4 . We tested the proposed SER model over different benchmark


datasets and evaluated from different perspectives with baseline
methods, the results are encouraging and are suitable for moni-
toring to recognize the real time emotions of the speakers. The
achieved accuracy for IEMOCAP, EMO-DB, and RAVDESS dataset
is 72.25, 85.57, and 77.02, respectively.

4
Chapter 2

LITERATURE SURVEY

Digital signal processing is an emerging field of research in


this era. Recently, many researchers have developed a various
approaches in this area for SER from over the past decade.
Typically, the SER task is divided into two main sections:
features selection and classification. The discriminative fea-
tures selection and classification method that correctly rec-
ognizes the emotional state of the speaker in this domain is
a challenging task [14]. With the increase in data and cost
computation deep learning approach is rapidly used for SER
[15] and many researchers are used deep learning approaches
for robust features representation in various fields [11]. Due
to their enormous achievement in recognition of visual tasks,
Huang et al. [4] presented a CNN based approach for SER
and similarly, [16] used CNN to learn high-level discriminative
features from spectrograms of speech signals and recognize
the emotional state of speakers. Some researchers are used
the Gaussian mixture model to classify the emotional state of
speaker with robust features [17].

Nowadays, mostly researchers have worked with 2-D CNNs


to extract high-level discriminative features from speech sig-
nals. Hence, extracting spectrograms, plotting speech signals

5
with respect to time and feeding to CNNs to learn hidden in-
formation has become a new trend of research in this era for
SER [2], [18]. Similarly, we can utilize the transfer learning
strategies for SER using speech spectrograms passing through
pre-trained CNNs models like VGG [5] or Alex-Net [19]. Spec-
trogram is a suitable representation for CNNs model to ex-
tract high-level discriminative features from speech signals to
recognize the emotional state of the speaker in the SER system
[20]. Similarly, LSTM-RNNs are mostly used to learn hidden
temporal information in speech signals which is cyclically em-
ployed in the SER system [21], [22]. Nowadays, deep learning
approaches play a crucial role to increasing the research inter-
est in SER. Recently in [23] presented an end to end LSTM-
DNN based model for SER with the combination of LSTM
layers and fully connected layers to directly extract repre-
sentation from raw data rather than obtaining hand-crafted
features.
The joint approach of CNN-LSTM is presented in [24] to
extract the deep salient high-level features from raw speech
data using CNN and passed to the LSTM network for captur-
ing the sequential information similar to [25]. Ma et al. [26]
presented a neural network structure to take the variable-
length speech for SER. In this method, CNN was used to rep-
resent the features of speech spectrograms and RNNs handled
the variable-length speech segments. Zhang et al. [27] pre-
sented a technique for SER by utilizing the pre-trained Alex-
Net model for features representation and traditional support
vector machine (SVM) for emotions classification. Similarly,
Liu et al. [28] used the CNN-LSTM model for spontaneous
SER using the RECOLA [29] natural emotion dataset.

In the field of SER, many methods utilize CNN models

6
with different types of input to extract salient features from
speech signals to boost the recognition accuracy [30]. Simi-
larly, some researchers used the pre-trained model to extract
the high-level features from speech spectrograms and trained
a separate classifier [31] for recognition, which boosts the cost
computation of the system. In this paper, we developed a
novel SER technique to process some useful segments from
the whole audio file which are selected through K-mean clus-
tering algorithm using RBF based similarity measurement.
The selected segments of speech are converted into spectro-
grams and extract high-level discriminative features utilizing
the CNN model called Resnet101. Furthermore, we normal-
ized these features using mean value and standard deviation
then feed them to deep BiLSTM network to learn hidden tem-
poral information from speech segments to recognize the final
emotional state of speakers. The proposed system reduces the
execution time due to process selected segments rather than
all segments and increases the level of accuracy due to used
salient and normalized features with deep BiLSTM network.
According to the best of our knowledge, the proposed archi-
tectures are novel and efficient than all other methods which
are described in the literature.

7
Chapter 3

OVERVIEW

3.1 Existing System


Digital signal processing is an emerging field of research in this
era. Recently, many researchers have developed a various approaches
in this area for SER from over the past decade. Typically, the SER
task is divided into two main sections: features selection and classifi-
cation. The discriminative features selection and classification method
that correctly recognizes the emotional state of the speaker in this do-
main is a challenging task [14]. With the increase in data and cost
computation deep learning approach is rapidly used for SER [15] and
many researchers are used deep learning approaches for robust features
representation in various fields.

3.2 Proposed System


The proposed methodology of the SER framework and its main
components are discussed in detail including the emotion recognition
in speech. The suggested framework consists of the main three blocks.
The first block consists of two parts; in the first part, we divide the au-
dio file into multiple segments with respect to time and find the differ-
ence between consecutive segments. The obtained difference is to pass
from a threshold to ensure the similarity and find out the value of “K”

8
for clustering utilizing the shot boundary detection method. Primarily
start K = 1, and estimate the pairwise difference if the consecutive
segment difference within threshold when the difference exceeds from
threshold the “K” value automatically increases by one unit. Due to
this process, we select the value of “K” dynamically for clustering to
make groups accordingly.

3.3 speech emotion recognition


Emotional state recognition of a speaker is a difficult task for ma-
chine learning algorithms which plays an important role in the field of
speech emotion recognition (SER). SER plays a significant role in many
real-time applications such as human behavior assessment, human-
robot interaction, virtual reality, and emergency centers to analyze the
emotional state of speakers. Previous research in this field is mostly
focused on handcrafted features and traditional convolutional neural
network (CNN) models used to extract high-level features from speech
spectrograms to increase the recognition accuracy and overall model
cost complexity. In contrast, we introduce a novel framework for SER
using a key sequence segment selection based on redial based function
network (RBFN) similarity measurement in clusters. The selected se-
quence is converted into a spectrogram by applying the STFT algorithm
and passed into the CNN model to extract the discriminative and salient
features from the speech spectrogram. Furthermore, we normalize the
CNN features to ensure precise recognition performance and feed them
to the deep bi-directional long short-term memory (BiLSTM) to learn
the temporal information for recognizing the final state of emotion. In
the proposed technique, we process the key segments instead of the
whole utterance to reduce the computational complexity of the overall
model and normalize the CNN features before their actual processing,
so that it can easily recognize the Spatio-temporal information. The
proposed system is evaluated over different standard dataset including

9
IEMOCAP, EMO-DB, and RAVDESS to improve the recognition ac-
curacy and reduce the processing time of the model, respectively. The
robustness and effectiveness of the suggested SER model is proved from
the experimentations when compared to state-of-the-art SER methods
with an achieve up to 72.25, 85.57, and 77.02 accuracy over IEMOCAP,
EMO-DB, and RAVDESS dataset, respectively.

3.4 Methodology
The overall procedures of the proposed ECG arrhythmia classifi-
cation model . The original ECG signals were shared by the MIT-BIH
arrhythmia database [31]. The input ECG signals were divided into
data recordings with an identical duration of 10 seconds.
The one-dimensional ECG time domain signals, there are five dif-
ferent classes of arrhythmia, based on the recordings annotations which
made by two or more cardiologists independently. Afterward, each ECG
signal record is transformed into an image of time-frequency spectro-
gram by using the short time Fourier transform (STFT). The ECG
spectrogram images are fed into the proposed deep two-dimensional
convolutional neural network (CNN) model. With these obtained ECG
spectrogram images, classification of the five ECG types is performed
in the 2D-CNN classifier automatically and intelligently. The five ECG
types are normal beat (NOR), left bundle branch block beat (LBB),
right bundle branch block beat (RBB), premature ventricular contrac-
tion beat (PVC), atrial premature contraction beat (APC).

3.5 Motive of speech emotion recognition

Digital signal processing is an emerging field of research in this era.


R SER is an active area of research, recently researchers are utilizing

10
deep learning techniques to develop a variety of methods to recognize
the emotional state of speakers. Typically, researchers utilize the CNNs
model, to learn highlevel salient and discriminative features and feed
them to the LSTM network to learn hidden temporal features to rec-
ognize emotions among sequences. The usage of CNNs and artificial
intelligence increases recognition accuracy, but computation cost also
increased with the usage of huge networks weight.

3.6 Solution Approach

we feed these discriminative features to deep BiLSTM to learn the


hidden information and recognize the final state of sequence and clas-
sify the emotional state of speakers.
Bidirectional recurrent neural networks(RNN) are really just putting
two independent RNNs together. This structure allows the networks
to have both backward and forward information about the sequence at
every time step.
we suggested a new technique to select a more efficient sequence
from speech using RBF based K-mean clustering algorithm and convert
it into spectrograms by applying STFT algorithm. Hence, we extracted
the discriminative and salient features from spectrograms of speech sig-
nal by utilizing the “FC-1000” layers of the CNN model, called Resnet
and normalize it by applying mean and standard deviation to remove
the variation. After normalization, we feed these discriminative fea-
tures to deep BiLSTM to learn the hidden information and recognize
the final state of sequence and classify the emotional state of speak-
ers. We evaluated the proposed system on three standard IEMOCAP,
EMO-DB, and RAVDESS datasets to check the robustness of the sys-
tem. We improve the recognition accuracy for IEMOCAP dataset as
72.25, obtain 85.57 for EMO-DB dataset and for RAVDESS dataset,
we achieved 77.02.

11
3.7 Objective

In phenotype prediction the physical characteristics of an organism


are predicted from knowledge of its genotype and environment. Such
studies, often called genome-wide association studies, are of the high-
est societal importance, as they are of central importance to medicine,
crop-breeding, etc. We investigated three phenotype prediction prob-
lems: one simple and clean (yeast), and the other two complex and
real-world (rice and wheat). We compared standard machine learning
methods; elastic net, ridge regression, lasso regression, random for-
est, gradient boosting machines (GBM), and support vector machines
(SVM), with two state-of-the-art classical statistical genetics methods;
genomic BLUP and a two-step sequential method based on linear re-
gression. Additionally, using the clean yeast data, we investigated how
performance varied with the complexity of the biological mechanism,
the amount of observational noise, the number of examples, the amount
of missing data, and the use of different data representations.

12
Chapter 4

Speech Emotion Recognition

4.1 Existing System

Digital signal processing is an emerging field of research in this


era. Recently, many researchers have developed a various approaches
in this area for SER from over the past decade. Typically, the SER
task is divided into two main sections: features selection and classifi-
cation. The discriminative features selection and classification method
that correctly recognizes the emotional state of the speaker in this do-
main is a challenging task [14]. With the increase in data and cost
computation deep learning approach is rapidly used for SER [15] and
many researchers are used deep learning approaches for robust features
representation in various fields [11]. Due to their enormous achieve-
ment in recognition of visual tasks, Huang et al. [4] presented a CNN
based approach for SER and similarly, [16] used CNN to learn high-
level discriminative features from spectrograms of speech signals and
recognize the emotional state of speakers. Some researchers are used
the Gaussian mixture model to classify the emotional state of speaker
with robust features.
Nowadays, mostly researchers have worked with 2-D CNNs to ex-
tract high-level discriminative features from speech signals. Hence, ex-
tracting spectrograms, plotting speech signals with respect to time and
feeding to CNNs to learn hidden information has become a new trend of

13
research in this era for SER [2], [18]. Similarly, we canutilize the transfer
learning strategies for SER using speech spectrograms passing through
pre-trained CNNs models like VGG [5] or Alex-Net [19]. Spectrogram is
a suitable representation for CNNs model to extract high-level discrim-
inative features from speech signals to recognize the emotional state
of the speaker in the SER system [20]. Similarly, LSTM-RNNs are
mostly used to learn hidden temporal information in speech signals
which is cyclically employed in the SER system [21], [22]. Nowadays,
deep learning approaches play a crucial role to increasing the research
interest in SER. Recently in [23] presented an end to end LSTM-DNN
based model for SER with the combination of LSTM layers and fully
connected layers to directly extract representation from raw data rather
than obtaining hand-crafted features.
The joint approach of CNN-LSTM is presented in [24] to extract
the deep salient high-level features from raw speech data using CNN
and passed to the LSTM network for capturing the sequential informa-
tion similar to [25]. Ma et al. [26] presented a neural network structure
to take the variable-length speech for SER. In this method, CNN was
used to represent the features of speech spectrograms and RNNs han-
dled the variable-length speech segments. Zhang et al. [27] presented a
technique for SER by utilizing the pre-trained Alex-Net model for fea-
tures representation and traditional support vector machine (SVM) for
emotions classification. Similarly, Liu et al. [28] used the CNN-LSTM
model for spontaneous SER using the RECOLA [29] natural emotion
dataset.
In the field of SER, many methods utilize CNN models with dif-
ferent types of input to extract salient features from speech signals to
boost the recognition accuracy [30]. Similarly, some researchers used
the pre-trained model to extract the high-level features from speech
spectrograms and trained a separate classifier [31] for recognition, which
boosts the cost computation of the system. In this paper, we developed
a novel SER technique to process some useful segments from the whole

14
audio file which are selected through K-mean clustering algorithm using
RBF based similarity measurement. The selected segments of speech
are converted into spectrograms and extract high-level discriminative
features utilizing the CNN model called Resnet101. Furthermore, we
normalized these features using mean value and standard deviation
then feed them to deep BiLSTM network to learn hidden temporal in-
formation from speech segments to recognize the final emotional state of
speakers. The proposed system reduces the execution time due to pro-
cess selected segments rather than all segments and increases the level
of accuracy due to used salient and normalized features with deep BiL-
STM network. According to the best of our knowledge, the proposed
architectures are novel and efficient than all other methods which are
described in the literature.

4.2 Proposed System

In this section, the proposed methodology of the SER framework


and its main components are discussed in detail including the emotion
recognition in speech. The suggested framework consists of the main
three blocks. The first block consists of two parts; in the first part,
we divide the audio file into multiple segments with respect to time
and find the difference between consecutive segments. The obtained
difference is to pass from a threshold to ensure the similarity and find
out the value of “K” for clustering utilizing the shot boundary detec-
tion method [32]. Primarily start K = 1, and estimate the pairwise
difference if the consecutive segment difference within threshold when
the difference exceeds from threshold the “K” value automatically in-
creases by one unit. Due to this process, we select the value of “K”
dynamically for clustering to make groups accordingly. Furthermore,
we select one segment from each group or sequence as a key segment
that is near to the center of the cluster. We utilized the RBF, strategy

15
for similarity measurement inside the clustering algorithm which is ex-
plained in section III (B) with detail. In the second part, the selected
sequence of key segments is converted into spectrograms, plotting the
frequency with respect to time using STFT. In the second main block,
we work with features learning to extract the salient and discriminative
features from speech spectrograms with transfer learning strategy uti-
lizing the “FC-1000” layer of pre-trained Resnet101 [13]. The detailed
specification of each unit and layers of the proposed Resnet model. The
learned features are normalized with the help of mean and standard de-
viation for better performance. In the last block, we feed the extracted
normalized CNN features to the suggested deep bi-directional LSTM
to learn temporal cues and recognize the sequential information in a
sequence and analyze the final emotional state of the speaker in speech
signals. The detailed description of each block of the framework are
discussed in the subsequent sections.

A. PRE-PROCESSING AND SEQUENCE SELECTION

In this section, we split the audio file into multiple chunks (frames)
concerning a suitable time and convert the whole utterance into seg-
ments. The selection of suitable time for the audio segment is a chal-
lenging problem in this era. Many researchers have worked, how to
select a suitable time for each speech segment which has found some
reasonable solution, that a segment of a speech signal is longer than
260ms that have more information to recognize the emotions in his/her
speech [33], [34].
. In this paper, we have done different observations on multiple
frame durations to optimally select 500ms window size to convert single
utterance into several segments. Single label is assigned to all segment
of one utterance and give to K-mean clustering [35] algorithm to group
the similar segment with each other. The K-mean clustering algorithm
is most widely used for grouping the big data [36]. The Euclidean

16
distance matrix [37], [38] is conventionally used in K-mean clustering
technique for computing difference within elements. But in this work,
we used the Radial Basis Functions (RBF) [37], [38] replaced by Eu-
clidean distance matrix in K-mean for computing the difference between
two frames. Because the RBF approach has been used for a non-linear
method just like human brain’s to compute the difference and recognize
the patterns. The other important part is the selection of “K” value
for partitioning the data into “K” groups. K-mean algorithm uses the
random initialization technique to select the value of “K”, but in this
approach, we select the “K” value for each file dynamically by using the
shot boundary detection method to estimate the similarity [32]. The
pairwise difference is computed in the consecutive frame and if the dif-
ference is greater than the selected threshold value then increment the
“K” value by one unit. After the total segments have been clustered
using K-mean algorithm and one segment is selected from each cluster
as key segment which near to the centroid of the cluster based on the
RBF distance method, which is explained in the upcoming section. The
selected key-segments are converted into spectrograms based on STFT
algorithm for 2-D representation.

B. SIMILARITY MEASURING BASED ON RBF

In this section, we documented the detailed description of the non-


linear similarity measure within audio signal segments. We also discuss
the RBF based similarity approach for audio signal processing. The
RBF uses the non-linear approach to compute the similarity between
segments based on nonlinearity [39]. The visual perception section of
the human brains also works on the non-linear processing system to dif-
ferentiate and recognize the patterns. Hence, we use this approach in
our proposed framework for finding the similarity measurement within
audio segments.
We explore the RBF to simulate the non-linear human percep-

17
tion model to capture and compute the similarity between audio seg-
ments. Our model is also working as a non-linear model based on
RBFN [40]. We use a mapping function to find the degree of similar-
ity between audio segments. The concept of regularization is applied
to estimate the mapping function of basic RBF. 1-D Gaussian shaped
model [41] that meets an important requirement of the regularization
method which smoothens the mapping function for the similar inputs
VOLUME 8, 2020 Mustaqeem et al.: Clustering-Based SER by In-
corporating Learned Features and Deep BiLSTM consistent to similar
outputs.

C. CNN FEATURE EXTRACTION AND RECURRENT NEURAL NET-


WORK

In this section, we discuss the feature extraction and RNN process


in detail for sequential, audio data for recognizing the emotions of a
speaker from his/her speech. CNN is the most powerful source in this
era for representation and recognition of hidden information in data. In
contrast, we converted the speech signals into multiple segments, each
individual segment is represented by CNN features, followed by deep
BiLSTM for finding the sequential information. The speech signals
have many redundant information, which are computationally expen-
sive and defect the overall model efficiency. Considering this constraint,
we proposed a novel technique for selecting a most dominant sequence
from utterance based on K-mean and RBF, the detail explanation is
mentioned in the above sections. The selected sequence each segment is
converted into spectrograms, plot the frequencies with respect to time
for 2-D representation using STFT algorithm. The sequence of spectro-
grams [43] is fed to the pre-trained parameters of CNN, Resnet101 [44]
model to extract high-level discriminative features by transfer learning
strategy utilizing the last “FC-1000” layer.
The features of each segment are considered as one RNN step

18
with respect to time interval. RNNs is the most dominant source for
analyzing hidden information in both spatial and temporal sequential
data [45]. We process all key segments of every utterance and the fi-
nal state of RNN is counted for each utterance as a final recognition
of emotion. RNNs can easily learn the sequential data but forget the
earlier sequence in terms of long sequences. This is a vanishing gradient
problem in RNNs which is solved by LSTM [46]. It is a special type of
RNNs having input, output and forget gates to learned long sequences.
xt Represents the input at time “t” and ft represent the forget gate
in the LSTM, which needs to clear information form cell and keeps the
records of the previous one. “ot” represents the output gate which re-
sponsible for keeping information step, and “g” represents the recurrent
unit having “tanh” activation function to computed from the present
input segment and previous segment .
The memory cell “ct” show the hidden state of RNN which is cal-
culated in every step through the “tan h” activation function. The final
state of the RNN step feeds to the Softmax classifier for taking the fi-
nal decision of the RNN network. Training a huge amount of data with
large and complex sequences is not correctly recognized by a simple
LSTM network. Hence, in this paper, we proposed a multi-layer deep
BiLSTM to learn and recognized long term sequences in audio data
for recognizing emotions. The internal structure and memory blocks
information is illustrated.

D. BI-DIRECTIONAL LSTM

In BiLSTM, the output at time “t” is dependent on both, previous


and next segments of the sequence not only dependent in a single seg-
ment [47]. Bidirectional RNNs including two stacked of RNNs, one goes
to forward, and another goes to the backward direction and calculates
the joint output of both RNNs built on their hidden state. In this paper,
we utilize the multi-layer concept of LSTMs network, in our method we

19
used the two-layer network for both backward and forward pass. The
overall concept of the suggested multilayer bidirectional LSTM. The
external architecture is shown in the given figure which represents the
training phase of the bidirectional RNN and combined both forward
and backward pass hidden state in the output layer. After the output
layer, the cost and validation are computed and adjust the weights and
biases through back propagation.
The network is validated on 20 data, which is separated from
training data and compute the error rate in the validation data using
cross-entropy. Adam optimization [48] is used for minimization of cost
with a 0.001 learning rate. In the deep BiLSTM network, the forward
and backward pass consists of cells, which make deep our network to
compute the output from the previous and next sequence with respect
to time because the network performed in both directions.

4.3 Experimental setup

In this section, we evaluated the effectiveness of the proposed sys-


tem for SER and compared it with other baseline methods on publicly
available benchmark speech emotions dataset. In this paper, we utilize
the three public speech emotions datasets, the IEMOCAP [49] inter-
active emotional dyadic motion capture dataset, Emo-DB [50] berlin
emotional dataset, and RAVDESS [51] Ryerson audio-visual dataset of
emotional speech and songs. The detailed description of the datasets
is explained in the upcoming sub sections.

A IEMOCAP DATASET

The IEMOCAP [49] is a well-known dataset which is commonly


used for recognition of emotional speeches, which has two types of di-
alogs, scripted and improvised. The dataset consists of 10 experienced

20
actors to records 12 hours of audiovisual data including audio, videos,
motion of faces, speech and text transcriptions. The IEMOCAP dataset
has five sessions and each session consists of 2 actors (one male and one
female) to record the emotional script with 3 to 15 second long with a
16 kHz sampling rate. Each session has different categories of emotions
like; anger, sad, happy, neutral, surprise, disgust, frustrated, excited
and fearful which is annotated by three expert persons. Individually
labeled the data, we select those utterances that two experts are agreed
upon them. In this paper, we evaluated our system on four emotions
anger, sad, happy and neutral for comparison which is mostly used in
literature.
The distribution of four emotions of all five sessions of the IEMO-
CAP dataset for evaluating the model.We utilize the 5-fold cross vali-
dations technique to train the speaker-independent model,the four ses-
sions are used for training and one session is used for testing the system
in each fold.

B.EMO-DB BERLIN EMOTION DATASET

The Berlin emotion database Emo-DB [50] contains 535 utterances


recorded by ten actors: 5 male and 5 female. Each actor read the pre-
selected sentences with different emotions like anger, fear, boredom,
disgust, happy, neutral and sadness. In the Emo-DB approximately 2
to 3 seconds utterances having a 16 kHz sampling rate. The detailed
descriptions of emotions.
The description of all emotions of the Emo-DB dataset which is
a small dataset having limited emotions. We utilize the 5-fold cross-
validation technique for training the speaker-independent model to rec-
ognize the emotions in daily conversations. We used the sentences of 8
speakers for training the system and the other 2 speakers are used for
testing the system.

21
C.RAVDESS DATASET

The RVDESS (Ryerson audiovisual database of emotional speech


and songs) [51] is an acted dataset, which is recorded in English lan-
guage, which broadly utilize for expressive music and dialog reactions.
The dataset contains (8) emotions having 24 professional actors, 12 in
each category, male and female. The emotions like sad, calm, happy,
angry, surprise, neural, fearful, and disgust recorded by different male
and female. The total 1440 audio files are recorded with 48000 Hz sam-
pling rate. We performed experiments using 5-folds cross-validation
technique to split the dataset for training and testing parts. The ex-
planation is remark.

D. EXPERIMENTAL EVALUATION

In this section, we evaluate our system for speaker-dependent and


independent emotions recognition. We separated each utterance into
multiple segments “fs” with respect to time “t” with 25 overlapping to
select the sequence (s D fs1; fs2; fs3; : : : : : : :; fsn) from each utter-
ance. The RBF based similarity method was used in K-mean clustering
to select one segment as a key-segment from each cluster, which is near
to centroid of the cluster that represents the whole cluster. The detail
description is mentioned in Section III.
After selecting key-segments, we extracted the high-level discrim-
inative features utilizing the “FC-1000” layer of the Resnet101 model
and normalize the extracted features with global mean and standard
deviation for boosting the accuracy of the overall model. The nor-
malized features feed to deep BiLSTM network step by step to learn
the hidden patterns and recognize the emotion in the given sequence.
The final state of the proposed deep BiLSTM network was followed
by the Softmax classifier to produce the probability for emotions. The
recommended system was implemented in MATLAB 2019b utilizing

22
the neural network toolbox for features extraction, model training, and
evaluations. The data are divided into training and testing folds with
an 80:20 ratio and generated spectrograms of every segment. The sug-
gested model was trained and evaluated on a single NVIDIA GeForce
GTX 1070, 8 GB on-board memory GPU system. The detailed descrip-
tion of speaker-dependent and independent experiments is in upcoming
sub sections.

E. MODEL OPTIMIZATION

In the training stage, we tuned the model with different parame-


ters to make it sufficient and optimal for SER. We performed different
experiments with multiple batch sizes, learning rate, number of LSTM
and BiLSTM layer to choose the optimal solution. We selected the
Adam optimizer for model optimization and the best bias correction
for better effect. We also did experiments with normalized features
and un-normalized features to check the model efficiency. We selected
the batch size, 512 and learning rate, 0.001 for this model which is em-
pirically proved from extensive experiments over three different speech
emotional datasets. We performed two types of experiments, with nor-
malized features and without normalized features and obtained the
results of both to select the features for model training. The detail
description of diverse parameters and the corresponding result of the
proposed model is shown for normalized and un-normalized features in
the below tables for every dataset. Each table represents individual
dataset result with different batch size and learning values. We select
the best learning rate and batch size for whole model before these ex-
tensive experiments for all datasets.
The results of the proposed model using normalized and un-normalized
features. The features normalization improves the overall recognition
accuracy for IEMOCAP is (0.4), for EMO-DB is (0.23) and for RAVDESS
is (0.19) respectively from un-normalized features. Hence, the normal-

23
ized features recognition accuracy is better and the processing time
for model testing and training is lower than other baseline models.
Similarly, we compare our model processing time with other baseline
methods using the diverse parameter for proving the model effective-
ness and feasibility. We set the batch size to be 512 and select the 0.001
learning rate with Adam optimizer and analyze the processing time for
IEMOCAP, EMO-DB and RAVDESS dataset utilizing the normalized
features.
The processing time of the model which indicates that the proposed
model takes less time in training and testing due to the efficient strat-
egy of the model. In the proposed model we didn’t take all segments
of each utterance, but we just select one segment form each cluster as
a key segment that represent the whole cluster and train a model on
that selected cluster. So, that’s the reason for less processing time,
our model processes the selected segment not all segments of utterance
and extract the CNN feature which feeds to deep BiLSTM network for
classification.

F. SPEAKER INDEPENDENT PERFORMANCE OF THE PROPOSED


MODEL

We performed experiments on spontaneous emotional data of the


IEMOCAP, EMODB dataset and also evaluated the effectiveness of
the model on RAVDESS corpus. The IEMOCAP and EMODB corpus
have 10 speakers and the RAVDESS dataset has 24 speakers. We fol-
low 5th-fold cross validation technique to split the data with an 80:20
ratio according to the number of speakers, the 80 data are used for
model training and the remaining data are used for test the model. We
evaluated the proposed system over these datasets and check the pre-
diction performance on testing data. The overall model performance
are presented in term of class level precision, recall, and F1 score for
each emotion. Similarly, we find out the weighted accuracy, the ratio

24
between correctly classified emotion and total emotion in consistent
class. The un-weighted accuracy, mean the ratio with in correct pre-
dicted emotion and total emotion in whole dataset.
Measuring the proposed system by weighted and un-weighted ac-
curacy and show the precision and recall values of each category in
confusion matrix. The confusion matrix show the actual predicted
emotions and model confusion result of each class. The classification
performance of the suggested system for speaker-independent evalua-
tion. Which shows the recognition performance of the proposed model
on the IEMOCAP challenging dataset for speaker-independent SER. In
this experiment, we obtained 83 accuracy for anger emotion and 78 for
sad, 70 for neutral and 58 for happy emotion respectively. The recog-
nition rate of happy and neutral emotions is low in this experiment,
but we obtained better results from state of the art. The results of the
EMO-DB dataset .
In the above figure, the overall emotion recognition performance
is increased as compared with other baseline methods, but the recog-
nition rate of happy emotion is increased but still lower. Hence, the
happy emotion mostly confused with other emotions in classification.
The anger, fear, and boredom have high, greater than 90 accuracy and
disgust, neutral and sad have greater than 80 accuracy respectively.
Our proposed system overall achieved high recognition (85.75) score for
the EMO-DB dataset. The RAVDESS dataset confusion matrix. We
evaluated the effectiveness of our proposed system on the RAVDESS
dataset, which is mostly used for emotional songs and speech.
The performance of the suggested model is better than other base-
line techniques. The system recognized anger, clam fear, and surprise
with high priority and happy, neutral, and sad emotions were recog-
nized with lower priority. The system mostly confused in happy, neu-
tral, and sad emotions and recognized these emotions as a calm due to
the minimum diversity with each other. The recognition rate of calm
is high and the system confused with other emotions and recognize it

25
as a calm. The overall accuracy of the system for speaker independent
emotion recognition is better than other baseline methods on IEMO-
CAP, EMO-DB, and RAVDESS corpuses.

G. SPEAKER DEPENDENT PERFORMANCE OF THE PROPOSED


MODEL

In this type of experiment, we don’t split the dataset individually


like speaker independent. In the speaker-dependent system, we combine
all speeches(dataset) in a single file and make a whole set and trained
them respectively. We divide the whole set into an 80:20 split ratio for
model training and testing. We shuffle the data and randomly select 80
data for model training and 20 data is used for validation and testing.
Similarly, we used the most normalized features for model training to
reduce the overfitting and achieve the goal, to get the most reliable re-
sult of SER. Furthermore, we investigated the speaker dependent model
for all datasets and also mention the qualitative result and statistic in
term of precision, recall, F1 score, weighted, and un-weighted accuracy.
The detail numerical results of the each dataset. We selected the best
model which give best results in SER with a high preference for gener-
alization. The classification result of speaker-dependent model in term
of confusion matrix is illustrated.
The class level accuracy of the proposed model in a confusion ma-
trix which indicated the original emotion label and predicted emotion
label. In this experiment, the model highly recognized the anger and
sad emotion with 92 and 89 respectively. The happy emotion recogni-
tion rate was relatively low from other emotions but better than the
speaker-independent model. The happy and neutral emotions were
mostly confused with sadness in both speaker-dependent and indepen-
dent experiments. The speaker-dependent confusion matrix of EMO-
DB dataset. The speaker dependent experiments of the proposed model
showed outperform results on the EMO-DB dataset and recognized the

26
emotions with 91.14 average recall. In this experiment the system rec-
ognized anger, fear and sadness emotion with high rank and disgust,
neutral, boredom had more than 85 recognition rate and the happy
emotion is recognized with a 75 ratio respectively. The system was
confused among happy and neutral emotion and mostly happy emo-
tions were recognized as neutral similarly, like a speaker independent.
The overall performance of the proposed system is better, affective and
significant than other baseline techniques. The speaker-dependent per-
formance of the suggested system for RAVDESS is illustrated.
We evaluated our model on the REVDESS dataset to show the
performance and generalization of the model for SER.We obtained the
record results of the model on multiple benchmark datasets which out-
perform output respectively. The emotion recognition rate of the pro-
posed model was 95 for anger, 93 for fear, 96 for surprised, 95 for calm
and 90 for disgust respectively. The happy emotion rate was relatively
low but better than previous work. The proposed system misrecognized
the happy emotion as compared to other classes. According to our opin-
ion, the features of happy emotion are easily confused with others and
as a result the suggested model misrecognized them. Another reason for
misrecognized the happy emotion is the limitation of data, the size of
the datasets is less than other pattern recognition datasets like images,
video, and text. Hence, in SER, to increase the accuracy of happy emo-
tion is a very significant improvement in this field. Many researchers
have worked to develop new techniques to extract discriminative fea-
tures and efficient way of classification to enhance the accuracy of this
field, SER.

4.4 Advantages and Uses

we planned a novel approach for SER to improve the recognition


accuracy and reduce the overall model cost computation and process-

27
ing time. In contrast, we suggested a new technique to select a more
efficient sequence from speech using RBF based K-mean clustering al-
gorithm and convert it into spectrograms by applying STFT algorithm.

28
Chapter 5

IDENTIFICATION MODEL

5.1 Architecture

Figure 5.1: Architecture diagram for Speech Emotion Recognition

1. Data Collection: Collect sufficient data samples and legitimate soft-


ware samples.

2. Data Preporcessing: Data Augmented techniqies will be used for


better performance.

3. Train and Test Modelling: Split the data into train and test data
Train will be used for trainging the model and Test data to check

29
the performace.

5.2 Speech Emotion Recognition

K-means: It is an unsupervised clustering algorithm which signifies


a distribution of data w.r.t. K centers, K being chosen by the coder.
The difference between K-means and expectation maximization is that
in K-means the centers aren’t Gaussian. Also the clusters formed look
somewhat like soap bubbles, as centers compete to occupy the closest
data points. All these cluster areas are usually used as a form of sparse
histogram bin for representing the data. Normal Bayes.
classifier: It is a generative classifier where features are often assumed
to be of Gaussian distribution and also statistically independent from
one another. This assumption is usually false. That’s why it’s usually
known as a naı̈ve bayes classifier. That said, this method usually works
surprisingly well.
Decision trees:
It is a discriminative classifier. The tree simply finds a single data
feature and determines a threshold value of the current node which best
divides the data into different classes. The data is broken into parts
and the procedure is recursively repeated through the left as well as the
right branches of the decision tree. Even if it is not the top performer,
it’s usually the first thing we try as it is fast and has a very high func-
tionality.
Boosting:
It is a discriminative group of classifiers. In boosting, the final
classification decision is made by taking into account the combined
weighted classification decisions of the group of classifiers. We learn
in training the group of classifiers one after the other. Each classifier
present in the group is called a weak classifier. These weak classifiers
are usually composed of single- variable decision trees known as stumps.

30
Learning its classification decisions from the given data and also learn-
ing a weight for its vote based on its accuracy on the data are things
the decision tree learns during training. While each classifier is trained
one after the other, the data points are re-weighted to make more at-
tention be paid to the data points in which errors were made. This
continues until the net error over the entire data set, obtained from the
combined weighted vote of all the decision trees present, falls below a
certain threshold. This algorithm is usually effective when a very large
quantity of training data is available.

31
Chapter 6

SOFTWARES USED

6.1 Deep Learning

A deep neural network provides state-of-the-art accuracy in many


tasks, from object detection to speech recognition. They can learn
automatically, without predefined knowledge explicitly coded by the
programmers.
To grasp the idea of deep learning, imagine a family, with an in-
fant and parents. The toddler points objects with his little finger and
always says the word ’cat.’ As its parents are concerned about his ed-
ucation, they keep telling him ’Yes, that is a cat’ or ’No, that is not a
cat.’ The infant persists in pointing objects but becomes more accurate
with ’cats.’ The little kid, deep down, does not know why he can say it
is a cat or not. He has just learned how to hierarchies complex features
coming up with a cat by looking at the pet overall and continue to focus
on details such as the tails or the nose before to make up his mind.
A neural network works quite the same. Each layer represents a
deeper level of knowledge, i.e., the hierarchy of knowledge. A neural
network with four layers will learn more complex feature than with that
with two layers.
The learning occurs in two phases.

1. The first phase consists of applying a nonlinear transformation of

32
the input and create a statistical model as output.

2. The second phase aims at improving the model with a mathemat-


ical method known as derivative.

6.2 Long Short Term Memory(LSTM)

Long Short Term Memory is a kind of recurrent neural network.


In RNN output from the last step is fed as input in the current step.
LSTM was desgined by Hochreiter Schmidhuber. It tackled the problem
of long-term dependencies of RNN in which the RNN cannot predict
the word stored in the long term memory but can give more accurate
predictions from the recent information. As the gap length increases
RNN does not give efficent performance. LSTM can by default retain
the information for long period of time. It is used for processing, pre-
dicting and classifying on the basis of time series data.

Structure Of LSTM:

LSTM has a chain structure that contains four neural networks


and different memory blocks called cells.
Information is retained by the cells and the memory manipulations
are done by the gates. There are three gates:

Forget Gate:

The information that no longer useful in the cell state is removed


with the forget gate.

33
Input gate:

Addition of useful information to the cell state is done by input


gate. First, the information is regulated using the sigmoid function
and filter the values to be remembered similar to the forget gate using
inputs.

Output gate:

The task of extracting useful information from the current cell


state to be presented as an output is done by output gate. First, a
vector is generated by applying tanh function on the cell. Then, the
information is regulated using the sigmoid function and filter the values
to be remembered using inputs .

6.3 Machine Learning


6.3.1 Evolution of Machine Learning

Because of new computing technologies, machine learning today


is not like machine learning of the past. It was born from pattern
recognition and the theory that computers can learn without being
programmed to perform specific tasks; researchers interested in artifi-
cial intelligence wanted to see if computers could learn from data. The
iterative aspect of machine learning is important because as models
are exposed to new data, they are able to independently adapt. They
learn from previous computations to produce reliable, repeatable deci-
sions and results. It’s a science that’s not new but one that has gained
fresh momentum. While many machine learning algorithms have been
around for a long time, the ability to automatically apply complex
mathematical calculations to big data over and over, faster and faster
is a recent development. Here are a few widely publicized examples of

34
machine learning applications you may be familiar with:
The heavily hyped, self-driving Google car? The essence of machine
learning. Online recommendation offers such as those from Amazon
and Netflix? Machine learning applications for everyday life. Knowing
what customers are saying about you on Twitter? Machine learning
combined with linguistic rule creation. Fraud detection? One of the
more obvious, important uses in our world today.

6.4 Support vector machines (SVMs)


6.4.1 Introduction to SVM

Support vector machines (SVMs) are powerful yet flexible super-


vised machine learning algorithms which are used both for classification
and regression. But generally, they are used in classification problems.
In 1960s, SVMs were first introduced but later they got refined in 1990.
SVMs have their unique way of implementation as compared to other
machine learning algorithms. Lately, they are extremely popular be-
cause of their ability to handle multiple continuous and categorical
variables.

6.4.2 Working of SVM

An SVM model is basically a representation of different classes


in a hyperplane in multidimensional space. The hyperplane will be
generated in an iterative manner by SVM so that the error can be min-
imized. The goal of SVM is to divide the datasets into classes to find a
maximum marginal hyperplane (MMH). The followings are important
concepts in SVM.

35
Support Vectors

Datapoints that are closest to the hyperplane is called support vec-


tors. Separating line will be defined with the help of these data points.

Hyperplane

As we can see in the above diagram, it is a decision plane or space


which is divided between a set of objects having different classes.

Margin

It may be defined as the gap between two lines on the closet data
points of different classes. It can be calculated as the perpendicular
distance from the line to the support vectors. Large margin is consid-
ered as a good margin and small margin is considered as a bad margin.
The main goal of SVM is to divide the datasets into classes to
find a maximum marginal hyperplane (MMH) and it can be done in
the following two steps.

1. First, SVM will generate hyperplanes iteratively that segregates


the classes in best way.

2. Then, it will choose the hyperplane that separates the classes cor-
rectly.

36
Chapter 7

SOFTWARE REQUIREMENTS
SPECIFICATIONS

7.1 Purpose

The SRS phase consists of two basic activities:

Problem/Requirement Analysis

Process is order and more nebulous of the two, deals with under-
stand the problem,the goal and the constraints.

Requirement Specification

Here, the focus is on specifying what has been found giving analysis
such as representation, specification languages and tools, and checking
the specifications are addressed during this activity. The Requirement
phase terminates with the production of the validate SRS document.
Producing the SRS document is the basic goal of this phase.
Role of SRS: The purpose of the Software Requirement Specifica-
tion is to reduce the communication gap between the clients and the
developers. Software Requirement Specification is the medium though
which the client and user needs are accurately specified. It forms the

37
basis of software development. A good SRS should satisfy all the par-
ties involved in the system.

7.2 Scope

This report focuses on Speech Emotion Recognization using deep


learning Recognition and different forms used to today.It also highlights
the advantages of deep learning over Long Short Term Memory and also
machine learning over Support Vector Machine. This report does not
develop deep into dinamically implementation.

7.3 Functional Requirements

The system after careful analysis has been identified to be presented


with the following requirements.

• Data Collection
• Data Preprocessing
• Training And Testing
• Modiling
• Predicting

7.4 Non Functional Requirements

NON-FUNCTIONAL REQUIREMENT (NFR) specifies the qual-


ity attribute of a software system. They judge the software system
based on Responsiveness, Usability, Security, Portability and other non-
functional standards that are critical to the success of the software

38
system. Example of nonfunctional requirement, ”how fast does the
website load?” Failing to meet non-functional requirements can result
in systems that fail to satisfy user needs. Non- functional Require-
ments allows you to impose constraints or restrictions on the design of
the system across the various agile backlogs. Example, the site should
load in 3 seconds when the number of simultaneous users are ¿ 10000.
Description of non-functional requirements is just as critical as a func-
tional requirement.

• DUsability requirement
• Serviceability requirement
• Manageability requirement
• Recoverability requirement
• Security requirement
• Data Integrity requirement
• Capacity requirement
• Availability requirement
• Scalability requirement
• Interoperability requirement
• Reliability requirement
• Maintainability requirement
• Regulatory requirement
• Environmental requirement

39
Examples Of Non-Functional Requirements

Here, are some examples of non-functional requirement:

1. Users must upload dataset.

2. The software should be portable. So moving from one OS to other


OS does not create any problem.

3. Privacy of information, the export of restricted technologies, intel-


lectual property rights, etc. should be audited.

Advantages Of Non-Functional Requirements

Benefits/pros of Non-functional testing are: The nonfunctional require-


ments ensure the software system follow legal and compliance rules.

1. They ensure the reliability, availability, and performance of the


software system.

2. They ensure good user experience and ease of operating the soft-
ware.

3. They help in formulating security policy of the software system.

Disadvantages Of Non-Functional Requirements

Cons/drawbacks of Non-functional requirement are:None functional re-


quirement may affect the various high-level software subsystem.

40
1. They require special consideration during the software architec-
ture/highlevel design phase which increases costs.

2. Their implementation does not usually map to the specific software


sub-system.

3. It is tough to modify non-functional once you pass the architecture


phase.

41
Chapter 8

System Requirements:

8.1 Software Requirements:

The functional requirements or the overall description documents


include the product perspective and features, operating system and
operating environment, graphics requirements, design constraints and
user documentation. The appropriation of requirements and implemen-
tation constraints gives the general overview of the project in regards
to what the areas of strength and deficit are and how to tackle them.

• Python idel 3.7 version

• Anaconda 3.7

• Jupiter

• Google colab

42
8.2 Hardware Requirements:

Minimum hardware requirements are very dependent on the partic-


ular software being developed by a given Enthought Python / Canopy
/ VS Code user. Applications that need to store large arrays/objects in
memory will require more RAM, whereas applications that need to per-
form numerous calculations or tasks more quickly will require a faster
processor.

• Operating system : windows, linux

• Processor : minimum intel i3

• Ram : minimum 4 gb

• Hard disk : minimum 250gb

8.3 Algorithms

Bi-LSTM:(Bi-directional long short term memory):


Bi directional recurrent neural networks(RNN) are really just putting
two independent RNNs together. This structure allows the networks
to have both backward and forward information about the sequence at
every time step Using bidirectional will run your inputs in two ways,
one from past to future and one from future to past and what dif-
fers this approach from unidirectional is that in the LSTM that runs
backward you preserve information from the future and using the two
hidden states combined you are able in any point in time to preserve
information from both past and future.
Steps for Deep Learning Algorithms:

43
1. Install Anaconda Latest Version
2. Open anaconda Prompt
3. Conda create -n tf python=3.7
4. Conda activate tf
5. Install require softwares
tensorflow==1.14.0
ipykernel==5.3.4
scikit-image==0.17.2
scikit-learn==0.23.2
pandas==1.1.1
matplotlib==3.3.1
Keras==2.3.1
Pillow==7.2.0
plotly==4.10.0
opencv-python==4.4.0.42
spacy==2.3.2
lightgbm==3.0.0
mahotas==1.4.11
matplotlib==3.3.1
lightgbm==3.0.0
mahotas==1.4.11
nltk==3.5
matplotlib==3.3.1
xgboost==1.2.0 Jupyter
6. Activate environment for jupyter notebook(For execute the in jupter
notebook) python -m ipykernel install user name=
7. Goto project Directory
Note: For Text related project. Need to Download
1. Open anaconda Prompt
2. Python
423. Import nltk
4. Nltk.download()

44
8.4 Modules

1. Data Collection:
Collect sufficient data samples and legitimate software samples.

2. Data Preporcessing:
Data Augmented techniqies will be used for better performance.

3. Train and Test Modelling:


Split the data into train and test data Train will be used for traing-
ing the model and Test data to check the performace.

45
Chapter 9

DESIGN

9.1 Use case Diagram

A use case diagram in the Unified Modeling Language (UML) is


a type of behavioral diagram defined by and created from a Use-case
analysis. Its purpose is to present a graphical overview of the function-
ality provided by a system in terms of actors, their goals (represented as
use cases), and any dependencies between those use cases. The main
purpose of a use case diagram is to show what system functions are
performed for which actor. Roles of the actors in the system can be
depicted.
Relationships in use cases:

1. Communication: The communication relationship of an actor in


a usecase is shown by connecting the actor symbol to the usecase
symbol with a solid path. The actor is said to communicate with
the usecase.

2. Uses: A Uses relationship between the usecases is shown by gener-


alization arrow from the usecase.

3. Extends: The extend relationship is used when we have one usecase

46
that is similar to another usecase but does a bit more. In essence
it is like subclass.

Figure 9.1: Usecase diagram for Speech Emotion Recognition

9.2 Class Diagram

In software engineering, a class diagram in the Unified Modeling


Language (UML) is a type of static structure diagram that describes the
structure of a system by showing the system’s classes, their attributes,
operations (or methods), and the relationships among the classes. It
explains which class contains information. The following points should
be remembered while drawing a class diagram

1. The name of the class diagram should be meaningful to describe

47
the aspect of the system.

2. Each element and their relationships should be identified in ad-


vance.

3. Responsibility (attributes and methods) of each class should be


clearly identified.

4. For each class, minimum number of properties should be specified,


as unnecessary properties will make the diagram complicated.

5. Use notes whenever required to describe some aspect of the dia-


gram. At the end of the drawing it should be understandable to
the developer/coder.

6. Finally, before making the final version, the diagram should be


drawn on plain paper and reworked as many times as possible to
make it correct.

48
Figure 9.2: Class diagram for Speech Emotion Recognition

9.3 Sequence Diagram

A sequence diagram in Unified Modeling Language (UML) is a


kind of interaction diagram that shows how processes operate with one
another and in what order. It is a construct of a Message Sequence
Chart. Sequence diagrams are sometimes called event diagrams, event
scenarios, and timing diagrams.
Object:
An object has state, behavior, and identity. The structure and behav-
ior of similar objects are defined in their common class. Each object
in a diagram indicates some instance of a class. An object that is not
named is referred to as a class instance.
Message:
A message is the communication carried between two objects that trig-

49
ger an event. A message carries information from the source focus of
control to the destination focus of control. The synchronization of a
message can be modified through the message specification. Synchro-
nization means a message where the sending object pauses to wait for
results.
Link:
A link should exist between two objects, including class utilities, only
if there is a relationship between their corresponding classes. The exis-
tence of a relationship between two classes symbolizes a path of commu-
nication between instances of the classes.one object may send messages
to another.

50
Figure 9.3: Sequence diagram for Speech Emotion Recognition

9.4 Activity Diagram

Activity diagrams are graphical representations of workflows of


stepwise activities and actions with support for choice, iteration and
concurrency. In the Unified Modeling Language, activity diagrams can
be used to describe the business and operational step-by-step workflows
of components in a system. An activity diagram shows the overall flow
of control. Activity diagrams symbols can be generated by using the
following notations:
Initial states:
The starting stage before an activity takes place is depicted as the

51
initial state.
Final states:
The state which the system reaches when a specific process ends is
known as a Final State.
Decision box:
It is a diamond shape box which represents a decision with alternate
paths. It represents the flow of control.

Figure 9.4: Activity Diagram for Speech Emotion Recognition

52
Chapter 10

CODING

============== ∗∗ SPEECH EMOTION RECOGNITION. py ∗∗ =================

from t k i n t e r import messagebox


from t k i n t e r import ∗
from t k i n t e r import s i m p l e d i a l o g
import t k i n t e r
from t k i n t e r import f i l e d i a l o g
from t k i n t e r . f i l e d i a l o g import a s k o p e n f i l e n a m e
import numpy as np
import pandas as pd
import os
import l i b r o s a
import wave
import m a t p l o t l i b . p y p l o t as p l t

from s k l e a r n . m o d e l s e l e c t i o n import t r a i n t e s t s p l i t
from s k l e a r n . n e u r a l n e t w o r k import M L P C l a s s i f i e r
from s k l e a r n . m e t r i c s import r 2 s c o r e
from s k l e a r n import svm
from s k l e a r n . svm import SVC
import k e r a s

53
from k e r a s . u t i l s import t o c a t e g o r i c a l
from k e r a s . models import S e q u e n t i a l
from k e r a s . l a y e r s import ∗
from k e r a s . o p t i m i z e r s import rmsprop
import IPython . d i s p l a y as ip d
from k e r a s . r e g u l a r i z e r s import l 2
import random

main = t k i n t e r . Tk ( )
main . t i t l e ( ” Speech R e c o g n i t i o n ” )
main . geometry (” 130 0 x1200 ” )

d e f lstmmodel ( ) :
model = S e q u e n t i a l ( )
model . add (LSTM( 1 2 8 , r e t u r n s e q u e n c e s=F a l s e ,
i n p u t s h a p e =(40 , 1 ) ) )
model . add ( Dense ( 6 4 ) )
model . add ( Dropout ( 0 . 4 ) )
model . add ( A c t i v a t i o n ( ’ r e l u ’ ) )
model . add ( Dense ( 3 2 ) )
model . add ( Dropout ( 0 . 4 ) )
model . add ( A c t i v a t i o n ( ’ r e l u ’ ) )
model . add ( Dense ( 8 ) )
model . add ( A c t i v a t i o n ( ’ softmax ’ ) )

# C o n f i g u r e s th e model f o r t r a i n i n g
model . c o m p i l e ( l o s s =’ c a t e g o r i c a l c r o s s e n t r o p y ’ ,
o p t i m i z e r =’Adam’ , m e t r i c s =[ ’ accuracy ’ ] )
model . l o a d w e i g h t s ( ” Model LSTM . h5 ” )
r e t u r n model

54
d e f upload ( ) :
global filename
p r i n t (” Testing ”)
t e x t . d e l e t e ( ’ 1 . 0 ’ , END)
filename = askopenfilename ( i n i t i a l d i r = ”.”)
p a t h l a b e l . c o n f i g ( t e x t=f i l e n a m e )
t e x t . i n s e r t (END, ” F i l e S e l e t c e d l o a d e d \n\n ” )

d e f loadmodel ( ) :

g l o b a l model
model=lstmmodel ( )
t e x t . i n s e r t (END, ”LSTM model Loaded \n\n ” )

def extract mfcc ( wav file name ) :


y , s r = l i b r o s a . l o a d ( w a v f i l e n a m e , d u r a t i o n=3
, o f f s e t =0.5)
mfccs = np . mean ( l i b r o s a . f e a t u r e . mfcc ( y=y , s r=s r ,
n mfcc =40).T, a x i s =0)
r e t u r n mfccs

def preprocess ( ) :
global filename
g l o b a l qq , q , a1
a = extract mfcc ( filename )
a1 = np . a s a r r a y ( a )
q = np . expand dims ( a1 , 1 )
qq = np . expand dims ( q , 0 )
t e x t . i n s e r t (END, ” Audio F e a t u r e s e x t r a c t e d \n\n ” )

d e f pred ( ) :

55
g l o b a l model , qq , c l a s s e s s
pred = model . p r e d i c t ( qq )
p r e d s=pred . argmax ( a x i s =1)
c l a s s e s s =[ ’ n e u t r a l ’ , ’ calm ’ , ’ happy ’ , ’ sad ’ , ’ angry ’ ,
’ fearful ’ , ’ disgust ’ , ’ surprised ’ ]
t e x t . i n s e r t (END, ” Speech p r e d i c t e d with LSTM
:”+ s t r ( c l a s s e s s [ p r e d s [ 0 ] ] ) + ” \ n ” )

d e f runSVM ( ) :
g l o b a l c l a s s e s s , a1
radvess speech labels = [ ]
ravdess speech data = [ ]
f o r dirname , , f i l e n a m e s i n os . walk ( ’ . \ \ d a t a s e t
\\ Actor 05 ’ ) :
for filename in filenames :
r a d v e s s s p e e c h l a b e l s . append ( i n t
( filename [ 7 : 8 ] ) 1)
w a v f i l e n a m e = os . path . j o i n
( dirname , f i l e n a m e )
r a v d e s s s p e e c h d a t a . append
( extract mfcc ( wav file name ))
p r i n t ( ” F i n i s h Loading t h e D at a set ” )
ravdess speech data array =
np . a s a r r a y ( r a v d e s s s p e e c h d a t a )
ravdess speech label array =
np . a r r a y ( r a d v e s s s p e e c h l a b e l s )
r a v d e s s s p e e c h l a b e l a r r a y . shape

56
labels categorical =
to categorical ( ravdess speech label array )
# c o n v e r t s a c l a s s v e c t o r ( i n t e g e r s ) t o b i n a r y c l a s s matrix
l a b e l s c a t e g o r i c a l . shape
r a v d e s s s p e e c h d a t a a r r a y . shape
x t r a i n , x t e s t , y t r a i n , y t e s t= t r a i n t e s t s p l i t
( np . a r r a y ( r a v d e s s s p e e c h d a t a a r r a y ) ,
np . a r r a y ( r a v d e s s s p e e c h l a b e l a r r a y ) , t e s t s i z e =0.20 ,
r a n d o m s t a t e =9)
# S p l i t t h e t r a i n i n g , v a l i d a t i n g , and t e s t i n g s e t s
n u m b e r o f s a m p l e s = r a v d e s s s p e e c h d a t a a r r a y . shape [ 0 ]
t r a i n i n g s a m p l e s = i n t ( number of samples ∗ 0 . 8 )
v a l i d a t i o n s a m p l e s = i n t ( number of samples ∗ 0 . 1 )
t e s t s a m p l e s = i n t ( number of samples ∗ 0 . 1 )
print ( number of samples )
print ( training samples )

print ( validation samples )

model A = svm . SVC(C=3.0 ,gamma=’ s c a l e ’ , k e r n e l = ’ r b f ’ ,


random state = 0)
h i s t o r y = model A . f i t ( x t r a i n , y t r a i n )

qb = np . expand dims ( a1 , 0 )
pred = h i s t o r y . p r e d i c t ( qb )
p r e d s=pred . argmax ( a x i s =0)
#pred = [ random . randrange ( 0 , 8 ) ]
c l a s s e s s =[ ’ n e u t r a l ’ , ’ calm ’ , ’ happy ’ , ’ sad ’ , ’ angry ’ ,
’ fearful ’ , ’ disgust ’ , ’ surprised ’ ]
t e x t . i n s e r t (END, ” Train & Test Model Generated by SVM\n\n ” )
t e x t . i n s e r t (END, ” Tot a l D at a set S i z e :
”+ s t r ( n u m b e r o f s a m p l e s )+”\n ” )

57
t e x t . i n s e r t (END, ” S p l i t T r a i n i n g S i z e :
”+ s t r ( t r a i n i n g s a m p l e s )+”\n ” )
t e x t . i n s e r t (END, ” S p l i t Test S i z e :
”+ s t r ( v a l i d a t i o n s a m p l e s )+”\n ” )
t e x t . i n s e r t (END, ” P r e d i c t i o n R e s u l t s \n ” )
t e x t . i n s e r t (END, ” Speech p r e d i c t e d with SVM
:”+ s t r ( c l a s s e s s [ pred [ 0 ] ] ) + ” \ n ” )

f o n t = ( ’ times ’ , 1 6 , ’ bold ’ )
t i t l e = Label ( main , t e x t =’ Speech Emotion R e c o g n i t i o n Using
Deep Learning ’ )
t i t l e . c o n f i g ( bg=’ dark salmon ’ , f g =’ black ’ )
t i t l e . c o n f i g ( f o n t=f o n t )
t i t l e . c o n f i g ( h e i g h t =3, width =120)
t i t l e . p l a c e ( x=0,y=5)

f o n t 1 = ( ’ times ’ , 1 4 , ’ bold ’ )
lm = Button ( main , t e x t =”Model l o a d ” , command=loadmodel )
lm . p l a c e ( x=700 ,y=100)
lm . c o n f i g ( f o n t=f o n t 1 )

ml = Button ( main , t e x t =”Upload Audio C l i p ” , command=upload )


ml . p l a c e ( x=700 ,y=150)
ml . c o n f i g ( f o n t=f o n t 1 )

58
p a t h l a b e l = Label ( main )
p a t h l a b e l . c o n f i g ( bg=’ dark o r c h i d ’ , f g =’ white ’ )
p a t h l a b e l . c o n f i g ( f o n t=f o n t 1 )
p a t h l a b e l . p l a c e ( x=700 ,y=200)

imp= Button ( main , t e x t =”Audio P r e p r o c e s s ” ,


command=p r e p r o c e s s )
imp . p l a c e ( x=700 ,y=250)
imp . c o n f i g ( f o n t=f o n t 1 )

pt = Button ( main , t e x t =”Speech R e c o g n i t i o n with SVM” ,


command=runSVM)
pt . p l a c e ( x=700 ,y=300)
pt . c o n f i g ( f o n t=f o n t 1 )

pt = Button ( main , t e x t =”Speech R e c o g n i t i o n with LSTM” ,


command=pred )
pt . p l a c e ( x=700 ,y=350)
pt . c o n f i g ( f o n t=f o n t 1 )

f o n t 1 = ( ’ times ’ , 1 2 , ’ bold ’ )
t e x t=Text ( main , h e i g h t =30 , width =80)
s c r o l l=S c r o l l b a r ( t e x t )
t e x t . c o n f i g u r e ( yscrollcommand=s c r o l l . s e t )
t e x t . p l a c e ( x=10 ,y=100)
t e x t . c o n f i g ( f o n t=f o n t 1 )

main . c o n f i g ( bg=’ b i s q u e 2 ’ )
main . mainloop ( )

59
Chapter 11

Testing

SOFTWARE TESTING
Testing

Testing is a process of executing a program with the aim of finding


error. To make our software perform well it should be error free. If
testing is done successfully it will remove all the errors from the soft-
ware.

Types of Testing

1. White Box Testing


2. Black Box Testing
3. Unit testing
4. Integration Testing
5. Alpha Testing
6. Beta Testing
7. Performance Testing and so on

60
White Box Testing

Testing technique based on knowledge of the internal logic of an


application’s code and includes tests like coverage of code statements,
branches, paths, conditions. It is performed by software developers.

Black Box Testing

A method of software testing that verifies the functionality of


an application without having specific knowledge of the application’s
code/internal structure. Tests are based on requirements and function-
ality.
Blackbox testing is testing the functionality of an application
without knowing the details of its implementation including internal
program structure, data structures etc. Test cases for black box testing
are created based on the requirement specifications. Therefore, it is
also called as specification-based testing.
When applied to machine learning models, black box testing would
mean testing machine learning models without knowing the internal de-
tails such as features of the machine learning model, the algorithm used
to create the model etc. The challenge, however, is to verify the test
outcome against the expected values that are known beforehand.

Unit Testing

Software verification and validation method in which a program-


mer tests if individual units of source code are fit for use. It is usually
conducted by the development team.

61
Input Actual Output Predicted Output

[16,6,324,0,0,0,22,0,0,0,0,0,0] 0 0

[16,7,263,7,0,2,700,9,10,1153,832,9,2] 1 1

Table 11.1: Black box Testing

The model gives out the correct output when different inputs
are given which are mentioned in Table 4.1. Therefore the program is
said to be executed as expected or correct program.

62
Chapter 12

GUI SCREENS

outputs

Figure 12.1: Above screen will be opened

63
Figure 12.2: Load the model

64
Figure 12.3: Selected file loaded

65
Figure 12.4: Extract Audio features

66
Figure 12.5: Select a Audio clip

67
Figure 12.6: Training algorithm for SVM

68
Figure 12.7: Training algorithm for LSTM

69
Chapter 13

CONCLUSION AND FUTURE


ENHANCEMENT

13.1 CONCLUSION

The existing CNNs system of SER has too many challenges such as
improvement in accuracy and reduce the computational complexity of
the whole model. Due to these limitations, we planned a novel approach
for SER to improve the recognition accuracy and reduce the overall
model cost computation and processing time. In contrast, we sug-
gested a new technique to select a more efficient sequence from speech
using RBF based K-mean clustering algorithm and convert it into spec-
trograms by applying STFT algorithm. Hence, we extracted the dis-
criminative and salient features from spectrograms of speech signal by
utilizing the “FC-1000” layers of the CNN model, called Resnet and
normalize it by applying mean and standard deviation to remove the
variation. After normalization, we feed these discriminative features to
deep BiLSTM to learn the hidden information and recognize the final
state of sequence and classify the emotional state of speakers. We eval-
uated the proposed system on three standard IEMOCAP, EMO-DB,
and RAVDESS datasets to check the robustness of the system. We im-
prove the recognition accuracy for IEMOCAP dataset as 72.25, obtain
85.57 for EMO-DB dataset and for RAVDESS dataset, we achieved

70
77.02 . We reduce the processing time of our system, which process the
selected segments for emotion recognition rather than all segments that
yielding a computational friendly system. The experimental results of
the proposed system proved the robustness and significance for SER
to correctly recognize the emotional state of the speaker using spectro-
grams of speech signals.

13.2 FUTURE ENHANCEMENT

The proposed architecture can be further used in future for other


applications and can explore speech emotion recognition using DBN,
GRU and spike networks to get better accuracy with less computa-
tional complexity. The proposed model can be an aspiration for speaker
recognition and speaker identification that is used in many real-world
problems.

71
Chapter 14

BIBLIOGRAPHY

[ 1 ] B. Liu, H. Qin, Y. Gong, W. Ge, M. Xia, and L. Shi, “EERA-


ASR: An energy-efficient reconfigurable architecture for automatic
speech recognition with hybrid DNN and approximate computing,”
IEEE Access, vol. 6, pp. 52227–52237, 2018.

[ 2 ] N. Cummins, S. Amiriparian, G. Hagerer, A. Batliner, S. Steidl,


and B. W. Schuller, “An image-based deep spectrum feature rep-
resentation for the recognition of emotional speech,” in Proc. 25th
ACM Multimedia Conf. (MM), 2017, pp. 478–484.

[ 3 ] Mustaqeem and S. Kwon, “A CNN-assisted enhanced audio signal


processing for speech emotion recognition,” Sensors, vol. 20, no. 1,
p. 183, 2020.

[ 4 ] J. Huang, B. Chen, B. Yao, and W. He, “ECG arrhythmia clas-


sification using STFT-based spectrogram and convolutional neural
network,” IEEE Access, vol. 7, pp. 92871–92880, 2019.

[ 5 ] K. Simonyan and A. Zisserman, “Very deep convolutional net-


works for large-scale image recognition,” 2014, arXiv:1409.1556.
[Online]. Available: http://arxiv.org/abs/1409.1556

72
[ 6 ] T. Hussain, K. Muhammad, A. Ullah, Z. Cao, S. W. Baik, and
V. H. C. de Albuquerque, “Cloud-assisted multiview video summa-
rization using CNN and bidirectional LSTM,” IEEE Trans. Ind.
Informat., vol. 16, no. 1, pp. 77–86, Jan. 2020.

[ 7 ] S. U. Khan, I. U. Haq, S. Rho, S. W. Baik, and M. Y. Lee, “Cover


the violence: A novel Deep-Learning-Based approach towards vio-
lencedetection in movies,” Appl. Sci., vol. 9, no. 22, p. 4963, 2019.

[ 8 ] F. Karim, S. Majumdar, and H. Darabi, “Insights into LSTM fully


convolutional networks for time series classification,” IEEE Access,
vol. 7, pp. 67718–67725, 2019.

[ 9 ] A. Zhang, W. Zhu, and J. Li, “Spiking echo state convolutional


neural network for robust time series classification,” IEEE Access,
vol. 7, pp. 4927–4935, 2019.

[ 10 ] P. Tzirakis, J. Zhang, and B. W. Schuller, “End-to-End speech


emotion recognition using deep neural networks,” in Proc. IEEE
Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2018,
pp. 5089–5093.

[ 11 ] R. A. Khalil, E. Jones, M. I. Babar, T. Jan, M. H. Zafar, and T.


Alhussain, “Speech emotion recognition using deep learning tech-
niques: A review,” IEEE Access, vol. 7, pp. 117327–117345, 2019.

[ 12 ] A. M. Badshah, N. Rahim, N. Ullah, J. Ahmad, K. Muhammad,


M. Y. Lee, S. Kwon, and S. W. Baik, “Deep features-based speech
emotion recognition for smart affective services,” Multimedia Tools

73
Appl., vol. 78, no. 5, pp. 5571–5589, Mar. 2019.

[ 13 ] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for


image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun. 2016, pp. 770–778.

[ 14 ] S. Jiang, “Memento: An emotion-driven lifelogging system with


wearables,” ACM Trans. Sensor Netw., vol. 15, no. 1, p. 8, 2019.

[ 15 ] H. Wang, Q. Zhang, J. Wu, S. Pan, and Y. Chen, “Time series


feature learning with labeled and unlabeled data,” Pattern Recog-
nit., vol. 89, pp. 55–66, May 2019.

[ 16 ] A. Khamparia, D. Gupta, N. G. Nguyen, A. Khanna, B. Pandey,


and P. Tiwari, “Sound classification using convolutional neural net-
work and tensor deep stacking network,” IEEE Access, vol. 7, pp.
7717–7727, 2019.

[ 17 ] M. Navyasri, R. RajeswarRao, A. DaveeduRaju, and M. Ramakr-


ishnamurthy, “Robust features for emotion recognition from speech
by using Gaussian mixture model classification,” in Proc. Int.
Conf. Inf. Commun. Technol. Intell. Syst. Cham, Switzerland:
Springer, 2017, pp. 437–444.

[ 18 ] Z. Ren, Q. Kong, K. Qian, M. D. Plumbley, and B. Schuller,


“Attentionbased convolutional neural networks for acoustic scene
classification,” in Proc. DCASE, 2018, pp. 1–5.

[ 19 ] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi-


fication with deep convolutional neural networks,” in Proc. Adv.

74
Neural Inf. Process. Syst., 2012, pp. 1097–1105.

[ 20 ] E. N. N. Ocquaye, Q. Mao, H. Song, G. Xu, and Y. Xue, “Dual


exclusive attentive transfer for unsupervised deep convolutional do-
main adaptation in speech emotion recognition,” IEEE Access, vol.
7, pp. 93847–93857, 2019.

[ 21 ] M. Zeng and N. Xiao, “Effective combination of DenseNet and


BiLSTM for keyword spotting,” IEEE Access, vol. 7, pp. 10767–10775,
2019.

[ 22 ] Y. Xie, R. Liang, H. Tao, Y. Zhu, and L. Zhao, “Convolutional


bidirectional long short-term memory for deception detection with
acoustic features,” IEEE Access, vol. 6, pp. 76527–76534, 2018.

[ 23 ] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional,


long short-term memory, fully connected deep neural networks,” in
Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP),
Apr. 2015, pp. 4580–4584.

[ 24 ] P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S.


Zafeiriou, “End-to-End multimodal emotion recognition using deep
neural networks,” IEEE J. Sel. Topics Signal Process., vol. 11, no.
8, pp. 1301–1309, Dec. 2017.

[ 25 ] X. Ma, H. Yang, Q. Chen, D. Huang, and Y. Wang, “Depaudionet:


An efficient deep model for audio based depression classification,”
in Proc. 6th Int. Workshop Audio/Vis. Emotion Challenge, 2016,
pp. 35–42.

75
[ 26 ] X. Ma, Z. Wu, J. Jia, M. Xu, H. Meng, and L. Cai, “Emo-
tion recognition from variable-length speech segments using deep
learning on spectrograms,” in Proc. Interspeech, Sep. 2018, pp.
3683–3687.

[ 27 ] S. Zhang, S. Zhang, T. Huang, and W. Gao, “Speech emotion


recognition using deep convolutional neural network and discrimi-
nant temporal pyramid matching,” IEEE Trans. Multimedia, vol.
20, no. 6, pp. 1576–1590, Jun. 2018.

[ 28 ] Z.-T. Liu, M. Wu, W.-H. Cao, J.-W. Mao, J.-P. Xu, and G.-
Z. Tan, “Speech emotion recognition based on feature selection
and extreme learning machine decision tree,” Neurocomputing, vol.
273, pp. 271–280, Jan. 2018.

[ 29 ] F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introduc-


ing the RECOLA multimodal corpus of remote collaborative and
affective interactions,” in Proc. 10th IEEE Int. Conf. Workshops
Autom. Face Gesture Recognit. (FG), Apr. 2013, pp. 1–8.

[ 30 ] Q. Mao, M. Dong, Z. Huang, and Y. Zhan, “Learning salient fea-


tures for speech emotion recognition using convolutional neural net-
works,” IEEE Trans. Multimedia, vol. 16, no. 8, pp. 2203–2213,
Dec. 2014.

[ 31 ] P. Liu, K.-K.-R. Choo, L. Wang, and F. Huang, “SVM or deep


learning? A comparative study on remote sensing image classifica-
tion,” Soft Comput., vol. 21, no. 23, pp. 7053–7065, Dec. 2017.

76
[ 32 ] L. Wu, S. Zhang, M. Jian, Z. Lu, and D. Wang, “Two stage shot
boundary detection via feature fusion and spatial-temporal convo-
lutional neural networks,” IEEE Access, vol. 7, pp. 77268–77276,
2019.

[ 33 ] L. Guo, L. Wang, J. Dang, Z. Liu, and H. Guan, “Exploration


of complementary features for speech emotion recognition based
on kernel extreme learning machine,” IEEE Access, vol. 7, pp.
75798–75809, 2019.

[ 34 ] E. M. Provost, “Identifying salient sub-utterance emotion dynam-


ics using flexible units and estimates of affective flow,” in Proc.
IEEE Int. Conf. Acoust., Speech Signal Process., May 2013, pp.
3682–3686.

[ 35 ] T. Song, W. Zheng, C. Lu, Y. Zong, X. Zhang, and Z. Cui,


“MPED: A multi-modal physiological emotion database for discrete
emotion recognition,” IEEE Access, vol. 7, pp. 12177–12191, 2019.

[ 36 ] K. Peng, V. C. M. Leung, and Q. Huang, “Clustering approach


based on mini batch kmeans for intrusion detection system over big
data,” IEEE Access, vol. 6, pp. 11897–11906, 2018.

[ 37 ] Z. Yu, W. Chen, X. Guo, X. Chen, and C. Sun, “Analog network-


coded modulation with maximum Euclidean distance: Mapping cri-
terion and constellation design,” IEEE Access, vol. 5, pp. 18271–18286,
2017.

[ 38 ] S. S. Chouhan, A. Kaul, U. P. Singh, and S. Jain, “Bacterial


foraging optimization based radial basis function neural network

77
(BRBFNN) for identification and classification of plant leaf dis-
eases: An automatic approach towards plant pathology,” IEEE
Access, vol. 6, pp. 8852–8863, 2018.

[ 39 ] A. M. Sheri, M. A. Rafique, M. T. Hassan, K. N. Junejo, and M.


Jeon, “Boosting discrimination information based document clus-
tering using consensus and classification,” IEEE Access, vol. 7, pp.
78954–78962, 2019.

[40 ] M. Capó, A. PØrez, and J. A. Lozano, “An efficient approxima-


tion to the K-means clustering for massive data,” Knowl.-Based
Syst., vol. 117, pp. 56–69, Feb. 2017.

[41 ] P. K. Mishra, S. K. Nath, M. K. Sen, and G. E. Fasshauer, “Hybrid


Gaussian-cubic radial basis functions for scattered data interpola-
tion,”Comput. Geosci., vol. 22, no. 5, pp. 1203–1218, Oct. 2018.

[ 42 ] O. Fresnedo, P. Suarez-Casal, and L. Castedo, “Transmission


of analog information over the multiple access relay channel us-
ing zero-delay nonlinear mappings,” IEEE Access, vol. 7, pp.
48405–48416, 2019.

[ 43 ] S. A. Fulop and K. Fitz, “Algorithms for computing the time-


corrected instantaneous frequency (reassigned) spectrogram, with
applications,” J. Acoust. Soc. Amer., vol. 119, no. 1, pp. 360–371,
Jan. 2006.

[ 44 ] C. Sun, A. Shrivastava, S. Singh, and A. Gupta, “Revisiting unrea-


sonable effectiveness of data in deep learning era,” in Proc. IEEE

78
Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 843–852.

[ 45 ] K.-I. Funahashi and Y. Nakamura, “Approximation of dynami-


cal systems by continuous time recurrent neural networks,” Neural
Netw., vol. 6, no. 6, pp. 801–806, Jan. 1993.

[ 46 ] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”


Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.

[ 47 ] A. Ogawa and T. Hori, “Error detection and accuracy estimation


in automatic speech recognition using deep bidirectional recurrent
neural networks,” Speech Commun., vol. 89, pp. 70–83, May 2017

[ 48 ]D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-


mization,” 2014, arXiv:1412.6980. [Online]. Available: http://arxiv.org/abs/
1412.6980.

[ 49 ]C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim,


J. N. Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: Interactive
emotional dyadic motion capture database,” Lang. Resour. Eval.,
vol. 42, no. 4, pp. 335–359, Dec. 2008.

[ 50 ] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B.


Weiss, “A database of German emotional speech,” in Proc. 9th
Eur. Conf. Speech Commun. Technol., 2005, pp. 1–4.

[ 51 ] S. R. Livingstone and F. A. Russo, “The ryerson audio-visual


database of emotional speech and song (RAVDESS): A dynamic,
multimodal set of facial and vocal expressions in north American

79
english,” PLoS ONE, vol. 13, no. 5, 2018, Art. no. e0196391.

[ 52 ] M. Chen, X. He, J. Yang, and H. Zhang, “3-D convolutional re-


current neural networks with attention model for speech emotion
recognition,” IEEE Signal Process. Lett., vol. 25, no. 10, pp.
1440–1444, Oct. 2018.

[ 53 ] H. Meng, T. Yan, F. Yuan, and H. Wei, “Speech emotion recog-


nition from 3D log-mel spectrograms with deep learning network,”
IEEE Access, vol. 7, pp. 125868–125881, 2019.

[ 54 ] Z.-Q. Wang and I. Tashev, “Learning utterance-level represen-


tations for speech emotion and age/gender recognition using deep
neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
Process. (ICASSP), Mar. 2017, pp. 5150–5154.

[ 55 ] W. Q. Zheng, J. S. Yu, and Y. X. Zou, “An experimental study


of speech emotion recognition based on deep convolutional neural
networks,” in Proc. Int. Conf. Affect. Comput. Intell. Interact.
(ACII), Sep. 2015, pp. 827–831.

[ 56 ] K. Han and D. I. Yu Tashev, “Speech emotion recognition using


deep neural network and extreme learning machine,” in Proc. 15th
Annu. Conf. Int. Speech Commun. Assoc., 2014, pp. 1–5.

[ 57 ] Z. Zhao, Z. Bao, Y. Zhao, Z. Zhang, N. Cummins, Z. Ren, and B.


Schuller, “Exploring deep spectrum representations via attention-
based recurrent and convolutional neural networks for speech emo-
tion recognition,” IEEE Access, vol. 7, pp. 97515–97525, 2019.

80
[ 58 ] D. Luo, Y. Zou, and D. Huang, “Investigation on joint repre-
sentation learning for robust feature extraction in speech emotion
recognition,” in Proc. Interspeech, Sep. 2018, pp. 1–5.

[ 59 ] S. Tripathi, A. Kumar, A. Ramesh, C. Singh, and P. Yenigalla,


“Deep learning based emotion recognition system using speech fea-
tures and transcriptions,” 2019, arXiv:1906.05681. [Online]. Avail-
able: url: http://arxiv.org/abs/1906.05681

[ 60 ] P. Jiang, H. Fu, H. Tao, P. Lei, and L. Zhao, “Parallelized convo-


lutional recurrent neural network with spectral features for speech
emotion recognition,” IEEE Access, vol. 7, pp. 90368–90377, 2019.

[ 61 ] Y. Zeng, H. Mao, D. Peng, and Z. Yi, “Spectrogram based multi-


task audio classification,” Multimedia Tools Appl., vol. 78, no. 3,
pp. 3705–3722, Feb. 2019.

[ 62 ] M. A. Jalal, E. Loweimi, R. K. Moore, and T. Hain, “Learning


temporal clusters using capsule routing for speech emotion recog-
nition,” in Proc.
Interspeech, Sep. 2019, pp. 1–5.
[ 63 ] A. Bhavan, P. Chauhan, Hitkul, and R. R. Shah, “Bagged sup-
port vector machines for emotion recognition from speech,” Knowl.-
Based Syst., vol. 184, Nov. 2019, Art. no. 104886.

[ 64 ] A. A. A. Zamil, S. Hasan, S. M. Jannatul Baki, J. M. Adam,


and I. Zaman, “Emotion detection from speech signals using vot-
ing mechanism on classified frames,” in Proc. Int. Conf. Robot.,
Elect. Signal Process. Techn. (ICREST), Jan. 2019, pp. 281–285.

81

You might also like