Speech Emotion Recognition Guide
Speech Emotion Recognition Guide
1 INTRODUCTION 1
2 LITERATURE SURVEY 5
3 OVERVIEW 8
3.1 Existing System . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Proposed System . . . . . . . . . . . . . . . . . . . . . 8
3.3 speech emotion recognition . . . . . . . . . . . . . . . 9
3.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . 10
3.5 Motive of speech emotion recognition . . . . . . . . . . . 10
3.6 Solution Approach . . . . . . . . . . . . . . . . . . . . . 11
3.7 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5 IDENTIFICATION MODEL 29
5.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Speech Emotion Recognition . . . . . . . . . . . . . . . . 30
6 SOFTWARES USED 32
6.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . 32
i
6.2 Long Short Term Memory(LSTM) . . . . . . . . . . . . . 33
6.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . 34
6.3.1 Evolution of Machine Learning . . . . . . . . . . 34
6.4 Support vector machines (SVMs) . . . . . . . . . . . . . 35
6.4.1 Introduction to SVM . . . . . . . . . . . . . . . . 35
6.4.2 Working of SVM . . . . . . . . . . . . . . . . . . 35
8 System Requirements: 42
8.1 Software Requirements: . . . . . . . . . . . . . . . . . . . 42
8.2 Hardware Requirements: . . . . . . . . . . . . . . . . . . 43
8.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.4 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
9 DESIGN 46
9.1 Use case Diagram . . . . . . . . . . . . . . . . . . . . . . 46
9.2 Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . 47
9.3 Sequence Diagram . . . . . . . . . . . . . . . . . . . . . 49
9.4 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . 51
10 CODING 53
11 Testing 60
12 GUI SCREENS 63
ii
14 BIBLIOGRAPHY 72
iii
List of Figures
iv
ABSTRACT
Emotional state recognition of a speaker is a difficult task for ma-
chine learning algorithms which plays an important role in the field of
speech emotion recognition (SER). SER plays a significant role in many
real-time applications such as human behavior assessment, human-
robot interaction, virtual reality, and emergency centers to analyze the
emotional state of speakers. Previous research in this field is mostly
focused on handcrafted features and traditional convolutional neural
network (CNN) models used to extract high-level features from speech
spectrograms to increase the recognition accuracy and overall model
cost complexity. In contrast, we introduce a novel framework for SER
using a key sequence segment selection based on redial based function
network (RBFN) similarity measurement in clusters. The selected se-
quence is converted into a spectrogram by applying the STFT algorithm
and passed into the CNN model to extract the discriminative and salient
features from the speech spectrogram. Furthermore, we normalize the
CNN features to ensure precise recognition performance and feed them
to the deep bi-directional long short-term memory (BiLSTM) to learn
the temporal information for recognizing the final state of emotion. In
the proposed technique, we process the key segments instead of the
whole utterance to reduce the computational complexity of the overall
model and normalize the CNN features before their actual processing,
so that it can easily recognize the Spatio-temporal information. The
proposed system is evaluated over different standard dataset including
IEMOCAP, EMO-DB, and RAVDESS to improve the recognition ac-
curacy and reduce the processing time of the model, respectively. The
robustness and effectiveness of the suggested SER model is proved from
the experimentations when compared to state-of-the-art SER methods
with an achieve up to 72.25, 85.57, and 77.02 accuracy over IEMOCAP,
EMO-DB, and RAVDESS dataset, respectively
v
Chapter 1
INTRODUCTION
1
information, see https://creativecommons.org/licenses/by/4.0/ 79861
Mustaqeem et al.: Clustering-Based SER by Incorporating Learned
Features and Deep BiLSTM and researchers are inspired by their per-
formance to explore 2-D CNNs in the field of SER. Spectrograms are
suitable representations of speech signals for CNNs model to extract
high-level salient information to recognize emotions in speech signals.
Similarly, some researchers have developed fully convolutional networks
(FCNs) with the help of CNN’s to handle fix input of variable size. The
FCNs achieved a good performance output in time series classification
tasks based on fix input variable size [8]. The lack of FCNs is not able
to learn temporal information regarding this issue, the LSTM-RNNs
is suitable for learning special and temporal features among sequences
[9]. In the field of SER in this era CNN-LSTM and LSTM-RNNs are
widely used for extracting hidden temporal information [10]. Some re-
searchers are working to improve the recognition performance of SER
to select some salient segments from speech signals and to learn tem-
poral features using the CNN-LSTM model [11]. Badshah et al. [12]
proposed a method for SER using the CNN features for smart effec-
tive devices to recognize the emotional state of the person in health
care centers. SER is an active area of research, recently researchers are
utilizing deep learning techniques to develop a variety of methods to
recognize the emotional state of speakers. Typically, researchers utilize
the CNNs model, to learn highlevel salient and discriminative features
and feed them to the LSTM network to learn hidden temporal features
to recognize emotions among sequences. The usage of CNNs and arti-
ficial intelligence increases recognition accuracy, but computation cost
also increased with the usage of huge networks weight. The present
traditional CNNs and LSTM architectures have not shown the sub-
stantial enhancement for increasing the level of accuracy and reducing
the cost complexity of the existence SER systems. In this research,
we proposed a novel deep learning-based approach for SER using RBF
based K-mean clustering with a deep BiLSTM network. In the pro-
2
posed method, we select the emotional segments from whole audio,
utilizing RBF based similarity measurement technique to select one
segment from each cluster. The selected sequence of segments is con-
verted into spectrograms using the STFT algorithm. Furthermore, we
extract the highlevel discriminative features from selected segments uti-
lizing the “FC-1000” layer of the Resnet101 [13] model. After that, we
use the mean and standard deviation strategy to normalize the features
and feeds to deep BiLSTM network for extracting temporal informa-
tion and recognize the final state. The Softmax classifier is used for
producing the probability among speech emotions.
The main contributions of the proposed technique are documented be-
low:
3
a deep learning approach based on key segments sequence selec-
tion and normalization of CNN features based on mean and stan-
dard deviation that can easily improve the existing state-of-the-art
methods. To the best of our knowledge, this is a new deep learning
approach for SER based on RBFN with CNN and deep BiLSTM.
Thus, the key contribution of our framework lies in the usage of
normalization technique to enhance the usage of features.
4
Chapter 2
LITERATURE SURVEY
5
with respect to time and feeding to CNNs to learn hidden in-
formation has become a new trend of research in this era for
SER [2], [18]. Similarly, we can utilize the transfer learning
strategies for SER using speech spectrograms passing through
pre-trained CNNs models like VGG [5] or Alex-Net [19]. Spec-
trogram is a suitable representation for CNNs model to ex-
tract high-level discriminative features from speech signals to
recognize the emotional state of the speaker in the SER system
[20]. Similarly, LSTM-RNNs are mostly used to learn hidden
temporal information in speech signals which is cyclically em-
ployed in the SER system [21], [22]. Nowadays, deep learning
approaches play a crucial role to increasing the research inter-
est in SER. Recently in [23] presented an end to end LSTM-
DNN based model for SER with the combination of LSTM
layers and fully connected layers to directly extract repre-
sentation from raw data rather than obtaining hand-crafted
features.
The joint approach of CNN-LSTM is presented in [24] to
extract the deep salient high-level features from raw speech
data using CNN and passed to the LSTM network for captur-
ing the sequential information similar to [25]. Ma et al. [26]
presented a neural network structure to take the variable-
length speech for SER. In this method, CNN was used to rep-
resent the features of speech spectrograms and RNNs handled
the variable-length speech segments. Zhang et al. [27] pre-
sented a technique for SER by utilizing the pre-trained Alex-
Net model for features representation and traditional support
vector machine (SVM) for emotions classification. Similarly,
Liu et al. [28] used the CNN-LSTM model for spontaneous
SER using the RECOLA [29] natural emotion dataset.
6
with different types of input to extract salient features from
speech signals to boost the recognition accuracy [30]. Simi-
larly, some researchers used the pre-trained model to extract
the high-level features from speech spectrograms and trained
a separate classifier [31] for recognition, which boosts the cost
computation of the system. In this paper, we developed a
novel SER technique to process some useful segments from
the whole audio file which are selected through K-mean clus-
tering algorithm using RBF based similarity measurement.
The selected segments of speech are converted into spectro-
grams and extract high-level discriminative features utilizing
the CNN model called Resnet101. Furthermore, we normal-
ized these features using mean value and standard deviation
then feed them to deep BiLSTM network to learn hidden tem-
poral information from speech segments to recognize the final
emotional state of speakers. The proposed system reduces the
execution time due to process selected segments rather than
all segments and increases the level of accuracy due to used
salient and normalized features with deep BiLSTM network.
According to the best of our knowledge, the proposed archi-
tectures are novel and efficient than all other methods which
are described in the literature.
7
Chapter 3
OVERVIEW
8
for clustering utilizing the shot boundary detection method. Primarily
start K = 1, and estimate the pairwise difference if the consecutive
segment difference within threshold when the difference exceeds from
threshold the “K” value automatically increases by one unit. Due to
this process, we select the value of “K” dynamically for clustering to
make groups accordingly.
9
IEMOCAP, EMO-DB, and RAVDESS to improve the recognition ac-
curacy and reduce the processing time of the model, respectively. The
robustness and effectiveness of the suggested SER model is proved from
the experimentations when compared to state-of-the-art SER methods
with an achieve up to 72.25, 85.57, and 77.02 accuracy over IEMOCAP,
EMO-DB, and RAVDESS dataset, respectively.
3.4 Methodology
The overall procedures of the proposed ECG arrhythmia classifi-
cation model . The original ECG signals were shared by the MIT-BIH
arrhythmia database [31]. The input ECG signals were divided into
data recordings with an identical duration of 10 seconds.
The one-dimensional ECG time domain signals, there are five dif-
ferent classes of arrhythmia, based on the recordings annotations which
made by two or more cardiologists independently. Afterward, each ECG
signal record is transformed into an image of time-frequency spectro-
gram by using the short time Fourier transform (STFT). The ECG
spectrogram images are fed into the proposed deep two-dimensional
convolutional neural network (CNN) model. With these obtained ECG
spectrogram images, classification of the five ECG types is performed
in the 2D-CNN classifier automatically and intelligently. The five ECG
types are normal beat (NOR), left bundle branch block beat (LBB),
right bundle branch block beat (RBB), premature ventricular contrac-
tion beat (PVC), atrial premature contraction beat (APC).
10
deep learning techniques to develop a variety of methods to recognize
the emotional state of speakers. Typically, researchers utilize the CNNs
model, to learn highlevel salient and discriminative features and feed
them to the LSTM network to learn hidden temporal features to rec-
ognize emotions among sequences. The usage of CNNs and artificial
intelligence increases recognition accuracy, but computation cost also
increased with the usage of huge networks weight.
11
3.7 Objective
12
Chapter 4
13
research in this era for SER [2], [18]. Similarly, we canutilize the transfer
learning strategies for SER using speech spectrograms passing through
pre-trained CNNs models like VGG [5] or Alex-Net [19]. Spectrogram is
a suitable representation for CNNs model to extract high-level discrim-
inative features from speech signals to recognize the emotional state
of the speaker in the SER system [20]. Similarly, LSTM-RNNs are
mostly used to learn hidden temporal information in speech signals
which is cyclically employed in the SER system [21], [22]. Nowadays,
deep learning approaches play a crucial role to increasing the research
interest in SER. Recently in [23] presented an end to end LSTM-DNN
based model for SER with the combination of LSTM layers and fully
connected layers to directly extract representation from raw data rather
than obtaining hand-crafted features.
The joint approach of CNN-LSTM is presented in [24] to extract
the deep salient high-level features from raw speech data using CNN
and passed to the LSTM network for capturing the sequential informa-
tion similar to [25]. Ma et al. [26] presented a neural network structure
to take the variable-length speech for SER. In this method, CNN was
used to represent the features of speech spectrograms and RNNs han-
dled the variable-length speech segments. Zhang et al. [27] presented a
technique for SER by utilizing the pre-trained Alex-Net model for fea-
tures representation and traditional support vector machine (SVM) for
emotions classification. Similarly, Liu et al. [28] used the CNN-LSTM
model for spontaneous SER using the RECOLA [29] natural emotion
dataset.
In the field of SER, many methods utilize CNN models with dif-
ferent types of input to extract salient features from speech signals to
boost the recognition accuracy [30]. Similarly, some researchers used
the pre-trained model to extract the high-level features from speech
spectrograms and trained a separate classifier [31] for recognition, which
boosts the cost computation of the system. In this paper, we developed
a novel SER technique to process some useful segments from the whole
14
audio file which are selected through K-mean clustering algorithm using
RBF based similarity measurement. The selected segments of speech
are converted into spectrograms and extract high-level discriminative
features utilizing the CNN model called Resnet101. Furthermore, we
normalized these features using mean value and standard deviation
then feed them to deep BiLSTM network to learn hidden temporal in-
formation from speech segments to recognize the final emotional state of
speakers. The proposed system reduces the execution time due to pro-
cess selected segments rather than all segments and increases the level
of accuracy due to used salient and normalized features with deep BiL-
STM network. According to the best of our knowledge, the proposed
architectures are novel and efficient than all other methods which are
described in the literature.
15
for similarity measurement inside the clustering algorithm which is ex-
plained in section III (B) with detail. In the second part, the selected
sequence of key segments is converted into spectrograms, plotting the
frequency with respect to time using STFT. In the second main block,
we work with features learning to extract the salient and discriminative
features from speech spectrograms with transfer learning strategy uti-
lizing the “FC-1000” layer of pre-trained Resnet101 [13]. The detailed
specification of each unit and layers of the proposed Resnet model. The
learned features are normalized with the help of mean and standard de-
viation for better performance. In the last block, we feed the extracted
normalized CNN features to the suggested deep bi-directional LSTM
to learn temporal cues and recognize the sequential information in a
sequence and analyze the final emotional state of the speaker in speech
signals. The detailed description of each block of the framework are
discussed in the subsequent sections.
In this section, we split the audio file into multiple chunks (frames)
concerning a suitable time and convert the whole utterance into seg-
ments. The selection of suitable time for the audio segment is a chal-
lenging problem in this era. Many researchers have worked, how to
select a suitable time for each speech segment which has found some
reasonable solution, that a segment of a speech signal is longer than
260ms that have more information to recognize the emotions in his/her
speech [33], [34].
. In this paper, we have done different observations on multiple
frame durations to optimally select 500ms window size to convert single
utterance into several segments. Single label is assigned to all segment
of one utterance and give to K-mean clustering [35] algorithm to group
the similar segment with each other. The K-mean clustering algorithm
is most widely used for grouping the big data [36]. The Euclidean
16
distance matrix [37], [38] is conventionally used in K-mean clustering
technique for computing difference within elements. But in this work,
we used the Radial Basis Functions (RBF) [37], [38] replaced by Eu-
clidean distance matrix in K-mean for computing the difference between
two frames. Because the RBF approach has been used for a non-linear
method just like human brain’s to compute the difference and recognize
the patterns. The other important part is the selection of “K” value
for partitioning the data into “K” groups. K-mean algorithm uses the
random initialization technique to select the value of “K”, but in this
approach, we select the “K” value for each file dynamically by using the
shot boundary detection method to estimate the similarity [32]. The
pairwise difference is computed in the consecutive frame and if the dif-
ference is greater than the selected threshold value then increment the
“K” value by one unit. After the total segments have been clustered
using K-mean algorithm and one segment is selected from each cluster
as key segment which near to the centroid of the cluster based on the
RBF distance method, which is explained in the upcoming section. The
selected key-segments are converted into spectrograms based on STFT
algorithm for 2-D representation.
17
tion model to capture and compute the similarity between audio seg-
ments. Our model is also working as a non-linear model based on
RBFN [40]. We use a mapping function to find the degree of similar-
ity between audio segments. The concept of regularization is applied
to estimate the mapping function of basic RBF. 1-D Gaussian shaped
model [41] that meets an important requirement of the regularization
method which smoothens the mapping function for the similar inputs
VOLUME 8, 2020 Mustaqeem et al.: Clustering-Based SER by In-
corporating Learned Features and Deep BiLSTM consistent to similar
outputs.
18
with respect to time interval. RNNs is the most dominant source for
analyzing hidden information in both spatial and temporal sequential
data [45]. We process all key segments of every utterance and the fi-
nal state of RNN is counted for each utterance as a final recognition
of emotion. RNNs can easily learn the sequential data but forget the
earlier sequence in terms of long sequences. This is a vanishing gradient
problem in RNNs which is solved by LSTM [46]. It is a special type of
RNNs having input, output and forget gates to learned long sequences.
xt Represents the input at time “t” and ft represent the forget gate
in the LSTM, which needs to clear information form cell and keeps the
records of the previous one. “ot” represents the output gate which re-
sponsible for keeping information step, and “g” represents the recurrent
unit having “tanh” activation function to computed from the present
input segment and previous segment .
The memory cell “ct” show the hidden state of RNN which is cal-
culated in every step through the “tan h” activation function. The final
state of the RNN step feeds to the Softmax classifier for taking the fi-
nal decision of the RNN network. Training a huge amount of data with
large and complex sequences is not correctly recognized by a simple
LSTM network. Hence, in this paper, we proposed a multi-layer deep
BiLSTM to learn and recognized long term sequences in audio data
for recognizing emotions. The internal structure and memory blocks
information is illustrated.
D. BI-DIRECTIONAL LSTM
19
used the two-layer network for both backward and forward pass. The
overall concept of the suggested multilayer bidirectional LSTM. The
external architecture is shown in the given figure which represents the
training phase of the bidirectional RNN and combined both forward
and backward pass hidden state in the output layer. After the output
layer, the cost and validation are computed and adjust the weights and
biases through back propagation.
The network is validated on 20 data, which is separated from
training data and compute the error rate in the validation data using
cross-entropy. Adam optimization [48] is used for minimization of cost
with a 0.001 learning rate. In the deep BiLSTM network, the forward
and backward pass consists of cells, which make deep our network to
compute the output from the previous and next sequence with respect
to time because the network performed in both directions.
A IEMOCAP DATASET
20
actors to records 12 hours of audiovisual data including audio, videos,
motion of faces, speech and text transcriptions. The IEMOCAP dataset
has five sessions and each session consists of 2 actors (one male and one
female) to record the emotional script with 3 to 15 second long with a
16 kHz sampling rate. Each session has different categories of emotions
like; anger, sad, happy, neutral, surprise, disgust, frustrated, excited
and fearful which is annotated by three expert persons. Individually
labeled the data, we select those utterances that two experts are agreed
upon them. In this paper, we evaluated our system on four emotions
anger, sad, happy and neutral for comparison which is mostly used in
literature.
The distribution of four emotions of all five sessions of the IEMO-
CAP dataset for evaluating the model.We utilize the 5-fold cross vali-
dations technique to train the speaker-independent model,the four ses-
sions are used for training and one session is used for testing the system
in each fold.
21
C.RAVDESS DATASET
D. EXPERIMENTAL EVALUATION
22
the neural network toolbox for features extraction, model training, and
evaluations. The data are divided into training and testing folds with
an 80:20 ratio and generated spectrograms of every segment. The sug-
gested model was trained and evaluated on a single NVIDIA GeForce
GTX 1070, 8 GB on-board memory GPU system. The detailed descrip-
tion of speaker-dependent and independent experiments is in upcoming
sub sections.
E. MODEL OPTIMIZATION
23
ized features recognition accuracy is better and the processing time
for model testing and training is lower than other baseline models.
Similarly, we compare our model processing time with other baseline
methods using the diverse parameter for proving the model effective-
ness and feasibility. We set the batch size to be 512 and select the 0.001
learning rate with Adam optimizer and analyze the processing time for
IEMOCAP, EMO-DB and RAVDESS dataset utilizing the normalized
features.
The processing time of the model which indicates that the proposed
model takes less time in training and testing due to the efficient strat-
egy of the model. In the proposed model we didn’t take all segments
of each utterance, but we just select one segment form each cluster as
a key segment that represent the whole cluster and train a model on
that selected cluster. So, that’s the reason for less processing time,
our model processes the selected segment not all segments of utterance
and extract the CNN feature which feeds to deep BiLSTM network for
classification.
24
between correctly classified emotion and total emotion in consistent
class. The un-weighted accuracy, mean the ratio with in correct pre-
dicted emotion and total emotion in whole dataset.
Measuring the proposed system by weighted and un-weighted ac-
curacy and show the precision and recall values of each category in
confusion matrix. The confusion matrix show the actual predicted
emotions and model confusion result of each class. The classification
performance of the suggested system for speaker-independent evalua-
tion. Which shows the recognition performance of the proposed model
on the IEMOCAP challenging dataset for speaker-independent SER. In
this experiment, we obtained 83 accuracy for anger emotion and 78 for
sad, 70 for neutral and 58 for happy emotion respectively. The recog-
nition rate of happy and neutral emotions is low in this experiment,
but we obtained better results from state of the art. The results of the
EMO-DB dataset .
In the above figure, the overall emotion recognition performance
is increased as compared with other baseline methods, but the recog-
nition rate of happy emotion is increased but still lower. Hence, the
happy emotion mostly confused with other emotions in classification.
The anger, fear, and boredom have high, greater than 90 accuracy and
disgust, neutral and sad have greater than 80 accuracy respectively.
Our proposed system overall achieved high recognition (85.75) score for
the EMO-DB dataset. The RAVDESS dataset confusion matrix. We
evaluated the effectiveness of our proposed system on the RAVDESS
dataset, which is mostly used for emotional songs and speech.
The performance of the suggested model is better than other base-
line techniques. The system recognized anger, clam fear, and surprise
with high priority and happy, neutral, and sad emotions were recog-
nized with lower priority. The system mostly confused in happy, neu-
tral, and sad emotions and recognized these emotions as a calm due to
the minimum diversity with each other. The recognition rate of calm
is high and the system confused with other emotions and recognize it
25
as a calm. The overall accuracy of the system for speaker independent
emotion recognition is better than other baseline methods on IEMO-
CAP, EMO-DB, and RAVDESS corpuses.
26
emotions with 91.14 average recall. In this experiment the system rec-
ognized anger, fear and sadness emotion with high rank and disgust,
neutral, boredom had more than 85 recognition rate and the happy
emotion is recognized with a 75 ratio respectively. The system was
confused among happy and neutral emotion and mostly happy emo-
tions were recognized as neutral similarly, like a speaker independent.
The overall performance of the proposed system is better, affective and
significant than other baseline techniques. The speaker-dependent per-
formance of the suggested system for RAVDESS is illustrated.
We evaluated our model on the REVDESS dataset to show the
performance and generalization of the model for SER.We obtained the
record results of the model on multiple benchmark datasets which out-
perform output respectively. The emotion recognition rate of the pro-
posed model was 95 for anger, 93 for fear, 96 for surprised, 95 for calm
and 90 for disgust respectively. The happy emotion rate was relatively
low but better than previous work. The proposed system misrecognized
the happy emotion as compared to other classes. According to our opin-
ion, the features of happy emotion are easily confused with others and
as a result the suggested model misrecognized them. Another reason for
misrecognized the happy emotion is the limitation of data, the size of
the datasets is less than other pattern recognition datasets like images,
video, and text. Hence, in SER, to increase the accuracy of happy emo-
tion is a very significant improvement in this field. Many researchers
have worked to develop new techniques to extract discriminative fea-
tures and efficient way of classification to enhance the accuracy of this
field, SER.
27
ing time. In contrast, we suggested a new technique to select a more
efficient sequence from speech using RBF based K-mean clustering al-
gorithm and convert it into spectrograms by applying STFT algorithm.
28
Chapter 5
IDENTIFICATION MODEL
5.1 Architecture
3. Train and Test Modelling: Split the data into train and test data
Train will be used for trainging the model and Test data to check
29
the performace.
30
Learning its classification decisions from the given data and also learn-
ing a weight for its vote based on its accuracy on the data are things
the decision tree learns during training. While each classifier is trained
one after the other, the data points are re-weighted to make more at-
tention be paid to the data points in which errors were made. This
continues until the net error over the entire data set, obtained from the
combined weighted vote of all the decision trees present, falls below a
certain threshold. This algorithm is usually effective when a very large
quantity of training data is available.
31
Chapter 6
SOFTWARES USED
32
the input and create a statistical model as output.
Structure Of LSTM:
Forget Gate:
33
Input gate:
Output gate:
34
machine learning applications you may be familiar with:
The heavily hyped, self-driving Google car? The essence of machine
learning. Online recommendation offers such as those from Amazon
and Netflix? Machine learning applications for everyday life. Knowing
what customers are saying about you on Twitter? Machine learning
combined with linguistic rule creation. Fraud detection? One of the
more obvious, important uses in our world today.
35
Support Vectors
Hyperplane
Margin
It may be defined as the gap between two lines on the closet data
points of different classes. It can be calculated as the perpendicular
distance from the line to the support vectors. Large margin is consid-
ered as a good margin and small margin is considered as a bad margin.
The main goal of SVM is to divide the datasets into classes to
find a maximum marginal hyperplane (MMH) and it can be done in
the following two steps.
2. Then, it will choose the hyperplane that separates the classes cor-
rectly.
36
Chapter 7
SOFTWARE REQUIREMENTS
SPECIFICATIONS
7.1 Purpose
Problem/Requirement Analysis
Process is order and more nebulous of the two, deals with under-
stand the problem,the goal and the constraints.
Requirement Specification
Here, the focus is on specifying what has been found giving analysis
such as representation, specification languages and tools, and checking
the specifications are addressed during this activity. The Requirement
phase terminates with the production of the validate SRS document.
Producing the SRS document is the basic goal of this phase.
Role of SRS: The purpose of the Software Requirement Specifica-
tion is to reduce the communication gap between the clients and the
developers. Software Requirement Specification is the medium though
which the client and user needs are accurately specified. It forms the
37
basis of software development. A good SRS should satisfy all the par-
ties involved in the system.
7.2 Scope
• Data Collection
• Data Preprocessing
• Training And Testing
• Modiling
• Predicting
38
system. Example of nonfunctional requirement, ”how fast does the
website load?” Failing to meet non-functional requirements can result
in systems that fail to satisfy user needs. Non- functional Require-
ments allows you to impose constraints or restrictions on the design of
the system across the various agile backlogs. Example, the site should
load in 3 seconds when the number of simultaneous users are ¿ 10000.
Description of non-functional requirements is just as critical as a func-
tional requirement.
• DUsability requirement
• Serviceability requirement
• Manageability requirement
• Recoverability requirement
• Security requirement
• Data Integrity requirement
• Capacity requirement
• Availability requirement
• Scalability requirement
• Interoperability requirement
• Reliability requirement
• Maintainability requirement
• Regulatory requirement
• Environmental requirement
39
Examples Of Non-Functional Requirements
2. They ensure good user experience and ease of operating the soft-
ware.
40
1. They require special consideration during the software architec-
ture/highlevel design phase which increases costs.
41
Chapter 8
System Requirements:
• Anaconda 3.7
• Jupiter
• Google colab
42
8.2 Hardware Requirements:
• Ram : minimum 4 gb
8.3 Algorithms
43
1. Install Anaconda Latest Version
2. Open anaconda Prompt
3. Conda create -n tf python=3.7
4. Conda activate tf
5. Install require softwares
tensorflow==1.14.0
ipykernel==5.3.4
scikit-image==0.17.2
scikit-learn==0.23.2
pandas==1.1.1
matplotlib==3.3.1
Keras==2.3.1
Pillow==7.2.0
plotly==4.10.0
opencv-python==4.4.0.42
spacy==2.3.2
lightgbm==3.0.0
mahotas==1.4.11
matplotlib==3.3.1
lightgbm==3.0.0
mahotas==1.4.11
nltk==3.5
matplotlib==3.3.1
xgboost==1.2.0 Jupyter
6. Activate environment for jupyter notebook(For execute the in jupter
notebook) python -m ipykernel install user name=
7. Goto project Directory
Note: For Text related project. Need to Download
1. Open anaconda Prompt
2. Python
423. Import nltk
4. Nltk.download()
44
8.4 Modules
1. Data Collection:
Collect sufficient data samples and legitimate software samples.
2. Data Preporcessing:
Data Augmented techniqies will be used for better performance.
45
Chapter 9
DESIGN
46
that is similar to another usecase but does a bit more. In essence
it is like subclass.
47
the aspect of the system.
48
Figure 9.2: Class diagram for Speech Emotion Recognition
49
ger an event. A message carries information from the source focus of
control to the destination focus of control. The synchronization of a
message can be modified through the message specification. Synchro-
nization means a message where the sending object pauses to wait for
results.
Link:
A link should exist between two objects, including class utilities, only
if there is a relationship between their corresponding classes. The exis-
tence of a relationship between two classes symbolizes a path of commu-
nication between instances of the classes.one object may send messages
to another.
50
Figure 9.3: Sequence diagram for Speech Emotion Recognition
51
initial state.
Final states:
The state which the system reaches when a specific process ends is
known as a Final State.
Decision box:
It is a diamond shape box which represents a decision with alternate
paths. It represents the flow of control.
52
Chapter 10
CODING
from s k l e a r n . m o d e l s e l e c t i o n import t r a i n t e s t s p l i t
from s k l e a r n . n e u r a l n e t w o r k import M L P C l a s s i f i e r
from s k l e a r n . m e t r i c s import r 2 s c o r e
from s k l e a r n import svm
from s k l e a r n . svm import SVC
import k e r a s
53
from k e r a s . u t i l s import t o c a t e g o r i c a l
from k e r a s . models import S e q u e n t i a l
from k e r a s . l a y e r s import ∗
from k e r a s . o p t i m i z e r s import rmsprop
import IPython . d i s p l a y as ip d
from k e r a s . r e g u l a r i z e r s import l 2
import random
main = t k i n t e r . Tk ( )
main . t i t l e ( ” Speech R e c o g n i t i o n ” )
main . geometry (” 130 0 x1200 ” )
d e f lstmmodel ( ) :
model = S e q u e n t i a l ( )
model . add (LSTM( 1 2 8 , r e t u r n s e q u e n c e s=F a l s e ,
i n p u t s h a p e =(40 , 1 ) ) )
model . add ( Dense ( 6 4 ) )
model . add ( Dropout ( 0 . 4 ) )
model . add ( A c t i v a t i o n ( ’ r e l u ’ ) )
model . add ( Dense ( 3 2 ) )
model . add ( Dropout ( 0 . 4 ) )
model . add ( A c t i v a t i o n ( ’ r e l u ’ ) )
model . add ( Dense ( 8 ) )
model . add ( A c t i v a t i o n ( ’ softmax ’ ) )
# C o n f i g u r e s th e model f o r t r a i n i n g
model . c o m p i l e ( l o s s =’ c a t e g o r i c a l c r o s s e n t r o p y ’ ,
o p t i m i z e r =’Adam’ , m e t r i c s =[ ’ accuracy ’ ] )
model . l o a d w e i g h t s ( ” Model LSTM . h5 ” )
r e t u r n model
54
d e f upload ( ) :
global filename
p r i n t (” Testing ”)
t e x t . d e l e t e ( ’ 1 . 0 ’ , END)
filename = askopenfilename ( i n i t i a l d i r = ”.”)
p a t h l a b e l . c o n f i g ( t e x t=f i l e n a m e )
t e x t . i n s e r t (END, ” F i l e S e l e t c e d l o a d e d \n\n ” )
d e f loadmodel ( ) :
g l o b a l model
model=lstmmodel ( )
t e x t . i n s e r t (END, ”LSTM model Loaded \n\n ” )
def preprocess ( ) :
global filename
g l o b a l qq , q , a1
a = extract mfcc ( filename )
a1 = np . a s a r r a y ( a )
q = np . expand dims ( a1 , 1 )
qq = np . expand dims ( q , 0 )
t e x t . i n s e r t (END, ” Audio F e a t u r e s e x t r a c t e d \n\n ” )
d e f pred ( ) :
55
g l o b a l model , qq , c l a s s e s s
pred = model . p r e d i c t ( qq )
p r e d s=pred . argmax ( a x i s =1)
c l a s s e s s =[ ’ n e u t r a l ’ , ’ calm ’ , ’ happy ’ , ’ sad ’ , ’ angry ’ ,
’ fearful ’ , ’ disgust ’ , ’ surprised ’ ]
t e x t . i n s e r t (END, ” Speech p r e d i c t e d with LSTM
:”+ s t r ( c l a s s e s s [ p r e d s [ 0 ] ] ) + ” \ n ” )
d e f runSVM ( ) :
g l o b a l c l a s s e s s , a1
radvess speech labels = [ ]
ravdess speech data = [ ]
f o r dirname , , f i l e n a m e s i n os . walk ( ’ . \ \ d a t a s e t
\\ Actor 05 ’ ) :
for filename in filenames :
r a d v e s s s p e e c h l a b e l s . append ( i n t
( filename [ 7 : 8 ] ) 1)
w a v f i l e n a m e = os . path . j o i n
( dirname , f i l e n a m e )
r a v d e s s s p e e c h d a t a . append
( extract mfcc ( wav file name ))
p r i n t ( ” F i n i s h Loading t h e D at a set ” )
ravdess speech data array =
np . a s a r r a y ( r a v d e s s s p e e c h d a t a )
ravdess speech label array =
np . a r r a y ( r a d v e s s s p e e c h l a b e l s )
r a v d e s s s p e e c h l a b e l a r r a y . shape
56
labels categorical =
to categorical ( ravdess speech label array )
# c o n v e r t s a c l a s s v e c t o r ( i n t e g e r s ) t o b i n a r y c l a s s matrix
l a b e l s c a t e g o r i c a l . shape
r a v d e s s s p e e c h d a t a a r r a y . shape
x t r a i n , x t e s t , y t r a i n , y t e s t= t r a i n t e s t s p l i t
( np . a r r a y ( r a v d e s s s p e e c h d a t a a r r a y ) ,
np . a r r a y ( r a v d e s s s p e e c h l a b e l a r r a y ) , t e s t s i z e =0.20 ,
r a n d o m s t a t e =9)
# S p l i t t h e t r a i n i n g , v a l i d a t i n g , and t e s t i n g s e t s
n u m b e r o f s a m p l e s = r a v d e s s s p e e c h d a t a a r r a y . shape [ 0 ]
t r a i n i n g s a m p l e s = i n t ( number of samples ∗ 0 . 8 )
v a l i d a t i o n s a m p l e s = i n t ( number of samples ∗ 0 . 1 )
t e s t s a m p l e s = i n t ( number of samples ∗ 0 . 1 )
print ( number of samples )
print ( training samples )
qb = np . expand dims ( a1 , 0 )
pred = h i s t o r y . p r e d i c t ( qb )
p r e d s=pred . argmax ( a x i s =0)
#pred = [ random . randrange ( 0 , 8 ) ]
c l a s s e s s =[ ’ n e u t r a l ’ , ’ calm ’ , ’ happy ’ , ’ sad ’ , ’ angry ’ ,
’ fearful ’ , ’ disgust ’ , ’ surprised ’ ]
t e x t . i n s e r t (END, ” Train & Test Model Generated by SVM\n\n ” )
t e x t . i n s e r t (END, ” Tot a l D at a set S i z e :
”+ s t r ( n u m b e r o f s a m p l e s )+”\n ” )
57
t e x t . i n s e r t (END, ” S p l i t T r a i n i n g S i z e :
”+ s t r ( t r a i n i n g s a m p l e s )+”\n ” )
t e x t . i n s e r t (END, ” S p l i t Test S i z e :
”+ s t r ( v a l i d a t i o n s a m p l e s )+”\n ” )
t e x t . i n s e r t (END, ” P r e d i c t i o n R e s u l t s \n ” )
t e x t . i n s e r t (END, ” Speech p r e d i c t e d with SVM
:”+ s t r ( c l a s s e s s [ pred [ 0 ] ] ) + ” \ n ” )
f o n t = ( ’ times ’ , 1 6 , ’ bold ’ )
t i t l e = Label ( main , t e x t =’ Speech Emotion R e c o g n i t i o n Using
Deep Learning ’ )
t i t l e . c o n f i g ( bg=’ dark salmon ’ , f g =’ black ’ )
t i t l e . c o n f i g ( f o n t=f o n t )
t i t l e . c o n f i g ( h e i g h t =3, width =120)
t i t l e . p l a c e ( x=0,y=5)
f o n t 1 = ( ’ times ’ , 1 4 , ’ bold ’ )
lm = Button ( main , t e x t =”Model l o a d ” , command=loadmodel )
lm . p l a c e ( x=700 ,y=100)
lm . c o n f i g ( f o n t=f o n t 1 )
58
p a t h l a b e l = Label ( main )
p a t h l a b e l . c o n f i g ( bg=’ dark o r c h i d ’ , f g =’ white ’ )
p a t h l a b e l . c o n f i g ( f o n t=f o n t 1 )
p a t h l a b e l . p l a c e ( x=700 ,y=200)
f o n t 1 = ( ’ times ’ , 1 2 , ’ bold ’ )
t e x t=Text ( main , h e i g h t =30 , width =80)
s c r o l l=S c r o l l b a r ( t e x t )
t e x t . c o n f i g u r e ( yscrollcommand=s c r o l l . s e t )
t e x t . p l a c e ( x=10 ,y=100)
t e x t . c o n f i g ( f o n t=f o n t 1 )
main . c o n f i g ( bg=’ b i s q u e 2 ’ )
main . mainloop ( )
59
Chapter 11
Testing
SOFTWARE TESTING
Testing
Types of Testing
60
White Box Testing
Unit Testing
61
Input Actual Output Predicted Output
[16,6,324,0,0,0,22,0,0,0,0,0,0] 0 0
[16,7,263,7,0,2,700,9,10,1153,832,9,2] 1 1
The model gives out the correct output when different inputs
are given which are mentioned in Table 4.1. Therefore the program is
said to be executed as expected or correct program.
62
Chapter 12
GUI SCREENS
outputs
63
Figure 12.2: Load the model
64
Figure 12.3: Selected file loaded
65
Figure 12.4: Extract Audio features
66
Figure 12.5: Select a Audio clip
67
Figure 12.6: Training algorithm for SVM
68
Figure 12.7: Training algorithm for LSTM
69
Chapter 13
13.1 CONCLUSION
The existing CNNs system of SER has too many challenges such as
improvement in accuracy and reduce the computational complexity of
the whole model. Due to these limitations, we planned a novel approach
for SER to improve the recognition accuracy and reduce the overall
model cost computation and processing time. In contrast, we sug-
gested a new technique to select a more efficient sequence from speech
using RBF based K-mean clustering algorithm and convert it into spec-
trograms by applying STFT algorithm. Hence, we extracted the dis-
criminative and salient features from spectrograms of speech signal by
utilizing the “FC-1000” layers of the CNN model, called Resnet and
normalize it by applying mean and standard deviation to remove the
variation. After normalization, we feed these discriminative features to
deep BiLSTM to learn the hidden information and recognize the final
state of sequence and classify the emotional state of speakers. We eval-
uated the proposed system on three standard IEMOCAP, EMO-DB,
and RAVDESS datasets to check the robustness of the system. We im-
prove the recognition accuracy for IEMOCAP dataset as 72.25, obtain
85.57 for EMO-DB dataset and for RAVDESS dataset, we achieved
70
77.02 . We reduce the processing time of our system, which process the
selected segments for emotion recognition rather than all segments that
yielding a computational friendly system. The experimental results of
the proposed system proved the robustness and significance for SER
to correctly recognize the emotional state of the speaker using spectro-
grams of speech signals.
71
Chapter 14
BIBLIOGRAPHY
72
[ 6 ] T. Hussain, K. Muhammad, A. Ullah, Z. Cao, S. W. Baik, and
V. H. C. de Albuquerque, “Cloud-assisted multiview video summa-
rization using CNN and bidirectional LSTM,” IEEE Trans. Ind.
Informat., vol. 16, no. 1, pp. 77–86, Jan. 2020.
73
Appl., vol. 78, no. 5, pp. 5571–5589, Mar. 2019.
74
Neural Inf. Process. Syst., 2012, pp. 1097–1105.
75
[ 26 ] X. Ma, Z. Wu, J. Jia, M. Xu, H. Meng, and L. Cai, “Emo-
tion recognition from variable-length speech segments using deep
learning on spectrograms,” in Proc. Interspeech, Sep. 2018, pp.
3683–3687.
[ 28 ] Z.-T. Liu, M. Wu, W.-H. Cao, J.-W. Mao, J.-P. Xu, and G.-
Z. Tan, “Speech emotion recognition based on feature selection
and extreme learning machine decision tree,” Neurocomputing, vol.
273, pp. 271–280, Jan. 2018.
76
[ 32 ] L. Wu, S. Zhang, M. Jian, Z. Lu, and D. Wang, “Two stage shot
boundary detection via feature fusion and spatial-temporal convo-
lutional neural networks,” IEEE Access, vol. 7, pp. 77268–77276,
2019.
77
(BRBFNN) for identification and classification of plant leaf dis-
eases: An automatic approach towards plant pathology,” IEEE
Access, vol. 6, pp. 8852–8863, 2018.
78
Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 843–852.
79
english,” PLoS ONE, vol. 13, no. 5, 2018, Art. no. e0196391.
80
[ 58 ] D. Luo, Y. Zou, and D. Huang, “Investigation on joint repre-
sentation learning for robust feature extraction in speech emotion
recognition,” in Proc. Interspeech, Sep. 2018, pp. 1–5.
81