0% found this document useful (0 votes)
7 views13 pages

Deep Learning Structure For Emotion Prediction Using MFCC From Native Languages

This research presents a deep learning model for emotion prediction using Mel-frequency cepstral coefficients (MFCC) from various datasets, achieving an F1 score of 0.91 overall and 0.95 for the 'Angry' class. It highlights the challenges of speech emotion recognition (SER), including variability in emotional expression and the need for diverse datasets. The study emphasizes the potential applications of SER in enhancing human-machine interactions across various fields, while also addressing ethical considerations related to emotional data usage.

Uploaded by

capakit416
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views13 pages

Deep Learning Structure For Emotion Prediction Using MFCC From Native Languages

This research presents a deep learning model for emotion prediction using Mel-frequency cepstral coefficients (MFCC) from various datasets, achieving an F1 score of 0.91 overall and 0.95 for the 'Angry' class. It highlights the challenges of speech emotion recognition (SER), including variability in emotional expression and the need for diverse datasets. The study emphasizes the potential applications of SER in enhancing human-machine interactions across various fields, while also addressing ethical considerations related to emotional data usage.

Uploaded by

capakit416
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

International Journal of Speech Technology (2023) 26:721–733

https://doi.org/10.1007/s10772-023-10047-8

Deep learning structure for emotion prediction using MFCC


from native languages
A. Suresh Rao1 · A. Pramod Reddy1 · Pragathi Vulpala1 · K. Shwetha Rani1 · P. Hemalatha2

Received: 23 April 2023 / Accepted: 7 September 2023 / Published online: 5 October 2023
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023

Abstract
The role of AI in speech has been transformed to recognize and categorize emotions conveyed through speech. The research
employed audio recordings from different datasets, including the Ryerson Audio-Visual Database of Emotional Speech and
Song (RAVDESS), Berlin emotional data, and a self-developed Telugu dataset. The main contribution focused on using deep
neural network-based models to categorize emotional reactions elicited by spoken monologues in various situations. The goal
is to recognize eight distinct emotions: neutral, calm, happy, sad, angry, fearful, disgusted, and surprised. The evaluation of
the model’s performance was done using the F1 score, which is a measure that combines precision and recall. The model
achieved a weighted average F1 score of 0.91 on the test set and performed well in the "Angry" class with a score of 0.95.
However, the model’s performance in the "Sad" class was not as high, achieving a score of 0.87, which is still better than
the state-of-the-art results. The contribution with an effective model for recognizing emotional reactions conveyed through
spoken language, utilizing neural networks and a combination of datasets to improve the understanding of emotions in speech.

Keywords Emotion recognition · Deep learning · CNN · MFCC · RAVDESS

1 Introduction users’ emotional cues, leading to more personalized and


contextually relevant interactions. Video Games: Incorpo-
Emotions play a vital role in communication, affecting how rating SER into gaming environments can create immersive
messages are conveyed and received. Integrating Speech experiences. Games can adapt based on players’ emotional
Emotion Recognition (Bediou et al., 2005) (SER) into states, making gameplay more engaging and dynamic. Con-
human–machine interaction allows systems to better under- sumer Evaluations: In retail and product evaluation contexts,
stand and respond to users’ emotional states, creating a more SER can provide real-time insights into customers’ emo-
natural and effective interaction. The Various Challenges in tions, helping companies tailor their products and services
SER Recognizing emotions from speech is challenging due to meet customer needs. Healthcare: SER can be utilized
to the variability in how people express emotions. Factors in healthcare settings to monitor patients’ emotional states,
like accent, tone, pitch, and cultural differences can influence potentially aiding in diagnosing and treating mental health
how emotions are conveyed in speech. Despite advance- conditions. Automated emotional analysis can complement
ments, accurately interpreting emotions from speech remains traditional assessments. Benefits include enhanced user
a complex problem. Potential Applications: engagement, real-time feedback, data-driven insights, while
Multimedia Messaging: SER can enhance messaging emotion recognition offers numerous benefits, it also raises
platforms by enabling systems to interpret and respond to ethical concerns related to privacy, consent, and potential
misuse of emotional data. Careful consideration and regu-
lation are necessary to ensure responsible implementation.
* A. Pramod Reddy
pramodaeluri@gmail.com The integration of speech emotion recognition into various
intelligent systems holds the substantial potential to revolu-
1
Department of CSE, TKR College of Engineering tionize the way humans interact with technology, enhanc-
Technology, Saroornagar, Hyderabad, Telangana 500097, ing experiences in domains ranging from entertainment to
India
healthcare (Chen et al.. 2018), and beyond. The traditional
2
Vignan’s Institute of Management Technology for Women, approach to SER involved extracting para-linguistic features
Ghatkesar, Hyderabad, Telangana 501301, India

13
Vol.:(0123456789)
722 International Journal of Speech Technology (2023) 26:721–733

and employing machine learning algorithms for emotion feature is evaluated either with or without wrapping. Experi-
recognition. While this approach (Reddy & Vijayarajan, mental findings indicate that the fusion warping feature of
2017; Kwon et al., 2003; El Ayadi et al., 2011) had limi- DWT-MFCC and the characteristic MFCC approach have
tations in terms of generalization and feature engineering, improved performance in most environmental noise, reper-
advancements in deep learning have provided opportunities cussion, and rushing environments. In Jin and Liu (2017)
for more automated and effective emotion recognition by Extraction of hyperprosodic features and the use of a deep
learning directly from raw data. However, challenges related network for classification was made. Vasquez-Correa et al.
to emotionally irrelevant variables, data quality, and ethical (2016) extracted features using the Gaussian Mixing Models
considerations still need to be addressed in the development (GMM) for classification using Bionic Wavelet Transforma-
of robust emotion recognition systems. tion. While a number of prior research on the recognition of
The lexical or speaker contexts do not rely on these char- audiovisual emotions are included in the literature, they are
acteristics. Energy-related general features include pitch, all poorly reliable. One of the main considerations is how
format, Zcr, intonation, Mel-frequency cepstral coeffi- these two signals can be derived and combined (Khaleghi
cients (MFCC), and linear prediction cepstral coefficients et al., 2013). In certain cases, these hand-made functions
(LPCC). Furthermore, several classification methods have are derived, and two signals have their properties combined
been implemented, including, the Hidden–Markov-Model with one weight. This article proposes using a deep net-
(HMM), support vector machine (SVM), and K-nearest- work to extract features and integrate features. These two
neighbors (KNN). A system with many speech character- networks promise to be absolutely non-linear. The classifica-
istics, including MFCC, MFCC delta and Auto Correla- tion is done using CNN with TensorFlow. These days deep
tion Function Coefficients (ACFC) to assess emotions was learning is widely used in numerous applications, including
used in this study These features and classification methods image processing, voice processing, and emotion recogni-
demonstrate the complexity and multi-dimensional nature tion. speech analysis. The exact essence of the deep model
of analyzing speech data for emotion recognition. While and the availability of massive data varies in the different
traditional methods have provided insights into emotion applications using the deep-learning method (Chen et al.,
recognition, deep learning approaches have shown substan- 2017). The contributions towards the paper (i) The proposed
tial promise in automatically learning relevant features from system trained using RAVDESS (Livingstone & Russo,
raw data and improving classification performance. These 2018) and EMO-DB using deep networks in noisy environ-
advancements, coupled with large and diverse datasets, have ments for robust recognition of emotion. (ii) A 1D-CNN-
the potential to enhance the accuracy and generalization Convolutional Neural Network has been implemented over
capabilities of emotion recognition systems. Zhang et al. compressed speech signals.
(2013). The mentioned findings and approaches reflect a A few relevant research challenges in speech emotion rec-
range of strategies employed in speech emotion recognition ognition are briefly described here. (1) Emotion is a tough
and related fields. These include the use of GMMs, unsu- and confusing word to define accurately. The term emotion
pervised approaches with PCA, innovative feature extraction has been interpreted in various contexts by various persons.
techniques like DWT-based MFCC and hyperprosodic fea- Emotion is a distinct mental state that arises intuitively rather
tures, and the application of deep networks for classification. than through deliberate effort, making it difficult to scientifi-
These methods collectively contribute to advancing emotion cally characterize. As a result, no universally acknowledged
recognition and speech analysis by addressing challenges objective definition or acceptance of the concept of emotions
related to noisy environments, varying conditions, and the exists. This is the primary inhibiting factor to progressing
nuanced nature of emotional expression in speech. By com- forward with a scientific research approach (Schroder et al.,
bining techniques from different domains, researchers aim 2011). (2) There have been no standardized speech datasets
to improve the accuracy and robustness of systems that can that can be used to test the efficiency of recognition system
automatically detect and understand human emotions from methodological approaches. The overwhelming majority of
speech data. The experiment yielded an emotional classifi- automatic emotion systems depend on comprehensive emo-
cation accuracy of 74.45% using Gaussian Mixture Models tions, despite the fact that real-life emotional responses are
(GMM). In another work, Wang et al. (2019), an unsuper- pervasive and foundational. Many databases are made by
vised approach to the development of multiple speech emo- professional artists, whereas others are made by moderately
tions. This approach consists of integrating the classifier or inexperienced people. Although most datasets do not con-
with Principal component analysis(PCA) to have a precise tain a wide range of emotions, emotion recognition research
interpretation that can be preserved invariably in the organi- is limited to 5–6 emotions. (3) The speech emotion identi-
zation. Al-Ali et al. (2017) in their paper evaluated the use of fication system will not be able to recognize emotions from
DWT-based MFCC features for forensic speaker authentica- both natural and noisy environments. The rest of this paper
tion by the standard MFCC features. The performance of the runs through Related Work in section two, a working model

13
International Journal of Speech Technology (2023) 26:721–733 723

with a brief introduction about the speech datasets and fea- 1D- fully connected CNN with an input layer tuned with
ture extraction in section three, an Experimental setup in an SVM classifier where a 15x60 spectrogram image is
section four, and Discussions in section five, and the last used. Jannat et al. (2018) and others have trained 3 deep
section concludes this paper with future scope. networks separately, particularly considering the pre-
processed data on the image and the audio wave-forms:
one on image data alone; another on audio waveforms
2 Related work and one on image and waveform data alone. In particu-
lar, the other on the image format. Zhang et al. (2015)
The main objective is to develop a model for emotional published one of several models by using data collection
speech recognition using a librosa, sound file, and sklearn of RAVDESS but classifying only certain emotions, to
(such) to construct a model CNN(Convolution Neural Net- obtain total accuracy better than the model suggested in
work) with Tensor-Flow for recognizing emotion from a this paper but less precise. Four consecutive models of
Speech file. The data are loaded, features are stripped, emotional recognition have been proposed for speech and
and the data collection is separated into training and test song: a straightforward model, the hierarchical single-task
sets. Then we will load and train the model. Finally, the model, and the hierarchical multi-task model. The basic
precision of our model can be measured. Many classifica- model provides a single, domain-independent classifica-
tion methods have been proposed in recent years for this tion system. The framework has been used in testing both
Emotion recognition from speech. One method introduced models.Four consecutive models of emotional recognition
by researcher (Pinto et al., 2020) 1D CNN with Dense lay- have been proposed for speech and song: a straightfor-
ers has been used by extracting MFCC features. Decision ward model, the hierarchical single-task model, and the
Trees, Random Forest with 1000 trees by achieving 75% hierarchical multi-task model. The basic model provides
and 78% of average.The proposed method experiments a single, domain-independent classification system. The
were driven through the Ravdess dataset. A new set of framework has been used in testing both models. With a
features was extracted with the FFMPEG library tuned to sampling rate of 44.1 kHz and a mono channel, the KES
a deep learning model by achieving 91% of accuracy. One (Geethashree & Ravi, 2018) database was recorded in an
method introduced by Iqbal and Barua (2019) has been acoustic environment to reduce HNR and SNR. Before
used to classify gender-based distinctions, which have an recording, all of the speakers were given sentences and
overall accuracy of approximately 40 to 80% according to given time to prepare. To avoid influencing each other’s
the particular task by gradient boosting, KNN, and SVM, speaking style, each speaker’s speech was recorded sepa-
on the basis of the RAVDESS dataset linear classification rately in separate sessions. The Linear prediction coeffi-
used during these works. Different datasets have experi- cients (LPC) and LFCC are the features that were obtained
mented with the proposed classifiers. 100% average was for the objective of emotion classification. The Praat tool
achieved with SVM and K-NN on the RAVDESS dataset was used to obtain these spectral coefficients. LPC is a
for male recordings of emotion anger where Gradient- popular speech signal analysis technique. The extraction
boosting performed poorly. In Gao et al. (2017) approach technique for LPC. They reduce the error in the regres-
depth-first search is applied to extract the best feature sion notion to find the wavelet coefficients of the forward
and all prosodic and spectral features are extracted which techniques applied as a function of time. The coefficients
includes pitch, zcr, MFCC, and a combination of fea- of a pt h order sample are determined, which anticipates
tures. In this method, Emo-DB and Ravdess are used to the state variables of P(x) (the legitimate analysis) derived
recognize emotions using SVM, for extracting emotions from previous p samples, given by LPC. They concluded
sequential forward floating selection (SFFS) is used. The from their findings that humans recognize emotions bet-
literature Trigeorgis et al. (2016) put upon an end-to-end ter than classifiers. The K-NN technique is ineffective for
structural SER as a function step after a Long short-term emotion classification. As a result, pattern recognition
memory (LSTM) structure based on a convolution opera- and weight adaptation algorithms are more efficient. Cep-
tion. Every 6 segments of the raw signal are supplied with strum coefficients outperform LPC in emotion recogni-
an FIR filter to minimize noise to multiple 1D convolution tion systems. Fear and grief are the most puzzling emo-
layer structures to extract a high-level LSTM element. In tions, according to the KES database. In the other study
terms of state-of-the-art (Huang et al., 2014) algorithms Rajisha et al. (2016), Mel frequency cepstral coefficients
the device has done very well in some data sets. Short-time (MFCCs), Pitch, and Short Time Energy were used to per-
Fourier transform (STFT) is a well-known way to trans- form automated emotion recognition of four different emo-
form a signal input into picture data called a spectrogram. tions: anger, joy, sadness, and neutral state. As part of the
In this field, STFT is very general. For high recognition project, characteristics are extracted and analyzed from a
and accuracy in the literature, Yenigalla et al. (2018) a Malayalam language database. The classification was done

13
724 International Journal of Speech Technology (2023) 26:721–733

Table 1  Table shows the Previous Works and comparisons. In the above table H->Happiness, A-> Angry,S->Sadness, N->Neutral,F->Fear,
D-> Disgust,SARC-> Sarcastic, SUR-> Surprise,AIR-> All India Radio, M->Male,F-> Female,Desc->Description
Previous Work Database1 Size Desc Emotions Evolution Features

Geethashree and Kannada speech 660 sentences 4-Speakers 28y(M), H,S,A,F Acted
Ravi (2018) corpus 23Y(F),9Y(Boy,girl)
Deshmukh et al. Hindi, Marathi A,H,S
(2019)
Rajisha et al. Malayalam lan- 18–50Years(M,F) 20 Sentences each in 4 N,S,H,A Acted
(2016) guage emotions
(Koolagudi et al., Telugu language 25–40yrs 12,000 Utterances A,C,D,F,H,N,SC AIR, Acted Prosodic Study
2009)
Syed et al. (2020) Urdu–Sindhi 734-Urdu, 701- 1435 Audio recordings A,D,H,N,SARC,SAD,SUR Prosody
Sindhi

using ANN and SVM to compare classifier outputs. The the accuracy and efficiency of emotion recognition in both
dataset investigations show that speech emotion recogni- Urdu and Sindhi.
tion with an ANN classifier has a recognition accuracy of
88.4% higher than SVM, which has a recognition accuracy
of 78.2%. Anger, Compassion, Disgust, Fear, Happy, Neu- 3 Working model
tral, Sarcasm, and Surprise are the fundamental emotions
addressed in the development of the IITKGP-SESC (Kool- The section consists of Working Model 3, Speech Corpus
agudi et al., 2009) in Telugu (a South Indian language). 3.1, Feature Extraction 3.2, and Deep Learning Structure
The previous works and their comparisons are shown in 3.5 as sub-sections. The emotion classification model pre-
Table 1. Prosodic metrics from the produced speech cor- sented here is built on a detailed learning approach focused
pus were used to investigate the basic emotions. Using on CNN and dense layers. (LeCun & Bengio, 1995). The
basic statistical parametric models to identify emotions, main assumption is that the Mel frequency cepstral coeffi-
the importance of prosodic properties for emotion discrim- cients (MFCC), chroma, and mel function to train the model,
ination was proven. The proposed statistical models may which is Efficient, and seamless and standardized operations
be incapable of capturing the complex nonlinear correla- are then required in order to minimize the noises. MFCC is
tions present in prosodic data derived from longer speech another Mel-Frequency Cepstrum (MFC) classification and
segments. As a result, nonlinear models may be researched advanced audio formalization of automatic speech recogni-
further to improve recognition performance. Subjective tion was presented. The ability to demonstrate the sound
listening tests are used to evaluate the quality of the emo- wave amplitude of a discrete function vector is helpful for
tions contained in the emotional speech corpus that has complex MFC coefficients. The proposed SER approach is
been developed. The suggested emotional speech database presented in this section. The audio transmission is divided
can be utilized to explain feelings by utilizing emotion- into fragments of the same length (2.5 s). The model com-
specific information derived from the vocal tract and exci- prises multiple 1D convolution layers and a dense layer, fol-
tation source. Prosodic Features like duration, Average lowed by a completely linked layer and a softmax layer. The
pitch, Std Dev, and average energy are considered for 15 specifications of the proposed model are discussed below.
Telugu sentences. Another work proposed by Syed et al. DFT is applied on derived frames along with the Mel scale
(2020) the Urdu–Sindhi Speech Emotion Corpus in their amplitude spectrum which is trimmed and normalized. The
study, is a new dataset that can be used to train machine deep neural network used is seen in Fig. 3. 40 MFCC fea-
learning methods for speech emotion recognition in two tures were extracted for each audio recording which is con-
low-resource languages. We’ve made the dataset available verted single precision time sequence and later transposed
for academic research on the Zenodo framework. They by obtaining means. In the article by Davis S. et al. Davis
also conducted experiments to generate and identify the and Mermelstein (1980) and Huang et al. (2001) the cal-
potential overall performance of UAR utilizing selected culations of the MFCC are detailed. Therefore we have a
features first from the Open-Smile framework - a library scale le − 3 training set 1 x 40 x 1 with the 1D CNN method
used by work on improving development capacities clas- activation. dropout 0.2, 2x2 max-pooling function, dense
sification techniques boundaries. Their research shows that layer-two fully connected networks, and a softmax layer
multilingual logistic algorithms work on the Comparison resampled form-4224 to 8 elements. Relu- rectified linear
set of features and perform much better in order to increase unit helps the model to achieve a significant benefit by using

13
International Journal of Speech Technology (2023) 26:721–733 725

the feature as a good option for representing hidden units in Telugu speaking, the peculiar can be found the same, imply-
the event of activation. In this case, max-pooling just helps ing that for different vocal words, we developed a dataset
the model focus on the main features of each section of the using post-graduate and primary learners in order to incor-
study to invariably boost them by category. The entire pro- porate them. Five men and three women voluntarily partici-
cess was compiled numerous times by changing the kernel pated, ranging in age from seventeen to twenty-one years.
values. The structure of the test set is 432x40. with no cross- As is customary, the reason for choosing an age group per-
validation. Each class of emotion Neutral, Sadness, calm, son is that contemporary science studies show that all these
Happy, Anger, Fear, Disgust, and Surprised is encoded from demographic groups are dwindling. viciousness in a variety
the range 0–7. The kernel in 1D CNN is shifting in 1 direc- of situations all audio speakers are well-trained and capable
tion. 1D CNN data is in 2 dimensions input and output. Used of communicating 15 phrases for every 5 different emotions
mostly for data from time series. As a versatile measure of while taking into account the recording. As is customary,
evaluation, we used the F1 score for Classification model the reason for choosing an age group ( statistics taken from1)
quality and as a benchmark to compare our Results at the person is that contemporary science studies show that all
state of the art with them. The CNN was evaluated using the these demographic groups are dwindling Global Health Data
sparse categorical weight matrix Goodfellow et al. (2016). Exchange (2019). viciousness in a variety of situations All
Optimizer that implements the RMS prop algorithm for 1000 audio speakers are well-trained and capable of communicat-
epochs for best classification with a batch size of 32. We ing 15 phrases for every 5 different emotions while taking
validate the model with the exactness of the deep learning into account the recording.
architectures during the course of our training.
3.1.2 RAVDESS
3.1 Speech corpus
There are 7356 files in the Ryerson Audio-Visual Database
Speech augmentation tries to improve the audio signal’s of Emotional Speech and Song (RAVDESS) (Livingstone &
integrity by decreasing ambient noise. The cleanliness, Russo, 2018) (total size: 24.8 GB). Two lexically matched
coherence, and comfort of a voice signal are used to deter- statements are vocalized in a neutral North American accent
mine its qualities. Voice augmentation is a step before speech by 24 professional actors (12 female, 12 male) in the data-
recognition, synthesizing, interpretation, and encoding in base. Calm, happy, sad, angry, afraid, surprised, and dis-
speech perception. Speech augmentation tries to improve gusted expressions can be found in speech, whereas calm,
the audio signal’s integrity by decreasing ambient noise. The happy, sad, angry, and fearful emotions can be found in a
cleanliness, coherence, and comfort of a voice signal are song. Each expression has two emotional intensity levels
used to determine its qualities. Short-duration sounds, such (normal and strong), as well as a neutral expression. All
as impulsive noise, can degrade speech signals in commu- three modalities are available: audio-only (16bit, 48kHz.
nication devices. Such interruptions are particularly irritat- wav), audio-video (720p H.264, AAC 48kHz,.mp4), and
ing to receivers and must be eliminated in aims to enhance video-only (720p H.264, AAC 48kHz,.mp4) (no sound).
the quality and legibility of voice signals. The majority Actor 18 does not have any song files. Audio-Visual and
of utterance processing techniques assume that distortion Video-only files.
resembles a Gaussian distribution and is additive in nature. Video files are provided as separate zip downloads for
Non-Gaussian distributions, on the other hand, are charac- each actor (01-24, 500 MB each), and are split into separate
teristic of impulsive noise. The existence of noise sources, speech and song downloads:
this one will significantly decrease the efficiency of speech In total, the downloadable dataset RAVDESS is available
2
recognition systems. at collection includes 7356 files (2880+2024+1440+1012
files).
3.1.1 Self‑developed Telugu dataset File naming convention
Each of the 7356 RAVDESS files has a unique filename.
For our experiments, we created the Telugu language data- The filename consists of a 7-part numerical identifier (e.g.,
base (which is also the next highest-speaking Native Indian 02-01-06-01-02-01-12.mp4). These identifiers define the
native indigenous vocabulary and well-known southern stimulus characteristics: Filename identifiers
Native Indian Language) as a primary requirement in order Modality (01 = full-AV, 02 = video-only, 03 = audio-
to ensure that as an important component quality is usually only). Vocal channel (01 = speech, 02 = song). Emotion (01
managed for correct emotional acceptance. The following
are the standard aspects to consider when developing an
information resource: (1) there are almost no Telugu local 1
https://​www.​who.​int/​news-​room/​fact-​sheets/​detail/​depre​ssion.
databases that are standard. (2) Within the specific areas of 2
https://​smart​labor​atory.​org/​ravde​ss/.

13
726 International Journal of Speech Technology (2023) 26:721–733

Fig. 1  Image shows Wave plot of male actor’s Neutral emotion (a) Clean Spectrogram image male actor anx-
iety/fear emotion

= neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06


= fearful, 07 = disgust, 08 = surprised). Emotional intensity
(01 = normal, 02 = strong). NOTE: There is no strong inten-
sity for the ‘neutral’ emotion. Statement (01 = “Kids are
talking by the door”, 02 = “Dogs are sitting by the door”).
Repetition (01 = 1st repetition, 02 = 2nd repetition). Actor
(01 to 24. Odd-numbered actors are male, even-numbered
actors are female).

3.2 Feature extraction

All available senses can sense the emotional state of their


communication partner. This emotional identification is
common to humans, but computers have very tough tasks;
However, they can quickly understand knowledge-based (b) Predicted emotional classes of F-> Fear, A-Angry,
N->Neutral& accuracy for EMO-DB dataset
information, it is difficult to reach the depth behind the con-
tent, and that is what speech emotion recognition (SER) is
all about. It is an audio file system that classifies different
audio speech files into emotions like dissatisfaction, sadness,
anger, and machine neutrality. Emoticons can be recognized
in fields such as healthcare or customer service centers.
Using librosa (python library) we loaded the dataset. we
look here at a male actor neutral tone sample recording in
Fig. 1. The Librosa, a Python Audio, and Music Analysis
Library. It has a simpler structure, consolidates interfaces
and names, reverses characteristics, and with readable code.
This procedure is done to foreground the frequency that is
more accurate because the SER-Speech Emotion Recogni-
tion system will interpret an essential reconstruction of the
wave.
Wave plots record the wavelength of signals and see the (c) Predicted accuracy of emotional classes for Self-
developed Telugu dataset
ultimate shape of the emotions helps determine which type
of extraction to use (MFCC, STFT, Log-Mel Spectograms,
Spectral Centroid, etc.). Function removal is important for Fig. 2  Predictions of Emo-db and Telugu dataset
modeling since audio files are converted into a model-com-
prehensible format. After careful observation of wave plots, Sets up the process for extracting sound format MFCC,
We opted to use MFCC as an extraction tool for each emo- chroma, and mel functions. Four parameters - the file name
tion. The MFCC and log-mel spectrogram can be viewed and three Boolean parameters for the three functions into a
with ’specshow’ can be viewed in Fig. 2a. librosa.feature data frame. The data is split in 70:30 for training and testing.

13
International Journal of Speech Technology (2023) 26:721–733 727

the utterance held the majority of the emotional cues. It was


an approximation, but as illustrated below, it’s a valuable
notion. To begin, we analyze the overall pitched frequencies
of an utterance in every frame using a typical standard pitch
locator using Praat (Boersma, 2011). After that, we con-
struct a customized log-spectrum estimation for every audi-
ble frame that statistically resembles the original speaker.
The sinusoidal waveform components of the Fourier series
will all be 0 because the auto-correlation function has even
symmetry, and the equation 1 can be reduced to only have
legitimate cosine factors:
T

T ∫0 auto
1
S(f ) = r (t) cos(2𝜋mt∕F0 )dt (1)

where, rauto (t) represents the auto-co-relation function.


S(f) is modified log-spectrum,
F0 : Pitch freq.
m = 0, 1, 2, ....N
A fast Fourier transform (FFT) is applied to the variation
Fig. 3  Description of classifier for proposed architecture of a particular signal to compute its frequency spectrum,
which is a mechanism often employed by Parameter opti-
mization tools. This outcome is shown as a frequency dis-
The collection of NumPy arrays is used to store both the
tribution, which is a representation of signal power against
input and output data. Data Normalization is done. Hot
frequency. The relative magnitudes of the frequency com-
Encoding is done. The extracted features are trained with
ponents that make up a signal are represented by the signal’s
the proposed model (using Fig. 3) by creating a checkpoint
power spectrum. To determine the power spectra, all of the
that best fits the model.
data collected should indicate enough source excitement.
The signal should preferably be subjected to some form of
3.3 Spectrogram generation
stochastic stimulation in order to obtain suitable data (for
example, by the application of a pseudo-random binary
The STFT of a windowed audio or spoken stream produces
sequence to the plant at some appropriate point). The power
a spectrogram. The audio was sampled at a rate of 22050Hz.
spectrum is useful for determining the amount of noise in a
Each audio frame is windowed using a 2048-pixel "ham-
signal and determining the best sampling speeds.
ming" window. On the windowed audio samples, we chose
2048 length FFT windows and 512 as the hop length for
the STFT. To obtain the mel-spectrogram, the magnitude 3.4.1 Speech recognition system
spectrogram is mapped to the mel-scale. The Mel-frequency
scale promotes low frequency over high frequency, analo- According to Ekman and Keltner (1997), emotions are dis-
gous to the perceptive capability of the human ear. We tinct, measurable, and biologically distinct. Based on his
computed the mel-spectrogram using the "librosa" Python observations, he classifies seven emotions as fundamental:
package and the above-mentioned parameters. We cropped fury, contempt, fear, happiness, sadness, and surprise. Emo-
the extended sound sentences to a length that spans the 75th tions are tough to comprehend. Consider hypotheses; they
% among all audio signal records in the system, assuming are a mental state that generates physical and psychological
that the smoother transition that represents the emotional changes that influence our behavior. Emotion physiology is
information would be present throughout the interaction inextricably linked to nervous system arousal, with varying
and thus would not be compromised by trimming the end of levels and intensities of arousal correlating to various emo-
long conversations. As a result, the longest time taken into tions. Emotion is also linked to a person’s predilection for
account is 5 s. specific behaviors. Introverts are more likely to be social and
express their emotions, whereas extroverted people are more
3.4 Noise removal likely to be socially aloof and hide their emotions. These
feelings can be communicated to others through speech
Our approach is focused on the spoken statement’s narrow tone, demeanor, and attitude. Each emotion has its own tone,
harmonics nature. We assumed that the vocal portions of which is classified as chest, mouth, or head. Example wave

13
728 International Journal of Speech Technology (2023) 26:721–733

plot of speech signal male Neutral shown using Fig 1 and explain why the data was over-fitted. It is possible that more
spectrogram shown using Fig 2a. datasets could be used.
We present a framework for recognizing emotion from
speech signals for this purpose. The entire speech emotion
recognition system may be seen in action. 4 Experimental setup

(1) Input) Benchmark datasets are employed to increase the The methodology and results of this research analysis
current framework’s accuracy. focused on the model’s ability to generate precise findings in
(2) Pre-Process) Is therefore chosen to use a time series the context of compressed speech over noisy environments.
signal. The minimum average is calculated initially. The analysis involved the development of a framework,
This technique would reduce the voice signal to 0, mak- testing different classification models, and evaluating their
ing it easier to approach. In reality, the quantity of utter- performance. The primary goal of the analysis and frame-
ance frequency varies among individuals. As a result, work development was to determine whether the model’s
it is necessary to modify those various speech signals generated findings are precise enough to provide valuable
required for processing. High-pass Finite Impulse insights for future research in the area of compressed speech
Response processes the pre-emphasis for smoothing the over noisy environments. The first technique we employed
speech spectra (FIR). Then, from -1 and 1, normalize was creating a decision tree (DT) and a random forest (RF)
the amplitude signal. classifier with more than a hundred trees. These models
were evaluated using the Scikit-learn Python library. This
evaluation likely involved measuring the accuracy, preci-
3.5 Deep learning structure sion, recall, F1-score, and other relevant metrics to assess
the models’ performance. The consistency of identification
By importing all necessary modules such as Numpy and in two separate datasets is highlighted in this section, sug-
matplotlab in Python. TensorFlow is already imported. gesting that the model’s performance remains consistent
Tensor-flow is a Deep learning technique for represent- across different data sources. When determining the size
ing data –by Layman’s terms. This can be represented in a of the training set, the composition and dimensions of the
1D-2D-3D array. In terms of Machine Learning and Deep training samples were taken into account. This likely refers
Learning, it is a multi-dimensional feature vector. By using to the process of selecting an appropriate proportion of data
the oneh ot = True(encoding) argument all categorical class for training the model and ensuring that the samples are
labels are converted to binary vectors. The image dimen- representative of the overall dataset. The tests conducted
sions are calculated using a computational approach. One showed that using six layers of the FLB (presumably a type
important thing noted here, the images should be scaled con- of neural network architecture) is sufficient for handling a
jugate to the range of ’0’ to ’1’. Figure 3 for each audio file large database, indicating that deeper layers might not neces-
provided as input, the network can work on 40 feature vector sarily improve performance. However, in the case of smaller
which represents the compressed computational form of 2ms databases like RAVDESS and EMO-DB, the risk of overfit-
frame length of audio signal. TensorFlow intends a specific ting emerged when using more than three layers and increas-
input shape for its deep learning models, in this case, a CNN ing model complexity. This implies that a more complex
model. We have used two convolutional layers, the first layer model might start to fit noise in the smaller dataset, leading
64 × 3 × 3 and the second and third layer 128 × 3 × 3, two to worse generalization.
max-pooling layers used of 2 × 2 size each. Usually, the
learning rate is le − 3 and is maintained as normal perception 4.1 Pre‑processing
which is magnified based on which the weights are updated
by reducing the cost function by integrating the optimal In data and model context by Acknowledging the challenge
solution. Because by making small updates to the weights of using deep learning models with limited data, we aimed
in the network (if the learning rate is too low) training will to overcome this limitation. The dataset at hand lacks an
move very slowly, and the loss function may exhibit undesir- extensive number of features, prompting the use of tech-
able divergent behavior if it is too low. The training samples niques to enrich it. To augment the dataset, we employed
are distributed into a rectified scale factor, and each chunk the OpenSmile library. This tool enabled the extraction
will train a fixed size of images. The fact that we didn’t use of new audio features from video data available in the
feature selection to reduce the dimensionality of our aug- dataset. Notably, the audio frequencies of the video files
mented CNN, which could enhance training effectiveness, (44.1MHz) differ slightly from those in the original audio
is one constraint. Another barrier was the use of limited data files (48MHz). This introduced some noise, which we con-
the RAVEDESS dataset has only 1440 files, which could sidered necessary to increase the training dimension and

13
International Journal of Speech Technology (2023) 26:721–733 729

enhance the dataset. Then dataset is dynamically split into


two subsets, a training set, and a test set. The training set
constituted 75% of the original data, while the remaining
25% was reserved for testing. We utilized the LibROSA
library to extract almost 3000 MFCC vectors for the train-
ing set, each of which had 40 features. The specifics of
these attributes may be quite important for the model’s
capacity to extract pertinent patterns and information from
the audio data. The test set was structured as a matrix with
dimensions of 432 rows and 40 columns, indicating that it
contained 432 instances, each represented by a vector of
40 features. The evaluation metric was determined to be
F1 scores. This measure achieves a compromise between
recall and accuracy, and it may be especially helpful when
working with datasets that are unbalanced or where both
Fig. 4  Accuracy of proposed model
false positives and false negatives have significant effects.
Across all experiments, we consistently used a training
setup. For example, 75% of the data were utilized for train- 5.1 Performance measure
ing, while the remaining 25% were used for testing. During
training, we conducted a maximum of 1000 epochs with The proposed CNN model prediction performance was
a batch size of 32. At 0.0001, the transfer learning rate measured using the self-developed Telugu dataset and
was established. The method exhibits a carefully consid- RAVDESS datasets to demonstrate the model effective-
ered technique to deal with the problem of sparse data ness effectiveness. Mostly on the Telugu dataset, Table 2
and noise. The meticulous consideration of dataset split- demonstrates the model predictive overall performance of
ting, feature extraction, data augmentation, and training the confusion matrix. The model confusion matrix on the
settings demonstrates an all-encompassing approach to RAVDESS dataset is shown in Table 2. The deep Convo-
experimental design. Reliable assessment and repeatabil- lutional neural model performance evaluation brings bet-
ity are facilitated by the use of F1 scores and consistent ter predictive performance, demonstrating the importance
training setups. With the help of this technique, users may and resilience of the developed framework. Figure 6 illus-
conduct tests and assess how well the deep learning mod- trates the cumulative effect of both of the five emotional
els are working. categories in sensitivities. The Telugu dataset contains this
information. Nonetheless, the model prediction accuracy for
anger and neutrality was good, while happiness and sadness
was marginally lower, however, the overall performance of
5 Experimental results the system (81.75%) is satisfactory for the Telugu dataset.
Similarly, Table 3 shows the RAVDESS dataset’s predic-
The proposed work involves using a Convolutional Neural tion performance for eight (8) classes. Fearful and disgust
Network (CNN) for emotion recognition. This CNN-based sometimes are intermingled with calm and angry emotions
approach demonstrates consistent separation margins, effi- in RAVDESS, but the cumulative accuracy rate (79.5%) is
cient memory usage, and effectiveness in dealing with situ- adequate for the RAVDESS dataset. The table shows the
ations where there’s a high degree of sparsity in the samples. training and testing accuracy of the proposed CNN models
It runs well and works with a consistent margin for separa- for both datasets.
tion, and efficient memory, and is effective in large areas
where the sparsity exceeds samples. Initially, the speech file 5.2 Discussion
is supplied as a pre-processing input for the AER system
until all sounds are manually deleted and a clean speech The section discusses the accuracy of recognizing emotions
signal input is pre-processed in various codec environments. on different datasets. The size and shape of the training set
We used here wav-file and compressed speech codecs such are determined by the network depth to avoid overfitting. The
as AMR-wideband, AMR-narrowband, and mp3-codec in model was evaluated based on its efficiency compared to the
our experiment. When used for sub-frames separated from benchmark dataset and its performance on the RAVDESS
AMR-WBs, samples representing both compressed and dataset. Figure 4 illustrates the accuracy of the model, which
uncompressed are normalized at 1,16KHz at different sam- was 54.39%, with a loss of 2.15% shown in Fig. 5. Several
pling rates. The transmissions are referred to as micro. studies have been conducted on the RAVDESS dataset to

13
730 International Journal of Speech Technology (2023) 26:721–733

Table 2  Results of the proposed model Telugu dataset on the test set
per class
Emotion Precision Recall f1-score Support

Angry 0.66 0.66 0.66 61


Calm 0.65 0.64 0.64 58
Disgust 0.57 0.67 0.61 57
Fear 0.56 0.50 0.53 58
Happy 0.43 0.46 0.44 57
Neutral 0.41 0.67 0.51 24
Sad 0.48 0.40 0.43 58
surprise 0.54 0.44 0.49 59
accuracy 0.54 432
macro avg 0.54 0.55 0.54 432
weighted avg 0.55 0.54 0.54 432
Fig. 5  cost function of Deep learning model

Table 3  Emotion recognition accuracy using RAVDESS


Emotion Precision Recall f1-score Support

Angry 0.984 1.000 0.992 129


Calm 0.951 0.975 0.963 78
Disgust 0.984 1.000 0.992 57
Fear 0.918 0.978 0.943 57
Happy 0.943 0.930 0.936 66
Neutral 0.951 0.951 0.951 78
Sad 0.918 0.978 0.943 45
Surprise 0.841 0.789 0.912 86
Accuracy 0.954 508
Macro avg 0.954 0.955 0.954 508
Weighted avg 0.956 0.955 0.948 508

of 75/25. The results obtained for each class of emotion


over the Telugu dataset, as shown in Table 2, indicate well-
balanced accuracy. The F1 values for almost all classes are
above 0.50, demonstrating the effectiveness of the model in
accurately classifying emotions into eight distinct classes.
Fig. 6  Confusion Matrix Telugu dataset It is worth noting that a few classes show lower accuracy,
which aligns with the findings described in the referenced
article by Polignano et al. (2017). These classes are known
improve the accuracy of speech emotion recognition using to be challenging to recognize not only through speech
hybrid transformer models, CapsuleNets, Convolutional but also by analyzing facial features and text. The narrow
Neural Networks, and Convolutional Attention-based Bi- variance of the F1 results highlights the consistency of the
GRU. The accuracy achieved by these models ranges from model in emotion classification.
56% to 87.15% and depends on the specific emotion being To measure the efficiency of the proposed model, we
recognized. chose to test it to the findings of two baselines, decision
Regarding our dataset, it can also be utilized for abnormal tree (DT) and random forest (RF), and compared it with
movement detection. Although we have benchmark event Iqbal et. al. Iqbal and Barua (2019) and one model with
names for the abnormal videos during data collection, these Zhang et al. (2015) other with De Pinto et al. (2020). Table3
event names are not used in our abnormality detection tech- shows the recognition accuracy of the Ravdess dataset. Fig-
nique. For motion recognition, we have used 30 videos from ure 2a shows the clean spectrogram image of the male actor
each event, splitting them into a training and testing ratio for anxiety/fear emotions. Figure 2b predicted emotional

13
International Journal of Speech Technology (2023) 26:721–733 731

classes of Fear, Angry, Neutral emotions as 30%, 25%, 7% simulations are used for toddlers or elderly persons, preci-
and other emotions shown using pie chart and Fig. 2c shows sion may be reduced due to anatomic structures in higher or
Predicted accuracy of emotional classes for Self-developed lower age groups.
Telugu dataset Sadness 33%, Happy 28%, Fear 19%, Disgust
17% and many other emotions also predicted shown using
pie chart. The proposed technique utilizes fusion features to 6 Conclusion
improve the accuracy of the Convolutional Neural Network
(CNN) in detecting emotions in real-time. LBP, ORB, and The study focuses on designing a CNN-based framework
CNN features were used to fuse the input data, resulting in that leverages feature extraction mechanisms, adaptive
a higher accuracy rate of 98.13% on the CK+ dataset, out- thresholding, and spectrogram transformations to enhance
performing previous methods. The CNN model was trained the accuracy and efficiency of SER models. This indicates
on a composite dataset, which achieved exceptional training a comprehensive approach to addressing the challenges in
and validation accuracy rates. The proposed model exhibits emotion recognition from speech, which includes dealing
superior performance compared to other techniques in rec- with noise, computational demands, and the complex nature
ognizing human emotions. To improve emotion detection of emotional cues in audio signals. Instead of employing
further, future research can investigate the incorporation of spectrograms to learn the most salient and discriminative
new features or methods with CNN, and consider the effect features in a convolutional layer, we employed stride CNN
of human variables, such as age and gender. Moreover, the frameworks for SER employing a specific stride setting
rise in classes can lead to a decrease in accuracy, but the to flat extracted features. To enhance the precision of the
proposed method in this study achieves an F1 score compa- Convolutional Neural Network (CNN) used for detecting
rable to the two tasks presented, as demonstrated in Figs. 4 emotions in real-time, various methods can be used. Fusion
and 5. Figure 3 corresponds to the True and Predicted labels, features were employed in the proposed technique, and the
respectively, and highlights ambiguity between the Angry input data were fused using LBP(Local Binary Patterns),
and Happy classes when Arousal and energy levels are high. ORB( oriented FAST and rotated BRIEF), and CNN fea-
The proposed model achieved 98% prediction accuracy on tures, which showed better accuracy (98.13%) than prior
the EMO-db dataset and 53% on the RAVDESS dataset, approaches on the CK+ dataset. In addition, the CNN model
with 100% prediction accuracy for Anger and Neutral. was trained on a composite dataset that attained a training
There are 535 instances within the EMO-db dataset. To accuracy of 95% and validation accuracy of 91%, which rose
minimize the error, we use k-fold cross-validation, and the to 95% after multiple epochs. The proposed model signifi-
findings were displayed in the figure. As can be seen, cross- cantly outperforms other techniques for recognizing human
validation has an overall precision of 87.3%. Prior EMO-db emotions. Future studies can focus on examining new fea-
dataset results showed that the average precision was 83.3%. tures or methods to be fused with CNN and investigating the
The combined characteristics from MPEG-7 characteristics, influence of human variables such as age and gender on the
MFCCs, and Timbre in [24]. To use the SVM classifier, the efficiency of emotion detection.
overall results achieved were 83.39%. [25] collected 1800
multidimensional characteristics and showed the average
efficiency for identifying six classes was 73.3%, excluding 7 Future scope
the disgust emotions. The six emotions in the RAVDESS
database are furious, calm, afraid, joyful, neutral, and sad. It is also worth noting that the efficiency of emotion detec-
We use the same procedures as EMO-db and have 92 sam- tion through the CNN model can vary based on human
ples for each emotion. Tables 2 and 3 display the results. variables such as age and gender. Further studies can aim
Anger is 98.4% accurate, Neurtral 95.1%, happy 94.3%, Fear to explore the impact of these factors on the performance
91.4%,Borden 95.1%. It is 98% accurate and for anger, 66%, and accuracy of the model. Additionally, new features or
disgust 65%, fear 56%, and other emotional classes more methods can be investigated in the future to enhance the
than 45% is achieved. precision and robustness of the emotion detection process
We must mention that all resulting classifiers were trained in real time. Furthermore, it is important to consider other
using the RAVDESS dataset, which is one of the recom- factors that may affect the efficiency of the CNN model, such
mended approaches’ weaknesses. Only Native American as cultural differences and psychological traits. These vari-
voices are included in this database, which might also indi- ables can significantly influence the way individuals express
cate that competence deteriorates with persons of other eth- and perceive emotions, potentially impacting the accuracy
nicities. Moreover, because most of the individual perform- of the emotion detection process. Therefore, future research
ers in the database are between the ages of 20 and 50, aging can focus on understanding and incorporating these ele-
could cause inaccurate projections. As a result, when such ments into the model to improve its overall performance.

13
732 International Journal of Speech Technology (2023) 26:721–733

Additionally, exploring different datasets that represent Geethashree, A., & Ravi, D. (2018). Kannada emotional speech
diverse populations can also contribute to ensuring the gen- database: Design, development and evaluation. In Proceedings
of international conference on cognition and recognition (pp.
eralizability and reliability of the CNN model in real-world 135–143). Springer.
scenarios. By continually refining and updating the model, Global Health Data Exchange (GHDx)., Institute Of Health Metrics
we can unlock its full potential in accurately detecting and And Evaluation. “GBD Results Tool | GHDx.” GBD Results Tool
analyzing emotions in various contexts and applications. | GHDx. ghdx.healthdata.org, 2019. http://​ghdx.​healt​hdata.​org/​
gbd-​resul​ts-​tool?​params=​gbd-​api-​2019-​perma​link/​d780d​ffbe8​
It’s a challenging task that involves a blend of user experi- a381b​25e14​16884​959e8​8b
ence design, machine learning skills, and ethical concerns. Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep
The resultant system may be used in a variety of industries, learning (Vol. 1). MIT Press.
such as communication technology, sentiment analysis, and Huang, Z., Dong, M., Mao, Q., & Zhan, Y. (2014). Speech emotion
recognition using cnn. In Proceedings of the 22nd ACM interna-
human-computer interaction. tional conference on multimedia (pp. 801–804).
Huang, X., Acero, A., Hon, H.-W., & Reddy, R. (2001). Spoken lan-
Acknowledgements We thank all the volunteers who helped us in mak- guage processing: A guide to theory, algorithm, and system devel-
ing the Telugu database. Presently the database is under review with opment. Prentice Hall PTR.
the committee for endorsement and will be publicly available. The Iqbal, A., Barua, K. (2019). A real-time emotion recognition from
RAVDESS dataset is available at https://www.kaggle.com/datasets/ speech using gradient boosting. In 2019 International conference
uwrfkaggler/ravdess-emotional-speech-audio. on electrical, computer and communication engineering (ECCE)
(pp. 1–5). IEEE
Declarations Jannat, R., Tynes, I., Lime, L. L., Adorno, J., & Canavan, S. (2018).
Ubiquitous emotion recognition using audio and video data. In
Conflict of interest The authors have no conflicts of interest to declare. Proceedings of the 2018 ACM international joint conference and
All co-authors have seen and agree with the contents of the manuscript 2018 International symposium on pervasive and ubiquitous com-
and there is no financial interest to report. We certify that the submis- puting and wearable computers (pp. 956–959).
sion is original work and is not under review at any other publication. Jin, B., & Liu, G. (2017). Speech emotion recognition based on hyper-
prosodic features. In 2017 International conference on computer
technology, electronics and communication (ICCTEC) (pp.
82–87). IEEE.
Khaleghi, B., Khamis, A., Karray, F. O., & Razavi, S. N. (2013). Mul-
References tisensor data fusion: A review of the state-of-the-art. Information
Fusion, 14(1), 28–44.
Al-Ali, A. K. H., Dean, D., Senadji, B., Chandran, V., & Naik, G. R. Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S., & Rao,
(2017). Enhanced forensic speaker verification using a combina- K. S. (2009). Iitkgp-sesc: Speech database for emotion analysis.
tion of dwt and mfcc feature warping in the presence of noise and In International conference on contemporary computing (pp.
reverberation conditions. IEEE Access, 5, 15400–15413. 485–492). Springer.
Bediou, B., Krolak-Salmon, P., Saoud, M., Henaff, M.-A., Burt, M., Kwon, O.-W., Chan, K., Hao, J., & Lee, T.-W. (2003). Emotion rec-
Dalery, J., & D’Amato, T. (2005). Facial expression and sex rec- ognition by speech signals. In Eighth European conference on
ognition in schizophrenia and depression. The Canadian Journal speech communication and technology.
of Psychiatry, 50(9), 525–533. LeCun, Y., Bengio, Y., et al. (1995). Convolutional networks for
Boersma, P. (2011). Praat: Doing phonetics by computer [computer images, speech, and time series. The Handbook of Brain Theory
program]. http://​www.​praat.​org/ and Neural Networks, 3361(10), 1995.
Chen, M., Hao, Y., Hwang, K., Wang, L., & Wang, L. (2017). Disease Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Vis-
prediction by machine learning over big data from healthcare ual Database of Emotional Speech and Song (RAVDESS): A
communities. IEEE Access, 5, 8869–8879. dynamic, multimodal set of facial and vocal expressions in north
Chen, M., Zhang, Y., Qiu, M., Guizani, N., & Hao, Y. (2018). Spha: American English. PLoS ONE, 13(5), 1–35. https://​doi.​org/​10.​
Smart personal health advisor based on deep analytics. IEEE 1371/​journ​al.​pone.​01963​91
Communications Magazine, 56(3), 164–169. Pinto, M. G., Polignano, M., Lops, P., Semeraro, G. (2020). Emotions
Davis, S., & Mermelstein, P. (1980). Comparison of parametric rep- understanding model from spoken language using deep neural
resentations for monosyllabic word recognition in continuously networks and mel-frequency cepstral coefficients. In 2020 IEEE
spoken sentences. IEEE Transactions on Acoustics, Speech, and conference on evolving and adaptive intelligent systems (EAIS)
Signal Processing, 28(4), 357–366. (pp. 1–5). IEEE.
Deshmukh, G., Gaonkar, A., Golwalkar, G., & Kulkarni, S. (2019). Rajisha, T., Sunija, A., & Riyas, K. (2016). Performance analysis of
Speech based emotion recognition using machine learning. In Malayalam language speech emotion recognition system using
2019 3rd International conference on computing methodologies ANN/SVM. Procedia Technology, 24, 1097–1104.
and communication (ICCMC) (pp. 812–817). IEEE. Reddy, A. P., & Vijayarajan, V. (2017). Extraction of emotions from
Ekman, P., & Keltner, D. (1997). Universal facial expressions of emo- speech-a survey. International Journal of Applied Engineering
tion. In U. Segerstrale & P. Molnar (Eds.), Nonverbal communica- Research, 12(16), 5760–5767.
tion: Where nature meets culture (Vol. 27, p. 46). Springer. Schroder, M., Bevacqua, E., Cowie, R., Eyben, F., Gunes, H., Heylen,
El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech D., Ter Maat, M., McKeown, G., Pammi, S., Pantic, M., et al.
emotion recognition: Features, classification schemes, and data- (2011). Building autonomous sensitive artificial listeners. IEEE
bases. Pattern Recognition, 44(3), 572–587. Transactions on Affective Computing, 3(2), 165–183.
Gao, Y., Li, B., Wang, N., & Zhu, T. (2017). Speech emotion recogni- Syed, Z. S., Memon, S. A., Shah, M. S., & Syed, A. S. (2020). Intro-
tion using local and global features. In International conference ducing the Urdu-Sindhi speech emotion corpus: A novel dataset
on brain informatics (pp. 3–13). Springer. of speech recordings for emotion recognition for two low-resource

13
International Journal of Speech Technology (2023) 26:721–733 733

languages. International Journal of Advanced Computer Science Zhang, Q., An, N., Wang, K., Ren, F., & Li, L. (2013). Speech emo-
and Applications, 11(4), 1–6. tion recognition using combination of features. In 2013 Fourth
Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. International Conference on intelligent control and information
A., Schuller, B., & Zafeiriou, S. (2016). Adieu features? End-to- processing (ICICIP) (pp. 523–528). IEEE
end speech emotion recognition using a deep convolutional recur- Zhang, B., Essl, G., & Provost, E. M. (2015). Recognizing emotion
rent network. In 2016 IEEE international conference on acoustics, from singing and speaking using shared models. In 2015 Interna-
speech and signal processing (ICASSP) (pp. 5200–5204). IEEE. tional conference on affective computing and intelligent interac-
Vasquez-Correa, J. C., Arias-Vergara, T., Orozco-Arroyave, J. R., tion (ACII) (pp. 139–145). IEEE.
Vargas-Bonilla, J. F., & Noeth, E. (2016). Wavelet-based time-
frequency representations for automatic recognition of emotions Springer Nature or its licensor (e.g. a society or other partner) holds
from speech. In Speech communication; 12. ITG symposium (pp. exclusive rights to this article under a publishing agreement with the
1–5). VDE. author(s) or other rightsholder(s); author self-archiving of the accepted
Wang, S., Soladie, C., & Seguier, R. (2019). Ocae: Organization-con- manuscript version of this article is solely governed by the terms of
trolled autoencoder for unsupervised speech emotion analysis. In such publishing agreement and applicable law.
2019 5th International conference on frontiers of signal process-
ing (ICFSP) (pp. 72–76). IEEE
Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S., & Vepa, J.
(2018). Speech emotion recognition using spectrogram & pho-
neme embedding. In Interspeech (pp. 3688–3692).

13

You might also like