Deep Learning Structure For Emotion Prediction Using MFCC From Native Languages
Deep Learning Structure For Emotion Prediction Using MFCC From Native Languages
https://doi.org/10.1007/s10772-023-10047-8
Received: 23 April 2023 / Accepted: 7 September 2023 / Published online: 5 October 2023
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023
Abstract
The role of AI in speech has been transformed to recognize and categorize emotions conveyed through speech. The research
employed audio recordings from different datasets, including the Ryerson Audio-Visual Database of Emotional Speech and
Song (RAVDESS), Berlin emotional data, and a self-developed Telugu dataset. The main contribution focused on using deep
neural network-based models to categorize emotional reactions elicited by spoken monologues in various situations. The goal
is to recognize eight distinct emotions: neutral, calm, happy, sad, angry, fearful, disgusted, and surprised. The evaluation of
the model’s performance was done using the F1 score, which is a measure that combines precision and recall. The model
achieved a weighted average F1 score of 0.91 on the test set and performed well in the "Angry" class with a score of 0.95.
However, the model’s performance in the "Sad" class was not as high, achieving a score of 0.87, which is still better than
the state-of-the-art results. The contribution with an effective model for recognizing emotional reactions conveyed through
spoken language, utilizing neural networks and a combination of datasets to improve the understanding of emotions in speech.
13
Vol.:(0123456789)
722 International Journal of Speech Technology (2023) 26:721–733
and employing machine learning algorithms for emotion feature is evaluated either with or without wrapping. Experi-
recognition. While this approach (Reddy & Vijayarajan, mental findings indicate that the fusion warping feature of
2017; Kwon et al., 2003; El Ayadi et al., 2011) had limi- DWT-MFCC and the characteristic MFCC approach have
tations in terms of generalization and feature engineering, improved performance in most environmental noise, reper-
advancements in deep learning have provided opportunities cussion, and rushing environments. In Jin and Liu (2017)
for more automated and effective emotion recognition by Extraction of hyperprosodic features and the use of a deep
learning directly from raw data. However, challenges related network for classification was made. Vasquez-Correa et al.
to emotionally irrelevant variables, data quality, and ethical (2016) extracted features using the Gaussian Mixing Models
considerations still need to be addressed in the development (GMM) for classification using Bionic Wavelet Transforma-
of robust emotion recognition systems. tion. While a number of prior research on the recognition of
The lexical or speaker contexts do not rely on these char- audiovisual emotions are included in the literature, they are
acteristics. Energy-related general features include pitch, all poorly reliable. One of the main considerations is how
format, Zcr, intonation, Mel-frequency cepstral coeffi- these two signals can be derived and combined (Khaleghi
cients (MFCC), and linear prediction cepstral coefficients et al., 2013). In certain cases, these hand-made functions
(LPCC). Furthermore, several classification methods have are derived, and two signals have their properties combined
been implemented, including, the Hidden–Markov-Model with one weight. This article proposes using a deep net-
(HMM), support vector machine (SVM), and K-nearest- work to extract features and integrate features. These two
neighbors (KNN). A system with many speech character- networks promise to be absolutely non-linear. The classifica-
istics, including MFCC, MFCC delta and Auto Correla- tion is done using CNN with TensorFlow. These days deep
tion Function Coefficients (ACFC) to assess emotions was learning is widely used in numerous applications, including
used in this study These features and classification methods image processing, voice processing, and emotion recogni-
demonstrate the complexity and multi-dimensional nature tion. speech analysis. The exact essence of the deep model
of analyzing speech data for emotion recognition. While and the availability of massive data varies in the different
traditional methods have provided insights into emotion applications using the deep-learning method (Chen et al.,
recognition, deep learning approaches have shown substan- 2017). The contributions towards the paper (i) The proposed
tial promise in automatically learning relevant features from system trained using RAVDESS (Livingstone & Russo,
raw data and improving classification performance. These 2018) and EMO-DB using deep networks in noisy environ-
advancements, coupled with large and diverse datasets, have ments for robust recognition of emotion. (ii) A 1D-CNN-
the potential to enhance the accuracy and generalization Convolutional Neural Network has been implemented over
capabilities of emotion recognition systems. Zhang et al. compressed speech signals.
(2013). The mentioned findings and approaches reflect a A few relevant research challenges in speech emotion rec-
range of strategies employed in speech emotion recognition ognition are briefly described here. (1) Emotion is a tough
and related fields. These include the use of GMMs, unsu- and confusing word to define accurately. The term emotion
pervised approaches with PCA, innovative feature extraction has been interpreted in various contexts by various persons.
techniques like DWT-based MFCC and hyperprosodic fea- Emotion is a distinct mental state that arises intuitively rather
tures, and the application of deep networks for classification. than through deliberate effort, making it difficult to scientifi-
These methods collectively contribute to advancing emotion cally characterize. As a result, no universally acknowledged
recognition and speech analysis by addressing challenges objective definition or acceptance of the concept of emotions
related to noisy environments, varying conditions, and the exists. This is the primary inhibiting factor to progressing
nuanced nature of emotional expression in speech. By com- forward with a scientific research approach (Schroder et al.,
bining techniques from different domains, researchers aim 2011). (2) There have been no standardized speech datasets
to improve the accuracy and robustness of systems that can that can be used to test the efficiency of recognition system
automatically detect and understand human emotions from methodological approaches. The overwhelming majority of
speech data. The experiment yielded an emotional classifi- automatic emotion systems depend on comprehensive emo-
cation accuracy of 74.45% using Gaussian Mixture Models tions, despite the fact that real-life emotional responses are
(GMM). In another work, Wang et al. (2019), an unsuper- pervasive and foundational. Many databases are made by
vised approach to the development of multiple speech emo- professional artists, whereas others are made by moderately
tions. This approach consists of integrating the classifier or inexperienced people. Although most datasets do not con-
with Principal component analysis(PCA) to have a precise tain a wide range of emotions, emotion recognition research
interpretation that can be preserved invariably in the organi- is limited to 5–6 emotions. (3) The speech emotion identi-
zation. Al-Ali et al. (2017) in their paper evaluated the use of fication system will not be able to recognize emotions from
DWT-based MFCC features for forensic speaker authentica- both natural and noisy environments. The rest of this paper
tion by the standard MFCC features. The performance of the runs through Related Work in section two, a working model
13
International Journal of Speech Technology (2023) 26:721–733 723
with a brief introduction about the speech datasets and fea- 1D- fully connected CNN with an input layer tuned with
ture extraction in section three, an Experimental setup in an SVM classifier where a 15x60 spectrogram image is
section four, and Discussions in section five, and the last used. Jannat et al. (2018) and others have trained 3 deep
section concludes this paper with future scope. networks separately, particularly considering the pre-
processed data on the image and the audio wave-forms:
one on image data alone; another on audio waveforms
2 Related work and one on image and waveform data alone. In particu-
lar, the other on the image format. Zhang et al. (2015)
The main objective is to develop a model for emotional published one of several models by using data collection
speech recognition using a librosa, sound file, and sklearn of RAVDESS but classifying only certain emotions, to
(such) to construct a model CNN(Convolution Neural Net- obtain total accuracy better than the model suggested in
work) with Tensor-Flow for recognizing emotion from a this paper but less precise. Four consecutive models of
Speech file. The data are loaded, features are stripped, emotional recognition have been proposed for speech and
and the data collection is separated into training and test song: a straightforward model, the hierarchical single-task
sets. Then we will load and train the model. Finally, the model, and the hierarchical multi-task model. The basic
precision of our model can be measured. Many classifica- model provides a single, domain-independent classifica-
tion methods have been proposed in recent years for this tion system. The framework has been used in testing both
Emotion recognition from speech. One method introduced models.Four consecutive models of emotional recognition
by researcher (Pinto et al., 2020) 1D CNN with Dense lay- have been proposed for speech and song: a straightfor-
ers has been used by extracting MFCC features. Decision ward model, the hierarchical single-task model, and the
Trees, Random Forest with 1000 trees by achieving 75% hierarchical multi-task model. The basic model provides
and 78% of average.The proposed method experiments a single, domain-independent classification system. The
were driven through the Ravdess dataset. A new set of framework has been used in testing both models. With a
features was extracted with the FFMPEG library tuned to sampling rate of 44.1 kHz and a mono channel, the KES
a deep learning model by achieving 91% of accuracy. One (Geethashree & Ravi, 2018) database was recorded in an
method introduced by Iqbal and Barua (2019) has been acoustic environment to reduce HNR and SNR. Before
used to classify gender-based distinctions, which have an recording, all of the speakers were given sentences and
overall accuracy of approximately 40 to 80% according to given time to prepare. To avoid influencing each other’s
the particular task by gradient boosting, KNN, and SVM, speaking style, each speaker’s speech was recorded sepa-
on the basis of the RAVDESS dataset linear classification rately in separate sessions. The Linear prediction coeffi-
used during these works. Different datasets have experi- cients (LPC) and LFCC are the features that were obtained
mented with the proposed classifiers. 100% average was for the objective of emotion classification. The Praat tool
achieved with SVM and K-NN on the RAVDESS dataset was used to obtain these spectral coefficients. LPC is a
for male recordings of emotion anger where Gradient- popular speech signal analysis technique. The extraction
boosting performed poorly. In Gao et al. (2017) approach technique for LPC. They reduce the error in the regres-
depth-first search is applied to extract the best feature sion notion to find the wavelet coefficients of the forward
and all prosodic and spectral features are extracted which techniques applied as a function of time. The coefficients
includes pitch, zcr, MFCC, and a combination of fea- of a pt h order sample are determined, which anticipates
tures. In this method, Emo-DB and Ravdess are used to the state variables of P(x) (the legitimate analysis) derived
recognize emotions using SVM, for extracting emotions from previous p samples, given by LPC. They concluded
sequential forward floating selection (SFFS) is used. The from their findings that humans recognize emotions bet-
literature Trigeorgis et al. (2016) put upon an end-to-end ter than classifiers. The K-NN technique is ineffective for
structural SER as a function step after a Long short-term emotion classification. As a result, pattern recognition
memory (LSTM) structure based on a convolution opera- and weight adaptation algorithms are more efficient. Cep-
tion. Every 6 segments of the raw signal are supplied with strum coefficients outperform LPC in emotion recogni-
an FIR filter to minimize noise to multiple 1D convolution tion systems. Fear and grief are the most puzzling emo-
layer structures to extract a high-level LSTM element. In tions, according to the KES database. In the other study
terms of state-of-the-art (Huang et al., 2014) algorithms Rajisha et al. (2016), Mel frequency cepstral coefficients
the device has done very well in some data sets. Short-time (MFCCs), Pitch, and Short Time Energy were used to per-
Fourier transform (STFT) is a well-known way to trans- form automated emotion recognition of four different emo-
form a signal input into picture data called a spectrogram. tions: anger, joy, sadness, and neutral state. As part of the
In this field, STFT is very general. For high recognition project, characteristics are extracted and analyzed from a
and accuracy in the literature, Yenigalla et al. (2018) a Malayalam language database. The classification was done
13
724 International Journal of Speech Technology (2023) 26:721–733
Table 1 Table shows the Previous Works and comparisons. In the above table H->Happiness, A-> Angry,S->Sadness, N->Neutral,F->Fear,
D-> Disgust,SARC-> Sarcastic, SUR-> Surprise,AIR-> All India Radio, M->Male,F-> Female,Desc->Description
Previous Work Database1 Size Desc Emotions Evolution Features
Geethashree and Kannada speech 660 sentences 4-Speakers 28y(M), H,S,A,F Acted
Ravi (2018) corpus 23Y(F),9Y(Boy,girl)
Deshmukh et al. Hindi, Marathi A,H,S
(2019)
Rajisha et al. Malayalam lan- 18–50Years(M,F) 20 Sentences each in 4 N,S,H,A Acted
(2016) guage emotions
(Koolagudi et al., Telugu language 25–40yrs 12,000 Utterances A,C,D,F,H,N,SC AIR, Acted Prosodic Study
2009)
Syed et al. (2020) Urdu–Sindhi 734-Urdu, 701- 1435 Audio recordings A,D,H,N,SARC,SAD,SUR Prosody
Sindhi
using ANN and SVM to compare classifier outputs. The the accuracy and efficiency of emotion recognition in both
dataset investigations show that speech emotion recogni- Urdu and Sindhi.
tion with an ANN classifier has a recognition accuracy of
88.4% higher than SVM, which has a recognition accuracy
of 78.2%. Anger, Compassion, Disgust, Fear, Happy, Neu- 3 Working model
tral, Sarcasm, and Surprise are the fundamental emotions
addressed in the development of the IITKGP-SESC (Kool- The section consists of Working Model 3, Speech Corpus
agudi et al., 2009) in Telugu (a South Indian language). 3.1, Feature Extraction 3.2, and Deep Learning Structure
The previous works and their comparisons are shown in 3.5 as sub-sections. The emotion classification model pre-
Table 1. Prosodic metrics from the produced speech cor- sented here is built on a detailed learning approach focused
pus were used to investigate the basic emotions. Using on CNN and dense layers. (LeCun & Bengio, 1995). The
basic statistical parametric models to identify emotions, main assumption is that the Mel frequency cepstral coeffi-
the importance of prosodic properties for emotion discrim- cients (MFCC), chroma, and mel function to train the model,
ination was proven. The proposed statistical models may which is Efficient, and seamless and standardized operations
be incapable of capturing the complex nonlinear correla- are then required in order to minimize the noises. MFCC is
tions present in prosodic data derived from longer speech another Mel-Frequency Cepstrum (MFC) classification and
segments. As a result, nonlinear models may be researched advanced audio formalization of automatic speech recogni-
further to improve recognition performance. Subjective tion was presented. The ability to demonstrate the sound
listening tests are used to evaluate the quality of the emo- wave amplitude of a discrete function vector is helpful for
tions contained in the emotional speech corpus that has complex MFC coefficients. The proposed SER approach is
been developed. The suggested emotional speech database presented in this section. The audio transmission is divided
can be utilized to explain feelings by utilizing emotion- into fragments of the same length (2.5 s). The model com-
specific information derived from the vocal tract and exci- prises multiple 1D convolution layers and a dense layer, fol-
tation source. Prosodic Features like duration, Average lowed by a completely linked layer and a softmax layer. The
pitch, Std Dev, and average energy are considered for 15 specifications of the proposed model are discussed below.
Telugu sentences. Another work proposed by Syed et al. DFT is applied on derived frames along with the Mel scale
(2020) the Urdu–Sindhi Speech Emotion Corpus in their amplitude spectrum which is trimmed and normalized. The
study, is a new dataset that can be used to train machine deep neural network used is seen in Fig. 3. 40 MFCC fea-
learning methods for speech emotion recognition in two tures were extracted for each audio recording which is con-
low-resource languages. We’ve made the dataset available verted single precision time sequence and later transposed
for academic research on the Zenodo framework. They by obtaining means. In the article by Davis S. et al. Davis
also conducted experiments to generate and identify the and Mermelstein (1980) and Huang et al. (2001) the cal-
potential overall performance of UAR utilizing selected culations of the MFCC are detailed. Therefore we have a
features first from the Open-Smile framework - a library scale le − 3 training set 1 x 40 x 1 with the 1D CNN method
used by work on improving development capacities clas- activation. dropout 0.2, 2x2 max-pooling function, dense
sification techniques boundaries. Their research shows that layer-two fully connected networks, and a softmax layer
multilingual logistic algorithms work on the Comparison resampled form-4224 to 8 elements. Relu- rectified linear
set of features and perform much better in order to increase unit helps the model to achieve a significant benefit by using
13
International Journal of Speech Technology (2023) 26:721–733 725
the feature as a good option for representing hidden units in Telugu speaking, the peculiar can be found the same, imply-
the event of activation. In this case, max-pooling just helps ing that for different vocal words, we developed a dataset
the model focus on the main features of each section of the using post-graduate and primary learners in order to incor-
study to invariably boost them by category. The entire pro- porate them. Five men and three women voluntarily partici-
cess was compiled numerous times by changing the kernel pated, ranging in age from seventeen to twenty-one years.
values. The structure of the test set is 432x40. with no cross- As is customary, the reason for choosing an age group per-
validation. Each class of emotion Neutral, Sadness, calm, son is that contemporary science studies show that all these
Happy, Anger, Fear, Disgust, and Surprised is encoded from demographic groups are dwindling. viciousness in a variety
the range 0–7. The kernel in 1D CNN is shifting in 1 direc- of situations all audio speakers are well-trained and capable
tion. 1D CNN data is in 2 dimensions input and output. Used of communicating 15 phrases for every 5 different emotions
mostly for data from time series. As a versatile measure of while taking into account the recording. As is customary,
evaluation, we used the F1 score for Classification model the reason for choosing an age group ( statistics taken from1)
quality and as a benchmark to compare our Results at the person is that contemporary science studies show that all
state of the art with them. The CNN was evaluated using the these demographic groups are dwindling Global Health Data
sparse categorical weight matrix Goodfellow et al. (2016). Exchange (2019). viciousness in a variety of situations All
Optimizer that implements the RMS prop algorithm for 1000 audio speakers are well-trained and capable of communicat-
epochs for best classification with a batch size of 32. We ing 15 phrases for every 5 different emotions while taking
validate the model with the exactness of the deep learning into account the recording.
architectures during the course of our training.
3.1.2 RAVDESS
3.1 Speech corpus
There are 7356 files in the Ryerson Audio-Visual Database
Speech augmentation tries to improve the audio signal’s of Emotional Speech and Song (RAVDESS) (Livingstone &
integrity by decreasing ambient noise. The cleanliness, Russo, 2018) (total size: 24.8 GB). Two lexically matched
coherence, and comfort of a voice signal are used to deter- statements are vocalized in a neutral North American accent
mine its qualities. Voice augmentation is a step before speech by 24 professional actors (12 female, 12 male) in the data-
recognition, synthesizing, interpretation, and encoding in base. Calm, happy, sad, angry, afraid, surprised, and dis-
speech perception. Speech augmentation tries to improve gusted expressions can be found in speech, whereas calm,
the audio signal’s integrity by decreasing ambient noise. The happy, sad, angry, and fearful emotions can be found in a
cleanliness, coherence, and comfort of a voice signal are song. Each expression has two emotional intensity levels
used to determine its qualities. Short-duration sounds, such (normal and strong), as well as a neutral expression. All
as impulsive noise, can degrade speech signals in commu- three modalities are available: audio-only (16bit, 48kHz.
nication devices. Such interruptions are particularly irritat- wav), audio-video (720p H.264, AAC 48kHz,.mp4), and
ing to receivers and must be eliminated in aims to enhance video-only (720p H.264, AAC 48kHz,.mp4) (no sound).
the quality and legibility of voice signals. The majority Actor 18 does not have any song files. Audio-Visual and
of utterance processing techniques assume that distortion Video-only files.
resembles a Gaussian distribution and is additive in nature. Video files are provided as separate zip downloads for
Non-Gaussian distributions, on the other hand, are charac- each actor (01-24, 500 MB each), and are split into separate
teristic of impulsive noise. The existence of noise sources, speech and song downloads:
this one will significantly decrease the efficiency of speech In total, the downloadable dataset RAVDESS is available
2
recognition systems. at collection includes 7356 files (2880+2024+1440+1012
files).
3.1.1 Self‑developed Telugu dataset File naming convention
Each of the 7356 RAVDESS files has a unique filename.
For our experiments, we created the Telugu language data- The filename consists of a 7-part numerical identifier (e.g.,
base (which is also the next highest-speaking Native Indian 02-01-06-01-02-01-12.mp4). These identifiers define the
native indigenous vocabulary and well-known southern stimulus characteristics: Filename identifiers
Native Indian Language) as a primary requirement in order Modality (01 = full-AV, 02 = video-only, 03 = audio-
to ensure that as an important component quality is usually only). Vocal channel (01 = speech, 02 = song). Emotion (01
managed for correct emotional acceptance. The following
are the standard aspects to consider when developing an
information resource: (1) there are almost no Telugu local 1
https://www.who.int/news-room/fact-sheets/detail/depression.
databases that are standard. (2) Within the specific areas of 2
https://smartlaboratory.org/ravdess/.
13
726 International Journal of Speech Technology (2023) 26:721–733
Fig. 1 Image shows Wave plot of male actor’s Neutral emotion (a) Clean Spectrogram image male actor anx-
iety/fear emotion
3.2 Feature extraction
13
International Journal of Speech Technology (2023) 26:721–733 727
T ∫0 auto
1
S(f ) = r (t) cos(2𝜋mt∕F0 )dt (1)
13
728 International Journal of Speech Technology (2023) 26:721–733
plot of speech signal male Neutral shown using Fig 1 and explain why the data was over-fitted. It is possible that more
spectrogram shown using Fig 2a. datasets could be used.
We present a framework for recognizing emotion from
speech signals for this purpose. The entire speech emotion
recognition system may be seen in action. 4 Experimental setup
(1) Input) Benchmark datasets are employed to increase the The methodology and results of this research analysis
current framework’s accuracy. focused on the model’s ability to generate precise findings in
(2) Pre-Process) Is therefore chosen to use a time series the context of compressed speech over noisy environments.
signal. The minimum average is calculated initially. The analysis involved the development of a framework,
This technique would reduce the voice signal to 0, mak- testing different classification models, and evaluating their
ing it easier to approach. In reality, the quantity of utter- performance. The primary goal of the analysis and frame-
ance frequency varies among individuals. As a result, work development was to determine whether the model’s
it is necessary to modify those various speech signals generated findings are precise enough to provide valuable
required for processing. High-pass Finite Impulse insights for future research in the area of compressed speech
Response processes the pre-emphasis for smoothing the over noisy environments. The first technique we employed
speech spectra (FIR). Then, from -1 and 1, normalize was creating a decision tree (DT) and a random forest (RF)
the amplitude signal. classifier with more than a hundred trees. These models
were evaluated using the Scikit-learn Python library. This
evaluation likely involved measuring the accuracy, preci-
3.5 Deep learning structure sion, recall, F1-score, and other relevant metrics to assess
the models’ performance. The consistency of identification
By importing all necessary modules such as Numpy and in two separate datasets is highlighted in this section, sug-
matplotlab in Python. TensorFlow is already imported. gesting that the model’s performance remains consistent
Tensor-flow is a Deep learning technique for represent- across different data sources. When determining the size
ing data –by Layman’s terms. This can be represented in a of the training set, the composition and dimensions of the
1D-2D-3D array. In terms of Machine Learning and Deep training samples were taken into account. This likely refers
Learning, it is a multi-dimensional feature vector. By using to the process of selecting an appropriate proportion of data
the oneh ot = True(encoding) argument all categorical class for training the model and ensuring that the samples are
labels are converted to binary vectors. The image dimen- representative of the overall dataset. The tests conducted
sions are calculated using a computational approach. One showed that using six layers of the FLB (presumably a type
important thing noted here, the images should be scaled con- of neural network architecture) is sufficient for handling a
jugate to the range of ’0’ to ’1’. Figure 3 for each audio file large database, indicating that deeper layers might not neces-
provided as input, the network can work on 40 feature vector sarily improve performance. However, in the case of smaller
which represents the compressed computational form of 2ms databases like RAVDESS and EMO-DB, the risk of overfit-
frame length of audio signal. TensorFlow intends a specific ting emerged when using more than three layers and increas-
input shape for its deep learning models, in this case, a CNN ing model complexity. This implies that a more complex
model. We have used two convolutional layers, the first layer model might start to fit noise in the smaller dataset, leading
64 × 3 × 3 and the second and third layer 128 × 3 × 3, two to worse generalization.
max-pooling layers used of 2 × 2 size each. Usually, the
learning rate is le − 3 and is maintained as normal perception 4.1 Pre‑processing
which is magnified based on which the weights are updated
by reducing the cost function by integrating the optimal In data and model context by Acknowledging the challenge
solution. Because by making small updates to the weights of using deep learning models with limited data, we aimed
in the network (if the learning rate is too low) training will to overcome this limitation. The dataset at hand lacks an
move very slowly, and the loss function may exhibit undesir- extensive number of features, prompting the use of tech-
able divergent behavior if it is too low. The training samples niques to enrich it. To augment the dataset, we employed
are distributed into a rectified scale factor, and each chunk the OpenSmile library. This tool enabled the extraction
will train a fixed size of images. The fact that we didn’t use of new audio features from video data available in the
feature selection to reduce the dimensionality of our aug- dataset. Notably, the audio frequencies of the video files
mented CNN, which could enhance training effectiveness, (44.1MHz) differ slightly from those in the original audio
is one constraint. Another barrier was the use of limited data files (48MHz). This introduced some noise, which we con-
the RAVEDESS dataset has only 1440 files, which could sidered necessary to increase the training dimension and
13
International Journal of Speech Technology (2023) 26:721–733 729
13
730 International Journal of Speech Technology (2023) 26:721–733
Table 2 Results of the proposed model Telugu dataset on the test set
per class
Emotion Precision Recall f1-score Support
13
International Journal of Speech Technology (2023) 26:721–733 731
classes of Fear, Angry, Neutral emotions as 30%, 25%, 7% simulations are used for toddlers or elderly persons, preci-
and other emotions shown using pie chart and Fig. 2c shows sion may be reduced due to anatomic structures in higher or
Predicted accuracy of emotional classes for Self-developed lower age groups.
Telugu dataset Sadness 33%, Happy 28%, Fear 19%, Disgust
17% and many other emotions also predicted shown using
pie chart. The proposed technique utilizes fusion features to 6 Conclusion
improve the accuracy of the Convolutional Neural Network
(CNN) in detecting emotions in real-time. LBP, ORB, and The study focuses on designing a CNN-based framework
CNN features were used to fuse the input data, resulting in that leverages feature extraction mechanisms, adaptive
a higher accuracy rate of 98.13% on the CK+ dataset, out- thresholding, and spectrogram transformations to enhance
performing previous methods. The CNN model was trained the accuracy and efficiency of SER models. This indicates
on a composite dataset, which achieved exceptional training a comprehensive approach to addressing the challenges in
and validation accuracy rates. The proposed model exhibits emotion recognition from speech, which includes dealing
superior performance compared to other techniques in rec- with noise, computational demands, and the complex nature
ognizing human emotions. To improve emotion detection of emotional cues in audio signals. Instead of employing
further, future research can investigate the incorporation of spectrograms to learn the most salient and discriminative
new features or methods with CNN, and consider the effect features in a convolutional layer, we employed stride CNN
of human variables, such as age and gender. Moreover, the frameworks for SER employing a specific stride setting
rise in classes can lead to a decrease in accuracy, but the to flat extracted features. To enhance the precision of the
proposed method in this study achieves an F1 score compa- Convolutional Neural Network (CNN) used for detecting
rable to the two tasks presented, as demonstrated in Figs. 4 emotions in real-time, various methods can be used. Fusion
and 5. Figure 3 corresponds to the True and Predicted labels, features were employed in the proposed technique, and the
respectively, and highlights ambiguity between the Angry input data were fused using LBP(Local Binary Patterns),
and Happy classes when Arousal and energy levels are high. ORB( oriented FAST and rotated BRIEF), and CNN fea-
The proposed model achieved 98% prediction accuracy on tures, which showed better accuracy (98.13%) than prior
the EMO-db dataset and 53% on the RAVDESS dataset, approaches on the CK+ dataset. In addition, the CNN model
with 100% prediction accuracy for Anger and Neutral. was trained on a composite dataset that attained a training
There are 535 instances within the EMO-db dataset. To accuracy of 95% and validation accuracy of 91%, which rose
minimize the error, we use k-fold cross-validation, and the to 95% after multiple epochs. The proposed model signifi-
findings were displayed in the figure. As can be seen, cross- cantly outperforms other techniques for recognizing human
validation has an overall precision of 87.3%. Prior EMO-db emotions. Future studies can focus on examining new fea-
dataset results showed that the average precision was 83.3%. tures or methods to be fused with CNN and investigating the
The combined characteristics from MPEG-7 characteristics, influence of human variables such as age and gender on the
MFCCs, and Timbre in [24]. To use the SVM classifier, the efficiency of emotion detection.
overall results achieved were 83.39%. [25] collected 1800
multidimensional characteristics and showed the average
efficiency for identifying six classes was 73.3%, excluding 7 Future scope
the disgust emotions. The six emotions in the RAVDESS
database are furious, calm, afraid, joyful, neutral, and sad. It is also worth noting that the efficiency of emotion detec-
We use the same procedures as EMO-db and have 92 sam- tion through the CNN model can vary based on human
ples for each emotion. Tables 2 and 3 display the results. variables such as age and gender. Further studies can aim
Anger is 98.4% accurate, Neurtral 95.1%, happy 94.3%, Fear to explore the impact of these factors on the performance
91.4%,Borden 95.1%. It is 98% accurate and for anger, 66%, and accuracy of the model. Additionally, new features or
disgust 65%, fear 56%, and other emotional classes more methods can be investigated in the future to enhance the
than 45% is achieved. precision and robustness of the emotion detection process
We must mention that all resulting classifiers were trained in real time. Furthermore, it is important to consider other
using the RAVDESS dataset, which is one of the recom- factors that may affect the efficiency of the CNN model, such
mended approaches’ weaknesses. Only Native American as cultural differences and psychological traits. These vari-
voices are included in this database, which might also indi- ables can significantly influence the way individuals express
cate that competence deteriorates with persons of other eth- and perceive emotions, potentially impacting the accuracy
nicities. Moreover, because most of the individual perform- of the emotion detection process. Therefore, future research
ers in the database are between the ages of 20 and 50, aging can focus on understanding and incorporating these ele-
could cause inaccurate projections. As a result, when such ments into the model to improve its overall performance.
13
732 International Journal of Speech Technology (2023) 26:721–733
Additionally, exploring different datasets that represent Geethashree, A., & Ravi, D. (2018). Kannada emotional speech
diverse populations can also contribute to ensuring the gen- database: Design, development and evaluation. In Proceedings
of international conference on cognition and recognition (pp.
eralizability and reliability of the CNN model in real-world 135–143). Springer.
scenarios. By continually refining and updating the model, Global Health Data Exchange (GHDx)., Institute Of Health Metrics
we can unlock its full potential in accurately detecting and And Evaluation. “GBD Results Tool | GHDx.” GBD Results Tool
analyzing emotions in various contexts and applications. | GHDx. ghdx.healthdata.org, 2019. http://ghdx.healthdata.org/
gbd-results-tool?params=gbd-api-2019-permalink/d780dffbe8
It’s a challenging task that involves a blend of user experi- a381b25e1416884959e88b
ence design, machine learning skills, and ethical concerns. Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep
The resultant system may be used in a variety of industries, learning (Vol. 1). MIT Press.
such as communication technology, sentiment analysis, and Huang, Z., Dong, M., Mao, Q., & Zhan, Y. (2014). Speech emotion
recognition using cnn. In Proceedings of the 22nd ACM interna-
human-computer interaction. tional conference on multimedia (pp. 801–804).
Huang, X., Acero, A., Hon, H.-W., & Reddy, R. (2001). Spoken lan-
Acknowledgements We thank all the volunteers who helped us in mak- guage processing: A guide to theory, algorithm, and system devel-
ing the Telugu database. Presently the database is under review with opment. Prentice Hall PTR.
the committee for endorsement and will be publicly available. The Iqbal, A., Barua, K. (2019). A real-time emotion recognition from
RAVDESS dataset is available at https://www.kaggle.com/datasets/ speech using gradient boosting. In 2019 International conference
uwrfkaggler/ravdess-emotional-speech-audio. on electrical, computer and communication engineering (ECCE)
(pp. 1–5). IEEE
Declarations Jannat, R., Tynes, I., Lime, L. L., Adorno, J., & Canavan, S. (2018).
Ubiquitous emotion recognition using audio and video data. In
Conflict of interest The authors have no conflicts of interest to declare. Proceedings of the 2018 ACM international joint conference and
All co-authors have seen and agree with the contents of the manuscript 2018 International symposium on pervasive and ubiquitous com-
and there is no financial interest to report. We certify that the submis- puting and wearable computers (pp. 956–959).
sion is original work and is not under review at any other publication. Jin, B., & Liu, G. (2017). Speech emotion recognition based on hyper-
prosodic features. In 2017 International conference on computer
technology, electronics and communication (ICCTEC) (pp.
82–87). IEEE.
Khaleghi, B., Khamis, A., Karray, F. O., & Razavi, S. N. (2013). Mul-
References tisensor data fusion: A review of the state-of-the-art. Information
Fusion, 14(1), 28–44.
Al-Ali, A. K. H., Dean, D., Senadji, B., Chandran, V., & Naik, G. R. Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S., & Rao,
(2017). Enhanced forensic speaker verification using a combina- K. S. (2009). Iitkgp-sesc: Speech database for emotion analysis.
tion of dwt and mfcc feature warping in the presence of noise and In International conference on contemporary computing (pp.
reverberation conditions. IEEE Access, 5, 15400–15413. 485–492). Springer.
Bediou, B., Krolak-Salmon, P., Saoud, M., Henaff, M.-A., Burt, M., Kwon, O.-W., Chan, K., Hao, J., & Lee, T.-W. (2003). Emotion rec-
Dalery, J., & D’Amato, T. (2005). Facial expression and sex rec- ognition by speech signals. In Eighth European conference on
ognition in schizophrenia and depression. The Canadian Journal speech communication and technology.
of Psychiatry, 50(9), 525–533. LeCun, Y., Bengio, Y., et al. (1995). Convolutional networks for
Boersma, P. (2011). Praat: Doing phonetics by computer [computer images, speech, and time series. The Handbook of Brain Theory
program]. http://www.praat.org/ and Neural Networks, 3361(10), 1995.
Chen, M., Hao, Y., Hwang, K., Wang, L., & Wang, L. (2017). Disease Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Vis-
prediction by machine learning over big data from healthcare ual Database of Emotional Speech and Song (RAVDESS): A
communities. IEEE Access, 5, 8869–8879. dynamic, multimodal set of facial and vocal expressions in north
Chen, M., Zhang, Y., Qiu, M., Guizani, N., & Hao, Y. (2018). Spha: American English. PLoS ONE, 13(5), 1–35. https://doi.org/10.
Smart personal health advisor based on deep analytics. IEEE 1371/journal.pone.0196391
Communications Magazine, 56(3), 164–169. Pinto, M. G., Polignano, M., Lops, P., Semeraro, G. (2020). Emotions
Davis, S., & Mermelstein, P. (1980). Comparison of parametric rep- understanding model from spoken language using deep neural
resentations for monosyllabic word recognition in continuously networks and mel-frequency cepstral coefficients. In 2020 IEEE
spoken sentences. IEEE Transactions on Acoustics, Speech, and conference on evolving and adaptive intelligent systems (EAIS)
Signal Processing, 28(4), 357–366. (pp. 1–5). IEEE.
Deshmukh, G., Gaonkar, A., Golwalkar, G., & Kulkarni, S. (2019). Rajisha, T., Sunija, A., & Riyas, K. (2016). Performance analysis of
Speech based emotion recognition using machine learning. In Malayalam language speech emotion recognition system using
2019 3rd International conference on computing methodologies ANN/SVM. Procedia Technology, 24, 1097–1104.
and communication (ICCMC) (pp. 812–817). IEEE. Reddy, A. P., & Vijayarajan, V. (2017). Extraction of emotions from
Ekman, P., & Keltner, D. (1997). Universal facial expressions of emo- speech-a survey. International Journal of Applied Engineering
tion. In U. Segerstrale & P. Molnar (Eds.), Nonverbal communica- Research, 12(16), 5760–5767.
tion: Where nature meets culture (Vol. 27, p. 46). Springer. Schroder, M., Bevacqua, E., Cowie, R., Eyben, F., Gunes, H., Heylen,
El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech D., Ter Maat, M., McKeown, G., Pammi, S., Pantic, M., et al.
emotion recognition: Features, classification schemes, and data- (2011). Building autonomous sensitive artificial listeners. IEEE
bases. Pattern Recognition, 44(3), 572–587. Transactions on Affective Computing, 3(2), 165–183.
Gao, Y., Li, B., Wang, N., & Zhu, T. (2017). Speech emotion recogni- Syed, Z. S., Memon, S. A., Shah, M. S., & Syed, A. S. (2020). Intro-
tion using local and global features. In International conference ducing the Urdu-Sindhi speech emotion corpus: A novel dataset
on brain informatics (pp. 3–13). Springer. of speech recordings for emotion recognition for two low-resource
13
International Journal of Speech Technology (2023) 26:721–733 733
languages. International Journal of Advanced Computer Science Zhang, Q., An, N., Wang, K., Ren, F., & Li, L. (2013). Speech emo-
and Applications, 11(4), 1–6. tion recognition using combination of features. In 2013 Fourth
Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. International Conference on intelligent control and information
A., Schuller, B., & Zafeiriou, S. (2016). Adieu features? End-to- processing (ICICIP) (pp. 523–528). IEEE
end speech emotion recognition using a deep convolutional recur- Zhang, B., Essl, G., & Provost, E. M. (2015). Recognizing emotion
rent network. In 2016 IEEE international conference on acoustics, from singing and speaking using shared models. In 2015 Interna-
speech and signal processing (ICASSP) (pp. 5200–5204). IEEE. tional conference on affective computing and intelligent interac-
Vasquez-Correa, J. C., Arias-Vergara, T., Orozco-Arroyave, J. R., tion (ACII) (pp. 139–145). IEEE.
Vargas-Bonilla, J. F., & Noeth, E. (2016). Wavelet-based time-
frequency representations for automatic recognition of emotions Springer Nature or its licensor (e.g. a society or other partner) holds
from speech. In Speech communication; 12. ITG symposium (pp. exclusive rights to this article under a publishing agreement with the
1–5). VDE. author(s) or other rightsholder(s); author self-archiving of the accepted
Wang, S., Soladie, C., & Seguier, R. (2019). Ocae: Organization-con- manuscript version of this article is solely governed by the terms of
trolled autoencoder for unsupervised speech emotion analysis. In such publishing agreement and applicable law.
2019 5th International conference on frontiers of signal process-
ing (ICFSP) (pp. 72–76). IEEE
Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S., & Vepa, J.
(2018). Speech emotion recognition using spectrogram & pho-
neme embedding. In Interspeech (pp. 3688–3692).
13