0% found this document useful (0 votes)
9 views8 pages

Gilsang+18671 17 Korea AAP Rev

This paper presents a novel system for emotion recognition in background music using Mel Frequency Cepstral Coefficients (MFCC) and various machine learning algorithms to enhance accessibility for hearing-impaired users on OTT platforms. The system translates the emotional context of music into textual subtitles, achieving a high accuracy of 94.8% with the Random Forest algorithm. This advancement aims to improve user engagement and inclusivity in multimedia content by making emotional nuances more accessible to all audiences.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views8 pages

Gilsang+18671 17 Korea AAP Rev

This paper presents a novel system for emotion recognition in background music using Mel Frequency Cepstral Coefficients (MFCC) and various machine learning algorithms to enhance accessibility for hearing-impaired users on OTT platforms. The system translates the emotional context of music into textual subtitles, achieving a high accuracy of 94.8% with the Random Forest algorithm. This advancement aims to improve user engagement and inclusivity in multimedia content by making emotional nuances more accessible to all audiences.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Vol.14 (2024) No.

3
ISSN: 2088-5334

Emotion Recognition and Multi-class Classification in Music


with MFCC and Machine Learning
Gilsang Yoo a,*, Sungdae Hong b, Hyeocheol Kim a
a
Creative Informatics and Computing Institute, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea
b
Division of Design, Seokyeong University, 124 Seogyeong-ro Seongbuk-gu Seoul, 02173, Republic of Korea
Corresponding author: *ksyoo@korea.ac.kr

Abstract—Background music in OTT services significantly enhances narratives and conveys emotions, yet users with hearing
impairments might not fully experience this emotional context. This paper illuminates the pivotal role of background music in user
engagement on OTT platforms. It introduces a novel system designed to mitigate the challenges the hearing-impaired face in
appreciating the emotional nuances of music. This system adeptly identifies the mood of background music and translates it into textual
subtitles, making emotional content accessible to all users. The proposed method extracts key audio features, including Mel Frequency
Cepstral Coefficients (MFCC), Root Mean Square (RMS), and MEL Spectrograms. It then harnesses the power of leading machine
learning algorithms Logistic Regression, Random Forest, AdaBoost, and Support Vector Classification (SVC) to analyze the emotional
traits embedded in the music and accurately identify its sentiment. Among these, the Random Forest algorithm, applied to MFCC
features, demonstrated exceptional accuracy, reaching 94.8% in our tests. The significance of this technology extends beyond mere
feature identification; it promises to revolutionize the accessibility of multimedia content. By automatically generating emotionally
resonant subtitles, this system can enrich the viewing experience for all, particularly those with hearing impairments. This advancement
not only underscores the critical role of music in storytelling and emotional engagement but also highlights the vast potential of machine
learning in enhancing the inclusivity and enjoyment of digital entertainment across diverse audiences.

Keywords—Emotion recognition; multi-class classification; machine learning; Mel spectrograms.

Manuscript received 11 Sep. 2023; revised 29 Nov. 2023; accepted 7 Jan. 2024. Date of publication 30 Jun. 2024.
IJASEIT is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.

background music in video content, explore methods for


I. INTRODUCTION effectively converting the mood and emotional qualities of
In contemporary society, multimedia content, mainly OTT music into text, and evaluate whether the generated subtitles
services, has become a vital component of daily life, offering can effectively convey the emotional context of the video to
diverse video content that enriches user experiences [1]. hearing-impaired users. To achieve these goals, the study
These services enhance narratives and convey emotions employs machine learning to analyze background music's
through background music [2]. However, users with auditory mood and emotional qualities and convert them into text.
limitations, such as hearing impairment, may find it Fundamental techniques used in this process include Mel-
challenging to fully grasp the emotional context conveyed by Frequency Cepstral Coefficients, Root Mean Square, and
these musical elements [3]–[5]. To address this issue, this MEL Spectrogram analysis. These techniques extract
study proposes a system within OTT services that meaningful features from audio signals and then classify and
automatically recognizes the mood of background music in textualize the music's mood through various machine learning
video content and converts it into textual subtitles. This algorithms, including Random Forest, Regression Analysis,
system aims to improve the accessibility and inclusiveness of AdaBoost, and SVC (Support Vector Classification). The
multimedia content, enabling all users, including those with application of machine learning for background music
hearing impairments, to have a more enriching multimedia analysis represents a significant advancement in this field,
experience. contributing significantly to the understanding and processing
The objectives of this research are to develop techniques of multimedia content.
for accurately extracting and analyzing the characteristics of

818
This research is expected to enable all users, including for automatic classification of music genres, music
those with hearing impairments, to experience multimedia recommendation systems, and emotion analysis based on
content fully. By enhancing content accessibility and music. For instance, CNN models utilizing Mel Spectrograms
inclusiveness, this research could establish new standards for as input have been developed to classify music genres with
producing and delivering video content. The next part of this high accuracy [16]–[17]. This approach allows deep learning
paper, section 2, reviews existing background music models to effectively learn the complex features of audio
recognition and mood analysis research. It presents the signals, yielding superior results compared to traditional
technical details and algorithms used in this study and the manual feature extraction methods.
system's implementation method. The experimental results Moreover, due to their ability to capture acceptable audio
assess the system's performance and potential benefits for signal variations, Mel Spectrograms are also valuable for
hearing-impaired users. Finally, section 4 reveals the analyzing the subtle emotional nuances within music [18]–
implications of the research findings and future research [19]. This facilitates a more sophisticated understanding of
directions. the emotional impact of background music in films or videos
on viewers. Thus, Mel Spectrogram analysis is a powerful tool
II. MATERIALS AND METHODS for effectively visualizing and analyzing the complex features
of audio signals, offering various applications in music
A. MFCC Technique for Background Music Processing processing and analysis. This study aims to leverage the
The Mel-Frequency Cepstral Coefficients technique plays advantages of Mel Spectrograms to analyze the mood of
a pivotal role in speech recognition and music information background music in video content and convert it into subtitle
retrieval, showcasing notable success in processing information for users with hearing impairments, presenting a
background music. Developed based on the human auditory novel approach to this challenge.
system's use of the Mel scale rather than a linear frequency
scale to perceive sounds, MFCC effectively extracts essential C. Machine Learning for Background Music Analysis
features such as timbre, pitch, and rhythm from musical Background music analysis has established itself as one of
signals, mirroring human auditory characteristics [6]. the crucial elements in understanding and processing
Initially utilized primarily in feature extraction for speech multimedia content, with the application of machine learning
recognition systems, MFCC has expanded its application to emerging as a significant research focus in this area.
analyzing background music characteristics through the work Leveraging machine learning techniques allows researchers to
of various researchers, finding widespread use in areas such automatically identify and classify the complex
as music genre classification, emotion analysis, and music characteristics of background music, enhancing the content's
recommendation systems [7]. emotional ambiance and viewer experience [20].
Furthermore, MFCC-based background music processing In early studies, machine learning was primarily applied to
techniques have been applied to analyze the mood and simple tasks, such as classifying music genres. Based on these
emotions of background music in films and videos [8]–[9], attributes, these studies quantified various musical attributes
[10]. By classifying the emotional states of movie background (e.g., rhythm, melody, harmony) and trained classification
music based on features extracted through MFCC, it is algorithms (e.g., decision trees, K-nearest neighbors). While
possible to analyze the overall emotional flow of the movie. this approach proved helpful in identifying the musical
These studies highlight MFCC's potential as a tool for music features of specific genres, it faced limitations when analyzing
information retrieval and multimedia content analysis and more complex characteristics, such as the subtle emotional
understanding [11]–[13]. Building on these previous studies, nuances of background music.
this research aims to utilize MFCC technology for processing Recent studies have started applying more advanced
background music in video content, automatically developing machine learning models, particularly techniques like deep
a system to textualize music's mood and emotional learning, to background music analysis to overcome these
characteristics. This study aims to extend existing MFCC- limitations. For instance, CNNs and Recurrent Neural
based music processing techniques to enable a broader Networks (RNNs) are particularly effective in learning
audience, including users with hearing impairments, to temporal changes and patterns in audio signals, enabling the
understand the musical elements of video content. practical analysis of complex characteristics like the
emotional mood of background music [21]–[22].
B. Mel Spectrogram Analysis Furthermore, machine learning techniques have been utilized
The Mel Spectrogram is an essential tool in audio signal to analyze the interactions between background music and
analysis, widely utilized particularly in music and speech video content [23]–[29]. This approach has enabled a more
processing. This technique visualizes the changes in accurate understanding of the impact of background music on
frequency over time, aiding in understanding the complex viewers, allowing content creators to design more effective
structures within audio signals. Designed to mimic the emotional experiences.
characteristics of human hearing, the Mel scale processes
frequencies akin to how humans perceive sound, sharing D. Data Exploration
similarities with MFCC analysis but emphasizing the visual The dataset is composed as follows: The dataset, which is
representation of audio signals [14]–[15]. publicly available on Kaggle (https://www.kaggle.com/
Recent studies have demonstrated the effectiveness of Mel datasets/dikshashri13702/features-music-mood-
Spectrograms when used in conjunction with deep learning classification/data), consists of 2500.wav files that are labeled
models, especially Convolutional Neural Networks (CNN), with five emotions:

819
 Aggressive (500 files)
 Dramatic (500 files)
 Happy (500 files)
 Romantic (500 files)
 Sad (500 files)
The emotional music is composed of 5-second segments, and
the signal waveform for each piece of music data is shown in
Fig. 1.
E. Feature Extraction
Audio features such as Mel Frequency Cepstral
Coefficients, Root Mean Square, and Mel Spectrograms are
extracted in the feature extraction stage. These features
numerically represent various acoustic properties of music
and are used as inputs for machine learning models. The
computational methods for each feature at this stage are as
follows:
1) Fourier Transform (FFT): Fourier Transform is
performed for each frame to convert it into the frequency
domain. The FFT is conducted to transform the time-domain
signal into the frequency domain and extract the energy
distribution in the spectrum. The amplitude spectrum of the
signal is then converted into the energy spectrum using
Equation (1). The window size used for calculating the FFT
(Fast Fourier Transform) is set to 4096. A larger value
increases the frequency resolution but decreases the time
resolution.
=| | (1)

The equation X[k] = |FFT(x[n])|2 represents the power


spectrum of a discrete signal. Here, X[k] denotes the energy
at the k th frequency bin, while FFT(x[n]) stands for the
Fourier Transform of the signal x[n]. The modulus squared
|FFT(x[n])|^2 is used to calculate the power of the given
frequency component in the signal. This is commonly used in
signal processing to analyze the frequency content of signals.
2) Application of Mel Filter Bank: The human ear is more
sensitive to lower frequencies than higher frequencies, a
characteristic that the Mel filter bank considers. As frequency
increases, the bandwidth of Mel-filters broadens to extract
sufficient energy information from the lower frequency
bands. The boundaries of each filter are calculated using a
fixed equation between frequency and Mel frequency. Post-
processing tasks, including logarithmic multiplication and
Discrete Cosine Transformation (DCT), divide the frequency
domain into several bands according to the Mel scale and
calculate the energy in each band, transforming the filtered
signals into MFCC features.
3) Log Energy: The logarithm of the energy is taken for
each band.
4) Discrete Cosine Transform (DCT): The log energy
spectrum is subjected to the DCT to calculate the MFCCs as
illustrated in Equation (2).
.! "
MFCC i = ∑#' & log E cos #
$% (2)

Fig. 1 The waveform of the signal for each music data

820
The equation provided is the formula for calculating the  Root Mean Square: RMS is used to represent the
Mel Frequency Cepstral Coefficients (MFCCs). Here, average power of a signal. In audio, the RMS value
MFCC[i] represents the ith cepstral coefficient, En denotes the roughly indicates the "loudness" or volume level of the
log energy of the nth Mel filter bank channel, and N is the signal. It's calculated by squaring all the sample values
total number of Mel filter bank channels. The expression of the signal, averaging them, and then taking the
log(En) indicates the logarithm of the energy, which is used to square root of that average. RMS analysis is useful for
capture the non-linear human ear perception of sound. The understanding the overall energy level of audio and
cosine term is part of the Discrete Cosine Transform (DCT), comparing the loudness between different audio clips
which is applied to the log Mel spectrum, and i is the index of or tracks.
the MFCC. This transformation from the log Mel spectrum to  Delta RMSE (Root Mean Square Error): Delta RMSE
the cepstral domain helps to decorrelate the signal and represents the rate of change of RMS values over time.
compresses the spectrum, resulting in a representation that can It can detect changes in energy levels in an audio signal
be effectively used in various audio processing tasks, and is particularly useful for analyzing dynamic range
particularly for voice and speech analysis. In the final variations in music or audio clips. For instance, strong
analysis, MFCC[i] is stored in a 2D array containing 40 beats or sudden increases in loudness in music can
MFCC features calculated for each frame of the audio signal. manifest as sharp changes in Delta RMSE values.
Each column of the array represents a single frame of the  Energy Novelty: Energy Novelty analysis is used to
audio signal, while each row corresponds to one of the MFCC find new points of energy change, or "novelty points,"
coefficients for that frame. The transformed MFCC results within an audio signal. It's primarily used in music
effectively capture the timbre of the audio signal, as illustrated structure analysis to identify significant changes within
in Fig. 2., and are utilized in applications such as speech a track. Energy Novelty is obtained by calculating the
recognition and music classification. energy of the signal over short periods and analyzing
changes in this energy level. If the rate of change
exceeds a certain threshold, that point can be
considered a new change in energy.

Fig. 3 Results of RMS Analysis for Dramatic Music

Fig. 2 The transformed MFCC results

5) RMS analysis: RMS, Delta RMSE, and Energy


Novelty analyses are standard methods used in audio signal
processing, especially in music and sound analysis. Each
method examines specific aspects of a signal, helping to Fig. 4 Results of RMS Analysis for Romantic Music
understand its characteristics:

821
Fig. 3. presents the results of the RMS, delta RMSE, and manageable form. This transformation also ensures high
Energy novelty analyses for Dramatic music, while Fig. 4. compatibility with powerful deep learning models such as
shows the analysis results for Romantic music. Convolutional Neural Networks for image processing.
Moreover, the Mel Spectrogram captures the temporal
dynamics of audio signals, effectively conveying critical
information about frequency variations over time to machine
learning models. Due to these characteristics, the Mel
Spectrogram is widely utilized as a crucial tool in machine
learning and deep learning research related to audio analysis.
The Mel Spectrogram can be obtained by arranging the energy
of the Mel filter banks acquired during the MFCC calculation
process along the time axis. Representative Mel Spectrogram
results for each piece of music are depicted in Fig. 5.

III. RESULTS AND DISCUSSION


Based on the extracted features, simulations were
conducted using Logistic Regression, Random Forest,
Support Vector Classification (SVC), and AdaBoost models.
The models were trained using the training data and evaluated
using the validation data. Where necessary, adjustments to the
model's structure or optimization of hyperparameters were
performed to enhance performance. The experimental results
for each model are as follows.
A. Logistic Regression Model
Table 1 [30]–[33] displays the logistic regression model's
performance metrics for a multi-class classification task.
These metrics indicate the model's efficacy in predicting each
class based on precision, recall, and f1-score [34].
 Aggressive: This class has the highest precision and
recall at 96.5%, suggesting that the model predicts this
class with excellent accuracy and reliability.
 Dramatic: Precision and recall are slightly lower than
Class 0 at 91.6% and 92.2%, respectively, but still
indicate high performance.
 Happy: With a precision of 93.9% and a recall of 96.0%,
the model can identify this class correctly.
 Romantic: Presents the lowest precision and recall, at
83.6% and 83.0%, respectively, highlighting it as the
most challenging class for the model to predict
accurately.
 Sad: Scores are decent, with a precision of 85.7% and a
recall of 83.8%. While not as high as other classes, the
performance is satisfactory.
The model's overall accuracy is 90.3%, with the macro
average and weighted average of precision, recall, and the f1-
score closely matching this figure. This reflects a well-
balanced performance across the classes, though there is room
for improvement, particularly for Romantic and Sad, which
show lower performance.
Fig. 5 Representative Mel Spectrogram Results for Each Piece of Music
TABLE I
F. Quantification of Acoustic Properties EVALUATION RESULTS OF THE LOGISTIC REGRESSION MODEL
Class Precision Recall F1-Score
The Mel Spectrogram is pivotal in machine learning,
Aggressive 0.965 0.965 0.965
particularly in audio-related deep learning applications. It Dramatic 0.916 0.922 0.919
effectively extracts significant audio features through a Mel Happy 0.939 0.96 0.949
scale that mimics the human auditory system, offering these Romantic 0.836 0.83 0.833
features in a format suitable for machine learning models to Sad 0.857 0.838 0.847
learn from. Transforming complex audio signals into a two- accuracy - - 0.903
dimensional image format simplifies the high-dimensional macro avg 0.902 0.903 0.903
complexity of data into a visually and computationally weighted avg 0.902 0.903 0.903

822
B. Random Forest Model the comparatively lower performance in the 'romantic' and 'ad'
The Random Forest model is an ensemble classification classes suggests that further model performance enhancement
and regression technique that utilizes a collection of decision is necessary.
trees. Each tree in a Random Forest is trained on a random TABLE III
subset of the data, and the final prediction is made by EVALUATION RESULTS OF THE SVC MODEL
aggregating (typically by voting for classification) the Class Precision Recall F1-Score
predictions of individual trees [35]–[40]. Table 2 shows the Aggressive 0.934 0.99 0.961
classification results for a Random Forest model, covering Dramatic 0.865 0.9 0.882
five different labels: Aggressive, Dramatic, Happy, Romantic, Happy 0.93 0.93 0.93
and Sad. Here’s a breakdown of the performance metrics: Romantic 0.862 0.81 0.835
 Aggressive: It shows exceptional precision at 97.1%
Sad 0.854 0.82 0.837
accuracy - - 0.89
and perfect recall, meaning every instance of macro avg 0.889 0.89 0.889
Aggressive in the test set was correctly identified. The weighted avg 0.889 0.89 0.889
F1 Score is 98.5%, indicating an excellent balance
between precision and recall. D. AdaBoost Model
 Dramatic: High precision and recall, 96% and 95%, The AdaBoost classifier is a machine learning model that
respectively, leading to a very high F1-Score of 95.5%, combines multiple "weak learners" to form a robust predictive
reflecting strong classification performance for this model [41]. In this case, the AdaBoostClassifier from the sci-
label. kit-learn library is used, which defaults to using the
 Happy: Achieved perfect precision, indicating that DecisionTreeClassifier as its weak learner. AdaBoost is
every prediction made as Happy was correct. The recall particularly sensitive to noisy data and outliers, which might
of 96% suggests that most, but not all, Happy instances explain the lower performance in some classes. Table 4 shows
were captured. The F1-Score is 98%, which is a breakdown of the model and its performance metrics:
outstanding.  The `aggressive` class has a relatively high precision
 Romantic: This label has lower precision at 87.6% but and recall, resulting in a solid f1-score of 83%.
a higher recall of 92%, suggesting some false positives  The `dramatic` and `sad` classes have lower precision
in the predictions. The F1-Score is 89.8%, the weakest and recall values, indicating challenges in accurately
among the labels, but still suggests good performance. predicting these classes.
 Sad: It presents strong metrics, with precision at 93.8%  The `happy` class has decent precision but lower recall,
and recall at 91%, resulting in a robust F1-Score of suggesting the model is conservative in predicting this
92.4%. class and misses some actual cases.
The model's accuracy across all labels is 94.8%, indicating  While the `romantic` class has lower precision, it has
that the model correctly predicts the label 94.8% of the time the highest recall, indicating that the model tends to
across the dataset. The results reflect a highly effective over-predict this class.
classifier, particularly for the labels Aggressive and Happy, The model's overall accuracy is 53.0%, with the macro and
with room for improvement in the classification of Romantic. weighted averages for precision, recall, and f1-score hovering
TABLE II around 52.4% to 54.7%. These results imply that while the
EVALUATION RESULTS OF THE RANDOM FOREST MODEL model performs well in the `aggressive` class, it struggles
Class Precision Recall F1-Score with the other categories to varying degrees, leading to
Aggressive 0.971 1 0.985 moderate overall performance. AdaBoost is particularly
Dramatic 0.96 0.95 0.955 sensitive to noisy data and outliers, which might explain the
Happy 1 0.96 0.98 lower performance in some classes. Fine-tuning parameters
Romantic 0.876 0.92 0.898 like the number of estimators and learning rate, or even using
Sad 0.938 0.91 0.924 a different base estimator, could improve the model's
accuracy - - 0.948
macro avg 0.949 0.948 0.948
predictive accuracy.
weighted avg 0.949 0.948 0.948 TABLE IV
EVALUATION RESULTS OF THE ADABOOST MODEL
C. Support Vector Classifier Model Class Precision Recall F1-Score
Table 3 presents the results of solving a classification Aggressive 0.934 0.99 0.961
problem using the SVC (Support Vector Classifier). The Dramatic 0.865 0.9 0.882
SVC, a classification algorithm based on Support Vector Happy 0.93 0.93 0.93
Machine principles, was implemented in this experiment with Romantic 0.862 0.81 0.835
a linear kernel to construct the model. The model was then Sad 0.854 0.82 0.837
accuracy - - 0.89
trained and evaluated on a specific dataset. The precision for
macro avg 0.889 0.89 0.889
the 'aggressive' class was 93.4%, and the recall was 99%, weighted avg 0.889 0.89 0.889
indicating a high proportion of predictions correctly identified
as 'aggressive.' The F1-Score, the harmonic mean of precision IV. CONCLUSION
and recall, was recorded at 96.1%. Overall, the SVC results
demonstrate that the model performs well generally and This paper underscores the importance of background
classifies the 'aggressive' class exceptionally well. However, music in OTT services and its impact on user experience,
proposing a system that aids all users, including those with

823
hearing impairments, better understand the emotions [9] V. Bansal, G. Pahwa and N. Kannan, "Cough Classification for
COVID-19 based on audio mfcc features using Convolutional Neural
conveyed through background music. The system combines
Networks," 2020 IEEE International Conference on Computing,
audio signal processing techniques such as MFCC, RMS, and Power and Communication Technologies (GUCON), Greater Noida,
Mel Spectrogram with various machine learning algorithms— India, 2020, pp. 604-608, doi: 10.1109/gucon48875.2020.9231094.
Logistic Regression, Random Forest, AdaBoost, and SVC— [10] S. A. A. Qadri, T. S. Gunawan, M. Kartiwi, H. Mansor and T. M. Wani,
"Speech Emotion Recognition Using Feature Fusion of TEO and
to analyze the emotional characteristics of background music
MFCC on Multilingual Databases", Lecture Notes in Electrical
and convert them into textual subtitles. The experimental Engineering, vol. 730, pp. 681-691, 2022.
results validate the utility of this technology, with the Random [11] Q. Li et al., "MSP-MFCC: Energy-Efficient MFCC Feature Extraction
Forest algorithm showing the highest accuracy. This system Method With Mixed-Signal Processing Architecture for Wearable
Speech Recognition Applications," in IEEE Access, vol. 8, pp. 48720-
can be utilized to improve the accessibility of emotional 48730, 2020, doi: 10.1109/access.2020.2979799.
elements in a wide range of multimedia content, not just OTT [12] S. Masood, J. S. Nayal and R. K. Jain, "Singer identification in Indian
services, enabling all users, especially those with hearing Hindi songs using MFCC and spectral features," 2016 IEEE 1st
impairments, to have a deeper understanding and enjoyment International Conference on Power Electronics, Intelligent Control
and Energy Systems (ICPEICES), Delhi, India, 2016, pp. 1-5,
of content through music, a non-verbal communication
doi:10.1109/icpeices.2016.7853641.
channel. [13] J. Dutta and D. Chanda, "Music Emotion Recognition in Assamese
While the proposed system for emotional analysis of Songs using MFCC Features and MLP Classifier," 2021 International
background music and conversion to textual subtitles has Conference on Intelligent Technologies (CONIT), Hubli, India, 2021,
pp. 1-5, doi: 10.1109/conit51480.2021.9498345.
significantly enriched user experience in OTT services,
[14] K. L. Ong, C. P. Lee, H. S. Lim, K. M. Lim and A. Alqahtani, "Mel-
further research is anticipated in several areas. First, as the MViTv2: Enhanced Speech Emotion Recognition With Mel
perception of emotions in background music can vary across Spectrogram and Improved Multiscale Vision Transformers," in IEEE
different cultures and genres, there is a need to improve the Access, vol. 11, pp. 108571-108579, 2023,
doi:10.1109/access.2023.3321122.
universality of the model through research that includes a
[15] S. D. Handy Permana and T. K. A. Rahman, "Improved Feature
variety of cultural backgrounds and musical genres. Second, Extraction for Sound Recognition Using Combined Constant-Q
refining and expanding the emotion classification system used Transform (CQT) and Mel Spectrogram for CNN Input," 2023
in the current study is essential to develop a model capable of International Conference on Modeling & E-Information Research,
Artificial Learning and Digital Applications (ICMERALDA),
recognizing and expressing a more sophisticated and varied
Karawang, Indonesia, 2023, pp. 185-190,
range of emotional states. Third, for actual application in OTT doi:10.1109/icmeralda60125.2023.10458162.
services, creating a system capable of analyzing background [16] Y. Khasgiwala and J. Tailor, "Vision Transformer for Music Genre
music and generating subtitles in real time is planned. Classification using Mel-frequency Cepstrum Coefficient," 2021 IEEE
4th International Conference on Computing, Power and
Communication Technologies (GUCON), Kuala Lumpur, Malaysia,
ACKNOWLEDGMENT 2021, pp. 1-5, doi: 10.1109/gucon50781.2021.9573568.
[17] S. -H. Cho, Y. Park and J. Lee, "Effective Music Genre Classification
The Basic Science Research Program supported this
using Late Fusion Convolutional Neural Network with Multiple
research through the National Research Foundation of Korea Spectral Features," 2022 IEEE International Conference on Consumer
(NRF), funded by the Ministry of Education (RS-2023- Electronics-Asia (ICCE-Asia), Yeosu, Korea, Republic of, 2022, pp.
00246191) 1-4, doi: 10.1109/icce-asia57006.2022.9954732.
[18] G. Ulutas, G. Tahaoglu and B. Ustubioglu, "Forge Audio Detection
Using Keypoint Features on Mel Spectrograms," 2022 45th
REFERENCES International Conference on Telecommunications and Signal
[1] K. S. Sontakke, “Trends in OTT Platforms Usage During COVID-19 Processing (TSP), Prague, Czech Republic, 2022, pp. 413-416,
Lockdown in India,” Journal of Scientific Research, vol. 65, no. 08, doi:10.1109/tsp55681.2022.9851327.
pp. 112–114, 2021, doi: 10.37398/jsr.2021.650823. [19] W. B. Zulfikar, Y. A. Gerhana, A. Y. P. Almi, D. S. Maylawati and M.
[2] Kim, Woo-Hyeon, et al. “Multi-Modal Deep Learning Based Metadata I. A. Amin, "Mood of Song Detection Using Mel Frequency Cepstral
Extensions for Video Clipping”. International Journal on Advanced Coefficient and Convolutional Neural Network with Tuning
Science, Engineering and Information Technology, vol. 14, no. 1, Feb. Hyperparameter," 2023 11th International Conference on Cyber and
2024, pp. 375-80, doi:10.18517/ijaseit.14.1.19047. IT Service Management (CITSM), Makassar, Indonesia, 2023, pp. 1-6,
[3] Gangwar VP, Sudhagoni VS, Adepu N, Bellamkonda ST. Profiles and doi: 10.1109/citsm60085.2023.10455644.
Preferences of OTT users in Indian Perspective. European Journal of [20] K. Wang, C. Qian and L. Zhang, "Machine learning music emotion
Molecular & Clinical Medicine. vol. 7, no. 8, 2020. recognition based on audio features," 2023 IEEE 6th International
[4] M. Yasen and S. Tedmori, “Movies Reviews Sentiment Analysis and Conference on Information Systems and Computer Aided Education
Classification,” 2019 IEEE Jordan International Joint Conference on (ICISCAE), Dalian, China, 2023, pp. 215-220,
Electrical Engineering and Information Technology (JEEIT), Apr. doi:10.1109/iciscae59047.2023.10392981.
2019, doi: 10.1109/jeeit.2019.8717422. [21] W. Wang, "CNN based music emotion recognition," 2021 2nd
[5] J. Kim, C. Nam, and M. H. Ryu, “IPTV vs. emerging video services: International Conference on Artificial Intelligence and Computer
Dilemma of telcos to upgrade the broadband,” Telecommunications Engineering (ICAICE), Hangzhou, China, 2021, pp. 190-195,
Policy, vol. 44, no. 4, p. 101889, May 2020, doi:10.1109/icaice54393.2021.00044.
doi:10.1016/j.telpol.2019.101889. [22] Melinda, Melinda, et al. “Design and Implementation of Mobile
[6] M. S. Nordin et al., "Stress Detection based on TEO and MFCC speech Application for CNN-Based EEG Identification of Autism Spectrum
features using Convolutional Neural Networks (CNN)," 2022 IEEE Disorder”. International Journal on Advanced Science, Engineering
International Conference on Computing (ICOCO), Kota Kinabalu, and Information Technology, vol. 14, no. 1, Feb. 2024, pp. 57-64,
Malaysia, 2022, pp. 84-89, doi:10.1109/ICOCO56118.2022.10031771. doi:10.18517/ijaseit.14.1.19676.
[7] M. Selvaraj, R. Bhuvana and S. Padmaja, "Human speech emotion [23] Haque, Radiah, et al. “Classification Techniques Using Machine
recognition", Int. J. Eng. Technol, vol. 8, no. 1, pp. 311-323, 2016. Learning for Graduate Student Employability Predictions”.
[8] Z. Fu, G. Lu, K. M. Ting and D. Zhang, "A Survey of Audio-Based International Journal on Advanced Science, Engineering and
Music Classification and Annotation," in IEEE Transactions on Information Technology, vol. 14, no. 1, Feb. 2024, pp. 45-56,
Multimedia, vol. 13, no. 2, pp. 303-319, April 2011, doi:10.18517/ijaseit.14.1.19549.
doi:10.1109/TMM.2010.2098858. [24] S. Khade, S. Gite, S. D. Thepade, B. Pradhan, and A. Alamri,
“Detection of Iris Presentation Attacks Using Hybridization of
Discrete Cosine Transform and Haar Transform with Machine

824
Learning Classifiers and Ensembles,” IEEE Access, vol. 9, pp. Based on Logistic Regression Model with Gazing Information,” IEEE
169231-169249, 2021, doi: 10.1109/access.2021.3138455. Access, vol. 9, pp. 127672-127684, 2021,
[25] M. H. Baffa, M. A. Miyim, and A. S. D. Dauda, "Machine learning for doi:10.1109/access.2021.3111753.
predicting students' employability," UMYU Sci., vol. 2, no. 1, 2023, [34] J. Xu, Y. Zhang, and D. Miao, "Three-way confusion matrix for
doi:10.56919/usci.2123_001. classification: A measure driven view," Inf. Sci. (Ny)., vol. 507, pp.
[26] L. S. Hugo, “A comparison of machine learning models predicting 772–794, 2020, doi: 10.1016/j.ins.2019.06.064.
student employment,” J. of Chemical Information and Modeling, vol. 53, [35] Susetyoko, Ronny, et al. “An Improved Accuracy of Multiclass
no. 9. 2018, [Online]. Available: Random Forest Classifier With Continuous Attribute Transformation
http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1544127100472053. Using Random Percentile Generation”. International Journal on
[27] S. Islam, T. Akter, S. Zakir, S. Sabreen, and M. I. Hossain, "Autism Advanced Science, Engineering and Information Technology, vol. 13,
Spectrum Disorder Detection in Toddlers for Early Diagnosis Using no. 3, June 2023, pp. 943-5, doi:10.18517/ijaseit.13.3.18379.
Machine Learning," 2020 IEEE Asia-Pacific Conf. Comput. Sci. Data [36] R. Susetyoko, W. Yuwono, E. Purwantini, and B. N. Iman,
Eng. CSDE 2020, 2020, doi: 10.1109/csde50874.2020.9411531. “Characteristics of Accuracy Function on Multiclass Classification
[28] M. A. Siddiqi and W. Pak, “An Agile Approach to Identify Single and Based on Best, Average, and Worst (BAW) Subset of Random Forest
Hybrid Normalization for Enhancing Machine Learning-Based Model,” pp. 410-417, 2022, doi: 10.1109/ies55876.2022.9888374.
Network Intrusion Detection,” IEEE Access, vol. 9, pp. 137494- [37] M. A. Ganaie, M. Tanveer, P. N. Suganthan, and V. Snasel, “Oblique
137513, 2021, doi: 10.1109/access.2021.3118361. and rotation double random forest,” Neural Networks, vol. 153, pp.
[29] T. Le Minh, L. Van Tran, and S. V. T. Dao, “A Feature Selection 496-517, 2022, doi: 10.1016/j.neunet.2022.06.012.
Approach for Fall Detection Using Various Machine Learning [38] M. Gencturk, A. Anil Sinaci, and N. K. Cicekli, “BOFRF: A Novel
Classifiers,” IEEE Access, vol. 9, pp. 115895-115908, 2021, Boosting-based Federated Random Forest Algorithm on Horizontally
doi:10.1109/access.2021.3105581. Partitioned Data,” IEEE Access, vol. 10, no. August, pp. 89835-89851,
[30] B. Wang and J. Zhang, “Logistic Regression Analysis for LncRNA- 2022, doi: 10.1109/access.2022.3202008.
Disease Association Prediction Based on Random Forest and Clinical [39] C. Zou et al., “Heartbeat Classification by Random Forest With a
Stage Data,” IEEE Access, vol. 8, pp. 35004-35017, 2020, Novel Context Feature: A Segment Label,” IEEE J. Transl. Eng. Heal.
doi:10.1109/access.2020.2974624. Med., vol. 10, no. August 2022, doi: 10.1109/jtehm.2022.3202749.
[31] A. Lucas, A. T. Williams, and P. Cabrales, “Prediction of Recovery [40] D. A. Anggoro and N. A. Afdallah, “Grid Search CV Implementation
from Severe Hemorrhagic Shock Using Logistic Regression,” IEEE J. in Random Forest Algorithm to Improve Accuracy of Breast Cancer
Transl. Eng. Heal. Med., vol. 7, no. June, pp. 1-9, 2019, Data,” International Journal on Advanced Science, Engineering and
doi:10.1109/jtehm.2019.2924011. Information Technology, vol. 12, no. 2, p. 515, Apr. 2022,
[32] Z. Zhang and Y. Han, “Detection of Ovarian Tumors in Obstetric doi:10.18517/ijaseit.12.2.15487.
Ultrasound Imaging Using Logistic Regression Classifier with an [41] E. Ileberi, Y. Sun, and Z. Wang, “Performance Evaluation of Machine
Advanced Machine Learning Approach,” IEEE Access, vol. 8, pp. Learning Methods for Credit Card Fraud Detection Using SMOTE and
44999-45008, 2020, doi: 10.1109/access.2020.2977962. AdaBoost,” IEEE Access, vol. 9, pp. 165286-165294, 2021,
[33] J. C. Nwadiuto, S. Yoshino, H. Okuda, and T. Suzuki, “Variable doi:10.1109/access.2021.3134330.
Selection and Modeling of Drivers’ Decision in Overtaking Behavior

825

You might also like