External Report
External Report
Submitted
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE and ENGINEERING
By
K. Supraja 211FA04381
K. Thanmai Ganga Bhavani 211FA04385
May, 2025
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
CERTIFICATE
This is to certify that the project report entitled “Audio Classification: Identification of
Single and Multiple Sounds in an Audio Clip” has been submitted by K.
Supraja(211FA04381), K. Thanmai Ganga Bhavani(211FA04385) in partial fulfillment of
the requirements for the Major Project course, as part of the academic curriculum of the
B.Tech. CSE Program, Department of Computer Science and Engineering (CSE) at
VFSTR Deemed to be University.
External Examiner
i
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
DECLARATION
We hereby declare that the project work entitled "Audio Classification: Identification of Single
and Multiple Sounds in an Audio Clip" submitted in partial fulfillment of the requirements for
the award of the degree of Bachelor of Technology (B.Tech) in Computer Science and
Engineering at VFSTR Deemed to be University is a record of my/our original work.
This project has been carried out under the supervision of the Department of Computer Science
and Engineering, VFSTR Deemed to be University. The work embodied in this thesis has not
been submitted previously, in part or full, to any other University or Institution for the award of
any degree or diploma.
I/We have duly acknowledged all sources of information and data used in the preparation of this
project report and shall abide by the principles of academic integrity and ethical guidelines.
By
K. Supraja (211FA04381)
K. Thanmai Ganga Bhavani (211FA04385)
Date:
ii
ACKNOWLEDGEMENT
We take this opportunity to express our deep sense of gratitude to our Project Guide, Dr. G.
Balu Narasimha Rao for granting us permission to undertake this project and for his
unwavering support, valuable guidance, and constant encouragement throughout the duration
of our work.
It is our privilege to extend our heartfelt thanks to Dr. S. V. Phani Kumar, Head of the
Department, Computer Science and Engineering, VFSTR Deemed to be University, for
providing us the opportunity and necessary resources to carry out this project work. We would
also like to express our profound gratitude to Dr. K. V. Krishna Kishore, Dean, School of
Computing and Informatics, VFSTR Deemed to be University, for his encouragement and
for facilitating an environment conducive to research and innovation.
We extend our sincere appreciation to all the faculty members, programmers, and technical
staff of the Department of Computer Science and Engineering for their valuable support,
knowledge sharing, and assistance throughout our academic journey.
Finally, we are deeply thankful to our family members for their unconditional love, constant
support, and encouragement, which were crucial in sustaining our efforts and successfully
completing this project.
K. Supraja (211FA04020)
K. Thanmai Ganga Bhavani(211FA04285)
iii
ABSTRACT
Classifying multiple sounds in continuous audio streams is a complex and critical task,
especially in real-world scenarios where different audio sources such as background
conversations, music, environmental noise, and speech can occur simultaneously. Traditional
audio classification methods often struggle in such conditions due to the complexity and
variation in the sound patterns. Classifying multiple sounds in continuous audio streams is a
complex and critical task, especially in real-world scenarios where different audio sources
such as background conversations, music, environmental noise, and speech can occur
simultaneously. Traditional audio classification methods often struggle in such conditions due
to the complexity and variation in the sound patterns. This project presents a deep learning
based approach to effectively identify and categorize individual sound types within a mixed
audio recording. To process and analyze audio data, raw signals are first converted into
informative visual representations such as Mel-frequency Cepstral Coefficients (MFCCs) and
spectrograms, which record crucial time-frequency characteristics. Convolutional Neural
Networks (CNNs), which can automatically learn complex sound patterns and spatial
information, are then fed these representations. CNNs are excellent at extracting hierarchical
characteristics, which makes them ideal for jobs that require classifying continuous and
overlapping sounds. There are many real-world uses for this approach. In voice-controlled
smart homes, it can help detect unusual sounds like glass breaking or smoke alarms. In online
examination systems, it can be used to monitor background noise for potential violations such
as whispering or unauthorized conversations. Other applications include enhancing the
robustness of speech recognition systems, improving multimedia content indexing, and
supporting surveillance or security /systems by identifying critical sound events in real-time.
iv
TABLE OF CONTENTS
1. Introduction 1
1.1 What is Audio Classification and what causes the need for it? 3
1.2 The consequences of inaccurate Audio Classification? 4
1.3 why is identifying multiple continuous sounds important? 5
1.4 Current Methodologies 5
1.5 Applications of Deep Learning in Audio Classification 6
2. Literature Survey 8
2.1 Literature review 9
2.2 Motivation 13
3. Proposed System 14
3.1 Input dataset 15
3.2 Data Pre-processing 15
3.2.1 Resampling 16
3.2.2 Noise Reduction/Silence Removal 17
3.2.3 Feature Extraction 17
3.3 Model Training 20
3.3.1 Splitting Data 20
3.4 Methodology of the system 20
3.5 Model Evaluation 22
3.5.1 Model Summary 23
3.6 Constraints 24
3.7 Cost and Sustainability Impact 26
4. Implementation 28
4.1 Environment Setup 29
4.2 Sample code 29
5. Experimentation and Result Analysis 33
5.1 Results 34
6. Conclusion 38
7. References 40
v
LIST OF FIGURES
vi
LIST OF TABLES
1
CHAPTER-1
INTRODUCTION
2
INTRODUCTION
1.1 What is Audio Classification and what causes the need for it?
Audio classification is the process of analyzing and categorizing different types of sounds
into specific classes such as speech, music, or everyday environmental sounds like dog
barking, horns, alarms, or children playing. This process typically begins by capturing raw
audio signals and converting them into useful data using techniques like Mel-Frequency
Cepstral Coefficients (MFCCs), spectrograms, or chroma features. Once features are
extracted, machine learning or deep learning models are used to identify and classify the
type of sound.
The need for audio classification has grown rapidly due to the increasing use of smart
technologies in our daily lives. Devices like smartphones, smart home systems,
surveillance cameras, and autonomous vehicles rely on sound to better understand and
respond to their surroundings. For example, a voice assistant must recognize whether
you're giving a command or playing music, while a surveillance system should be able to
detect suspicious sounds like a scream or a breaking window.
Audio classification is also crucial in areas like health monitoring (e.g., detecting coughs
in patients), wildlife conservation (monitoring animal calls), and public safety (identifying
alarms or sirens). Additionally, in entertainment and media, sound classification helps in
tagging content, enabling better content discovery and recommendation. In industrial
environments, it aids in machinery fault detection by identifying abnormal sounds. As
digital environments become more interactive, there's a growing demand for machines that
can "hear" and react accurately to various sounds in real time. This helps automate
processes, improve user experiences, and enable smarter, more responsive systems across
industries.
3
contextual understanding. This not only improves system accuracy but also opens the door
to innovations in areas like immersive gaming, virtual reality, real-time transcription
services, and assistive technologies for people with disabilities. As we move towards
smarter, more perceptive environments, the role of accurate, efficient, and real-time audio
classification will continue to grow in importance, driving forward the development of
truly intelligent systems.
Furthermore, in multimedia retrieval systems, wrongly labeled audio files can reduce the
efficiency and accuracy of content indexing and searching. Such inaccuracies also impact
training data for machine learning models, leading to further degradation of performance
over time. In business applications like customer service or call center monitoring,
misclassified audio inputs can affect analytics and customer satisfaction. Therefore,
maintaining high accuracy in audio classification is critical not only for operational
efficiency but also for user trust, system dependability, and real-world impact where sound
plays a vital role in communication and automation.
it is evident that inaccurate audio classification can have a profound ripple effect across
various sectors. In addition to the already highlighted domains, industries such as
automotive (e.g., driver-assist systems that rely on detecting sirens or honks), education
(e.g., transcription tools or lecture capture systems), and entertainment (e.g., automatic
tagging or sound mixing) also suffer when sound is not accurately interpreted.
Misclassification can lead to flawed decision-making, loss of valuable insights, and even
4
legal liabilities in cases where audio evidence plays a role. Furthermore, in environments
involving human-computer interaction, poor audio recognition diminishes accessibility for
individuals relying on voice-based systems, such as those with visual impairments.
5
While traditional methods like SVM and Random Forest are still in use, deep learning
models offer superior performance, especially with large datasets. Techniques like data
augmentation and ensemble learning further improve accuracy in noisy or real-world
environments.Despite progress, challenges remain in detecting low-volume or overlapping
sounds accurately, particularly in dynamic, noisy settings. Continued research in model
architectures and multimodal integration is helping to close these gaps.
In the field of smart surveillance, deep learning models are deployed to detect and classify
potentially dangerous sounds like gunshots, breaking glass, or distress calls. These systems
are widely used in urban public safety infrastructures. Technologies like ShotSpotter use
real-time audio classification to notify law enforcement of gunfire incidents with high
precision and low latency, helping authorities respond faster and more accurately.
Another critical area is environmental and noise monitoring, where urban planners and
local governments use deep learning to classify and track sound pollution sources.
Initiatives like the Sounds of New York City (SONYC) project use a network of sensors
and CNN-based models to analyze millions of audio clips, detecting sound types such as
jackhammers, honking, or sirens. These insights help authorities make informed decisions
about zoning, traffic regulation, and community well-being.
6
Deep learning also plays a significant role in the entertainment and music industry.
Music genre detection systems powered by CNNs and spectrogram analysis can identify
not only the genre but also instruments, tempo, and mood of a song. This automatic tagging
improves the effectiveness of recommendation engines on platforms like Spotify,
YouTube Music, and SoundCloud, personalizing user experience based on listening
history and preferences.
7
CHAPTER-2
LITERATURE SURVEY
8
2. LITERATURE SURVEY
Momynkulov et al. [2] Outlined a strategy driven by deep learning utilizing a CNN-RNN
model for detecting and classifying dangerous urban sounds. Their study employed the
ESC-50 dataset, selecting approximately 300 audio samples, including gunshots,
explosions, sirens, and cries. The results indicate the effectiveness of deep learning in real-
time security applications, enabling automated sound-based surveillance in public and
confined spaces. This method enhances law enforcement and emergency response by
efficiently identifying critical auditory events.
Garcia et al. [3] introduced a fall detection system based on machine learning, utilizing the
SAFE dataset, which includes 950 audio recordings, with 475 representing simulated fall
incidents. The study explored decision trees, Gaussian Naive Bayes, and deep learning
models incorporating spectrogram analysis for classification. Findings confirmed the
efficiency of audio-based fall detection, with deep learning models exhibiting the highest
performance. This approach aims to enhance elderly care by enabling real-time monitoring
of falls in residential and healthcare environments.
9
vessel detection, bioacoustics monitoring, and environmental assessments.
Brunese et al. [5] introduced a deep learning-based method for detecting heart disease by
analyzing cardiac sounds from the Classifying Heart Sounds Challenge dataset. Their ap
proach transforms audio signals into numerical features such as MFCCs and utilizes a deep
neural network for classification tasks. The results demonstrate that deep learning models
are highly effective in distinguishing between healthy individuals and those with heart
conditions, surpassing the performance of conventional techniques. The study also
suggests that mobile applications powered by deep learning could facilitate early diagnosis
and remote health monitoring, potentially minimizing the need for frequent hospital visits.
Kho et al. [6] developed a deep learning framework using the Mini VGG Net model for
COVID-19 detection through cough sound analysis. The study utilized datasets from the
University of Cambridge, Coswara Project, and NIH Malaysia. It incorporated cough
segmentation techniques and examined data augmentation effects, finding that
segmentation improved model performance, while augmentation had no significant
impact.
Arafath et al. [8] developed a deep learning approach for breath sound detection using
speech recordings and thermal video data. Their method incorporated self-supervised learn
ing and CNN-BiLSTM models to enhance performance. The study suggests applications
in medical diagnostics, biometric authentication, and respiratory health monitoring.
Kim et al. [9] conducted milling experiments using varied machining parameters and
labeled chatter events using expert knowledge. They applied a CNN with an attention
block combining AlexNet outputs and cutting parameters. The model achieved 94.51%
accuracy in OOD testing, outperforming the baseline CNN model’s 88.66% accuracy.
10
The research carried out by Sophia et al. [10] explored how The use of audio elements like
sound effects and music enhances robotic storytelling in a range of genres. Four online
studies were conducted using stories from horror, detective, romance, and comedy genres.
The findings revealed that incorporating sound and music enhanced the enjoyment of
romantic stories and generally contributed to reduced fatigue across all genres.
Tuomas Virtanen et al. [11] used the TUT Urban Acoustic Scenes 2018 dataset to classify
short audio samples into ten acoustic scenes using a CNN-based baseline. The model
achieved 59.7% on the development set and up to 61% on evaluation, with reduced
accuracy on varied recording devices.
Yizhar Lavner et al. [12] conducted a study using a dataset comprising audio recordings
of infants aged 0 to 6 months in home settings. Their approach applied two machine learn
ing techniques for the automatic recognition of baby cries: a lightweight logistic regression
model and a convolutional neural network (CNN). The findings indicate that the CNN
significantly outperforms the logistic regression classifier in terms of detection accuracy.
Three publicly accessible ambient audio benchmark datasets are used in Sheryl Brahnam
et al.’s [13] study: (1) bird cries, (2) cat sounds, and (3) the ambient Sound Classification
(ESC 50) database. Five pre-trained convolutional neural networks (CNNs) are retrained
using ensembles of classifiers that em ploy four signal representations and six data
augmentation approaches. The findings demonstrate that the best ensembles perform better
than or in comparison to the best approaches described in the literature on several datasets,
including the ESC-5 dataset.
Piczak et al. [14] explored the application of convolutional neural networks (CNNs) for
the classification of brief audio segments featuring environmental sounds. Their method
employed a deep learning structure consisting of two convolutional layers, followed by
max-pooling operations and two fully connected layers. The model was trained using low-
level audio inputs, particularly segmented spectrograms enhanced with delta features. To
assess the effectiveness of their approach, the system was evaluated using three publicly
available datasets containing samples of environmental and urban audio.
Cakir et al. [15] proposed a deep learning-based method employing multi-label neural
11
networks to detect simultaneously occurring sound events in real-world acoustic scenes.
Their study utilized a diverse dataset consisting of over 1100 minutes of audio captured
from 10 different everyday environments, comprising 61 distinct sound event categories.
The approach incorporated log Mel-band energy features and a median filter ing technique
during post-processing. The model demonstrated an accuracy of 63.8%, achieving a 19%
improvement over conventional baseline systems.
Forrest Briggs et al. [16] introduced a technique to detect multiple bird species vocalizing
simultaneously by utilizing a multi-instance multi-label (MIML) classification framework.
The model was evaluated on 548 ten-second audio samples recorded in natural forest
settings. By applying a specialized segmentation method for feature extraction, their
approach achieved a 96.1% accuracy rate, effectively handling noisy conditions with
overlapping bird calls.
The tutorial by Mesaros et al. [17]explores sound event detection using deep learning
methods like CRNNs with log mel spectrogram features. It discusses datasets such as
Audio Set and URBAN-SED, and techniques like data augmentation, transfer learning,
and weak/strong labeling. Evaluation uses F-score, precision, recall, and metrics like PSDS
for accuracy.
Zaman et al. [18] conducted a review of deep learning models for audio classification.
Their work examines Convolutional Neural Networks (CNNs), Recurrent Neural
Networks (RNNs), autoencoders, transformers, and hybrid architectures. They discuss
audio datasets like ESC-50 and UrbanSound8k, and analyze the application of various
deep learning architectures to audio classification tasks.
In their work, Qamhan et al. [19] utilized the KSU-DB corpus, consisting of 3600
recordings from 3 environments and 4 recording devices, to build an acoustic source
identification system. This system, based on a hybrid CNN-LSTM model, was designed
to classify recording devices and environments. The study demonstrated that
voiced/unvoiced speech segments are effective for this classification task, with the system
reaching 98% accuracy for environment and 98.57% accuracy for microphone
classification.
12
Inoue et al. [20]applied anomaly detection methods that use classification confidence to
the DCASE 2020 Task 2 Challenge. Their systems ensemble two classification-based
detectors. They trained classifiers to classify sounds by ma chine type and ID, and to
classify transformed sounds by data-augmentation type. The dataset was The DCASE
2020 Task 2 Challenge
2.1 Motivation
The motivation for audio classification, particularly the identification of single and
multiple sounds within an audio clip, stems from the growing demand for intelligent
systems that can understand and interact with complex acoustic environments. In today’s
world, audio-based data is ubiquitous, ranging from urban soundscapes and natural
environments to speech and industrial settings. Accurately identifying sounds from such
diverse sources has significant implications in fields like surveillance, healthcare
monitoring, smart homes, autonomous vehicles, and human-computer interaction.
Traditional audio classification methods often struggle when multiple sounds occur
simultaneously, as overlapping acoustic signals can obscure or distort key features. This
limitation hinders their effectiveness in dynamic, real-world environments. Therefore,
there is a strong need for advanced approaches that can not only recognize isolated sounds
but also disentangle and correctly label multiple concurrent sound sources. By enabling
systems to perceive and differentiate between overlapping audio events, we move closer
to creating machines that can interpret sound as effectively as humans—understanding
context, reacting appropriately, and making intelligent decisions based on the auditory
scene. This capability is crucial for building responsive AI systems that operate reliably in
uncontrolled, noisy, and complex environments.
13
CHAPTER-3
PROPOSED SYSTEM
14
3. PROPOSED SYSTEM
The entire process begins by segmenting the audio input into smaller portions. Each
segment is then fed into the CNN model, which has been pre-trained and stored in a file
named model.h5. This model operates in a multi-class classification mode, meaning it can
identify and distinguish between several sound classes simultaneously. For every segment,
the model outputs a probability distribution across the predefined sound classes.
The class with the highest probability is selected using the argmax function, effectively
assigning the most likely sound label to that segment. This predicted numerical label is
then decoded into its corresponding sound category using a LabelEncoder that maps
numeric classes back to their original names. The final output is a list of predicted sound
labels, one for each segment, providing a detailed breakdown of all the different types of
sounds detected throughout the audio clip. This method ensures accurate identification
even in audio environments with complex, overlapping auditory signals.
15
Fig. 3.1. CNN Architecture
3.2.1 Resampling
Resampling is the process of altering the sampling rate of an audio signal, which refers to
the number of audio samples captured per second and is measured in Hertz (Hz). For
16
instance, audio recorded at 44,100 Hz (commonly used in CDs) contains 44,100 samples
per second. However, in many machine learning and deep learning applications, such a
high sampling rate may not be necessary and can lead to increased computational costs.
Resampling helps address this by reducing or standardizing the sampling rate (e.g., down
sampling to 16,000 Hz or 8,000 Hz), making the audio data more manageable and
consistent across datasets. It also ensures compatibility with pre-trained models or feature
extraction methods that expect input at a specific rate. Additionally, resampling can
enhance the generalization of models by eliminating redundant frequency information that
is not critical for the classification task.
17
1. Pre-emphasis
Initially, the raw audio signal undergoes pre-emphasis filtering, which amplifies high-
frequency components to offset the natural attenuation introduced by the vocal tract or
environmental noise. The filtered output signal y[n] is defined by:
18
The output of each filter represents the energy in that Mel band.
5. Logarithmic Compression
To mimic the human ear’s logarithmic response to loudness, the energy within each Mel
frequency band is transformed using a logarithmic scale.
where M is Mel filters and L is the number of MFCCs retained (typically 12–13).
7. Post-Processing
In practical scenarios, temporal patterns can be captured by computing the first and
second-order derivatives, known as delta and delta-delta coefficients. In this research, 40
MFCCs were derived from each audio file, and the mean of each coefficient across all time
frames was computed to form a fixed-size feature vector. This vector was then utilized as
input to train a deep learning model for classifying urban sounds.
19
3.3 Model Training
The model was trained over 150 epochs using a batch size of 32. This batch size indicates
how many data samples are handled before the model adjusts its internal weights. The
training ran in verbose mode, meaning progress details such as loss and accuracy were
shown after every epoch.
The figure below illustrates the complete pipeline for audio classification using a
Convolutional Neural Network (CNN) model. The process begins with Data Collection,
which involves gathering a variety of audio recordings from relevant sources. These
recordings may include environmental sounds, speech, noise, or other acoustic events
depending on the objectives of the classification task.
20
Fig. 3.4.1 Methodology of the system
Once collected, the audio data undergoes Pre-processing, a crucial step aimed at
improving the quality of the input. Pre-processing typically involves resampling to a
standard frequency, normalizing audio levels, removing silence or background noise, and
trimming unnecessary sections. The cleaned audio signals are then transformed into
Spectrograms, which are time-frequency visual representations of the audio. This visual
form helps capture both spectral and temporal patterns, making it more suitable for deep
learning models like CNNs.
Following this, Feature Extraction is performed. This step involves deriving meaningful
numerical representations from the spectrograms that can effectively characterize the
audio signal. One of the most commonly used features is the Mel Frequency Cepstral
Coefficients (MFCC), which compresses the audio signal into a lower-dimensional space
while preserving essential information relevant to human auditory perception. These
extracted features are saved in a structured format—typically as a CSV (Comma-
Separated Values) file for further processing.
21
The CSV file is then loaded into a Data Frame, a tabular structure used for organizing
and manipulating data efficiently using libraries such as Pandas in Python. At this stage,
the dataset is divided into training and testing subsets using Train & Test Split, generally
in an 80:20 ratio. This ensures that the CNN model is trained on a majority of the data
while being tested on unseen data to evaluate its generalization performance.
Once the data is split, it is passed into the CNN Model for training. The model learns to
identify patterns and correlations between the features and the corresponding sound labels.
After training, the CNN is used for Sound Prediction, where it classifies new, unseen
audio segments into predefined sound categories based on the learned features.
This end-to-end pipeline from raw audio input to final prediction demonstrates an effective
methodology for developing a robust audio classification system that can handle both
single and multiple sound events in real-world applications.
22
generalization capability, suggesting that the model has effectively learned meaningful
patterns from the training data and can apply them to new, unseen audio inputs with a fair
level of confidence.
• conv1d_1: Another convolutional layer with 64 filters, resulting in (None, 17, 64);
6,208 parameters.
• activation_1, max_pooling1d_1, and dropout_1: Again apply non-linearity,
pooling, and regularization. Pooling reduces shape to (None, 8, 64).
23
3. Third Convolution Block:
• conv1d_2: Final convolution layer with 128 filters, resulting in (None, 6, 128);
24,704 parameters.
• activation_2, max_pooling1d_2, dropout_2: Shape reduced to (None, 3, 128) after
pooling.
• flatten: Converts 3D tensor to 1D of shape (None, 384) for fully connected layers.
• dense: Fully connected layer with 100 units; 38,500 parameters.
• activation_3 and dropout_3: Apply non-linearity and regularization.
• dense_1: Final dense layer with 10 output classes; 1,010 parameters.
• activation_4: Applies softmax (or another final activation) to output class
probabilities.
3.6 Constraints
1.Continous Sound Events:
One of the most significant challenges is the presence of continuous sound events within
the same audio clip. When multiple sounds occur continously(e.g., a dog barking while a
vehicle passes by), it becomes difficult for models to isolate and correctly classify each
sound source. Traditional classification models typically assume a single dominant sound,
which limits their effectiveness in multi-label scenarios.
24
depending on the surface, speed, and footwear. Such variability introduces intra-class
variation that complicates model learning and generalization.
25
3.7 Cost and sustainability Impact
1.Computational Cost:
Training deep learning models such as Convolutional Neural Networks (CNNs) for audio
classification requires significant computational resources, particularly when dealing with
large datasets or high-resolution spectrograms. These costs are often associated with the
use of Graphics Processing Units (GPUs) or cloud-based services (e.g., AWS, Google
Cloud), which can incur substantial financial charges depending on training duration,
hardware configurations, and storage needs. Additionally, real-time or continuous audio
classification systems, such as those deployed in smart surveillance or monitoring systems,
demand ongoing inference capabilities, further increasing operational expenses.
26
resource expenditures, and enhance safety in environments such as factories, hospitals, and
wildlife monitoring zones. Moreover, early detection of critical audio events (e.g., alarms,
glass breaking, or screams) can lead to faster response times, potentially preventing
damage or saving lives jusifying the investment.
27
CHAPTER- 4
IMPLEMENTATION
28
4.IMPLEMENTATION
To run sucessfully the script effectively, a proper environment setup is essential. The script
is compatible with Python version 3.8 or higher. It requires several key libraries, including
NumPy, Librosa, TensorFlow, and Scikit-learn, which should be installed using pip.
Utilizing a virtual environment is recommended to manage dependencies cleanly and
avoid version conflicts with other projects.In terms of hardware, a system with a minimum
of 8 GB RAM is necessary, though 16 GB is ideal for handling large audio datasets. While
the script can be executed on a CPU, using a GPU is highly beneficial for accelerating the
training process, especially when working with deep learning models.
The dataset structure is critical for successful execution. Audio files must be organized
into subdirectories within a main dataset folder. Each subdirectory should represent a
unique class label and contain audio samples specific to that class. For example, a folder
named “dog” could include barking sounds, while another labeled “ar” might contain
engine or horn sounds. This format enables the script to automatically read and label the
data correctly, facilitating efficient feature extraction, model training, and classification.
Proper setup ensures accurate audio classification and smooth workflow execution.
29
4.2 Feature Extraction from Audio Files
def extract_features(file_path):
audio, sample_rate = librosa.load(file_path, res_type='kaiser_fast')
mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
return np.mean(mfccs.T, axis=0)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2)
30
4.5 Building the Neural Network Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Flatten, Dense, Dropout,
Activation
input_shape = (40, 1) # 1D data format
model = Sequential()
model.add(Conv1D(64, kernel_size=3))
model.add(Activation('relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.5))
model.add(Conv1D(128, kernel_size=3))
model.add(Activation('relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(100))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_labels))
model.add(Activation('sigmoid'))
31
4.6 Compiling and Training the Model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, batch_size=32, epochs=50, validation_data=(X_test, y_test))
32
CHAPTER- 5
RESULTS AND ANALYSIS
33
5.RESULTS AND ANALYSIS
5.1 Results:
The process begins by taking the audio file located at the specified file path,
"/content/drive/MyDrive/continue/New folder/playing-dog.wav", which is stored in
Google Drive. This audio file is then fed into the trained model for analysis. After
processing the audio, the model identifies and classifies the sounds present in the file. As
shown in the output, the model successfully detects two distinct sound events: "children
playing" and "dog bark". This indicates that the model is capable of segmenting the audio
and accurately predicting multiple overlapping sound classes, demonstrating its
effectiveness in real-world sound classification tasks.
After training the model, the performance is evaluated on the test dataset to assess its
generalization capabilities. The evaluation is carried out using the (evaluate) method, which
This demonstrates how to visualize and compare the predicted classes with the actual classes
for each audio segment. This visualization helps to assess the performance of a classification
The output graph for audio classification is generated by first dividing the input audio file
into overlapping segments, each five seconds long. For each segment, Mel-Frequency
Cepstral Coefficients (MFCCs) are extracted to capture key sound features, which are then
fed into a pre-trained deep learning model to predict the class of sound present. These
34
predicted class labels are collected in sequence, corresponding to each time segment of the
audio, forming the basis for visual comparison against actual labels.
The graph plots the predicted classes alongside the actual classes across all segments, with
each point representing a segment's classification result. This visual comparison helps
quickly identify where the model’s predictions match or differ from the expected outcome.
Matching lines indicate correct classification, while deviations reveal potential errors. The
plot, combined with the printed summary of actual vs. predicted values, provides a clear and
concise way to evaluate the model’s performance and analyze classification trends across
The visual output, as shown in figure, helps to quickly assess how well the model’s
predictions align with the true class labels for each audio segment.
35
C. ROC Curve for Multi-Class Classification:
model is explained using the Receiver Operating Characteristic (ROC) curve and the Area
The audio classification pipeline begins by dividing a continuous audio file into
overlapping segments to ensure that even brief or transitional sounds are captured. Each
extraction, which transforms raw audio into a compact numerical representation that
captures the spectral properties of sound. These MFCC features are reshaped and fed into
a trained deep learning model that predicts the probability of each sound class (e.g., dog
bark, siren, car horn). The predicted label for each segment is determined by selecting the
class with the highest probability. This process results in a list of predicted sound classes
for the entire audio file, which can be used for performance evaluation and visualization.
To assess the model’s effectiveness across different sound categories, a multi-class ROC
(Receiver Operating Characteristic) curve is plotted. This involves comparing the model’s
predicted probabilities against the true class labels in a one-vs-rest format by converting
the true labels into binary form for each class. The ROC curve for each class shows how
well the model distinguishes that specific sound from all others by plotting the true positive
rate against the false positive rate. The area under each curve (AUC) quantifies this
visualization allows for a clear comparison of model performance across all sound
categories and helps identify which classes are classified reliably and which may need
36
Fig.5.1.2. ROC Curve for Multi-Class Classification
The graphical representation in Figure above provides insight into the model’s
performance across various sound categories. The Area Under the Curve (AUC) scores
reflect the model’s ability to differentiate each class from the others, with higher AUC
37
CHAPTER-6
CONCLUSION
38
CONCLUSION
We presented a Convolutional Neural Network (CNN) model in this paper that was
especially created to classify audio signals into several groups, including sounds from air
conditioners, automobile horns, kids playing, and other background noises. The design of
the model includes a number of layers, including dropout layers to avoid overfitting, max
pooling layers to minimize dimensionality, activation layers to introduce non-linearity, and
convolutional layers for feature extraction. Together, these layers gradually create a com
prehensive understanding of the audio input, allowing the model to identify ever-more
intricate patterns as it progresses through the layers. The approach eliminates the
requirement for manual feature engineering by automatically extracting relevant
characteristics from the raw audio input. Through the use of convolutional filters, it detects
important patterns that define various sound components. The pooling layers improve the
generalization of the model and reduce the computing cost. In order to increase the model’s
ability to function well on unseen data, dropout layers help to reduce overfitting. The f inal
classification result is then produced by combining the learnt features by the fully linked
layers at the model’s output. Performance metrics like accuracy and the ROC curve were
employed to evaluate the model’s efficacy; the Area Under the Curve (AUC) was utilized
to show how well the model was able to differentiate between different audio classes. By
correctly recognizing and categorizing audio signals using the learnt characteristics, the
model showed strong performance. The architecture is well-suited for real-world audio
categoriza tion applications because of its balanced design, which guar antees both
powerful performance and efficient processing. In conclusion, this CNN-based model
demonstrates its ability to recognize a variety of audio signals and provides a dependable
method for audio classification. Future research might focus on improving the model’s
generalization by experimenting with various hyperparameters, deepening the network, or
using data augmentation techniques. Exciting opportunities for further development and
implementation are presented by the model’s practical applications in domains such as
audio monitoring, smart home systems, and surveillance.
39
REFERENCES
40
[12] Y. Lavner, R. Rosenhouse, and A. Cohen, “Baby cry detection using CNN and
logistic regression in domestic audio recordings,” IEEE Access, pp.(202) 452-1535,2012.
[13] S. Brahnam, L. Jain, and T. Lee, “Environmental sound classification using CNN
ensembles and data augmentation,” IEEE Access, pp.(202) 452-1535 , 2012.
[14] J. Piczak, “Environmental sound classification with convolutional neural networks,”
IEEE Access, pp.254 207 621234,2010.
[15] E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, “Polyphonic sound event
detection using multi label deep neural networks,” in IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 2986–2990.
[16] F. Briggs, B. Lakshminarayanan, L. Neal, X. Z. Fern, R. Raich, S. J. K. Hadley, A. S.
Hadley, and M. G. Betts, “Acoustic classification of multiple simultaneous bird species:
A multi-instance multi-label approach,” J. Acoust. Soc. Am., vol. 131, no. 6, pp. 4640–
4650, 2012.
[17] A. Mesaros, T. Heittola, T. Virtanen, and M. D. Plumbley, “Sound Event Detection:
A Tutorial,” IEEE Signal Processing Magazine, vol. 38, no. 5, pp. 67–83, Sep. 2021, doi:
10.1109/MSP.2021.3090678.
[18] K. Zaman, M. Sah, C. Direkoglu, and M. Unoki, “A Survey of Audio Classification
Using Deep Learning,” IEEE Access, vol. 11, 2023.
[19] M. A. Qamhan, H. Altaheri, A. H. Meftah, G. Muhammad, and Y. A. Alotaibi,
“Digital Audio Forensics: Microphone and Environment Classification Using Deep
Learning,” IEEE Access, vol. 9, 2021, pp. 62719-62738.
[20] T. Inoue, P. Vinayavekhin, S. Morikuni, S. Wang, T. H. Trong, D. Wood, M.
Tatsubori, and R. Tachibana, “Detection of Anomalous Sounds for Machine Condition
Monitoring Using Classification Confidence,” Detection and Classification of Acoustic
Scenes and Events 2020, Tokyo, Japan, 2020.
41