0% found this document useful (0 votes)
14 views27 pages

2018ac04523 Final Report

This dissertation by Joseph Panddian A focuses on developing a hybrid Speaker Diarization model to enhance natural language processing by differentiating between multiple speakers in audio. The project utilizes various machine learning techniques, including LSTM and feature extraction methods, to minimize the Diarization error rate and has potential applications in areas such as lie detection and speech analysis. The research is conducted under the supervision of Balaji Mahadevan at the Birla Institute of Technology and Science, Pilani, and aims to address challenges in audio analysis, particularly in the context of customer interactions during the COVID pandemic.

Uploaded by

infinityaug
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views27 pages

2018ac04523 Final Report

This dissertation by Joseph Panddian A focuses on developing a hybrid Speaker Diarization model to enhance natural language processing by differentiating between multiple speakers in audio. The project utilizes various machine learning techniques, including LSTM and feature extraction methods, to minimize the Diarization error rate and has potential applications in areas such as lie detection and speech analysis. The research is conducted under the supervision of Balaji Mahadevan at the Birla Institute of Technology and Science, Pilani, and aims to address challenges in audio analysis, particularly in the context of customer interactions during the COVID pandemic.

Uploaded by

infinityaug
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Speaker Diarization

DISSERTATION

Submitted in partial fulfillment of the requirements of the

MTech Data Science and Engineering Degree programme

By

Joseph Panddian A
2018AC04523

Under the supervision of

Balaji Mahadevan
Member of Technical Staff – Global Network and Technology, Verizon

BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE


Pilani (Rajasthan) INDIA

August, 2021

1
DSE CL ZG628T DISSERTATION

Speaker Diarization

Submitted in partial fulfillment of the requirements of the

M. Tech. Data Science and Engineering Degree programme

By

Joseph Panddian A
2018AC04523

Under the supervision of

Balaji Mahadevan
Member of Technical Staff – Global Network and Technology, Verizon

BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE


PILANI (RAJASTHAN)

August, 2021

2
ACKNOWLEDGEMENT

I have taken efforts in this project. However, it would not have been possible without
the kind support and help of many individuals and organizations. I would like to extend my
sincere thanks to all of them.

I am highly indebted to Birla Institute of Technology, Work Integrated Learning


Program, for their guidance and constant supervision as well as for providing necessary
information regarding the project & also for their support in completing the project.

I would like to express my gratitude towards my parents & Prof. Murali P for his continuous
feedback, direction, kind co-operation and encouragement which help me in completion of
this project.

I would like to express my special gratitude and thanks to my mentor Mr. Balaji
Mahadevan, Member of Technical Staff, GNT, Verizon, for giving me such attention and
time.

My thanks and appreciations also go to my colleague in developing the project and


people who have willingly helped me out with their abilities.

3
BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, PILANI

CERTIFICATE

This is to certify that the Dissertation entitled

Speaker Diarization

and submitted by Mr. Joseph Panddian A IDNo 2018AC04523

in partial fulfillment of the requirements of DSE CL ZG628T Dissertation, embodies the


work

done by him/her under my supervision.

Signature of the Supervisor


Name : Balaji Mahadevan
Place: CHENNAI Designation : Member of Technical Staff, Verizon

Date: 05-August-2021

4
DISSERTATION ABSTRACT

Audio analysis is a growing technology segment in which many are working


to advance the machines to learn and understand natural language to govern the
world. For instance, IOT devices like watches, ebikes and AI assistants like Alexa,
google smart devices are getting smarter to help in our day-to-day work and aid us
almost anything and everything. This dissertation work is to build a hybrid Speaker
Diarization model to aid natural language processing. Speaker Diarization is the
process of understanding and differentiate between multiple speakers using the factors
like loudness, pitch, timbre, frequency and pitch. These factors constitute to sensation
and perception to a dialogue. We are analyzing these factors and developing a hybrid
model which is trained using out of the domain data from open sources and deriving a
model with lower Diarization error rate. The application of this Diarization model is
limited to our imagination. Some of the application is we can use this model to
upgrade lie detectors, as we are involved in analyzing the sound waves with respect to
their physical characteristics this can be involved in creating pitch perfect songs or
develop software which can finalize contestants without the need of judges to judge
them. We can study the psychological behavior of a person with their speech analysis.
In this project we are just scratching the surface of a highly potential domain, we are
using speech Diarization with LSTM and the feature extraction, clustering, smoothing
and segmentation techniques are varied to find out who spoke when in a particular
audio file.

5
LIST OF SYMBOLS & ABBREVIATIONS USED

DER Diarization Error Rate


COVID Official name for the disease caused by the SARS-CoV-2 (2019-nCoV)
coronavirus

ML Machine Learning
NLP Natural Language Processing
VAD Voice Activity Detector
MFCC Mel-Frequency Cepstral Coefficients
SOE System Of Experience
CXP Customer eXPerience
SOR System Of Relational experience
SOI System Of Insight
LSTM Long Short-Term Memory
AI Artificial Intelligence
ASR Automatic Speech Recognition
STT Speech To Text
CPU Central Processing Unit
GPU Graphics Processing Unit
RNN Recurrent Neural Network
CNN Convolutional Neural Network
NA Not Applicable
i.e. Latin phrase id est, meaning “that is.”
SOM Self-Organizing Map
DNN Dynamic Neural Network
API Application Programming Interface
WAV Waveform Audio File Format
Mp3 MPEG-1 Audio Layer 3 Format
WMA Windows Media Audio Format
MPEG Moving Pictures Expert Group
MSCD Mean Squared Cosine Distance
FA False Alarm

6
LIST OF TABLES

Tables Page
Table1: Data Description 13
Table 2: DER % and Evaluation Metrics 18

7
LIST OF FIGURES

Figure Page
Fig1.Business Flow 10
Fig2.Solution Architecture 11
Fig3.System Flow 12
Fig4.Data Set 13
Fig5.Audio signal Axes 14
Fig 6.1. Typical SOM network model 15
Fig6.2. Two-dimensional SOM network model 15
Fig6.3. SOM and k-means clustering algorithm. 16
Fig7.1. Clustering Embedding Projection 19
Fig7.2. Clustering Embedding Test Data 19
Fig7.3. Clustering Projection with decision boundary 19
Fig7.4. Prediction Bar Chart with prediction threshold 20
Fig7.5. Prediction Data 20
Fig7.6. cross & normalized similarity between utterances and speakers 20

8
TABLE OF CONTENTS

S.no CONTENTS Page


1.0 Introduction 10
2.0 Project Requirement 10
3.0 System Specification 12
4.0 Implementation
4.1 Data set Collection 13
4.2 Data set analysis & Preprocessing 14
4.3 Model Development and Evaluation 15
4.4 Evaluation Result and Inference 18
5.0 Conclusion 21
6.0 Future Work 22

9
1.0 INTRODUCTION

This dissertation is inspired by the problem statement we had during COVID. I am


going to explain on two scenarios one on the old world i.e. before COVID the customer
walks into the store and interact with the representative on the orders the customer want to
buy and get its details from the representative and finalize on the order and both the
representative and the customer goes into an agreement by signing on the Kiosk or Tablet in
store to buy devices/plans/features over a period of time. And the new world follows
something like this, the customer walks into the store discusses with the representative on the
product the customer needs and once he gets all the information on a particular
device/plan/feature then the representative shows the QR code to the customer the customer
scans it and he/she is redirected to a page where he can sign his initials and the representative
and customer goes into agreement over a period of time. There are many fallouts in the new
world scenario, hence we have developed a bot which captures the conversation of the
representative and customer and analyzes it and concludes with the order with customers
intention to buy and finalizes the order so that the fallouts are minimized. This project is the
part of that bot which uses a hybrid ML model to segment, analyze and train the model to
separate the speaker from the speech and finds out who speaks when so that the other part of
the system is built on it. We are using MFCC feature extraction to extract the speech from the
audio and as we extracted it from the audio, we use it to extract the various features in
various cases such as classifying into other classes. We are also calculating the roll off using
spectral roll off so that for the given frame lies between the specified percentage of the
frames spectral range if the percentage is too low then the model is trained with the rise of
centroid spectrum at the beginning of the signal.

2.0 PROJECT REQUIREMENT

The requirement of this project is broken into smaller bits so that the implementation
becomes easy to understand and scale up a particular component when ever needed as per the
use case.

Fig1.Business Flow

10
The conversation between the speakers is saved as audio stream. Partitioning the
audio stream into smaller homogeneous bits and pieces which contains both the
individual speaker’s speech in it. The speech should be detected from the homogeneous
segments. We are segmenting it by applying different algorithms based on SOM and K-
means.

We process the Audio segment such that the noise which in this case is low frequency
sounds like back ground noise, till now everything comes under sliding window.
Then we run LSTM on the cleaned audio segments using D-vector DNN.
The processed audio files are aggregated. The goal of this project is to build a model for
which the DER value is very low so that other system can use the output of this module to
be the golden copy and apply their transformation. VAD module is built and the
segmented audio are separated into window with overlap. We also find the MFCC
coefficient comparing the cluster models we have created and trace the efficient model
with the lower DER.

Fig2.Solution Architecture

11
3.0 SYSTEM SPECIFICATION

This project evolved from the necessity of touch less process which in turn had many
advantages like next generation digital experience with customers loving the pilot and
transactional time reduction with voice assist no touch order placement during COVID.
Intuitive user experience was a major factor, proactive assist and less fallout which all
contributed to cost cutting which made it as a plug and play project for other systems to use it
in their own way for their own business case and their use case.

➢ SOE – UI Development - minimal to none


➢ CXP – API Development – minimal to none
➢ SOR – DB Development – minimal to none
➢ SOI – Modal development - maximun

Speaker Diarization is an actively growing area with new algorithms and methods
which are frequently introduces to enhance the system. A advanced segmentation algorithm
may not work well with an clustering algorithm, we need to train our model with lots of
domain independent data to tune our ML model to output lower DER.

As it is an important part od speech recognition there are many systems needed for
this development like SOE which is responsible for the User Interface development which
gives face to the bot. for example a line with frequency for google voice or a glowing bulb
for siri. Then comes the CXP portion of the system which is basically how the user interface
acts with the Model, this portion is something similar to the API which is used by our user
Interface to interact with our ML model.

We also need SOR to save the conversation to the data base and once again the
segmented data is saves do other system can use those as the testing or training data to test
their system of insights. We also need to save the segmented data in data base as well with
relation with the voice sample to categorize them and use it for comparison of the customer
actual feedback without generated feedback to test our system of insights. We need SOI for
our model development and testing and training our model.

Hence the resource needed by the project can be categorized as,

Fig3. System Flow

12
4.0 IMPLEMENTATION

4.1 DATA SET COLLECTION

As part of data collection and demo an video from YouTube is saved and converted
into audio file to run demo on it. As part of training and testing dataset, we grabbed data from
LibriSpeech © 2014 by Vassil Panayotov, to setup test and training data for model
development. We also downloaded audio files from reference. we categorized all the
downloaded audio files into their native format, we categorized the collected audio files into
wav, mp3 and wma format. On initial exploratory analysis audio is nothing but a wave whose
amplitude changes with respect to time and finalized on representing audio data on frequency
domain as it requires less computational power compared to other domains and it can be
easily represented with Mel Frequency cepstrums as well.

We collected around 100 audio samples for test data categorizing 10 speakers
speaking 10 audio files with range of pitch and frequency. We also collected around 250
audio samples for testing the model 1 speaker speaking each audio files. The audio file
featured in the demo is actually an interview conversation between an adult and a child, we
converted the video file to obtain the audio to showcase it apart from the test and training
data. Both models are trained on anonymized collection of voice searches, which has around
36M utterances and 18k speakers.
The i-vector model is trained with 13 PLP coefficients with delta and delta-delta coefficients.
The GMM-UBM includes 512 Gaussians, and the total variability matrix includes 100 eigen
vectors. The d-vector is a 3-layer LSTM network with a final linear layer. The layer are built
with 768 nodes, and it is also calculated to have a nodes of 256.The sample data can be
inferred as the following:

ID |SEX| SUBSET |MINUTES| NAME


810 | M | train-other-500 | 21.54 | Joseph

The meaning of the fields in left-to-right order is as follows:


ID: this is basically the id in which the data is hosted.
SEX: class value to determine the gender of the speaker.
SUBSET: this value is basically to hold the subset data of the speaker
MINUTES: this value maintains the duration of the speaking time.
NAME: this holds the name of the speaker in which the data is hosted.

Data Data Type Total No of Audio No of


Source Samples Speakers
Online Training 100 10
Data
Online Test Data 250 250
Table 1: Data Description
Fig4. Data Set

13
4.2 DATA SET ANALYSIS AND PREPROCESSING

As a part of data analysis, manual voice analysis has been performed initially to
investigate there were no overlaps between speakers. If overlap were present those data were
discarded. Then, data has been imported to the data frame. Usually, the speech or audio can
be rep\resented in a three-dimensional space with respect to amplitude, time and frequency.
In a two-dimensional space we can also categorize it with respect to amplitude and time
which is represented in Time Domain and we can also categorize it with respect to amplitude
and frequency which is represented in Frequency Domain. This is represented in the diagram
below.

Fig5. Audio signal Axes

Speech segments typically contain just one speaker. Now we need to find the best
representation of speech or audio, to finalize on the audio or speech representation we need to
understand what representation will be easy for us to represent and analyze with respect to
machine learning. In other words, we need to understand which domain representation either
the time domain or the frequency domain will be easy to represent in terms of data analysis.
In time domain amplitude is plotted against the time and in frequency domain amplitude is
plotted against frequency. Complexity in representing and analyzing frequency domain is
lesser compared to time domain. Stability is also an important factor as in frequency domain
stability is decided and filtering out is very much helpful to filter out the appropriate noise or
part in frequency domain compared to time domain. In frequency domain convolution of two
signals is as simple as multiplying or in multiplication operation compared to time domain.
Hence, we finalized of frequency domain and MFCC features are captured for the audio
inputs in this model.

14
4.3 MODAL DEVELOPMENT AND EVALUATION

We have started with using predefined modals like typical SOM network model, 2-D
SOM network model.

Fig 6.1. Typical SOM network model

Fig6.2. Two-dimensional SOM network model

Self-organizing maps basically takes the input with the attributes with features we want to
classify the self-organizing map will get you a new classification in which the object belongs.
The number of the neurons and the maximum of epochs will be the input parameter for this
model and also the learning rate depends on the displacement between each epoch. The
output of this model will give us one possible element. The operation of the model involves,
selecting random input then computing the winner neuron after neurons are updated and this
process is repeated for all the input data and the classification is completed.

15
Then we worked across other models like K-mean speaker clustering algorithm based
on Self-organizing Neural Network, as explained in the flowchart below.

Fig6.3. SOM and k-means clustering algorithm.

In this model, Audio is converted into MFCC feature set and it is normalizing to find the winning
neuron we are changing the weight over period of time for the imputed feature set. Once the training
time is more than the configured time, we record the number of categories the neuron weight obtained
by training the SOM network model and retrain it which is used as the initial clustering center of k-
means and then the clustering results are outputted.

The above two models have a higher rate of DER, as the algorithm such as K-Mean often performs
poorly due to

16
Non-Gaussian Distribution – The correction that modifies the expected gaussian function estimate for
the measurement of a physical quantity. Most of the theories predict some level of non-Gaussian in
the primordial density field.

Cluster Imbalance – If we have the number of clusters as unknown, but different classes. As we know
that any clustering algorithm finds the space between the two samples with common characteristics.
Most of the current world approaches showcase that the system will basically categorize that the
clusters share certain properties or the properties between the clusters will be mostly same or similar,
at least within certain boundaries – like distance between clusters only diverging within a certain
maximum. This will be problematic incase if we have one prominent, largely and unequally scattered
class, that to some extent shadows other, less prominent and as well unequally scattered class – which
further have very different distance between clusters. This could lead to not so prominent clusters to
not be found at all, as they get just pushed away from the prominent class SOM to some extent in our
case.

Hierarchical Structure – As this algorithm looks for similarities between the data and uses similar
approach to determine the number of clusters, the key element of discrimination here is similarity
among data points. Speech data analysis is an extremely challenging problem domain and
conventional clustering algorithm such as K-Mean often performs poorly due to points
mentioned above.

Hence, we run experiments with all the combinations of both i-vector and d-vector
models,

With respect to runtime latency, clustering algorithm is classified into two categories,
1) Online Clustering: A Speaker label is immediately emitted once a segment is
available, without seeing or taking future segments into consideration.
2) Offline Clustering: Speaker labels are emitted only after the embeddings of all
segments are available.
Offline clustering algorithm usually performs better than Online clustering algorithms
because of the fact that additional contextual information available in offline clustering. It
also determines on the system and domain as well, live video analysis like CCTV monitoring
and Traffic monitoring typically restricts the system to online clustering algorithms.

We also analyzed four clustering algorithms.

1. Naïve online clustering – This fall under online clustering algorithm. The threshold is
applied on the similarities between embeddings of segments. Cosine similarity is used as
similarity metric. In this, each cluster is represented by the centroid of all its corresponding
embeddings. If and when a new segment embedding is available, the similarities between the
centroid of the existing clusters are calculated. If they are all smaller than the threshold, then
new cluster containing only this embedding is created, if not then this embedding is added to
the most similar cluster and the centroid is updated.

2. Links online clustering – This fall under Online clustering algorithm. It is built upon Naïve
online clustering. The cluster probability distributions are estimated and the substructure
based on the embedding vectors modeled accordingly.

3. K-Means offline clustering – This Diarization system falls under offline clustering. K-
mean clustering algorithm is integrated into the system and uses k-means++ for initialization.
To determine the number of speakers or the k value this model uses elbow of the derivatives
17
of conditional Mean Squared Cosine Distances, MSCD between each embedding to the
cluster centroid.
4. Spectral offline clustering - This model proved to be efficient model out of all the models
analyzed. This model constructs the affinity matrix using cosine similarity between each
element and diagonal elements are set to the maximum value in each row. Various refinement
is applied on top of it like gaussian blur, row-wise thresholding, symmetrization, diffusion
and row-wise max normalization. These refinement denoised the data in similarity space.
In this case the gaussian blur plays a very important role of smoothing the data and therefore
reducing the effect of outliers which in turn affects the output of the model. Row-wise
thresholding acts as the point which removes the affinities between the embeddings
belonging to different speakers. Symmetrization helps in restoring the matrix symmetry
which is very important for spectral clustering algorithm. The diffusion step is introduced to
increase the sharpness of the image which in turn results in showing very clear boundaries
between the sections of the affinity matrix which belongs to different speakers. The row-wise
max normalization helps in rescaling the spectrum of matrix in ensuring undesirable scale
effects don not occur in the following spectral clustering step.

Table 2: DER % and Evaluation Metrics

The above table is tabulated from the result of training the models with same data, we
factored in 3 important components in the table, False Alarm (FA), Confusion and Miss. FA
and Miss are mostly due to the Voice Activity Detection errors and sometimes from the
aggregation of the frame level i-vector and sometimes the window level d-vector to segments.
The important point to note here is also the difference of FA and Miss between the d-vector
and i-vector is because of the different window or step sizes and aggregation logic. We can
also infer that the d-vector Diarization system very much outperforms the i-vector Diarization
system. We can also infer that the optimal sliding window size is 240 ms and sliding window
step is 120ms for the d-vector Diarization system.

4.4 EVALUATION RESULT AND INFERENCE

For evaluation of the model and finding the efficiency of the model, vide variety of
sample of dataset is chosen with vide variety of data and algorithm ran on the chosen dataset.
Then the model is tested against the actual production data which has the results on clustering
based on embedding projection, clustering based on sex of the speakers, fake speech
detection with respect to youtube video id’s vs the similarity to ground truth.

18
The below clustering shows that the speakers and their audio used to test our clustering
models and how they are clustered and the embeddings are projected below. This clustering
model proved to be highly efficient with zero error rate.

Fig7.1. Clustering Embedding Projection Fig7.2. Clustering Embedding Test Data

The below clustering projection shows the clustering patterns when the clustering is based on
sex of the individual speakers, clustering is projected to show the number of male and female
speakers.

Fig7.3. Clustering Projection with decision boundary

The below bar chart shows the realness and fakeness of the model, when the model is
inputted with twelve distinct utterances in which six are real inputs and the other six are fake
inputs. The model has made one wrong output in which a fake input outputted as a real input
and had a threshold more than the prediction threshold limit

19
Fig7.4. Prediction Bar Chart with prediction threshold Fig7.5. Prediction Data

The below plot shows the cross similarity between utterances, Normalized similarity value
between utterances and cross similarity between speakers, normalized similarity between
speakers.

Fig7.6. cross & normalized similarity between utterances and speakers

20
5.CONCLUSION

As we are trying to come up with model enhancement techniques for model developed, it
needs extensive research as new methodology are introduced in the field of audio
segmentation and audio clustering techniques. We also need to get more data for data
validation specifically cross domain data so that our model can function across all domains
and can be used any other use cases. We may also need to plan for potential risk and
mitigation plans we started with deep learning model as the first one, a good accuracy was
attained, other simple models may not yield similar result.

21
6.DIRECTIONS FOR FUTURE WORK

1) We may need to explore more on other cluster algorithms to find the low DER.
2) Train our model using our own data or data from other domains to extend cross
domain functionality.
3) We may also try using pre trained models to explore more on speaker embedding.
1)

22
REFERENCES

1) [LibriSpeech] http://www.openslr.org/12
2) [Audio] https://www.youtube.com/watch?v=X2zqiX6yL3I
3) https://towardsdatascience.com/extract-features-of-music-75a3f9bc265d
4) https://www.hindawi.com/journals/mpe/2020/3608286/
5) LSTM https://medium.com/saarthi-ai/who-spoke-when-build-your-own-speaker-
diarization-module-from-scratch-e7d725ee279
6) https://arxiv.org/abs/1710.10468
Speaker Diarization with LSTM
By Quan Wang, Carlton Downey, Li Wan,
Philip Andrew Mansfield, Ignacio Lopez Moreno.
7) https://arxiv.org/abs/2101.09624
A Review of Speaker Diarization: Recent Advances with Deep Learning
By Tae Jin Park, Naoyuki Kanda, Dimitrios Dimitriadis, Kyu J. Han, Shinji
Watanabe, Shrikanth Narayanan.
8) http://www.iitg.ac.in/cse/robotics/?page_id=2442
Dept of CSE IIT, Guwahati
9) https://www.ibm.com/blogs/research/2020/10/new-advances-in-speaker-diarization/
by Hagai Aronowitz and Weizhong
Zhu IBM Research Blog
10) https://github.com/tyiannak/pyAudioAnalysis/wiki/5.-Segmentation
11) Audio files https://file-examples.com/

23
APPENDICES
List of Appendices Page
Cover Page 1
Title page (inner cover) 2
Acknowledgements 3
Certificate from the Supervisor 4
Dissertation Abstract 5
List of Symbols & Abbreviation used 6
List of Tables 7
List of Figures 8
Table of contents 9
Conclusion 21
Recommendations 22
References 23
Appendices 24
Duly Completed Checklist 25
Mentor Final Evaluation 26

24
DULY COMPLETED CHECKLIST

a) Is the Cover page in proper format? Y


b) Is the Title page in proper format? Y
c) Is the Certificate from the Supervisor in proper format? Has it been signed? Y
d) Is Abstract included in the Report? Is it properly written? Y
e) Does the Table of Contents page include chapter page numbers? Y
f) Does the Report contain a summary of the literature survey? Y
i.Are the Pages numbered properly? Y
ii.Are the Figures numbered properly? Y
iii.Are the Tables numbered properly? Y
iv.Are the Captions for the Figures and Tables proper? Y
v.Are the Appendices numbered? Y
g) Does the Report have Conclusion / Recommendations of the work? Y
h) Are References/Bibliography given in the Report? Y
i) Have the References been cited in the Report? Y
j) Is the citation of References / Bibliography in proper format? Y

25
BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI
Work Integrated Learning Programmes Division
I SEMESTER 2020-21

DSE CL ZG628T DISSERTATION

ID No. : 2018AC04523

NAME OF THE STUDENT : Joseph Panddian A

EMAIL ADDRESS : 2018AC04523@wilp.bits-pilani.ac.in

NAME OF THE SUPERVISOR : Balaji Mahadevan

PROJECT WORK TITLE : Speaker Diarization

Project Work Final Evaluation (Please put a tick ( ) mark in the appropriate box)

S No. Evaluation Component Excellent Good Fair Poor


1. Final Project Work Report 
2. Final Seminar and Viva-Voce 

S.No Evaluation Criteria Excellent Good Fair Poor


.
1 Technical/Professional Competence 
2 Work Progress and Achievements 
3 Documentation and expression 
4 Initiative and Originality 
5 Research & Innovation 
6 Relevance to the work environment 

Please ENCIRCLE the Recommended Final Grade: Excellent / Good / Fair / Poor

Supervisor Additional Examiner


Remarks of the Supervisor:
Name
He has been an important team Balaji Mahadevan
member for VIKI (Verizon owned knowledge interface), He
Qualification M.TECH
was planning to do his dissertation work on the ML part of it. AS it could not be done, he took
the Diarization
Designation Member
part of it (paper
& Address ofreport)
cited in Technical
andStaff -
compared existing models to find the
suitable hybrid model to differentiate speakers
Global Network andin a conversation.
Technology, Verizon,
The original project involved differentiating Verizon rep and Verizon customer in a
conversation, which brought Chennai.
advantages to Verizon in Business and drastically increased sale
Email
duringAddress
covid era. Balaji.mahadevan@verizon.com
Signature
In this project he is trying to differentiate the speakers, in his example an adult and a child
with multiple models and train the model to reduce the DER.

26 other than that everything was up to the mark.


Documentations could have been better, but
Date 05-August-2021

NB : Kindly ensure that recommended final grade is duly indicated in the above evaluation
sheet. POSTAL ADDRESS FOR ALL FUTURE CORRESPONDENCE. FILL IT UP
NEATLY IN CAPITAL LETTER WITH PIN CODE ETC.

Address:
Joseph Panddian A, S/O J Arputharaj,
No 80/36, Mangaleri, East Coast Road,
Thiruvanmiyur,
Chennai – 600041
Tamil Nadu

Pin Code 600041

27

You might also like