117105145
117105145
Lecture – 01
Introduction to Digital Speech Processing
So welcome, welcome all. So, this course is digital speech processing. So, I will take this
course in 20 hours that means, that half 20 hours lectures. And this course mainly
designed for speech part not that sub computing and that part. So, this course you do not
expect that I will talk about that HMM or I can talk about that deep learning those things
I will not cover in this course. This course mainly I covered the scientific or scientific
aspects of the speech, and how those speech can be digitized, what kind of digital set up
we should use, how do we locate the different kind of speech signal in that recording. So,
all kinds of those things will be taught in this course. So, if you see the course coverage.
So, course coverage mainly covered if you see all are on the aspects of the speech signal
aspects. Not that the aspects of soft computing and developing a system that kind of
thins, but yes at the end of the course I should cover some of the portion or some of the
important speech processing application like that TTS, ASR many people are talking
about that TTS and ASR, but I am believing that there are not only the TTS ASR there is
a other kinds of speeches speech application also, like that second language acquisition,
action conversion all kinds of things. So, those will discuss I will discuss at the end of
1
the course that whatever the speech processing we have done entire course what is the
main application of those kind of information in this course.
So, what I aspects from the every learner, that not that cover the course. From the
learners prospective as a learner of the course of digital speech processing at the end of
the whole course you should able to do following this course objective. What are the
course objective? First is categorize and label the different speech sound for a given
speech signal, based on the spectrographic view and time domain speech view.
So, not all those are written in the slides, I can you can I can say something on the in
here also. So, if you see that categorize and label the different speech sound for a given
speech signal based on the spectrographic view or time domain signal view. So, what I
aspect from the earner? That, what is once I completed the course the learner should able
to do? Or learner should able to categorize. Suppose I give you a speech signal and I told
you to records a speech signal let us your name. You record your name you should able
to record that that speech in the computer using the computer. Then using some you can
say that open source software’s there are many speech processing software’s which are
available, using those software you should able to view the spectrogram and time domain
signal and label the different speech signal point.
So, for example, suppose I say the label that a consonant to vowel or tough consonant to
vowel transition. So, those things you will learn here, how do I label which is tough
2
consonant what is look like in speech domain, and what is the transitory part what is
content. So, those things you learn this course, and you should able to use that
knowledge for a labelling the speech signal for a given or of a given speech signal
whatever I say that you can. So, in examination also I will sometimes I will give you a
spectrogram and told you that identify the words based on the manner of articulation of
the speech.
So, I can expect that you should develop a skill not that theory, what is tough constant
what is articulatory place of that velum that palate not all kind of things, but given a
speech signal I should able to categorize and label the speech signal. Next one explain
psychoacoustics and psycho acous or you can say the psychoacoustics properties of
speech perception and speech production. That is not that much of skill oriented things,
but yes you should know what is the how the speech is produced and how the speech is
perceived. So, there is a lot of problem I will solve using the speech perception and
speech production.
Next one is suppose you know that this is a speech production mechanism, that here the
speech is produced using the vocal cords and there is a tube. You should have a
experience with that if you see the flute. I can sing the flute and pressing the different
hole and I can create the different kind of sounds. So, how a human being can produce
different kinds of sound using this vocal tract. So, this is a tube and that is a vocal cords
which is vibrate and create the sound, if you see the shehnai or if you see there is a
membrane things in the beginning. And there is a long you can see that there is a long
tube to produce the different kind of sound.
So, using the uniform tube model speech sound production and implementation of it is
using the signal processing you should be able to do. So, I should explain how to model
the this human vocal cords and how it should be implemented using the digital signal
processing techniques. So, some part of digital signal processing I will cover, but that I
am assuming that you should know the basic part of digital signal processing. So, using
those principle I should able to implement that human vocal tract. So, that uniform tube
model and I am mathematically model also.
Next one extract the fundamental frequency, or different kind of speech parameters. Why
I required? If you see what is the propose of this course, if I say ask you the what is the
3
purpose of studying the digital speech processing. That is a subject then what is the
purpose? If you see today in modern scenario or you can say 21st century there is a lot of
research in speech domain. Why because if you think that human speech is the main
communication media. Let us I take you the scenario, even a person does not know the
or you can say the literate you can say that he does not know the script, he does not know
the grammar, but he can speak effectively he can speak.
So, I can say the speak is the speech is the common natural mode of communication
among the human being even speech is the sound is the communication among that any
live things or you can that think about that also. So but if you see considering the human
being speech is the most you can say that most easiest or you can every people use the
speech mode communication or most natural communication medium. So, what scientist
want? Scientist want can machine will act like a human being today there is a lot of
artificial intelligence lot of soft computing let us of human intelligence we talk about.
Now scientist are trying to developed that can I developed an algorithm or can I
developed an systems by which a human being can talk to a machine. Think about an
application suppose you go to the railway stations to buy a ticket. So, instead of giving
instruction to the machine by dialling 1 2 3 if you want this dial one dial 2 kind of things,
you can replace with a chaos with a speech mode. I want to buy a ticket for say
Kharagpur to Howrah. I just told to the computer give me the ticket and then computer
ask how much money, the computer says at this much this amount of money give you the
money done.
Lot of lot of this kind of continuous communication is required. Another if you think
there is a lot other aspects speech communication. Think about the security biometry
speech can be used for one of the biometry for the human being. That is why that speaker
verification speak recognition human voice indentify all kinds of business are going on
because speech carries the speaker biometry, because every speaker produce the speech
to communicate the information during the production he impose some signature of that
person also. So, that kind of things is going on. Similarly other like communication think
about the pure communication thing forget about this kind of technology that human
communication telephone I want to send a voice from one point to another point. Now
how can I today if you see that? Today is that IBPS communication tie one that this
channel cost is so high. So, I can I reduce the channel cost reduce the bandwidth. So, all
4
kind of things we want to do so that is will speech coding. How do I developed an
efficient algorithm?
So, that with a minimum cost I can transfer a signal speech signal from here to here in a
real time situation. So, I want some kind of algorithm or compression kind of things
which can compress my speech and transfer that things. Think about that your CD, audio
CD the music CD or you can say that vocalist that song CD. Earlier a CD can contain 7
to 10 song if it is stored in original format now think about MP3 a compression
technique speech coding compression technique MP3, I can compress 160 song in single
CD.
So, speech coding is also another aspects. So, while doing the speech this any kinds of
application, the first thing is that I have to know the speech. What is the scientific aspects
of the speech? After recording how it is behave like that. So, all kinds of things we have
to know. So, the course aim is not to deal with that soft computing those things. Deal
with that what is speech, how it is produced? Which features I should exploit to develop
the speech base application? Which features scale is what kind of information? What is
what do how the human being produce the speech? How the human being perceive the
speech? All kinds of things will be covered.
So, in this course I will cover the extraction of different speech parameters details I will
discuss during the speech parameter extraction what do you mean by parameter and what
5
kind of parameter you think is suitable. So, those kind of things we will discuss there is a
signal processing algorithm we use those algorithm why we use this algorithm is very
important. Do not read anything which is just this has written have read it I copied it and
give the exam not this. Why this is important is very important. As a engineer or as a
scientist you should know why I am doing this thing.
So, different kind of speech parameter extraction I will explain this parameter extraction
algorithms are available in the book if you I have referred 2 books you know that you
can go to those books the all kinds of algorithms are explained, but in the class what I
explain that why we are doing these kind of parameter what is the advantage what is the
disadvantage. Those kind of things I will cover in this class.
So, at the end of the course is suitable to write the algorithm for find out the different
kind of speech parameter, that is my expectation. Then extraction of the spectral time
domain parameters speech signal those things is then design of simple TTS. And ASR
system I will cover and I will cover some part of the prosody modelling because today
prosody modelling one of the important aspects in speech processing class.
If you see there is lot of work is going on speech prosody there is a lot of segmental
speech work is done if you see the TTS most of the TTS are in even if in their language I
have developed one TTS in Indian language using different method I will explain that
also. But it is not as like natural as human being. So, what we are missing is the speech
prosody. So, today if you heard any speech scientist any speech lab if you go there lot of
people are working on speech prosody. How what do you mean by speech prosody, and
there is a different application also speech prosody, suppose I give you one example that
I have seen that this is a very important may be important research problem also. Think
about that I have since I have dealing I am I am part of an center for education
technology. So, I have seen many lecturing video which have recorded and even if
foreign lecture video also I have seen which are recorded, but if you find there is a
difficulty of understanding of the speech of different language speaker. Even all speakers
are saying in the same let us English language. Suppose a Japanese people is giving a
lecture on English language, and a Chinese people also giving a lecture on English
language and I am sitting there, but I am not understanding fully his English or his or her
English. Because it is come with a Japanese accent all kind of because if it is not the first
language of is English is not first language of that Japanese speaker.
6
Suppose I am seeing a seeing a lectures of a our north Indian people or our south Indian
people their English is little bit of different from the Bengali. Even my English may not
be 100 percent understand or 100 percent you can say that intelligibility of that English is
not that good. If you suppose you are a speaker of a American English when you listen
this English you said this intelligibility of the speech is not that good. Now think can I
make a device, I am saying in Bengali English and it convert to let us American English
American accented English.
So, intelligibility of the speech is increased. Think about I have developed a systems and
I am was the system in here. So, when you download the lectures and when you listen
the English, accent of the English is converted as per your preference. So, that kind of I
am not saying that language transformation somebody is giving a lecture in English I am
listening in Bengali, I am not saying speech to speech conversion. I am saying simple
action conversion that has I am speaking in the Bengali accented speech some American
speaker listening it in a American accented speech. Same things an American speaker
giving a lectures in American accented language I want to listen in Bengali accented
language.
So, those kind of tremendous kind of applications second language acquisition also there
is a some application of speech kind of things. So, even if speech research I will I will
told the that the ASR lot people are doing research in ASR automatic speech recognition.
And if you heard about that there is a HMM model hidden markov tool (Refer Time:
16:46) model, or I people are saying that, now people are saying that this model s not
sufficient for those language which resource is very less resource constant language. So,
what kind of alternative speech technology I can developed which is used the speech
science? So, that I can think about new kind of model like that exploitation of speech
prosody in ASR is a much more you can say the serious research which is going on by
different group of people are doing it.
So, those kind of lot of speech applications are you can say the think about in your mind
and you have to develop some expertise or some skill on which itself So that you can
think about what kind of algorithm or technology or soft computing algorithm o have to
sue to do these kind of things. So, this part I will this the, so if you see this my course
outcome or course objective have never written the soft computing part. So, I am not
covering on that part. So, that part may be cover in the other subject.
7
Now, since t is introduction class let us talk about that how human being produce speech.
So, there is a suppose I am giving a lectures, how I am producing the speech. What or
what kind of activity is going on in my body to production of this kind of speech? So, if
you see the slides you can close look in the slides also, the slides will be shared to you
that that I happening once I want to speak some message formulation is happening in our
mind. And so, what is happening? If I want to speak a sentence what I want to say that is
created that is that is come from our mind, and that is called message planning.
Once the message planning done then it is goes to the language model, what kind of
language code you should use if it is Bengali then the words will be selected if it is
English some others words will be selected. And based on the linguistic that is called
linguistic in person linguistic coding it will come to neuro muscular action So that
coding has to be executed by a human vocal cord. So, different kind of muscle has to be
activity is involved.
So, muscle command that is neuro muscular action command will be generated. And that
command will goes to the this whole function body. And this has a 2 part one is called
source and one is called muscular action or acoustics system, and to produce the different
kind of speech. So, message planning then the linguistics part. So, if I explain this one in
the block diagram basis.
8
So, message planning that is the rule of grammar, what kind of word I should say which
is well, this is unknowingly without grammar I can produce this speech. So, grammar is
not important to generate the message planning. The person who does not know the
language he can also speak in that same language. So, this is although it is it is rule of
grammar it is automatic. So, grammar is not primary criteria. Grammar is discovered by
us to explain the phenomena of what message planning is going on.
So, speech message planning which is lexical syntactic, semantic and pragmatic. Then
come to the rule of prosody utterance planning. How I produce the speech I can produce
this very excited manner, I can produce the very low manner, I can may be the very sad.
So, those kind of prosody planning will come here utterance planning then motor
command will be generated and speech production system is produce the speech.
If you see in message planning lexical syntactic, semantic, pragmatic then if you see the
utterance planning paralinguistic, intentional, attitude, stylistic every human being has a
different kind of speech styles. Then non linguistic parameters also there which is come
under motor and utterance planning also. If you see the physic physiological problem
also there somebody has a very thick vocal cords and very you can say the short kind of
things. So, he can produce the very voice which is very the fundamental frequency is
very low. And somebody has a different kind of style of speaking somebody has a
stammering problem. So, all are come from here motor command generation tract
planning, and then the speech is produced.
9
(Refer Slide Time: 21:36)
So, once the speech is produced, this is radiated either from the mouth or from the nose.
The once I have produced the sound using this vocal cord, if you see there is a velum
inside it details I will discuss. And either the velum will be closed or opened if velum is
closed some part of the sound will come to the nasal cavity. And some or if the oral
cavity is completely closed. So, it will come to the nasal cavity, if it is nasal cavity is
completely closed it will come to the oral cavity.
So, it is radiated so acoustics which is generated in here it is radiated from our mouth r
nose cavity. And propagated in a acoustics wave. That is why is called speech acoustics
is important. So, speech is produced by a human being. So, the acoustics wave is travel
in the medium. You know that acoustics wave cannot travel without the medium. So,
acoustics wave come in the medium and transmit it. Now once the acoustics wave is
transited a listeners who is present to listen that voice the acoustics wave strike is in ear
system. So, hearing system is there. So, the acoustics wave is there. So now, from there
the acoustics wave has to be converted to the again neural signal. Then again neural
signal has to fire the language code, and language code from the language code brain
decipher the intended meaning or intended message of the speaker. So, listeners is again
try to find out what the speaker want to communicate. Now if you see this is this is the
process, but this process you can say that there is a lot of optimisation is possible. Means
that human being has a tradition that in biological system we always please conserve the
10
energy. So, suppose I am speaking to some students, and once I his face gesture or I once
I realise he is tune with the same topic, lot of message planning done which is very short.
Or you can say that the planning complete message or complete linguistically complete
message is not required to transmit to the listeners. So, it may not be a linguistically
correct sentence, or linguistically correct words or whole words I may even I am not
speaken spoken, but listeners understand. So once the listeners understand speaker will
say that I my purpose is complete. So, do not want to produce the whole sentence.
So, this kind of vulnerability that the how this optimisation is how this can be tackled in
speech science, that is also important aspects. So, this is the speech production and
perception mechanism of the human being. The details I will cover in the following
lectures.
Now, if think about the engineering aspects so a human being, so if I see the engineering
aspects that person speaking in the speech or you can say the I have generated the
acoustics wave from my mouth. Now suppose this acoustic wave I want to transmit form
this point to this point and these 2 points are far away. While this acoustics signal cannot
reach here have you understand suppose I am speaking in this room the persons who are
sitting in this room can be audible. But suppose you are sitting in your home you want to
listen the same acoustics wave. So, I have to take the help of technology. So, what kind
of technology? So, I have spoken the speaker is or radiated the speech in acoustics wave.
11
Now I have to use the communication technology which is electrical communication
technology.
So, somehow I have to convert this speech acoustical wave to an electrical signal, and
transmit the electrical signal and this side I have to convert the electrical signal to the
acoustics wave. So, that is why conversion from acoustic wave to the electrical signal is
microphone, and conversion from the electrical signal to the acoustics wave is the loud
speaker.
So, this is the mechanism I can I can say that I can transmit the speech from one point to
another point. So now, I can deal with this electrical signal, whatever the coding
technology all kinds of things I can do with this. That is why I say it is digital speech
processing, I have not saying that the speech processing in analogue domain. I said
digital speech processing, once I done I take this electrical signal in digital domain what
kind of processing we should use.
So, that we can develop different kind of technology. So, if you see these slides the
source of information is the speaker human speaker then measurement of observation
this acoustics wave or a form has to be converted to the or you can say the acoustics
wave from I have taken and observation has been done. Then the signal representation
has to be done signal processing, then again human listeners I have to model the human
listeners. So, that now if you see that technology I want to replace this speaker.
So, I say the message planning to acoustics wave generation that has to be done by a
machine. So, I can say it is text or idea to acoustics wave. Here I want a machine which
can converted that the signal which is coming here, or acoustics wave to understanding
of the speaker’s message. So, this kind it is not ASR, I do not say it is automatic speech
not only recognition here understanding also is a part. Here also this figure modelling is
not that he just read a text, once we talk we do not read the text, we understand the text
and generate that text using the acoustics wave.
So, what all things are there we I will discuss on that things. So, that kind of technology.
So, either technology should be here, technology should be here, or technology should be
here. So, all kinds of speech technology we have developed to develop. Those kind of
technology we should know what is speech and how it produced ok.
12
(Refer Slide Time: 28:23)
Now, if you see some speech processing is involved different you can say the dimensions
or you can say the different discipline to work in speech processing. So, if you say the
language communication. So, different kind of dimension. So, speech processing again
involves algorithm it is psychoacoustics room acoustics speech production. So, acoustics
part. Then information theory part, phonetics part signal processing part, statistical signal
statistical signal processing entropy all kind of multi disciplinary actions has been
involved in developing the speech technology ok.
So, I will some of the discipline I will cover not all the discipline I will cover. So, some
of the discipline I will cover the mainly focused on acoustics part of the speech, and
phonetics acoustics and phonetics and signal processing part of the speech. So, thank
you. So, next lectures I will discuss how to record the speech. Because before you go to
anything on speech processing, you should know how do I record my speech. And how
do I see it in my (Refer Time: 29:29) computers. So, that whatever I say you can play
with it. Because I say that do not it is not a theory type of class that I read this class and
read this definition I will give the exam, not this. Developed and skill after seeing the
speech I can able to say this probably the worst signal and it is coming from this region.
So, those kind of expertise I want ok.
Thank you.
13
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture – 02
Digitization and Recording
So I will come back. So, this lecture is deal with the digitization and recording.
What I will discuss before you go to the any speech or this kind of things. I should you
should familiar how to record the speech. Many people know that how to record the
speech, but at least some you can say the skill kind of things I will explain here, not details
of digitization details of this that kind of things I is not clear. But some what do you mean
by digitization or kind of things I should explain.
14
(Refer Slide Time: 00:53)
So, if you see the normal human hearing is 20 hertz to 20 kilo hertz, we know that. And
human speech is 50 hertz to 8 kilo 800 or 8 kilo hertz and standard phone circuits, or
telephone speech is 300 hertz to 3 kilo hertz or 3.3 kilo hertz, sometimes say 3.3 kilo hertz.
So, this is the bandwidth of the human speech, normal human speech 20 hertz to 20 kilo
hertz ok.
Now, what I want that I want the digitization of the speech. What I said that speech is in
acoustic speech what about the human being is produce that human speech is in acoustics
15
wave. So, once the human being produces the speech any to any kind of speech processing
I should have to digitize the speech because we said the digital speech processing. So, I
say something I have to convert that acoustics wave to digital speech domain first analogue
signal then analogue to digital conversion. So, how do we do that is we use a micro phone
and amplifier and then micro phone amplifier then analogue to digital conversion to
convert the acoustics wave to a digital signal in this speech.
So, how do we do that? You know that this is a basic theory this is called sampling theory
that sampling frequency the this is the complete circuits that.(Refer Slide Time: 02:25)
16
This is the normal DSP systems. So, I can say analogue speech do not I am not detailing
of the details of this every block. So, those are the signal processing domain, but you
should know what is there in computer. So, there is a analogue speech which is coming
from the micro phone may be converted to the going to the anti-aliasing filter then
sampling hold then ADC analogue to digital conversion. Then you do the digital signal
processing then again if I say the human being human listeners cannot listen that electrical
signal. Again, I have converted that electrical or digital signal to analogue signal and
analogue should be played in the loudspeaker to produce the acoustics wave.
Now so, sampling frequency, analogue to digital conversion sampling frequency is one of
the important issues. So, basically what it is it is that may be some of some of you are there
who do not know the digital signal processing or analogue to digital conversion. So, just I
explain the very briefly that one.
So, let us xa (t) is a time domain signal. What do you mean by time domain signal? Here
is the time and here is the amplitude. So, the signal is varying like this way. So, that is the
time domain representation of the signal analogue signal. Since analogue signal is
continuous if you see this is a continuous signal. Now I want to convert this signal to a
digital signal which is x [n]. I convert this same signal to a analogue digital signal which
is x [n]. Instated of x (t), I convert to x n number what is nothing but a, so digital signal is
nothing but a integer number.
17
So, how do you do it? First there is a time which is continuous; I have to convert this time
to some instant with a fixed interval, that is delta t is a fixed interval ∆t. So, I can say that
I am taking the xa (t) as a sample or I can think about a switch. That suppose this is a
switching circuit. I there is a xa (t) is going on, I am closing the switch with a frequency
with a duration of ∆t. So, for this ∆t I take the signal then again, I had taken the signal here,
I take the signal in here, I take the signal in here take the signal in here.
So, what I am doing? I am converting the continuous t with a n t with a duration of ∆t. So,
n ∆t. So, n 0; that means, here 1 means here 2 means here 3 means here 4 means here 5
means here. So, if I think about the frequency representation of this 1/∆t is a frequency,
how do what is the frequency operation of this switch, with a frequency is1/ ∆t. So, which
is nothing but a called Fs sampling frequency, Fs. So, sampling theorem if you say if the
there is a sampling theorem.
This is a sampling theorem which explain this slide I am not saying again and again. So, I
can say if the sampling theorem say that, it is possible to completely recover this signal
from this sampled signal. There is some header or some constant what is those
constant? It constant says that if the baseband frequency of this signal is bandwidth is B.
Baseband or you can say the bandwidth is B means let us the highest frequency component
of this signal is fm. If I sample this signal which is Fs which is which must be greater than
or equal to 2 fm. This the header. It is possible to recover the signal if it is greater than or
18
equal to 2fm. Why it is greater than or not all time equal to? Because let us think about a
pure sinusoidal wave, if it is a pure Sinusoidal wave.
If Fs = 2 f m then all the sample will be here. So, I cannot recover the signal. So, in that
case I required Fs > 2 fm ok.
So, instead of taking the analogue signal, so on the analogue signal I am taking some
sample instant.
19
That is called that; how do we take the sample instant? Multiplying this signal with a
sampling frequency impulses impulse signal whose frequency is Fs. So, when I can say
the impulse is there when the impulse is present, I take the measurement of the signal, that
is Fs. Now if you see if my F s is increases, this gap will be decreases ∆𝑠∆t will be decrease
if F s is increases ∆t will be decreases.
So, if I use the if my sampling frequency is much larger, then I can get accurately the
signal. But right at the I cannot lowest I the guider is that Fs must be greater than equal to
2 F m, but I can take any sampling frequency. So, let us human speech has a 20 hertz to
20 kilo hertz. I can take the sampling frequency just 44 kilo hertz I can take it which is
much much above the 2 times of the which is much above the 2 times of the 20 kilo hertz
44 40 kilo hertz 44 kilo hertz I have taken. Now if I increase the sampling frequency, I
said that I can accurately take the signal ok.
Now, sometime you see although human speech is 20 hertz to 20 Kilo hertz.
We have taken a sampling frequency let us 16 kilo hertz. What is meaning? Meaning is
that, I am restricted the speech signal, but I am interested the speech signal. So, if it 20
kilo hertz is sampling frequency. So, as frequency of the speech signal is 8 kilo hertz. So,
human speech has to be possibly a filter, who’s highest cut off frequency is low pass filter
is 8 kilo hertz upper up to above 8 kilo hertz all frequency has got down to 0 or you can
discard all those frequencies. So, sampling frequency based on the sampling frequency
20
you know up to what frequency I can get of this speech signal. So, that is the sampling
frequency idea. So, in speech sometimes we use 16 kilo hertz; that means, highest
frequency component of the signal is 8 kilo hertz. If it is 22 kilo hertz, then it is 11 kilo
hertz. So, that is called sampling frequency, other aspect is quantization.
So, quantization is again I must what I said I am taking instant of this value. This is nothing
but a voltage if t is amplitude it is voltage let us +5 volt, -5 voltage here. So, this voltage
must be converted to some step if you see here some step. So, each step I can represent by
a binary number you are computer is (Refer Time: 10:55) binary number. So, I can say let
us that - 5 volt to +5 volt or I am neatly draw it. So, that you can understand it that let us I
have a signal whose varies from minus 5 volt to let us this is minus plus 5 volt and this is
minus 5 volt ok.
21
(Refer Slide Time: 11:09)
So, this whole 10 volt I divided into some level some level. So, how do we decide the
level? Let us this whole 10 volts is divided in or you can say that I am representing these
10 volts by a 8 bit number. So, what is highest value of 8? 2 to the power 8, 8-bit number
means integer value is 2 to the power 8. So, 28 = 256. If it is negative and positive. So, one
bit is going 4 side.
So, it is -127 to +127 or including 0 it will be 2 1 2. So, I am not saying that that part. So,
making 0 is middle positive and negative understand 250 sorry, 2 to the power minus 127
or 126 you can take it. So, 8-bit number. So now, if I do that, then 250 I have divided these
10 volts in 256 level. So, plus side and minus side is there. I can divide this thing instead
of 8-bit 16 bit. So, 2 to the power 16 level I can divided this thing.
So, once I divided this signal much smaller gap the accuracy of the quantization will be
increases or quantization error will be decreases. I am not detail discussing about the
quantization error because that is there quantization error. So, what I am saying that if I
recorded the speech signal with a high quantization the accuracy or error quantization error
of the signal is reduced. So, when I am going for a good quality recording should use
quantization level should be very high instead of 8 bit.
22
(Refer Slide Time : 13:26)
If I 12 bit or 16 bit the quantization error is less, but other aspects is there.
What is other aspects? Once I increase the quantization level what is increasing data size
because if it is 8-bit one sample is represented by 8 bits. If it is 16-bit single sample is
represented by a 16 bit. Similarly, s if it is 8 kilo hertz, I can say in one second, I have
generated 8 k sample. In one second, I have generated 8 k sample if each sample is
represented by 8 bits. So, for one second speech recording will get 8 kilo bytes why 8 bits
is 1 byte. So, 8 kilo bytes similarly if I quantize it to the 16-bit t will be 16 kilo byte 16 bit
is 2 byte. So, 16 kilo bytes.
So, when I increase the quantization level although my signal to noise ratio is decreases
means increases; that means, I quantization error is decreases, but my memory size is
increases. Similarly once I increase the sampling frequency my accuracy of the digital
signal will be increases, but what is increases size of the file is increases instead of 8 k if
it is 16 k; that means, in one second I have generate 16 k sample if each sample is 8 bit
then 16 kilo byte. So, that is a trade of what quantization level I should use what kind of
sampling frequency I should use. So, when you record the speech signal suppose you were
recording the speech signal for telephone speech which is bandwidth is 300 hertz to 3.3
kilo hertz.
23
(Refer Slide Time: 15:29)
So, if it is my signal is band limited to 3.3 kilo hertz, I should not sample this signal with
a very high frequency high sampling frequency. So, my 8-kilo hertz is sufficient even if
double this thing nearest is 8 kilo hertz is sufficient to sample this thing.
So, that is why the telephone bandwidth is 4 if it is 8 kilo hertz is sampling frequency 4
kilo hertz is the maximum frequency which can present in the speech. So now, I give a
problem, let us I am not going details of this digitization things lest this is a problem.
24
An audio signal is recorded using the following format to store 50 millisecond signals in
PCM WAV format, how much memory is required? Sampling frequency is 8 kilo hertz
encoded with 16 bit and recorded in mono invert mono channel. So, if it is 8 kilo hertz 8
bit in one second it will generate 8 k sample. So, in 50 50 millisecond I must know how
much how much sample it will be generate and for each sample is 1 byte how much
memory is required I can calculate. Similarly, there is another conversion also I have I
should teach in here. Suppose sometimes we said number of samples to time conversion
is very important. Suppose I have a speech signal. I say a speech signal is sampling
frequency is 16 kilo hertz.
How much sample will be there within a 20-millisecond signal. So, in one second there
will be a 16-kilo sample in 20 milliseconds in one second into 20 320 sample. So, 320
sample represent 20 milliseconds if it is 8 kilo hertz then how many samples will be there
half of this half of this. Because it is half of this if it is 44 kilo hertz then you can calculate
how many samples will be there.
So, instead of 20 second, I can say 30 second. So, once I say the window size of forty
millisecond then you know how much sample will there. Similarly, if I say if 320 samples,
I have taken the time domain representation is 20 millisecond can convert. This will be
frequently used in speech processing number of samples to time to sample. So, this is the
25
basic idea of digitization of the speech. Now there is a expertise required how do I record
the.
Speech normal recording is there if you see I have connected if my if you see I have a
collar mic connected to a since it is wireless mic connected to a wireless transmitter and
there is a sound cord. And this is a one side talk this mic is converted to electrical signal
this electrical signal goes to the sound cord digitize the sound cord is digitize the signal
and recorded it. So, microphone there is a amplifier if it is required sometime microphone
directly connected to the sound cord which contain the amplifier then digital recording
device or which convert that analogue signal to digital signal then if I want to playback.
So, I required a speaker or headphone these kinds of things ok.
So, in digital recording what is done it is does the digitization and storing of the sound. So,
digitization means it requires the sampling frequency and n number of quantization bit.
And, stereo or mono that I have not discussed details, but most cases in sound recording
does in mono since it is a single channel that I want to record the human voice. So, when
you analyse your voice in computer you should record the signal in mono format not stereo
format it is not required.
Because there are no 2 sources are there single source is speaking. So, it is mono is
sufficient some things that how this is store in computer there is a different kind of file
format mono what is stereo that is said stereo mean 2 channel mono means single channel
26
stereo sound I have not explained in details there is a stereo sound there is a 5.1 channel
sound there is a Dolby digital sound. So, Dolby digitalis is again one format it is not a
sound you can say that it is a sound recording Dolby digital is sound you can say the
compression technique, or you can say sound compression procedure. So, Dolby digital is
a format instead of you can say the it is not stereo mono kind of things. So, 8 5.1 channel
sound we have heard mono sound heard stereo sound heard stereo means 2 channel mono
means single channel 5.1 means 5.1 channel, I am not discussing that part that is a another
course which is audio system engineering there I have discussed the mono and stereo kind
of things.
So, what kind of file format is there? If you see that there is a file format lot of file format
you see dot wav nsp mp3 mp4 wma dss au etc.
So, all are some or coded speech format some are non-coded because it is just some stored
the sample. So, if it is the mp3 it is a compressed. So, sound can be stored either
compressed or without compress. If I stored the signal let us that in one second it is
generate 8 kilo y I can compress it and store it and there is no compression technique and
when I want to play the sound. I will decompress the signal and play it in the wave.
Similarly, when I process the sound, I should extract the sound sample by sample. So, if it
is a compressed format stored in a complex compressed format, I must decompress it. And
use the sample by sample or processing and again I can compress it. Now any audio
27
compression or mp3 mp4 wm wma dss, there is a loss. If I compress the sound, I have lost
some sound. So, I cannot recover those signal those sound once I do the decomposed it
and process the sound sample by sample. So, if I so, if I want to store the just simple
recording sample by sample use PCM a wav. So, there is a PCM wav format header file
Microsoft PCM a wav format header file.
So, once I get the sound file dot wav format this is a binary file it contains whole recording
of the sound let us, I get this there is a header. So, if you see this header there is a different
chunk position has different information. So, I have to know what I the sampling frequency
of this sound and what is the encoded bit of this sound. If it is 16 bit; that means, 2 byte
represents one sample. If it is 8-bit 8 bit represent one sample. If it is stereo all channels
are same. If it is mono it is stereo there is 2 a channel recording is there. If it is mono single
channel recording is there and there also sampling frequency is important. Because some
unless I do not know the sampling frequency I cannot process it.Digital signal or this
digital processing of this sound file. Sound file is required a sampling frequency. So,
sampling frequency also stored in this.
Sampling rate byte rate block again block size sub this. So, those format details are given
in these slides. So, digital format uncompressed and compressed format what I have
expressed and then I can go for the microphone selection.
28
(Refer Slide Time: 23:39)
Suppose I want to record I say you to record your name in your computer. So, how to
record it? You may use your headphone and connected to the computer sound cord and
you record the sound, but once the human being produces the speech it is range is 20 hertz
to 20 kilo hertz. The acoustics wave contains 20 hertz to 20 kilo hertz sound.
Once I put the microphone in the front of the mouth. So, microphone has a limitation.
Microphone has it is own property that it has to be convert this acoustic wave to an
electrical signal. So, microphone may have a limitation. So, microphone may have a
29
sensitivity microphone may have a frequency response microphone has a pickup pattern
all kinds of restrictions are there who know the details of microphone you know that there
is a lot of parameters of the microphones are there.
So, if you see I am recording a speech with a microphone whose frequency response is
may be 100 hertz to 10 kilo hertz. So, I cannot get the acoustics wave after 10 kilo hertz,
when we required in human speech, but I cannot get those speech after 10 kilos because
my microphone has a limitation it can only record the signal or it can only transfer the
acoustics wave to electrical signal of 100 hertz to 10 kilo hertz.
So, if I apply a 12 kilo hertz signal acoustics wave it may not produce any signal, it will
not produce any signal. So, I am not getting any information in the electrical wave once
that electrical signal I get I will pass through the digital cord. So, selection of the
microphone frequency response microphone type how to connect it all are very important.
So, microphone which microphone I should use t depends on what kind of application I
want to develop.
Suppose I want to develop a speech coding for telephone channel then I should not use a
very high-end microphone which frequency response is 20 kilo hertz. Because my signal
is 4 kilo hertz band limited. So, I can use 10 kilo hertz sensitivity of a frequency response
of the microphone is very good for recording that telephone bandwidth even 5 kilo hertz
microphone sensitivity up to 5 kilo hertz is enough for recorded the telephone channel
signals. Now suppose I want to develop a speech corpus for research purpose. If I want
that this corpus only used for scientific research, if I want that that should contain that
whole human speech range of frequency then I should use a very high end microphone,
which frequency response may be 15 kilo hertz or may be 20 kilo hertz ok.
So, the microphone, then the amplifier is also their amplifier frequency response also
important. So, amplifier and microphone whole frequency response should be higher than
my desired what frequency I want to record that things. And there is a if you know that
there is a different kind of microphone hand handle microphone head mounted microphone
all things are there.
30
(Refer Slide Time: 27:19)
So, all microphones the advantage and disadvantage am not going details of the
microphones. Then there is a carbon mic piezoelectric mic dynamic mic condenser mic
ribbon mic every mic has his own frequency response his own advantage and
disadvantage. Then there is a connector is there then if you see the Directivity.
Microphone has may be omni directional may be bi directional may be unit directional all
kinds of directional means you know which point microphone has a pick up; that means,
if I use this kind of microphone if you see the omni directional; that means, that
31
microphone can pick up from any sound any direction. Now if it is a unidirectional that
can only pick up a sound from in one direction, that is repeated in a polar (Refer Time:
28:11) slides are there this is the frequency response example of a microphone.
Then recording issue keep mic mouth distance relatively constant with and across
recording session. Setting you recording level is very important. Suppose if you see there
is a clipping one very important issue I will show you in software. There is a clipping is
very important issue during the recording. And maximum people or you can say the
maximum people say that one who record the signal that the signal is clipped. What do
you mean by clipping?
32
(Refer Slide Time: 28:52)
Now, suppose I have my ADC or volume of the recording volume which is control the
ADC level.
So, suppose this is 5 volts to -5 volt. This ADC can handle 5 volts to -5 volt. Now if I
amplify my sound 7 volt to -7 volt. So, suppose 5 volts to minus 5 volt somewhere sine
wave let us sine wave it is vary like this, if my limitation is 5 volts if I apply 7 volt. So,
what will happen after 5 volt it will be flat. So, after 5 volt it will be flat response.
So, what is if it is flat what is happening? So, this portion I am not getting I am getting a
flat. So, if it is clipped; that means, you developed a square wave there. So, original
frequency component of this signal I cannot get it. I get the squarer content all the
frequency all the frequency content. So, this kind of recording if you record the signal
which is clipped. Then if it is clipped then your whole recording purpose is gone. Your
recording whole recording purpose is means because it is developed lot of sound frequency
also. So, be sure that signal is not clipped, recording level is not clipped. And
environmental noise is very important.
33
(Refer Slide Time: 30:39)
And there is a Free software tools for recording there is a speech tools I have used cool
edit pro I have used wave surfer praat all, you can use any software you can use or I can
you just go to the Google and typed praat it will praat will be downloaded and store it.
Now what session what home tux or you can say what practice I can give that use this
recording procedure and record your name let us 5 times record your name in 5 times.
So, that the signal is not clipped and see the highest frequency content of your signal. Next
class I will show you using this software how to see the highest frequency content of the
signal all those kinds of thing and see the how much size it is taken. The think about that
if I told you cut a window of this sampled this number sample to this number sample how
do you do that in a programming also very important. So, those things you just practice
ok.
Thank you.
34
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture – 03
Review of DSP Concepts
So, before we start digital speech processing, some signal processing concept must be
reviewed because those will be used frequently in many places. So, you may find difficulty
that to conceptualise DSP problem. So, what I do that I will just review some DSP concept,
but details I will not go because details; this is a separate subject which is called digital
signal processing because this is a base point of knowing the digital speech processing.
So, whatever the acoustic phonetics, speech production you know that is different because
that does not require any DSP concept that much, but once I explain start that explaining
the digital speech processing and modelling; discrete modelling, discrete time modelling.
So, in that case some DSP concept is required.
So, I just quickly go through some DSP concept. So, what I will discuss I have not
discussed the details mathematics, I only discussed concept behind that DSP algorithm.
So, let us start with that continuous signal.
35
(Refer Slide Time: 01:25)
So, if I; if you know that this omega always represents that continuous frequency this
omega. So, if I say sinusoidal signal let A is the amplitude cos omega t plus theta; theta is
the range. So, this omega is continuous frequency. Now what are the property if you
change this the frequency are distinct, if it is 500 hertz sinusoidal when is the frequency is
500 hertz if you increase the frequency in time domain suppose if it is like this. So, what
you say this has an oscillation. So, the number of periods in per second give me the
frequency.
36
So, this there is a oscillation is there. Now if I increase the frequency. what will happen?
The number of oscillation or number of periods per second will be increases. So, high
frequency means oscillations will be increases. So, that is the property. So, more period is
included in one second more period are included in time. So, this is the low frequency, and
this is the high frequency signal. So, this is the concept of analogue signal. Now once I
make the digital signal, let this cos omega t, I sampled and stored it in the digital signal.
So, if it is a continuous if X a (t) is a continuous signal, then what should be the X n; X n
is a digital signal. So, x n is nothing, but A cos instead of continuous omega we write this
omega n plus theta this is called discrete frequency. So, this is continuous radian per
second, and this is that this frequency is per sample; I will come details I will come.
So, this is written like this way and what is the property. Now a discrete time signal is
periodic if its frequency f is rational number what is understand what is why the small f
has come; now what is omega? Omega is nothing, but a 2 𝜋 f. Now if it is digital signal
what I have done; I have replaced this t by nT. So, this 𝜔 continuously becomes 𝜔 is
nothing, but the 2 𝜋 f.
𝜔 = 2𝜋𝑓𝑛𝑇
𝜔𝑛 = 2𝜋𝐹𝑛𝑇
37
𝜔𝑛 = 2𝜋𝐹𝑛𝑇/𝐹𝑠
f = F/Fs
So, I cannot say that this is a if it is periodic if this is a rational number then only it will be
a periodic I am not discussing number system does not know what is rational number ok
and then discrete time sinusoidal whose frequency are separated by integer multiple of 2
𝜋 are identical very simple, if the cos is a distinct sinusoidal if it is integer multiple of 2 𝜋
separated then they are identical. So, 2 𝜋 is the maximum period. So, the highest rate of
oscillation the highest value of small f what will the highest value of small f.
Fs= 2F
f = F/Fs = F/2F = ½
−∞ ≤ Ω ≤ ∞
−1 1
≤𝑓≤2
2
𝜔 = 2𝜋𝑓
−𝜋 ≤ 𝜔 ≤ 𝜋
38
(Refer Slide Time: 06:49)
So, omega; small omega represents the discrete frequency and capital omega represent
analogue frequency, this is radian per second this is radian per sample radian per sample
and this is radian per second. So, if I say this sometime this omega is called also normalised
frequency.
So, if I say I have the signal whose capital F is 50 kilo hertz and less sampling frequency
is 100 kilo hertz. So, what is the value of omega? Omega is nothing, but the 2 𝜋 F where
39
it is 2 𝜋 F / F s. So, it nothing, but a 2 𝜋 50/ 100. So, it is nothing, but a 𝜋; it is nothing,
but a 𝜋.
40 𝜋4
2 𝜋 F/Fs = 2 𝜋 100 = 5
So, this omega is called discrete frequency, this concept is very important, this will be
used in many where. So, this omega continuous this is discrete this is radian per second
this is radian per sample is clear next concept is complex signal a complex; it is know you
know everybody is known that is complex signal either, it is a vector everybody knows a
complex vector and the complex signal has an amplitude and has an phase.
r = √𝛼 2 + 𝛽 2
𝛽
𝜃 = 𝑡𝑎𝑛−1 (𝛼)
40
(Refer Slide Time: 09:13)
X[n] = ∑∞
𝑘=−∞ 𝑋[𝑘] 𝛿[𝑛 − 𝑘]
So, this del a; 3 a 1 of the coefficients and those are the delta position. So, at that position
the signal has amplitude a 3 sample has a amplitude 3 at that position signal has amplitude
a 1. So, that can be represented like this way because it is plus 3. So, minus 3 point; there
41
will be a sample whose completion amplitude is minus a 3, there is a sample cos sample
position 1 2 like that way it will represent.
So, this is the impulse representation of a sequence. So, if I say have speech signal x n who
has suppose that is a 100 sample. Now if I say the sample start from 0 1 to 100, then first
sample is nothing, but the amplitude less first sample amplitude is a one with delta n n plus
n minus 0 is 0. Next a 2-delta n minus 1 like that I can write. So, that will be used in
different time one you process the digital signal.
Now, next concept is classification of signal different type of signal; energy signal, power
signal, I have not describing you know that if it is energy is finite then called energy signal.
If power is finite, then it is called power signal periodic signal a periodic signal; a signal
is periodic, if it is repeat its nature after certain time interval in time domain if it is repeat
in nature lets after n sample, then I can say the n; n sample is the period of the signal. So,
x[n] is periodic with period n if and only if x[n+N] = x[n].
42
(Refer Slide Time: 11:30)
So, finite amplitude; it is if this point and next repeating point is same, then I can say the
signal is repeat itself. So, it is a period. So, digital signal x if n is the period, then x[n+N]
= x[n], then you call signal is periodic, then symmetric and anti-symmetric signal; a real
valued signal x[n] is called symmetric if x[-n] = x [n]; on the other hand, if it is anti-
symmetric, then x[-n] = - x[n].
So, this is a signal property; next discrete time system. So, signal periodic; a periodic signal
symmetric anti symmetric; I have not discussed if n is called stationary and non-stationary
43
signal is also. So, what do you mean by stationary signal if the signal does not change its
property over the time, then I can say it is a stationary; suppose I have a sinusoidal of 500
hertz; let this is a 500 hertz, if I take this time; if I take this time all the time it is 500 hertz
sinusoidal. So, I can say the signal is a stationary signal does not change its property along
the time.
Now, let us say the speech signal; if I say speech signal, if I say a sentence along the time
the signal properties varies, sometimes it is silence, sometimes it is voice, sometimes it is
silent, sometimes it is fiction when may be silence, then may be voice. So, I can say along
the time signal changes its property. So, it is a time varying signal; later on, you learn that
signal processing algorithms are applied on a stationary signal. So, I must consider; how I
make the speech signal is nonstationary. So, those kinds of consideration we take. So, this
is stationary and non-stationary signal, then I come to the systems if you come system; this
is called discrete systems, I am to go to the analogue systems you know that signals and
systems that class.
y[n] = H[X[n]]
y[n] = 0.8y[n-1]+0.5X[n]+0.9X[n-1]
44
(Refer Slide Time: 16:22)
Now, a system consists of certain properties; I will come implementation later on also.
A system has certain property what are the property it is may be statics or it may be
dynamic system a discrete system is called statics or memory less; if its output at any
instant n depends at most on the input sample at the same time, but not past or future
sample of the input.
45
So, suppose my y[n] is only depends on x[n] and x[n-1], but not x[n+1]; then I can say my
system is static. So, or memory less system and if it is future sample then I can say it is
with memory system. So, memory less system static system with memory system is
dynamic system then time invariant and time variant system time invariant and time
variant signal that is called stationary and non-stationary signal, but if it is time variant and
time invariant system.
So, if the system does not change its property over the time then I call it is a time invariant
system; if it is change along with the time then I can say it is a time variant system if I
consider this vocal track is a system along the time, it change its construction. So, it is a
time variant system suppose access the suppose operator square always it will be square
whatever the input will come it will square. So, property does not change along with the
time. So, it is a time invariant system.
So, how do you mathematically said that if it is x n is a input of a system H and output is
y n if I apply x n minus k I should get the y n minus then it is called time invariant system
along the time system does not change its property.
46
(Refer Slide Time: 19:04)
So, if it is supporting the superposition principle then I say the system is l linear system.
So, if the system is linear and time invariant, then we call LTI system linear time invariant
system that is called LTI system; that means, the system is linear system does not change
its property along the time, then you know causal and non-causal system. So, if it is not
depends on the future, then I call system is a causal system means y[n] is only depends on
either x[n] or x[n-1], but not the future input future input will be x[n+1] if it not depend
on x[n+1] then I call it is a causal system; if it is depend, then it is a non-causal system.
Then stable system unstable system what is stability if I apply a bounded input to input to
the system it should produce bounded output if I apply bounded input if the system produce
bounded output then I call system is stable system BIBO bounded input bounded output
in bounded input should produce bounded output then stable and unstable system.
47
(Refer Slide Time: 20:55)
Then there is recursive and non-recursive system that you know that this; this slide is self-
exclamatory that explaining the recursive if the output depends on the previous output,
then we call it is a recursive system, I am not describing details then very much or you can
say the most commonly used signal processing algorithm in speech is called convolution.
So, convolution is a most frequently used if you see that if I want to find out an output for
an input in a system.
48
Then the system transfer function is convolved with a input to produce the output of the
system. So, convolution is most used algorithm in DSP.
y(0) = ∑∞
𝑘= −∞ ℎ[𝑘] 𝑋[𝑛 − 𝑘]
So, convolution means that suppose this is my system and I apply a signal here. So, output
is the property of the system must be convolved or modified the input signal produce the
output. So, each property of the; you can say that whole property of the system should
modify each input of the. So, that is why whole H[k] will modify each input. So, this is
convolution.
49
(Refer Slide Time: 23:02)
How it is done it has a operation; how many operation is involved; it is folding shifting
multiplication and summation if you look at this equation it is nothing, but a folding. So,
if I say x[n]= y = 0, then it is nothing, but a H k multiply x minus k. So, I can say the x k
is folded to produce x- k.
So, I can give an example. So, let us this is my signal, and this has to be convolved with
this. So, green is the lets the system. So, sorry there is a wrong interpretation you can say
that this is H and this is x lets I same thing is considered this is H and this is x. So, I can
50
say this H is convolved or you can say that the x property of x is convolved with H t. So,
what I said that the H has to be shifted folded. So, y[t] is means H has to be folded first
folding then shifting t equal to 0 then product and sum this sample product to this sample
produce sample then shift the signal y t equal to one then again product and sum then again
t = 2 product and sum then again one sample shift product and sum. So, that way output
will be produced.
51
Let us give an example lets x[n] is nothing, but instead of writing that way let 1 1 1 and
here lets this is 0 0 1 2 3 and lets H n is nothing, but a 1 1 1 and here I am not giving the
same example which is in the slide this is 2 signal. So, what is there H. [n]So, which signal
has to be folded x[n]. So, I can fold the x[n]. So, if I draw in pictogram. So, this is x[n]
first sample second sample third sample fourth sample.
So, 0 1 2 3; 0 1 2 3. So, if I want to fold it. So, it will be this side one 2 3 and if I want to
plot H[n]; it is nothing, but a 1 2 3. So, my y[n] first I fold it multiply by 0. So, y 0 is
nothing, but a multiply by this or this with this. So, 1 multiplied by 1 and this with multiply
by this 0. So, this is 0, this is 0, this is also 0, this is 0, sorry, this is y[n] I will draw nicely.
So, it will be best; let draw it neatly. So, use this pen.
So, this is my x[n] sample number one sample number 1, 2, 3 and 4 and this is my H[n],
H[n] is nothing, but a sample number 1 sample number 2 sample number 3; 3 sample is
there. Now what I said to produce the convolution lets I draw the first. So, produce the
convolution x[n] has to be folded. So, I folded x[n], I will draw here to here this is number
1. So, folding is this side 2, 3, let 4 I folded it and so this is x of minus k; now I use the H
k is nothing, but a 1 1 second sample third sample. So, this is H k; now what is my y. So,
this is lets y[n]. So, at y 0 y 0 this multiply; this and if I say if I projected; this side it is 0,
here it is also 0, here it is also 0, here it is 0. So, here also it is 0.
52
So, product this multiply with this some with this multiply with this some with this
multiply with this. So, 0 into 1 plus 0 into 1 plus 0 into 1 plus 1 into 1 plus 0 1 into 0 like
that way; it will come this will multiply with this add and so, I can get first sample y; to
get the second y[n], what I will do instead of doing here I shifted this signal one sample.
So, what I can say let’s shifted this. So, this will be one first sample this is the second
sample this is the third sample 1, 2, 3, 4. So, this sample again become 0. Now I product
this with this, this with this, this with this, this with this, this with this, this with this and
add together and get the y.
So, that way I get whole y[n]. So, that is why folding shifting product and sum. So,
convolution then there is a term called circular convolution.
Circular convolution of x[n] and H[n] is defined as the convolution of H[n] with a periodic
signal xp[n] if the signal is periodic then can do a circular convolution. So, this is the
mathematics I am not going details of the circular convolution.
53
(Refer Slide Time: 29:51)
And there is a signal like this and to find out whether they are similar or not that is why it
is correlation relation between let us this x n and this is y n or you can let us say this x one
n. So, correlation between x[n] and x1[n] is nothing, but a similarity between these 2
signals how they find out I just take the. So, it is a digital signal. So, I take the sample by
54
sample I take the sample by sample. So, I can say that I can multiply the 2 sample and take
then the sum product and sum is the correlation.
So, I can say the first correlation coefficient or I can say l th correlation coefficient r l is
nothing, but a product and sum n equal to minus infinity to infinity x[n] with x1[n- l] or
x[n] with n + l . So, it does not matter whether I shifted the signal this way or this way
both will give me the correlation. So, this is the l th correlation if it is 0th coefficient then
I start from 0 sample to 0 sample if it is fifth sample then I shifted this signal this side
fifth sample. So, start with this, this portion of the signal will be not used. So, will
compare this portion signal with this whole signal understand or not then if I l=10, then
lets this is the 10th sample. So, this portion of the signal is deducted, and I compare this
portion with whole signal. So, that is the correlation.
So, details that is may be a detailed discussion in signal processing books you can refer to
the any signal processing book the algorithm also I have written down you can go through
the slides and find out that things.
55
(Refer Slide Time: 31:54)
Then the convolution correlation; so, then LTI discrete time system which has you have
to consider because in signal processing or digital speech processing we consider the this
production system is LTI system discrete time linear time invariant discrete system LTI
discrete system linear time invariant linear system is linear system time invariant and a
discrete system. So, linearity you know time invariant you know and LTI impulse response
is direct convolution, I am not going details you know why I take that LTI system.
56
(Refer Slide Time: 32:35)
57
Now, if you this is the same signal system you already you have did it if the x[n] is input
h1 and h 2 are the 2 system function the transfer function and y n is the output then it is a
it is computable h 2 can be fast and h n can be the second also all those property you know
that this is probably I can shift it here and here. So, I can plus it or I can star it.
Then I can act you the draw the transfer function f this one. So, these 2 are parallel this is
series with these 2 systems and then again this; this whole transform summary will be
parallel.
58
(Refer Slide Time: 34:09)
You can practice it; now implementation is important. So, how do we implement a discrete
system there is 2 kinds of implementation one is called direct from one and one is called
direct from 2. So, one is called structure one implementation another is called structure 2
implementation.
59
So, I can say this is multiplied by c1 added with a delay sample Z-1 and multiplied by b one
then output is y[n] is nothing, but a delay sample which is Z-1 [n- 1] multiplied by a 1; this
is multiplied by a1 added with this input. So, this is the implementation of that system.
Now, I said that if a system has lets this is my input x[n] and a system had H1 then I have
another system H 2 let us say this is H 2 and this is y[n], then I can interchange these 2
system H 2 can be here and H 1 can be here does not does not effect that output. So, I can
say I can change this side to this side and this side to this side. So, I invert that one. So, I
can say x n inverting it. So, I can say x[n] plus this one is coming this side plus a1 Z-1, then
this side coming this side Z-1 + b1. So, this is c1 will be there y [n]. So, this is called direct
form one because there is a required 2 memory 2 delay are separated input delay and output
delay are separate if I implement this way then this signal is same. So, I can say instead of
2 delay I can replicated it by a single delay Z-1 a 1 and Z-1 b1 and this is c1 will be there
x[n] input y[n]output. So, this is the advantage; this is our direct form one; this is called
direct form 2 implementation. So, DSP book it is there.
60
(Refer Slide Time: 37:42)
This is I said that l long this is one block, and this is block once to interchange it will come
same side. So, instead of 2 Z to the power delay I can club together and put the single
delay. So, it is called delay from 2 implementation.
We used this thing when we implement this is our vocal systems using discrete time
system. So, next class I will just complete this DSP things.
Thank you.
61
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture – 04
Review of DSP Concepts (Contd.)
Now, come to the frequency domain representation of the discrete signals and LTI
systems.
62
So, what is there? Suppose I have an LTI system in here, LTI system which is h[n]. Now
I want to provide the input and I get an output y[n], if I provide an input x[n]. Now I want
to know frequency response of this h[n]. So, mathematically I can do many things, so what
is y [n]?.
So, if I excited at the system, with an individual frequency each and every frequency let
us say 1 hertz, 2 hertz, 3 hertz, 4 hertz, 5 hertz, 6 hertz, 7 hertz and then I try to find out
the output, for 1 hertz 2 hertz 3 hertz then I get the frequency response of the system. So,
instead of x[n] let us say I input the single sinusoidal. So,
𝑥[𝑛] = 𝑒 𝑗𝜔𝑛 .
I apply 𝑥[𝑛] = 𝑒 𝑗𝜔𝑛 ; while ω is distinct let us for 1 hertz. Then I change the ω for 2 hertz.
Then I change the ω for 3 hertz. So, I can do that way. And n is the number of samples.
Then y[n] = ∑∞
𝑘=−∞ ℎ[𝑘]. 𝑒
𝑗ω[n−k]
= ∑∞
𝑘=−∞ ℎ[𝑘]. 𝑒
−𝑗𝜔𝑘 𝑗ωn
.𝑒 .
63
(Refer Slide Time: 04:50)
So, I can write down y[n] = 𝑒 𝑗𝜔𝑛 . 𝐻(𝑒 𝑗𝜔 ). Frequency response of that LTI system.
If you see 𝑒 𝑗𝜔 is the frequency response of the said LTI systems. Now I can say if I see
this equation what is there? This is a complex number. So, there has a magnitude and
phase. So, I can say H (𝑒 𝑗𝜔 ) has a complex. So, it has a phase or magnitude,
which is nothing but a |H (𝑒 𝑗𝜔 ) |𝑒 𝑗𝜙(𝜔) is the phase. So, I can say H is the real part and
imaginary part. Again, this is complex number property. So, I can write real part is nothing
but a cos part, and imaginary part is a sin part.
64
(Refer Slide Time: 05:31)
So, magnitude response is nothing but a complex number (real square plus imaginary
square), phase is a tan inverse or arc tan inverse imaginary by real part.
𝑑𝜙(𝜔)
And group delay function is 𝜏(𝜔) = − .
𝑑𝜔
Those things will be used frequently in signal processing of speed signal processing.
So, this is the LTI systems signals with the DSP concept will be used later on.
65
(Refer Slide Time: 06:23)
There is another important DSP concept is there that will be used, that is called discrete
Fourier transform.
66
Discrete time Fourier transform, is called DFT (discrete Fourier transform). Here the time
domain signal is discrete, but frequency domain is continuous. So, discrete time Fourier
transform, it is discrete, but in discrete Fourier transform, input is discrete and frequency
domain also it is represented as a discrete.
1 𝜋
X[n] = 2𝜋 ∫−𝜋 𝑋(𝑒 𝑗𝜔 )𝑒 𝑗𝜔𝑛 𝑑𝜔
So, in DTFT, x[n] is digital and x(ω) is continuous, while in DFT x[n] is digital and x[ω]
is also digital. Some time you write it x[k] discrete frequency instead of x[ω].
So, what is the concept behind this? I will describe in the DFT.
67
(Refer Slide Time: 07:56)
x[ω] = ∑∞
𝑛=−∞ 𝑥[𝑛]𝑒
−𝑗𝜔𝑛
.
Now, the frequency domain it also discrete, that is called DFT (discrete Fourier transform).
So, it is not discrete time Fourier transform it is discrete Fourier transform.
68
So, ω is replaced by k, k is the discrete Fourier transform.
X[ω] = ∑∞
𝑛=−∞ 𝑥[𝑛]𝑒
−𝑗𝜔𝑛
𝑋[𝑘] = ∑𝑁−1
𝑛=0 𝑥[𝑛]. 𝑒
𝑗𝜔𝑛𝑘
.
Why x[ω] is -∞ to +∞, and X[k] is 0 to N-1? Frequency axis also discrete. And if I say
that the frequency; that means, I am sampling the frequency scale also. So, suppose I have
a frequency range is f, I want to discretize it. So, let us divided this length with a N number
of discrete values. it is the period after N, signal is repeating itself.
So, X[k] is a period of N, the number of discrete samples the discrete sample I have made
the frequency response. So, this is continuous frequency I discretized it with a number of
discrete values. So, if it is discrete DFT then equation is
𝑁−1
69
(Refer Slide Time: 11:02)
Now, suppose I have a signal with sampling frequency (Fs) is 8 kHz and N= 1000. I divided
whole frequency range is the 8 kHz. I divided in 1000 number of pieces equal number of
pieces. So, k is varying from 0 to 1000. Corresponding the whole 8 kHz space is divided
in 1000 pieces; that means, every sample k value, k = 0, k = 1 is separated by 8 Hz. So,
the value distance between the 2-k value is 8 Hz this is called frequency resolution.
𝐹𝑠
So, frequency resolution is .
𝑁
So, suppose I want to find out the what is the frequency analogue frequency value at k
=10. So,
𝐹𝑠
f= . 𝑘.
𝑁
8∗1000
f= .10 = 80 𝐻𝑧.
1000
70
(Refer Slide Time: 13:06)
𝑋[𝑘] = ∑𝑁−1
𝑛=0 𝑥[𝑛]. 𝑒
𝑗𝜔𝑛𝑘
.
Now X[k] is a complex signal, So, a complex signal has an amplitude and has a phase. So,
this signal is nothing but a+jb,. So, if it a+jb then I can say X[k] has a magnitude this ones.
𝑏
So, phase response, 𝜃 = tan−1 𝑎,
now suppose I have a signal x[n] find out the magnitude frequency response of x. So, I
have to plot the |X[k]|. after computing DFT, I get a+ jb . I compute √𝑎2 + 𝑏 2 that is the
amplitude if it is voltage then it has to be squared to make it power if it is power then it is
√𝑎2 + 𝑏 2 .
So now if I want to draw it arises ae k and |X[k]|. So, k is varying from 0 to (N-1).
Now we can make it with a axis of f instead of k, I can replace this axis by f very easy
8 𝑘𝐻𝑧
because f = . 𝑘, I get the corresponding frequency.
1000
71
So, instead of 0, next frequency f1, f2, f3, f4 for different k value I get the analogue frequency
value. Now if you see here f = 0. So, X [0] = 0 that is nothing but a DC component of the
signal. So, X [0] is a DC component of the signal. So, those properties you have to know
frequency resolution DC component, how do you make the k to f conversion? Those things
will be used in implementation of digital speed processing.
DFT is called symmetry, after DFT X[k] is symmetry; obviously, the symmetry property.
72
(Refer Slide Time: 16:53)
That means if I draw X[k] vs k axis. So, at k = (N-1) is nothing but a Fs.
𝐹𝑠⁄ 𝐹𝑠 𝑁 𝑁
2 is signal limitation. So, it repeats the property ⁄2 point or ⁄2 point. So, 0 to ( 2 −1)
give me the same this information for base band signal. So, after analyse the frequency if
I use 1000-point DFT 500-point property will be repeated in here in mirror in nature
symmetric like.
DFT magnitude. when the real input signal contains a sin wave component a peak
amplitude A0, the integer number of cycle N, then the cycle of over a input sample is Mr,
Where Mr =A0*𝑁⁄2.
So, amplitude of the signal will be magnified by 𝑁⁄2 if it is real valued signal but if it is
a complex signal it will be M number of. So, when you do the inverse DFT I should
normalise the amplitude by if it is real input signal. And there is a DFT leakage, which
means; it is 2 kind of things if you see.
73
(Refer Slide Time: 18:51)
Suppose I have sampled the signal in 8 kHz, and of a signal the let us the signal as 50 hertz.
So, 50 hertz signal sinusoidal signal has sampled at 8 kilo hertz. So, if I take the N=1000
sample, then the frequency resolution is 8 Hz. So, for every k, I will get 0, 8 hertz, 16 hertz,
then 24 hertz, so on 40, 48 hertz and then next one is 56 hertz.
So, my resolution does not support the 50 hertz. So, the power, but the signal has only
power at 50 hertz. So, if I make the DFT. So, instead of looking like 50 hertz single spike
it will be looked like this. Power will be distributed adjacent computation point. So, this
will be look like this. So, this is the DSP computational leakage. Then there will be another
effect windowing effect will also become, I will discuss later on when I design the filter.
Now you know that implementation of DFT discrete Fourier transform is done based on
fast Fourier transform algorithm.
74
So, FFT is not the transform. FFT is an algorithm to implement DFT. That is why it is
called fast Fourier transform algorithm; So, if I say that DFT is X[k]
Now I want to reduce this complex multiplication. So, that it can compute very faster. So,
instead of 𝑁 × 𝑁 can I use some other method which can be computationally which can
reduce the computational cost of the DFT. One method is called fast Fourier transform.
I will discuss only the radix 2 algorithm other you can study from the DSP book. So, radix
I will just discuss the philosophy this radix 2 algorithm is most commonly used algorithm.
So, if I say I want to reduce the 𝑁 × 𝑁 I asked from the computational background that
suppose I have a search space if I make it binary search, then it is log2n so; that means, if
I instead of computing using the whole signal, if somewhere I can dissimulating the signal
in some you can small chunk, and then if I compute is it help us yes.
So, what they have done if FFT radix 2 algorithm, that if I take the length of the DFT is
you can say that if the N can be represented by 2 to the power something (N=2𝛾 ), the
length of the DFT restriction is that I want to I want to divide the whole signal in binary
space in 2 space of 2 part then 4 part 5 part like that. And so, like the binary search whole
75
number divided middle first part second part. And then I use the first part then I use the
second part like that way I want to do it.
So, what I want that length of the DFT must be expressed by this way 2 to the power
something (N=2𝛾 ). So, I cannot take N =1000 because it cannot be expressed in 2 to the
power something away. So, if I take N=1024 then I can say it is nothing but a 210 . This is
the restriction radix 2 algorithm that is why it is called radix 2 algorithm.
Then whole signal x[n] either can be broken in time domain or I can use whole x[n] and
both the computational output this you know which is frequency domain output in small
number of chunks. So, either x[n] can be divide or I can divide X[k] also. 2 space I can
divide the signal. So, let us start with the disseminating time algorithm.
So, here what I do instead of computing whole DFT at a time I divide the input signal into
2 term, one is even term another is odd term. Then again, this one is divided even and odd
and even and odd until unless I reach 2 point, where I cannot divide one even and one odd.
So, this algorithm is described in a book you can do it I just explain you by a let us take
the 8 point (N=8).
76
(Refer Slide Time: 24:54)
So that means, it can be expressed 23 . So, 3 stage I can divide it. So, first is let us the signal
is x[0], x[1], x[2], x[3], x[4], x[5], x[6], and x[7]. So, 8 number of signals I have taken
N=8. I divide this signal space in 2 space. One is odd and even let us even. So, 0,2,4,6 and
1,3,5,7. So, I can say, let us start the signal instead of taking the signal 1 2 3 6.
So, I take the upper part let us 0 2 4 6, and then I take 1 3 5 7. Then again this can be
divided in 2 part, even 0 4 and odd 2 6 again it is even 1 5 odd 3 7.
77
So, once I get 2 signal I compute the DFT. This is called butterfly algorithm. Now if you
see I get the output x(0), x(1), x(2), x(3), x(4), x(5), x(6), x(7). Now you may say how do
I divide this signal in odd and even if it is my signal size is 1024. There is a trick is called
bit reversible, that technique.
000–0 000–0
001–1 001–4
010–2 010–2
011–3 110–6
100–4 001–1
101–5 101–5
110–6 011–3
111–7 111–7
So, this arrangement is done automatically if I just access the signal the index of the signal
it changes in bit reversible mode. So, if it is N=1024. So, my I required 10 bit to represent
the index. When I access the signal, I will change the bit reverse I get it and then I do the
calculation whatever is required for butterfly algorithm. So, that I have explained in DFT.
78
Now, I come to the application side. So, once I done the or I use the DFT or frequency
transform in speech, if you see suppose I have a time domain signal always showed you
time domain signal and I have select a portion of the signal and take the frequency response
of the signal. Now there is a point if I take N=1024; that means, order of the DFT is 1024.
So, I take the signal 1024 sample signal.
So, if I suppose I have a long signal which is stationary signal. So, whether I take any
portion, the signal property remains same. In that case I do not have any problem.
Now, suppose the signal time is also changing. Along the time signal property is changing.
So, if I say portion of the signal 1024. So, whatever the frequency response, I will get that
is the average frequency response of whole signal. Means I am considering during this
time the signal does not change it is property. So, what I am losing? I am losing the time
domain resolution of the signal.
Now if I want to increase the time domain resolution what I will do, instead of N=1024 let
us I take N=256. Then my time domain resolution is increases. Because I my window size
is very less. So, I can get the better time resolution. Now what is happened?
𝐹𝑠
Since frequency resolution f = . If N is decreasing the frequency resolution is also
𝑁
decreasing. So, once I increase the time domain resolution my frequency domain
resolution is decreasing. So, if I take one signal I time domain resolution is losing if I take
79
small signal frequency resolution is losing. So, this is the limitation of Fourier transform.
That is why people are using a greater number of other numbers of transform like
wavelength (Refer Time: 32:19) all kind of things is used.
So, let us record a voice or a signal and design an algorithm or design and compute or draw
it is frequency response or spectrum of the signal I am not saying spectrogram. So,
amplitude response representation is called spectrum. So, what is that it is nothing but a
|𝑋[𝑘]| versus frequency that drawing is called spectrum of the signal.
So, if I record the vowel and take the N=1024 point and draw the spectrum of the vowel.
So, I will draw the frequency response of the vowel; that means, I can say I can draw it
with respect to and I can convert the k with respect it then I get the frequency response of
the signal. So, this can be draw in linear view or log view, log view means if you see there
is a 2 kind of view log view doing, I will come later on. So, log view and linear view. So,
there will be a log view I can take the output as a log. So, log spectrum a normal spectrum
is a normal spectrum. So, then once I get the spectrum, I get the formant frequency plot.
So that means, I know the permanent frequency representation of the signal all details will
come STFT when we discuss now this is called spectrum.
Now, when you draw, these kinds of representation this is called spectrogram.
80
(Refer Slide Time: 34:19)
So, difference is that spectrum is frequency versus power means the power of each
frequency plot. Spectrogram x axis is the time, y axis is the frequency and power are
converted into the intensity or colour. So, it is the 3-D curve, but representation is that if
you see there is a long line same line is coming why this line is coming. So, what I do? So,
I want this axis is time, this axis is frequency and intensity as an amplitude. So, I have a
long signal. So, I take so I take that one chunk of the signal from here, and find out this
point, then I plot it here.
So, for this wholes chunk frequency representation will be same. Next chunk I will take
and then I plot it here. So, if it is 10 millisecond windows then these 10 milliseconds will
be same, these 10 milliseconds will be same. That is why you see this kind of things are
coming. So, this is called spectrogram and earlier one is called spectrum. So, next is the
digital filter, this is the last portion of the DSP review. So, digital filter, if you say any
forget about the filter any transfer function.
81
Let us I have a transfer function let us H[z].
𝑌(𝑧)
Where, H[z] = 𝑋(𝑧) , is the transfer function.
𝑃(𝑧)
So, let us H[z]= 𝑄(𝑧) .
Solution of P[z] provide me the 0 position and solution of Q[z] provide me the pole
position. That is why H[z] is pole 0 filter, pole and 0 filter. That is why H[z] has a formant
and 0 provide the anti-formant or anti resonance.
So, pole 0 filter. Any filter can be implemented using a pole 0 function. So, I can
implement P[z] and I can implement Q[z] and H[z]. Now how it is look like this? If you
see the mathematics.
So, there is M number of 0 and N number of poles. So, every pole and 0 has a complex
conjugate. So, that I will discuss later on that every poles has a complex conjugate if it is
complex pole complex 0. So, it is at the real value H[z] is real then every pole on every 0
complex pole has its conjugate value.
𝐴 ∏𝑀
𝑟=1(1−𝑐𝑟 𝑧
−1 )
H(z) = 𝑁
∏𝑘=1(1−𝑑𝑘 𝑧 )−1
82
So, if there is a pole in here, there will be conjugate value. So, this is the unit circle in all
poles a line in inside the unit circle then you can say there is a causal and stable filter now
how do you implement filter in digital domain.
So, I can say H[z] has an impulse response or H[z] in time domain it is nothing but h[n].
So, h[n] has an impulse response. It can be finite or it can be infinite. So, if I say if it is
infinite. So, if it can be finite or infinite. So, we say that it has a finite impulse response
then we said the implementation is called FIR filter (finite impulse response filter). If I
implement using infinite impulse response, and that is from minus infinity to plus infinity,
then I call IIR filter (Infinite impulse response filter).
So, lest discuss about the FIR filter which is called finite impulse response filter.
83
So, suppose I filter means I something is unwanted I want to discard, that is filter. Now in
the signal suppose I have a signal which has a frequency response let us all frequency are
present. Now I want to cut down after 5 kilo hertz I want signal or system should not less
signal should not produce any response; that means, that filter is there filter frequency
response is there up to 500 hertz, I want response should be like this, and after 500 hertz I
do not want anything.
So that means, 0 to 500 hertz all frequency components should pass to the output, and after
500 hertz, none of the component should pass to the output, then it is called low pass filter.
Because low frequency component is passed. High frequency component is attenuated to
0. So, there may, that frequency component between 500 hertz to 1.5 kilo hertz may pass.
This portion also becomes 0 and this portion also becomes 0. So, that is my requirement,
or I may say I this is called band pass filter. Or I may say I want a filter after 1.5 kilo hertz
all frequency should pass, below 1.5 kilo hertz none of the frequency should be there. So,
that should be 0, that is called high pass filter.
So, it is low pass filter, it is band pass filter it is high pass filter. That is my ideal filter
frequency response I want, this is the ideal response, but if I implement it with a finite
impulse response filter, then I instead of getting that single that this pass band with one,
instead of this if you see this is flat one, here I can get a some kind of variation, that is
called pass band ripple. Or instead of some cut off I can get a transit point that is called
84
transitory response. And after that instead of 0 I can get some signal here also, that is called
stop band ripple. Pass band ripple, stop band ripple, and filter condition transition
bandwidth. So, where the amplitude is one to 0 is transitory. So, that transitory portion that
frequency range is called transition bandwidth. So, if you say normalised when the peak
ripple is pass band peak ripple in stop band and normalised transition bandwidth. So, this
point to this point is called normalised transition bandwidth. So, this is pass band
frequency, this is a stop band frequency, where I want to stop that filter. So, this bandwidth
is called transition bandwidth.
85
If I know the transfer function time domain transfer function of the filter, then I convolve
the input signal I get the output of the filter. Or what I can do? I can take the frequency
response of the filter which is H[k], and I can take the frequency response of the input
signal X[k], I get
then I take the IDFT to get the y[n]. So, either I can implement in frequency domain, I can
frequency domain transfer the H[k], then I can get take the time domain signal in frequency
domain, multiply both and get the output and take the IDFT get the input.
So, this is the 2 implementations either frequency domain implementation or time domain
implementation. So, have you understood this either I can implement the filter in time
domain or I can implement the filter in frequency domain. Now suppose I want to
implement the filter in time domain or frequency or any domain when I want to implement,
how to implement it? FIR filter implementation. So, suppose I have a long signal, long
signal is there.
There is a long signal if I want take the whole signal. So, signal may be used signal. So, I
do not want to take the whole signal at a time. Then what I have to do let us I have to cut
the signal and apply the filter here then the take the next chunk again do that way.
86
Now, once I cut the filter what I am doing? Whether I take the DFT or filter, once I cut the
signal in a chunk; that means, if this is x[n], I am multiplying x[n] with a window function
w[n]. i.e. x[n]*w[n]
Which is one within the interval let us it is 0 to L-1. So, 0 to L -1 window function is a
square amplitude is 1, after that its amplitude it 0, that is window function. So, if it is one
if this type of window, I am not multiplying the signal with any function, this is called
rectangular window.
Now once I do the rectangular window then my frequency response of the output become
Y[n], if I take the Y[k] is also contain the frequency response of window function. So, I
am not getting the exact frequency response only the filter and input signal, but also it
combines with window function frequency response, that is why I get the different kinds
of ripple. So, if you see different kind of window function, I define to reduce the different
ripple stop band ripple and pass band ripple and different kind of window function is
defined that is called blackman window, hamming window and kaiser window.
So, I will everybody have a window function. That is multiply with an input signal, and
then convolve with this Fourier transform function I get the output.
87
(Refer Slide Time: 46:52)
So, if I implement the filter this is the common window windows frequency response if it
is frequency response of the rectangular window if you see like this. So, ripple is much
more if I want to bartlett blackman window, frequency response is like this hamming
window this hanning window this. So, most cases hanning window we are using ripple is
very less.
So, different window function has different kind of frequency response this frequency
response will combine with a frequency response of the signal and pass to the filter. So, it
is time domain multiplication; that means, window multiply with the signal in time
domain. So, frequency domain it will be convolution. So, X[k] will be convolved with
window function W[k], and then multiply with frequency response of the filter.
88
So, this is happening. Details I will discuss when we discuss about the STFT short term
frequency response short term Fourier transform. Now if you see the example
implementation for a time domain signal, suppose I want to design a low pass filter cut off
frequency is 𝜔𝑐 . So, this is the frequency response of the filter. So, you see here
implementation of filter, usually provide the frequency response of the filter. I have to
design if I want to if I want to design in time domain.
Then I have to derive the time domain representation or time domain transfer function or
time domain impulse response of the filter and then convolved with an input signal. So, if
I do that, I will get response, and then I can implement it in time domain. So, if you see
the book there is different. So, different window has different type of tension transition
bandwidth and peak sidelobe. So, this is a ripple in peak sidelobe and this is the transition
bandwidth.
4𝜋
transition bandwidth (∆) = .
𝑀
So, M is the order of the filter if it implement the filter in FIR way. So, I said finite impulse
response, ideally filter has an infinite impulse. So, I truncated the infinite impulse to a
finite number, if it is M, then M is called order of the filter. Ideally if I want to design a
89
low pass filter, this is infinite impulse response. Once I truncated it, I cannot get this half
cut off.
So, I get a transition. So, this transition bandwidth depends on the what kind of window I
have used to truncate the Fourier transform function or truncate the signal. So, if I use the
4𝜋 8𝜋
rectangular window, then it is ∆= , and for hamming window ∆= .
𝑀 𝑀
So, depending on the requirement I should use which kind of window I should apply. Next
one how to implement a band pass filter. So, suppose I implement a low pass filter up to
500 hertz. Let us it is 500 hertz. So, bandwidth is 500 hertz.
Now, I want to implement a band pass filter of 500 hertz bandwidth with a frequency
between the 1.5 kilo hertz to 2 kilo hertz. So, what I will do? I will just shift this low pass
filter to 1.5 kilo hertz frequency. So, that I get this one understands or not.
So, it is nothing but a frequency shifting of the low pass filter. So, any band pass filter I
can implement in just implement the low pass filter of the desired bandwidth and shifted
that frequency response where you want to implement the band pass filter. Similarly
suppose I want to design a high pass filter of 1.5 kilo hertz. If the sampling frequency of
the signal is 8 kilo hertz, then in frequency scale what is the maximum signal up to 8 kilo
hertz I can get base band signal is 4 kilo hertz again. Even I can get the base band signal
is 4 kilo hertz.
90
So, it is nothing but a band pass filter 1.5 kilo hertz to 4 kilo hertz if I design that filter and
shifted at 1.5 kilo hertz, I get the high pass filter. So, that way I can design the filter details
filter design I have not going here because that you can that is prerequisite of this course
that if you want to know the digital filter design digital signal processing you have to study.
I just give you some concept of that you can say that this is overview of DSP. So, how to
implement an IIR filter? There is a lot of methodology are there. By any I can take that
derive that filter transfer function in x domain.
then go to the bilinear transformation to make it z domain, and once I get the z domain
transfer function, I can use a discrete system to design that transfer function. So, that way
you can implement.
So, this is the prerequisite part which I will be referred many times during the speech
processing class. So, this is the overview of DSP. So, I just truncated that or summarise
the DSP required things in here, and some DSP also I require higher end DSP signal
processing algorithm I will be used. And that time I will describe that things when I will
use them.
Thank you.
91
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture – 05
Human Speech Production And Source Filter Model
Good morning.
So, let us start with our course that first class we have explained that introduction of
speech processing that things, and second class recording part. Now let’s start with a
human speech production, and it is source filter modelling that. So, since the objective
asks is to model the human speech production system
So, let us describe what is human speech production model and how it can be, you can
say synthesize using the source filter model. So, as we explained that human speech
production systems depends on some steps. First one is the message planning,that
involve the linguistic parameter.
92
Then rule of prosody that utterance planning then the psychological constraint by the
motor command generation then speech sound production, and radiation of the speech
from the mouth.
Now, if you record a speech, and if you just display the speech it will look like this. If
you see here, this portion I have not spoken anything. This portion let us I started spoken
this is called unvoiced sound some noise is sound may be there, then there is voice sound
is generated. So, normally in speech I can 3 part one is silence one is unvoiced another
one is a voiced. So, silence means when I am not taking anything nothing is coming out
from the mouth is a silence part. Then there may be a unvoiced aspiration will be there,
or friction will be there, and there will be a voice sound.
Now, how do I produce this kind of sound? And how do I what is the meaning? How
when I communicate a message? What kind of different voice sound and different
silence and different unvoiced sounds I have produced? That you have to know that
things. So, that is the speech production system how the different kind of sound is
produced by the human being. Now if you take the basic definition, somebody will ask,
you.
93
Now, once I say sound, what is sound? What kind of sound? Sound or phonemes serve
as a symbolic representation of information to share between human being. If I say I
want to communicate a message from one human being to another human being, or
human being to machine or machine to machine. In that case what I required a speech is
a composition of different sound.
suppose if I say Kolkata, let’s there is a sound kolka ko ka o la ka a ta and a All the
sounds are there. So, k is a sounds o is a sounds then lo is a sounds kol ka then again ka
is a sounds then again a is sounds, then again ta is a sounds then again a is a sound. So, if
those sounds are pronounced or composed in a sequence so that then a message the I am
saying Kolkata is transmitted from talker to listeners. At those composition of sounds are
governed by the language rule. I cannot compose kolka ok la then k any arbitrary
94
composition will not make any sense, although individual sound has an sense and this
composition is governed by the language rule.
If I say Japanese, Chinese suppose I do not know Chinese, if some speaker spoken
Chinese he is speaking the sounds in a some composition may be some of the sound I
know, but I do not or I cannot identify the intended message he want to communicate.
Because we do not know the language rule by which the sequence of sounds are
arranged, that is linguistics.
So, linguistics is the study of the rule of those rule, how do I human being produce
different sequence of sound to communicate different kinds of message, there may be
different kind of things. you can say the different kind of rules are applicable for rules
may vary from language to language. So, discipline which study those things is called
linguistics. Then how the different sound is produced or study of the individual sounds is
called phonetics.
Suppose ka this is a individual sound. So, sound of individual sounds that study is called
phonetics, how the sounds are governed by the rule language rule for making the valid
composition is a study of language that is called linguistics. Here I have explained the
details of the phonetics. So, phonetics is a sounds, So it is produced by a human being in
acoustics wave form.
So, I will cover the acoustics phonetics and articulatory phonetics once the sound is
coming in the air it is acoustics. So, I have to know the acoustics part f the phonetics.
And how we produce the different sound using this vocal tract is the articulatory
phonetics. So, the phonetics has a 2 part, one is called acoustics phonetics another is
called articulatory phonetics. So, I will details cover those things, because we have to
understand the sound, but composition of sound it is a study of linguistics.
So, there may be a phonology, there may be a language rule grammar, and there may be
a some kind of arrangement pragmatic paralinguistics linguistics all kinds of rules will
be there.
95
Now, let us look how the human vocal tract look like this is the x ray of human vocal this
whole systems x ray there is a vocal cords and there is a tube, articulatory tube whole
system x ray is look like this.
If I take the MRI this is taken from professor shri narayan. So, if you see that MRI of the
human vocal tract which look like this. This is a lip, then there is a tongue, whose lip tip
of the tongue is open back portion is closed. Then there will be a vocal cord here. So, if I
schematically represent this thing it will look like lip, then tongue tip of the tongue is
open and back and bottom is closed, bottom is fixed with a lower jaw. Then this is a
palate, then there is a velum because we have a nasal cavity, and we have a oral cavity.
So, velum may be closed or velum may be opened. Then there will be a pharyngeal walls
this epigiottis there will be a glottis section.
96
Now, if I take the schematic representation of the speech production, system it will look
like this. How the speech is produced by human being? This is schematic representation.
So, there is a lungs. So, what is the source of the sound, if you see forget about the
speech, basic source of sound is nothing but a vibration if I want to generate a sound,
sound is nothing but a mechanical oscillation. So, it is nothing but a mechanical
oscillation.
So, a mechanical body is vibrating, and sound is acoustic waves is generated. Now
vibration of mechanical body required external force I required a force to vibrate the
body. Now if you see if I have membrane connect a membrane if I pass the air, then thin
membrane is closed, and if I pass air in force the membrane will vibrate and sound will
be produced. In childhood if you see the kite who is flying in the sky and there is a some
membrane is attached and membrane is vibrating in the air and producing a sound, this
childhood in India people are used that things.
So, there is a source of you can say the source external force is required. So, during the
speech production the lungs when I take the air inhale, and I gradually exhorted that air
and produced different sound by constructing the different location of the tube. So, lungs
produce or give the air pressure which is required to the production of sound. So, this is
the force. And once this lungs press the air to upwards there is a vocal cords.
So, vocal cords are generally open when we take inhale. If I want to speak something the
vocal cords are has to be closed if it is open no sound. So, sound there is a no sound is
coming from the mouth. So, if it is vocal cord is closed then when the air is passing
through the vocal cords it is a membrane and if the membrane is exposed to a high
97
velocity of air, that create the vibration in the membrane. So, that vibration cause the
sound and pass through the cavity.
Now, if I see tongue we have a tongue. Tongue separate the cavity in 2 part, one is called
back cavity one is the front cavity. And we have a velum if you see the velum either
velum can be closed, or velum can be opened. If velum is open and oral cavity is closed
then the air will passes through the nose. So, nasal cavity is involved if the velum is
closed, the sound is only passed through the mouth that is called oral cavity. So, if there
is a that is a 2 kind of sound, either velum may be closed or open if velum is opened.
Then the sounds become nasal because nasal cavity involved if the velum is open closed,
then nasal cavity is destruct from the cavity then it is called oral sound.
So, if the sound is produced using his mechanism lungs. Create the pressure goes upward
if the vocal cords are closed, then the vibration will produce and that vibration create the
sound and that sound is modified. Depending on the tube structure when it is passes
through the tube. So, I can say in human speech production system has a 2 part. One is
the source here, the source of sound is vocal cords. And lungs only the provide the
pressure.
So, one is the source, and once the sound is produced that source is passes through the
cavity or this tube. And it is modify to create the different kinds of side sound. Once a
source create the sound when it passes through the cavity depending in the structure of
this tube, it’s modify and it produce different kinds of sound. So, I have a source and I
have a tube this is a 2 part of the sound production.
98
So now go goes to the source what is vocal cords how it is look like. If you see the vocal
cords is look like this picture. Now if you see that air is coming from the lungs if the
vocal cord is closed, then air flow will be obstructed and the velocity of the air will be
different because air pressure in here will be increased air pressure will be decreased.
Once it is little bit of open the air will passes through this vocal cords and create the
vibration.
So, that is the sound this is the original picture, or you can say that the endoscopic
picture of the vocal cord. This is the closed position of the vocal cord this is the open
position f the vocal cord. So, this is the vocal cord structures, this is the closed position,
this is the open position.
Now, if you see the vocal cords is closed it is not like a this side wave movement open
end of the vocal cord is always fixed. And it is closed and open in the other end, and that
create the sound. So now, if I take the schematic structure of this, this will look like this.
So, different cartilage and closed either.
So, movement of the vocal cords depend on the muscle movement. Now one end is
closed, this end vocal cord is closed and this end is the back end is open.
99
Now, once either it can be closed either or it can be totally opened. So, there is no sound
breathing, if it is closed worst, if it is slightly it is open, but the tube can produce a noise
kind of sound that is called unvoiced sound. So, unvoiced voiced and silence region I
have seen in the first slides in here.
Now, if I see the glottal flow, how the air is flowing in the glottis. So, once the vocal
cord is closed, if the vocal cord is closed no air can flow. So, either it is gradually open
and closed open and closed open and closed. Once is closed air flow stopped once is
gradually open one flow increases, increases, increases, increases and again this
decreases, decreases. So, I can say this is the flow diagram air flow is increases,
increases, increases and again decreases and closed phase, again it is open and again it is
closed. So, I can say either complete open to open is a period or closed to closed is a
period and if I take difference of the flow which create, So difference of the flow. Will
create the sound vibration. So, if I differentiate this thing it will be look like this which
create the vocal cord vibration. So, this is the differentiation of the flow now if you see
that this tip to tip is a period either closed to closed is a period or open to open is a
period.
100
So, this is source, I can show you the diagram in here this is called engineering model of
the source. So, what is the engineering model. If you see this is the lungs. Now vocal
cords any sound production of the sound is nothing but a mechanical vibration.
So, what is the form of a mechanical oscillator? There is a spring there is a mass and
there is a mechanical damping. This is the mass this is the spring constant s and this is
the mechanical damping Rm.
So, air act as a force on this mechanical oscillation. So, I can say 2 vocal cords can be act
as a mechanical oscillator, and that oscillate by the force which is coming from the lungs.
101
And that is Ug, if you see the Ug mechanical oscillation produce the acoustics wave. So,
acoustics wave has a volume velocity and is a pressure wave. It is has a volume velocity.
So, it is a volume velocity Ug particle velocity, and you can multiply a for a fixed unit
fixed volume is a volume velocity.
Now, Ug is the air flow velocity volume velocity. Will come through this different vocal
tract shape. So now, if you see that whole track, the source is nothing but a mechanical
oscillator, it create the vibration. Once it mechanical oscillator it’s create the vibration
let’s it has create the vibration is stomach impulse. So, source can be modelled using a
impulse source. Now this has to be passed through this tube, this is a tube.
Now, tube can form different structure using the tongue lip and velum movement. Now
if you see I can move the tongue in upward I can back off the tongue can be raised,
tongue tip can be touches in the upper palate. So, when you speak if you see the upper
palate is fixed. this upper palate is fixed. Now in lower jaw upper jaw is fixed, and lower
jaw is moving if you see. So, I can move the either lip, I can close the lip I can open the
lip or I can move the tongue to divide the cavity in different structures.
So, I can say when this vibration passes through a different kinds of tube structure can
produce different kind of sound, cascading tube. So, if I consider that things then I can
say this is the source of the sound production. And that source sound is passes through a
different structure of the cavity and different time to produce the different kind of sound
sequence. So, I can say that source is a impulse response that can be passes through a
filter depending on the filter structures different sound can be generated. So,in
engineering model or human production system if I see it is as source filter model.
102
So, vocal cord is producing the source which is nothing but a impulse, either sound is
present or sound is absent either source can be present or source can be absent. Then it is
passes through a filter, this is time varying filter. Different time structure of the tube will
different to produce different kind of sound, let’s this is sn or this is speech. So, digital
circuits I will come later on. So, I can say whole human speech production system, can
be modelled using a source filter model. So, I can study the source or I can study the
filter, combine the both I will get the speech. So, who produce different kind of sound,
depending on the structure of the filter. Source either source vocal cord can vibrate or
vocal cord without vibration. So, either source can be present, that is voiced sound either
source can be absent, that is called unvoiced sound. Or voiced sound is produced on it
passed through the filter it can produce the different kind of voiced sound different kind
of voiced speech.
Now, if I see the this once again I will come when I talk about the tube modelling, but
here I will explain. So, excitation So impulse is nothing but a excitation parameters,
excitation is generated and that excitation part of the going to the linear time varying
system filter and I get the speech. So, this filter I have to model this excitation part I have
to model. Now if I say what is the properties of this source, if you see if it is a impulse
then it has a period. So, which is the fundamental frequency of the source human being
also has a fundamental, the speech is a quasi periodic signal it is not exactly periodic. So,
human being also produce different kind of this source fundamental frequency depending
on the requirement.
When you speak, we never speak in a single fundamental frequency, that will become to
mechanical to hearing. So, fundamental frequency are gradually moving to provide the
103
melody on the speech. Think about singing. Filter produce the different sound for content
purpose, but melody is defined as a source itself. If you singing think about the singing,
the control of the fundamental frequency sa re ga ma all the notes are nothing but the
movement of the fundamental frequency. So, those controlling of the fundamental
frequency is generated by the source.
So, movement of the vocal cords. I will discuss how this vocal cord movement, how the
vocal cord produce different fundamental frequency, to by varying the tension of the
vocal cord and position of the vocal cord. So, that is I will discuss in later class in during
the prosody modelling.
So, movement of the vocal cord tension can change by 2 types, one is called forward and
backward movement one is vertical movement. So, fundamental frequency is a property
of the source, and filter responsible for producing different kind of sound. So, if you
know that digital filter, how the digital filter is characterised depending on the frequency
response.
So, if I say the digital filter is represented by hs you know the Laplace transform hs or
you can say the hn whatever hs represent the frequency response of the filter. So, if the
frequency response of the filter is look like this, and the source is passing or the impulse
response contained all the fundamental frequency, all the frequency let us same height
and this is the you can say time and this is the frequency and this is the power. This is
the frequency, this is the power.
So, as per the signal processing model, when the source is passes though a filter source
signal will convolved with the filter signal to produce this speech. In frequency domain
convolution is nothing but a multiplication. So, if I say in signal literature of signal
processing if I if I apply here, if I say source is nothing but a et and filter is nothing but a
ht. So, output speech is st. So, st is nothing but a ht convolve with et. If I take the Laplace
transform in frequency domain or z transform let us take the Laplace transform then ss
Laplace domain is nothing but a hs × s.
So, frequency response of the filter, will be multiplied with the frequency response of the
source and produce the speech. So, different sound is responsible by the filter, source is
responsible either there will be a voicing or there will be a no voicing. Either there will
104
be a voicing signal or there will be a no voicing. So, response. So, this is nothing but a
multiplication.
But source provide the fundamental frequency of the speech. So, in case of singing when
you say sa note [FL]. So, actually you are controlling the fundamental frequency of the
vocal cords. So, the controlling of the fundamental frequency which is I will talk about
the prosody modelling part, I will details mathematical explanation has given that how
this fundamental frequency of the vocal cords can be changed.
Now, if you see the singers are practicing in morning to touching up the some or you can
see those comment like that, you are you are upper note is note that clear lower note is
clear. What is meaning is that the changing of the fundamental frequency, from lower to
high or high to low. So, upper note and lower note that changing or controlling of the this
vocal cords is importance.
So, you are practicing controlling of the vocal cords by rivaj . So, that you can touch the
perfect note during the speaking also, we change this vocal cord fundamental frequency
with respect to time to provide the melody on the speech.
Now, some fact if you see. So, sound source and sound as a filter. So, the acoustics of
male and female vowel differ reliably along 2 different dimensions.
So, sound can be this st is depends on the source and depends on the filter. Source if it is
voiced sound source is exists and source provide the fundamental frequency, and filter
modify that source and produce different voice sounds, that may be vowel or that may be
105
a voice consonant also. So, it is not necessary it will be vowel it may be consonant voice
consonant also.
So, source modify by the filter producing different kind of sound. So, fundamental
frequency or F0 sometimes we say pitch F0 is sometimes defined as pitch, but pitch is a
perceptual elements in parameter f0 is the physical dimension of the parameter F0 can be
measured, but pitch is perceptual pitch cannot be measured. So, F0 of the source depend
on the source and different sounds is depend by the filter. So, if I say the human will a
male versus female.
If you see F0 depend on the vocal cord length, vocal cord mass those 2 things if you see
the vocal cord is closed, this end if the length of the vocal is increases what will happen?
The fundamental frequency will be decreases. If the mass of the vocal cord is increases
fundamental frequency will be decreases, if you see the drum huge membrane drum.
Mass length is area is increases fundamental frequency is decreases, if you take a short
membrane sound dung fundamental frequency will be increases.
So, for a woman since the vocal cord length is shorter, their fundamental frequency is
higher. If you see the woman singer they say the yours best scale is be sharp, what do
you mean by be sharp? Average fundamental frequency in the level of the average
fundamental frequency. So, if it is woman then the higher the average F0 will be very
high not compared to male, but for a male since the vocal cord length is longer the
fundamental frequency is shorter means average is lower.
So, average fundamental frequency for a male is lower than the average fundamental
frequency of female then filter. So, filter depends on the characteristics of it’s frequency
response. Now if I say hs is a transfer function of the filter if you are electrical students
or you know signal processing, then hs is nothing but a system functions. So, that
systems depend on what system has a property some pole and 0. So, from there I will
define that formant frequency in the next class. So, that is called formant or resonance
frequency based on the resonance frequency different sound is produced.
I will explain in the next class, what do you mean by pole and resonance frequency. So,
those is speech it is called formant. So, in case of woman since the length of the vocal
cord is shorter the formant frequency is little bit of higher, longer in case of man. So,
formant frequency is little bit of lower. So, that is difference between the male and
106
female speech. If it is child the fundamental frequency much more higher because of the
length of the vocal cord is much shorter, and the formant frequency will be higher
because the length of the vocal cord is shorter. So, I can say that formant frequency
depends on length of the vocal tract, and also other some properties we will discuss later
on when you module this local tract.
So, if the length is increases, the formant frequency will be different. So, I can say if
average if I can say personalised the speech means my vocal cord or length of this vocal
tract or you can say length of this tube, may not be exactly equal to other person. So, if it
is not equal then I should have a formant structures which will be different from the
other people. So, there may be some information exists from which I can identify the
person. So, I can exploit that information, to recognise the biometry or biometric
signature of the person.
Similarly, source vocal cords my vocal cords length and weight will be different from
somebody else. So, there may be a some parameter exists here also by which I can
identify the persons. So, biometric signature may be exists in the speech either in filter
portion or in source portion. Or if I say speech is also depends on the composition of the
sound sometime this composition pattern also provide some information which can be
exploit to detect the person. So, next class I will discuss what is formant, how it is
extracted, those things will be discussed.
Thank you.
107
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology Kharagpur
Lecture – 06
Place and Mannerat Articulation
In last class we have discussed about that engineering models of the speech production
systems, where we discussed bugle tube can be modeled as a filter, and glottis vibration
can be model as a source. So, it is said a source filter model.
Let us a filter H(s) and source E(s) or I can say e[n] and h[n] in digital system and the
output is the speech S(s) or s[n].
So, if the source exists. then all voice sounds source is modified by the filter. And source
has only property the fundamental frequency contour and it is nothing but an impulse. So,
excitation generation nothing but an impulse. And that excitation is coming through a filter
108
which is modeled as a linear filter, in which convolution of e[n] and h[n] produce speech
s[n] as shown in above figure. So, in time domain it is convolution and in frequency
domain it is a multiplication.
classification of the voice sound: - voice sound is classified based on two properties; one
is the fundamental frequency of the excitation which is constant for throughout the all
voice sound. Now different model of filter can produce different voice sound.
𝑃(𝑠)
H(s) = 𝑄(𝑠)
Any model is the combination of 0 and poles, root of P(s) provide 0 and root of Q(s)
provide the poles. All pole related to the formant frequency later on resonance frequency.
So, for every pole there is a resonance frequency. So, if Q(s) is a first order system then it
is only single pole. Similarly if it is a nth order system then n number of resonance
frequency will be there.
If you see linear frequency system as shown in above figure, every peak is called formant.
So, all voice sounds are characterized by those formants. So, number of formants shows
the number of peaks. So, if it is frequency domain and omega is the analog frequency,
109
then it is nothing but a multiplication. So, this frequency response of the unit impulse will
be modified by the filter frequency response.
Details of how what should be the formant structures will come on clear the (Refer Time:
03:53) model this speech production system. So, all the voice sounds are characterized by
the formant frequency for different voice sound as different kind of formant frequency.
That is why they sound like different sound.
So, there is some F0, F0 min, F0 max values for men, women and children as shown in table
below: -
If you compare the average frequency of all three categories (i.e. men, women and
children), you will get to know the reason of different voice sound of different category.
In singing, somebody has be sound B flat those kind of scale they use that is nothing but a
match with your average F0 frequency.
Now formant frequency also I will come on what should be the human (Refer Time: 04:59)
minimum formant frequency and all those variation I will come on when we discuss about
the uniformity modeling of the speech production system.
110
So, any voice sound I can say that it is characterized by the formant frequency or resonant
frequency.
So, formant frequency is nothing but the resonant frequency. Any voice sound is
represented in terms of frequency and amplitude as shown in above figure. Now if the
voice sound is characterized by the formant frequency and formant frequency depends on
the structure of this vocal tract. So, formant frequency will change depending on the H(s).
H(s) will be modified based on the structure of the whole tube. So, different structure can
produce the different kind of voice sound and this has the different kind of formant
frequency. Now if I related to the sound which is unvoiced sound we can say, that there is
a difference between the unvoiced and silence.
111
Unvoiced sound: - Aspiration friction is called unvoiced sound, which means vocal call
does not vibrate but still there is a noise kind of sound. The source of unvoiced sound is
aspiration and friction.
How the aspiration is produced? Suppose you have a tube. You create a narrow
constriction in the tube. Now lungs produce an air pressure and create a narrow constriction
using the tongue, then force air try to find out a path and there is a friction kind of noise
sound will be generated. So, those kinds of sound are called aspiration and friction sound.
Human vocal tract produces different kind of sound. The different sound produced by the
human being for speech communication, whatever sound is coming out from the mouth
has two properties. One is the tongue position how the tongue is moved and to produce
that sound. Second is place or position where the sound is produced, that is called place of
articulation. So, during the articulation the airstream through the vocal tract must be
obstructed in some way So that we can produce the sound. The place where the obstruction
is takes place is called the place of articulation.
For example, suppose I want to produce the vowel. So, we can see the vocal cords is
vibrating and hip of the tongue is raised. So, the place where the obstruction is happened
is called place of articulation and manner. Manner means how we produce that sound;
means, whether the flow of air is fully flow, stacked, or it passes through a narrow cavity.
So, all kind of airflow mechanism are produced the manner of articulation.
Manner of articulation has also 2 major difference, one is the vocal source, which can
either source can present or absent. And other is glottis, which can vibrate or absent.
112
When the glottis is closed, then the voice sound is produced while in case of fully opened
the unvoiced sound is produced. So, anything can be voiced or unvoiced.
So, if we see the upper track of the human speech production system which is shown in
figure.
You see the different upper cavity can be part in different place like Bilabial, dental,
labiodental means leap and dental can be there.
113
Now, when if the tip of the tongue touches the teeth then it is called dental sound. When
blocks happened in labial position this is called labial lips dental. So, all kind of place we
can see in schematic diagram that one kind of place for different kind of sound production
systems.
a. Bilabial: - Bilabial sounds are produced when the two lips make the construction.
b. Labiodentals: - These sounds are produces by constructing lower lip with upper
teeth.
c. Dental: - These sounds are produces by construction of tip or blade of the tongue
with the upper teeth.
d. Alveolar: - These sounds are produces by construction of tip of the tongue in
contact against the alveolar ridge which is the bone prominence immediately
behind the upper teeth.
e. Post Alveolar: - The sound which is articulated by the tip of the tongue with the
back area of the alveolar ridge.
f. Retroflex: - the sound made when the tip of tongue curled back in the direction of
the front part of the hard plate. Depending on how the tongue is curled back, the
retroflexed could be apico-postaveolar or apico-palatal.
g. Palatal: - This sound is produced when the construction is made by the front part
of the tongue with the hard palate.
h. Velar: - It refers to a sound made by the back of the tongue against the soft palate.
i. Pharyngeal: - this sound is produced in the pharynx, the tubular cavity, constitutes
the throat above the larynx.
j. Glottal: - these are the sounds, which made in the larynx due to the closure or
narrowing of the glottis.
114
(Refer Slide Time: 12:01)
So, depending on the place of the constriction sounds are described and that description is
called place of articulation because the constriction is made different position. So, tube
structure will be different kind of frequency will be produced by the tube and that is why
the sound is different.
115
Manner of articulation: -
There are three manner of articulation, depending on source and glottis; (1) voiced (2)
unvoiced and (3) aspiration. Apart from these, more manner of articulation is there as
follows: -
Plosive sound: - This sound is produced when tip of tongue closes the upper cavity and
the back of the cavity before closure air pressure build up and suddenly this is released and
sound burst comes out.
If the constant sound is followed by a vowel, then in back of the cavity sound pressure
increases, which creates a plosion that’s why this is call plosive sound. Example: - Pa
116
When vowel is followed is followed by constant, then vowel sound produces by vocal
track suddenly want to produce constant sound, so it creates a constriction or stop, then it
is called stop sound. Example: - Ak
Nasal Stop: - In nasal sounds the velum (soft plate) is lowered blocking off the oral cavity.
Example: - /m/, /n/ etc.
Fricative: - It is produced when air forces its way through the narrow gap between two
articulators at a steady pace. Example: - /s/, /z/ etc.
Affricate: - There are two parts. First part is plosion where complete blockage of air stream
in the oral cavity and second part is the friction that’s it is called affricate sound.
Lateral: - It is produced, when tip of the tongue touches upper cavity (alveolar ridge).
When the sound comes out two sides of tongue, is called bilateral sound. And sound comes
through a single side, called unilateral sound.
Trill: - It is produces, when tip of the tongue touches the upper cavity and vibrate.
Flap or Tap: -It is produced, when back of the tip of the tongue touches the upper cavity,
it acts like flap or tap om upper cavity that’s why it is called Flap or tap sound.
117
Classification of sound in linguistically distinct speech
If you see how the sound is unique it does not require the language for example sound of
ka in Hindi, Bengali and English all is ka, so how the sound sequence is produced then the
language rules come out. So, if we want to describe the each and individual sound which
is produced by the human being and try to find out a symbol for each individual sound.
So, depending on the place and manner of articulation, assign a symbol to each particular
sound. Sound which is produced by a different constriction and different kind of
mechanism by the human vocal tract has to be assign a particular symbol. That set of
symbols is called IPA (international phonetic alphabet).
118
(Refer Slide Time: 21:47)
The sound which is produced during the air flow is from lungs to mouth is called pulmonic
sound. during the production of the pulmonic sound air flow from lungs to mouth. When
the air flow is reversed then we call it is a non-pulmonic sound.
Now, pulmonic sound depending on the place and manner of articulation they have
assigned different symbol. If you see the chart, row wise is the place of articulation and
column wise is the manner of articulation. From constant pulmonic chart, suppose we want
to produce bilabial plosive sound. In case of plosive sound, lips are closed that place of
articulation is bilabial, glottis can either open or closed, if it is open, then it is called
unvoiced sound. If glottis is closed, it is called voiced sound.
119
So, any bilabial plosive sound depends on two things. First manner of articulation and
second source vocal cords either it can be voice or it can be unvoiced.
So, if you see the consonant pulmonic chart, it is written p for unvoiced and b is for voiced.
Similarly, for remaining set of plosive sound. So, in if you see that consonant symbols in
the IPA chart it depending on the voiced and unvoiced sound and manner symbols.
Vowel Sound: -
Production of vowel sound only tongue movement takes place, constriction does not take
place. Tongue can move either back or front. So, depending on the tongue movement
(backward and forward) and tongue height (movement of tongue towards the upper plate)
vowel sound produce. During production of vowel sound, upper palate remains fixed but
lower palate moves according to the movement of the tongue.
120
If tongue height is raised high then it is closed and when the tongue height is low it is
called open mouth.
Depending on the tongue height and tongue forwardness different symbols are there for
different vowel sound. So, vowels are classified or symbolified based under their vowel
production position instead of constriction.
Some vowels are required lip movement also. sometimes we round the lip and sometimes
we do not Round the lips as shown in figure below.
If you say woo lip are rounded. So, lip rounding and the tongue position describe the vowel
sound.
Relation between tongue height and tongue movement (forwardness and backwardness).
The formant vowels are described based on the formants (F); F1 = 500 hertz, F2 = 1.5
kHz, F3 = 2.5 kHz.
However, we can change the shape of vocal track to get different resonant frequencies.
121
(Refer Slide Time: 29:29).
Now there is a relation that depending on the tongue height is related to the first formant
(F1). Tongue movement (front and back) is related to the second formant (F2) and lip
rounding is related to the third formant (F3). So, there is correlation between the F1 and
tongue height, F2 and the tongue movement (front and back) and F3 and lip rounding.
So, with the value of F1 and F2, we can classify whether this is a high vowel or low vowel
and front vowel or back vowel. So, the articulatory place can be simulated or can be model
in a formant measurement plate plain. So, (Refer Time: 30:52) we can find out the
frequency and plot them F1 and F2 plane. F1 and F2 mainly responsible for which vowel
I have produced. We can consider F3 (lip rounding) depending on our requirement and the
classification of vowels.
Thank you.
122
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture – 07
Articulatory And Acoustic Phonetics
So we are describing the place of articulation. So, better if I show you the pictures let’s
show you this pictures.
Can you told me which is the place of articulation. It is written, if you see the place of
articulation back of the tongue touches the back cavity this velar region. So, velar region
it is touches. So, it is place of articulation is velar now vocal cords are open. So, I can say
it is a unvoiced what is the unvoiced velar sound symbol if you see the IP a chart it is ko
unvoiced velar. So, there is another variety that can be vocal cord can be closed if you
see here place of constriction is same, but during the constriction vocal cord is also
closed. So if the vocal cord is closed then I call it is a voiced part of the velar.
123
So, it is ga. Now ka is a plosive sound. So,this place of articulation both are velar both
are velar place of articulation both consonant are placed at velar position. In here manner
of articulation as per the glott vocal cord it is called unvoiced it is called voiced.
Unvoiced or voiced, now if you see physically that if a constriction is happened in a tube
and back of the tongue there is a air pressure is building up. Here the air pressure is
building up and the constriction is happen.
Now, if the air pressure is building up then suddenly it wants to burst. That’s why we call
it is a plosive sound now during that for the portion. So, there is a 2 kinds of movement.
If you see here articulator has to be released if it is released then a plosion will be
happen. Now if you see any consonant we cannot produce a single consonant alone,
because during the production of the ka articulator is completely closed vocal cord is
opened, so that means no signal is coming out from the mouth. Then how do you know it
is a ka? So, if we see we said none of the consonant can be produced signal it has to be
produced with a vowel. So, I can say k and a is ka I can produce think about the
production sequence. If I want to produce the ka during the ko the back of the tongue
touches the velar position and cavity is totally closed. Then suddenly it has to be burst
talk when the constriction will be open then there will be burst sound will released.
So, if I see the sound plain I will show you in the web diagram also, acoustic signal it is
silence no signal. Suddenly there will be a burst, then the burst is happened. So, this
vocal that tube constriction is released, but I have to produce aa. So, there is a
requirement the vocal cord has to be closed. So, if I want to close the vocal cords. So,
there is a synchronization problem between the articulator close and release and vocal
cord close and release. So, there will be a time gap.
So, if I say state diagram let’s hear vocal cord articulator is closed. So, velar portion it is
closed means articulator is closed. So, articulator is closed no signal let;s this is the
articulate or open position, this is closed position this is articulator. Now this is vocal
cords or glottis this is open position and this is closed position close this is vocal cord.
Now articulator is closed during the production of ka I am to produce ka. So, particular
ka period articulator is closed. Then suddenly it is want to open. So, the opening is not
drastically followed. So, it is open and then it has to be produced vowel a tongue has to
124
move to produce the vowel a. So, let us the articulator is closed in here like this way lest
this is not approximate like this point articulator is completely open this point.
Now the vocal cord is open during the production of ka. So, this period vocal cord is
open. Now once it want to produce a it has vocal cord has to be closed. So, let’s vocal
cord close is in here. Want to close in here and then it start closing in here, and then
complete a will be produce in here. Let complete a position tongue move in here.
So, if I see this point burst is occur this silence then burst is occur, then articulator is not
completely opened and it is not exactly a. So, there will be a although the vocal cord is
also not open, vocal cord is open here not closed. So, vocal cord is closed here. So,
vibration is start from here. So, let us this vibration is start, and here I reach the vowel I
can show you one picture let us come here.
If you see in this picture, so during the production the articulator is closed no sound
vocal cords is opened. Articulator is open vocal cord is open. So, this period we call
occlusion.
Now, articulator is suddenly released there is a burst is occur, but articulator released and
sudden sound is come and articulate try to move to position of the produced the vowel a.
Now if you see same time glottis may not be closed. So, there is no voice found. So,
there is a voice gradual gap between the articulator closed articulator open and vocal
cord closed. So, there is a settling time, that is called voiced onset time v o t voiced on
set time. So, then there will be a transitory part of the vowel then the steady state vowel
will be rich.
125
So here if you see totally unvoiced, nothing no signal is there occlusion, then there is a
burst then there is a dam gap the between the articulator opening and vocal cord closing.
So, there is called VOT voice onset time. During this period vocal cord may produce an
aspiration sound, then it is called aspirated sound. If it is aspiration is not there we call
un-aspirated VOT. So, VOT this VOT, might be aspirated or maybe un-aspirated. Should
depending on the whether it is aspirated or un-aspirated we can say we have a 4 Varieties
of the velar sound k and 𝑘 ℎ then g and 𝑔ℎ .
So, upper h this h is aspiration symbol for aspiration as per the IP notation.
If aspiration is voiced then we call voice expression if the expression is unvoiced than
you call unvoiced aspiration. So, I have it velar sound 4 varieties ka aspiration ka 𝑘 ℎ ga
and 𝑔ℎ . All are plosive or stop plosive. Ka is velar, 𝑘 ℎ is velar, ga is velar, 𝑔ℎ is velar.
Only difference is the manner of articulation. If I say ka is unvoiced un-aspirated. During
the production of the ka the back of the tongue touches the velar position, but vocal cord
is open.
So, there is no voicing during the occlusion that’s why it is called unvoiced closer
unvoiced velar. Since VOT does not carry any aspiration then it is called un-aspirated
sound. So, unvoiced un-aspirated velar plosive, this is the explanation of ka to write this
way we put 2 slash line to produce 2 identify to identify the different sound this is a
symbol for writing the vowels any consonant or any IP notation sound. So, if it is ka
once I write ka then in there I know what it is property. What is the property? It is
unvoiced un-aspirated velar plosive. So, if I I start from this language mainly in Indian
language if I give the example.
126
If you see if it is a vocal cords look like this if you see that this is my upper cavity and
this is my lower cavity and this is velum
If it is upper cavity bilabial, dental, alveolar, post alveolar, palatal, velar, uvular,
pharyngeal and girdle. Now if I come from this side velar let’s uvular velar palatal postal
velar alveolar then dental then bilabial. Depending on that way if you see in most of the
Bengali language Hindi most of the Indian language velar then we have [FL]. So, you
start from here velar sound velar consonant [FL]. So, difference is unvoiced un-aspirated
unvoiced aspirated voice un-aspirated voiced aspirated. [FL] then there is a nasal stop.
So, that is velar nasal if that nasal is produced that’s we call velar nasal.
So, then there will be nasal sound, which velar nasal I just forgot the symbol of the velar
nasal, I can go to the IP a chart and find out the symbols for the velar nasal if I say the
velar nasal it is look like not call back that it will be velar nasal not call back it will look
like this, this is velar nasal. So, velar nasal then after the velar we have forget about the
chow I will affricate I will come lateral this [FL] will come lateral. So, then we have ta
retroflex is the postal velar palatal region is the retroflex. So, tip of the tongue call back
and touches the upper cavity that is called retroflex sound if I come to the animation, this
is the retroflex sound and retroflex also have [FL].
127
What is there retroflex sound are writing ta is ta, but it is called this bottom up this call
back that is the symbol of retroflex symbol. So, it is nothing but a ta unvoiced un-
aspirated retroflex, unvoiced aspirated retroflex, voiced un-aspirated retroflex, voiced
aspirated retroflex [FL]. So, depending on the place and manner I am writing the symbol.
Then if it is retroflex no I can write retroflex no in here which is nothing but a n, but call
back. Retroflex no, then auto retroflex we have a dental sound [FL]. So, we have a [FL].
Bengali, Hindi all Indian language [FL]. Unvoiced un-aspirated, unvoiced aspirated,
voiced un-aspirated, voiced aspirated, this is velar this is retroflex, this is dental or
alveolar.
If you see this one affricate sound means it is a combination of 2 sound. When I produce
cha if you see when I am producing cha, then tip of the tongue touches the dental region
and glide back to velar region to produce the friction sound.
So tip of the tongue touches the dental region to produce the plosive part, and after the
plosive tongue is push back and touches the post alveolar region to produce the friction.
Than we can write the symbol ta is the plosion part. So, it is written by t and this saw is
128
the fiction part which is written by the palatal or postal velar. So, I can write postal velar
or palatal I do not know what is the saw this. So, is palatal or postal veolar post alveolar.
So, so if it is unvoiced fiction, then I can write f this kind of see not f there is no gat this
one. So, it is a plosion plus friction 2 kinds of sounds are there that is why it is written in
2 symbol [FL]. So, 2 symbols 2 sounds are there.
So, similarly if it is voiced then I can write da and I can write this one So, the it is not
retroflex. So, d and friction part this is voiced friction this is the voiced portion. Similarly
this cha after the friction there may be a VOT which is aspirated or un-aspirated. So, I
can write either this symbol which is ca and chaw and [FL]. If it is a aspiration then it is
ja. So, this way all the consonant can be described. It can be consists to 2 symbols or it
can consists a signal symbol and with a diacritics marks which explained whether the
aspiration is present or not, this aspiration if this h has 2 varieties. If it is simple h then it
is called unvoiced aspiration. If it is a voiced aspiration then I have to use the symbol of
y is fricative.
So that all are described in here now I will show you some picture.
So, somebody if I Give me this picture and told me which sound probably this is. I can
say this is nothing but a voiced stop voiced bilabial stop. So, if it is a voiced bilabial
stop; that means, either it can have a ba or it can have bha. That it can be have aspirated
voiced bilabial stop or it can have a un-aspirated voiced bilabial stop. Now if I see the
picture after recording if I see the spectrogram and the picture if you see this is ka.
129
You see here this is kha, burst then VOT is aspirated and VOT is longer. All aspirated
consonant VOT is longer. So, occlusion burst aspirated, if it is voiced during the
occlusion period vocal cord is closed. So, there is a voice bar kind of sound that’s why it
is called occlusion is voiced. So, it is voiced stop ga.
Then if it is gha then there will be aspiration, aspirated VOT, all are the same then it is
ka. Similarly if I say affricate occlusion part unvoiced. Then friction then VOT. So, if it
is occlusion in unvoiced friction is present VOT is un-aspirated then it is ca. So, if it is
VOT is aspirated then it is cha. If it is occlusion, this potion is voiced I can say this is
voiced. So, it is jha. So, I can show you after the next class that I open a spectrogram and
I classify which kind of sound it is. If it is a fricative look like this if it is a vowel then it
will be a periodic sound.
So, these are different how Bengali vowel shape you can cut see the other vowel shape
also then I can show you there a phonetic chart for English language and Bengali
language, you can see that there is a phoneme in English and phoneme in Bengali. So, all
the phonemes I have tablized in English all the phonemes in Bengali. So, there is a thesis
also we have studied that similarity between the American English and Bengali English.
So, there we have taken those things and find out the similarity non similarity all those
things, but here if you see that the 2 phonemic chart.
130
In English if you see we have only English plosive has only p t k b d g, but in Indian
language let us Bengali it has [FL]. So, 4 into 4 into 4 into 4.
But we have only 7 vowel in Bengali, but here if you see there are lot of vowels in
English how many 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15. Then if I say we have affricate
voice unvoiced fricative then nasal lateral, trill and flap there is a nasal affricate fricative
lateral approximant all kinds of sounds are. So, there is a similar sound dissimilar sound
all those things are there. So, you can go through that, but how to write that once I know
the symbols, I can describe the place and manner of articulation and if I know the place
and manner of articulation, I can write down the symbol. So, both way it has to be
practiced. That if you want the if I say that a consonant a plosive unvoiced un-aspirated
velar plosive then and there I have to write k. If I say k then and there I know it is un-
aspirated unvoiced velar plosive.
So, this is the sound description and this is how the place and manner of articulation are
described in acoustics phonetics. So, then instead of vowel and consonant there may be
other sound also like that approximant or there a vowel combination Semi vowel or
glides diphthong hiatus all kinds of sounds will be there.
131
As a engineer you should know what is the signal wise difference between a semi vowel
and a glide hiatus and diphthong that we should know. Not that what is the definition of
hiatus what is the definition of diphthong there may be linguistic definition all those
things.
But as a engineer I should know if there is a diphthong and if there is a semi vowel
somebody is going. So, what is the difference between because I have to recognize it
from the signal. So, what is the signal look like this. So, I will show you the signal for
Bengali example Bengali, but there is another important things.
I said that vowel or f1 f2 plain all the cardinal this is called cardinal vowel diagram, it is
called vowel can be placed. If I placed a Bengali vowel it will look like this. Back and
front is f2 height is defined by f1 closed or open is defined by f1 . So, it is f1 and f2 plane.
So, I can see that different vowel depending on their production mechanism production
132
of tongue position and tongue height they are described and that can be mapped in the f1
and f2 plane.
So when I describe the engineering model I have use these thing. So, if I see I want to
classify let;s I given example, if I say I am to classify between u and e if the distance is
much. One is front vowel one is back vowel. So, in that case f2 value one is very low and
another is will be very high. So, f2 value be different f1 value will be different.
So,depending on the f1 f2 value. So, I can place the vowel and I can immediately
understand which vowels can be classified in better accuracy from which. So, that a
cardinal vowel diagram or if I say plot the cardinal vowel diagram of English you can
construct the f1 f2 of different sample and then plot it in f1 f2 plane and find out the where
it is clustered where, who is there? What do you mean by central vowel? How it is
changing? So, all kind of clustered position you can analyze. So, that is also important.
Now if I said that diphthong hiatus I am not going through the slides because you can
read this slide this is the linguistic definition. There is a some example I have given from
the bangali.
If you want I can give you the English example also Is there.
133
So, if you see the signal this red line green line, blue line and yellow line are the formant
movement line. If you see that the movement of the formant is different for the different
vowel after vowel combination. And that produce a hiatus diphthong all kinds of things
will be there. This is the phonetic chart for Bengali, I have prepared it for that things.
Then next class what I will discuss I will open a speech signal in this spectrogram and
time domain speech signal, then try to find out what kind of signal it is. Whether it is ta,
whether it is da, whether it is dha, whether it is a pha, whether it is a wa, we want to
classify it.
So thank you.
134
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture - 08
Handson On Acoustic Phonetics
So, last class I have taken about that manner and place of articulation; I have described
what is manner and what is place of articulation. Based on the manner and based on the
place of articulation; the IPA symbols we have discussed and we assign an IPA symbols.
So, and we said that all the vowels are classified based on the tongue position and tongue
height; so, f1, f3 and f3 their relationship I have discussed.
For today instead of taking the lectures; today I give you some demonstration to see the
voice sound and identify the manner and place of articulation we cannot identify by
viewing that pitch sample, but yes we can identify the manner of articulation and from
there we can see what kind of consonant it is, what kind of vowels that is, how the
formants are moving or kind of things; so, those will explain.
So, if you see there is a number of open source softwares available Cool edit, kart then I
have explained wave circle. So, I will explain in one slides that lot of open source that
speech processing softwares are available. Now, one of them are the cool edit; if you see
the cool edit software that is also downloadable from the cool edit pro is downloadable
from the net. And then if you see that when you open a voice; I have recorded that this
let us I give you an example.
Let us go through that; this is a example of a voice which is recorded for a English
speech. Let us listen to the voice; so, this is the voice which is recorded by English
speaker English speech; some sentence are there.
135
Now, if you see that this X axis which is written the sample; number of sample and you
discuss in signal crossing that once you recorded the digital speech that this X axis is
sample number of sample or it can be time or sample, then Y axis is the amplitude.
If you see this amplitude; now if you see this portion, no speech that amplitude of the
samples are very low almost close to 0 line; this is 0, 0 amplitude line. So, 0 line if you
see this is recorded with 44 kilowatt 16 bit mono. So, each sample is you can say that it
is quantized using the 16 bit. So, some value of the 16 bit is there; based on that value, it
is Y axis is that value. So this is 0 axis; 0 axis, this is minus, this is plus and samples are
all samples are plot and this is look like a sample speech signal.
If you see this portion is a silence portion and here there is a voice portion; there is noise
portion, noisy kind of signal; I can magnify it. So, if you see that this is noise kind of
things; if you listen it also, some noise signals are there. Now, if I see that this is a time
domain representation of the speech signal; now there is a another kind of representation
which is called spectral view; look like a you can see; this look like a mostly X ray plate
kind of things. So, here what is there it is called 3 dimensional plot of the speech signal;
what it is done, if you are not signal crossing background; then broadly you can think
about like this that; suppose I want to know that frequency analysis of the signal.
136
So, I have a sign domain speech signal Sn; which I plot time versus sample value or
amplitude. Now, I want to plot what are the frequency contained in this signal; so, I say
my Y axis is the; let us X axis is the frequency and I want to know the each frequency
content amplitude. And then the power of every single frequency; so, this is a 2
dimensional plot; frequency versus power.
This is also time versus sample amplitude; now the problem is that if I plot this way, then
I do not know the time information; I want the time information should be there. So,
what I do, I do a spectrographic plot; how do I do it? I discussed in the signal crossing
class. So, in X axis let us this is time or sample and Y axis is the frequency.
And then the amplitude or power of the particular frequency is represented by a color; it
may be a color or it may be a black and white. So, black means the power is high; white
means there is no power. In this case, black means power is high white means there is no
power. So, I am not going details about the how the spectrogram is made that I will
discuss in the signal crossing class.
So, if you see in this picture; some portion are very dark those contain the high frequency
power of those portion are very high. So, if I say this Y axis is the frequency axis; if you
see there is a frequency axis. So, this is 2 kilohertz, 4 kilohertz, 6 kilohertz, 8 kilohertz;
so, sampling frequency is crossing 44 kilohertz.
So, up to 20 kilo hertz; it is there 22 kilohertz it S by 2. Now, if you see the black portion
are contain the high power. So, if I say within 2 kilo hertz; most of the sections are black.
137
So, that contains the high power. Now if it is noise kind of sound, if you listen it, it is a
noise. Noise kind of sound; the noise kind of sound then you see the power distribution
of the all frequency are present. So, there may be some dark portion are there, some light
are there, but more or less there is all powers are there. So, I can say that the random
noise all frequency has some power; so, this is random noise.
So, if I say who generate the random what kind of manner of articulation; generate those
kind of random noise. If I say fricative or if I say aspiration, so if it is fricative that were
a friction noise will be there. So, I can easily say this is nothing, but fricative; now if I
see here there is no power at all in the frequency. If you see, there is no power at all; so,
if you see the time domain, this is 0 sample value.
So; that means, this portion is silence; so, this portion is silence, this portion is fricative.
If I say I can zoom out; this is fricative this portion is voiced. So, it may contain voice
consonant, it may contain vocalic sound, vowel it may contain diphthong, it may contain
I do not know, but it contains some vocalic sounds; so, vocalic sound look like this.
Again if I say; so in this spectrogram asks you just spotout that fricative sound, easily I
can say this was fricative sound, this was in fricative sound. I can say find out the vocalic
region, I can say here to here is the one vocalic region. Here to here is the one vocalic
region, here to here is another vocalic region.
Now, if you see interestingly here; if you go to the time domain, if you see there is a
silence and there is a burst kind of noise is there. If you see there a burst kind of things, if
I go to the spectrogram see no voicing, then burst, then aspiration, then voicing. So, I can
say here either a plosive consonant region; I do not know which possibility is.
But I can say; it is a consonant which is plosive kind of nature because there is a burst
and there is a; if I say this is non aspirated because there is no aspiration in here. So, I
can say it is a unaspirated plosive is there; so, by seeing this spectrogram and the time
domain waveform of the signal, I can find out some manner of articulation of the
phoneme.
I will show you; suppose if just open some non-Hindi recording like this; let us open this
one. If you listen it this is [FL] all kind of things is there [FL] there is a lot of noise is
there that is why upside frequency are present. So, if I forgot the noise; if I consider only
138
this portion; is a silence portion. Let us I do the noise removal; there is this tool I
provided the facilitator I can remove the noise.
So, there may be a 50 hertz noise is in here; so, what I will do there is a provision for
noise reduction; analyze let us task form noise reduction; subtraction noise reduction it is
procedure; I am not going details, but it is there. So, I can say the select the silence
portion I can say this portion is noise portion. So, I can transform noise reduction, noise
reduction then I can say get profile because this portion; this is not sufficient to get the
profile.
So, I can edit the size; I can reduce and then also it is not sufficient. So, what I will do; I
will select much more whatever is possible, then find type try to find out noise reduction
this; then also I have to reduce it to relocate, then adjust this one not possible. So, this
one is possible get profile I get the noise profile then I close it; then I control a; select all,
then task one noise reduction.
So, what it will do; whole signal, this signal is very long that is why it will take time.
Whole signal noise will be subtracted as per the sample noise I have given to the system.
So, there are lot more details are there, signal processing details are there I am not going
that that much of detail; since the signals is noisy that is why I just cleaning the signal
fast.
Then I show you how the [k] is like that; what is the occlusion? What is burst? If you see
now; if you see this portion, noise is reduced; almost noise is reduced. So, if I see almost
noise is gone; now if I see listen [k], now it is clear [k]. Now if you, see [k] so he
pronounced [k]; so, [k] followed by a vowel, again [k] followed by a vowel, again [k]
followed by vowel; so, I can say if I zoom this portion only so, I can say if you see just
little bit of un zoom.
If you see this portion; to this portion, this portion is transitory part. So, this is steady
state vowel part; if I show you. So, this black color is gradually increasing and here if
you see them almost steady. Then, again if you see this is looking just follow the mouse
again starts getting. So, I can say this portion is nothing, but a transitory portion when the
vocalic; while we want to pronounce after the vowel; I want to pronounce again [k]. So,
this is the transitory portion transitory between vowels and consonants.
139
So, if I say I am in here vowel; I want to produce a consonant, then there will be a vowel
to consonant transition; that means, that articulator is producing vowel [k] then it want to
produce vowel [k]. So, what is the effect? The vocal cord has to be open and the
articulator has to be closed. So, vocal cord opening is not started, but articular is moving
to close to produce the [k] so once this articulator moving to [k] the air flow is reduced;
if you see the formant structure is changing from steady state vowel to consonant.
Then if you see after here; no noise, no signal if you see there is no signal; almost
silence. So, this part I can say occlusion part; then if you see there is a burst here there is
a burst, there is a burst. So, burst is happening then there is a VOT; if you see voice
sounds set time, voice sound set time is very less; if you see here, if I zoom it and then I
show you; if you see the after burst, vowels are not yet started. So, there is a delay
between the vocal cord vibration and articulatory opening.
So, what is done; this portion there is no signal then the signal is started, then what is
happening; again it want to produce [k] to[a]. So, again articulator has to be go to
position of [k]; So, there is a transitory movement to restart a statistic. So, I can say there
is a VOT; then again there will be consonant to vowel transition. So, now if I see; I can
recognize; if I give you a signal you can say; this portion is nothing, but a transitory
portion. This portion is nothing, but a occlusion, this is the burst then, this is a VOT and
then this is again consonant to vowel position; again steady state vowel position. If you
look the waveform of the steady state vowel; almost all period are same.
140
So, I can say; a two peak each period; so I can say this to this is period; this is a period or
I can say here to here or may be a period, here maybe a period. So, I can say almost all
period are same. Now, if I see the transitory portion; if you see the structure is not yet
completed. So, I can say here to here is nothing, but a consonant to vowel transition that
part vowels is steady state part; while vowel steady state.
Then again it will start from steady state to consonant. So, by seeing a wave form or by
seeing a spectrogram; if you see the formant structure I am not showing you the formant
because this time I am not introducing the formant here to see the formant in
spectrogram later on I can show you.
So, but if you see this black portion is started breaking in this portion. So, I can say this
portion is nothing, but a consonant to vowel transition. So, after seeing a spectrogram; I
can know the structure of a stop contour or proceed [k] there is a occlusion burst VOT,
then again vowel again occlusion burst VOT. So, [k]; now if I say [k] with [k]; if you see
[kha] if it is [kha], now if you see zoom it this portion if it is [kha] if you see again
consonant to alteration occlusion period; burst, then if you see aspirated VOT. So, long
VOT; you see compared to [k] it is long VOT which is totally aspirated; noise like of
sound aspiration is there; noise kind of aspiration is there.
So, if you see there is a long aspiration; then I say this is [kha]. So, this is unvoiced
aspiration that is why; I say it is [kha] to differ h. So, this aspiration; so, difference
between the [k] and [kha] if I say what is the difference between the [k] and [kha] is
unvoiced VOT is unvoiced, but in case of [k] VOT is; sorry VOT is unaspirated and in
case of [kha] VOT is aspirated by seeing I can say.
So, I can say I do not know if I not listen the sound I cannot say while it is a [k] or [kha],
but I can seeing this spectrogram; I can say at least I can say this is your some aspiration
plosive consonant; seeing the spectrogram I can say this is a unaspirated plosive
consonant. I do not know the place of articulation because place of articulation, if I want
to know then I have to know how the tongue is moving. Because the difference between
the [k] and [t]; what is the difference? [t] is bilabial dental plosive [t] is dental plosive
and [k] is velar plosive.
So, if I say [k] and [k] only difference is that tongue movement from [t] to [t] and [t] to is
different from [a] and [t]. So, it is one with the velar opening velar closure anther one is
141
the tip of the tongue touches the teeth to produce the [kha]. So, cavity structure will be
different in that time. So, seeing this spectrogram I can at least say what kind of manner
of that consonant. So, this is [kha] again I can go little detail then if you can see there is
[g]; let us see the [g]. So, [g] is nothing in hindi [g] Bengali [g] or [g] is [g] is whether it
is English, Hindi, Bengali; [g] is voiced; voiced unaspirated plosive.
So, if you see the voiced; that means, there will be voice in during the occlusion. So, if
you see the occlusion period; there is a voicing. So, I can say the there since voicing is
there then it is voiced. If you see there is a burst is also there; there is a burst then VOT is
less if you see the [gha]; let us I show you the [gha] again if it is [gha] you see the
aspiration will be there again VOT occlusion period is voiced burst is there and then the
see the aspiration is voiced. So, this pronunciation as voiced aspiration [gha] then if I say
I do not know; this is maybe [cha] I think it is [cha] if it is [cha] then if you see occlusion
friction then VOT; very little VOT.
Then if you see occlusion, friction, occlusion, friction; now if I show you [FL] let us
come to the [FL] only difference is that it will [FL] is if you see affricate. So, if you see
[FL] the aspiration this is the aspiration part; this is the friction part and then followed by
an aspiration is also there [FL]. Now if you see the [FL] what is the difference? Only the
occlusion part should be instead of unvoiced, it is voiced and then there is aspiration
understand. So, [FL] then [FL] so [FL] place of articulation I cannot say, but seeing a
spectrogram from the manner of articulation, I can say what kind of consonant it is.
If you see the retroflex; I do not know whether that retroflex is [FL] this may be [FL]
then that that may be [FL] I think so, if it is I do not know it may be [FL]. So, [FL] see
the again there is a burst unvoiced occlusion. So, I seeing these pictures waveform I
cannot say whether it is a [FL] whether it is [FL] whether it is a [FL], but I can say yes it
is a unvoiced, unaspirated plosive.
So, at least manner I can say. So, if you look like the record your different kind of
consonant with vowel and look like what kind of it is look. So, you can see the look, but
if you closely watch the spectrogram; then you find in case of retroflex, the formant
movement that only a formant structure. Here I cannot show you, but in plot it is possible
to draw the formant also. If you see the formant movement will be different, so a
consonant whether it is [FL] whether it is [FL] whether it is [FL] whether it is [FL] is
142
only differentiate while the transitory movement from consonant to vowel or vowel to
consonant.
So, if I say the difference between the [FL] and that [FL] only lies between the vowel to
consonant transition and consonant to vowel transitory part where since the movement of
the tongue is different that is why transition movement of attention will be different;
frequency movement in the transitory movement will be different. So, only rely have to
rely on this thing to find out whether it is [FL] or whether it is a [FL] or whether it is a
[FL]. So, if I have a lot of confusion maybe happened in signal crossing between this
group and [FL] is different by the [FL] is by only aspiration. So, if I able to find out the
aspiration; then I can say yes I can differentiate between the [FL] and [FL] because else
where there is no signal occlusion is silence.
So, seeing the manner you can learn what kind of signal processing I should use for what
kind of application; then you have to know what kind of signal I am getting and what are
the phoneme say it how it is behaved. So, I cannot say the result consonant recognition
result of English and consonant recognition result of Bengali may be different. Because
Bengali has more number of stop consonant compared to the English stop and plosive.
Even Hindi has more number of stop consonant compared to that English. So, identifying
the voice consonant is easy because voice there is a sound signal are there sound
frequencies components are there. But if you see the difference between the [FL] is only
the transitory part; this portion is silence there is a burst, there is a VOT and difference
between the [FL] and [FL] is I rely only on aspiration.
So, if I want to find out the high accuracy we were not get it within the [FL] and ka. So,
that now there is a another aspect is that coarticulatory effect. I am not showing you this
is the nonsense pronounce in the [FL] the structure is very means very well behaved, but
if it is continuous pronunciation; if you see I can show you in Bengali example if I say I
show you Bengali example.
143
That continuous things; let us say Bangla this one [FL]. So, if you see that the speech is
which is continuous. So, it maybe a [FL] so in word if when I write if I write a or [FL].
So, there is two words up there what is you see the continuous speech there is a no word
boundary is there only continuous speech.
So, that may be a vowel, that may be a [FL] that maybe a if I say this is what kind of
consonant; again it may be a [FL] because occlusion period is voiced there is a burst and
there is a transition; I do not know which kind of vowel it is then I have to listen it; I can
see it. Then I can say this may be a [FL] because is a friction. So, Bengali has a only
pronounce the probab. So, [FL] is there palatal sound is mostly cases we pronounce the
palatal [FL] so, it may be palatal [FL]. So, there is a if you see; there is a no word
boundary; there is no gap between the two words when we say in a continuous, but when
we writing it there is a gap in a words.
If I show you English example; if you see there is a no gap I can give you the number all
you can understand. So, if I show you number 7, 5, 4, 6, 1, 2, 9 equal numbers. Now I
can ask you based on the manner and seeing the spectrogram; can you identify the
number? So, this kind of problem people may ask; you that let us we do the serial
crossing later on find out the speech crossing later on. But seeing the number; if I say
that find out what should be the number this one and what should be the number this
one? So, if it is a 7; it should be start by a friction.
144
So, that way I have to find out whether there is a lot of difference or not. Now, one
things I can again show you; let us see this speech [FL]. Now, when you learn when you
heard it; you cannot find out if we do not know the Japanese you do not know what it is
spoken or you can recognize the word sentence all kind of things. But if you heard the
basic sound; you can understand the basic sound. So, I can say that some language so
that is why past I have said that to decipher the language is important to find out the
sequence of phonemes to produce a meaningful pronunciation.
So, all the sounds are there; all basic sounds will be same; you can you can also see it. I
will share this all sound file you can see it, all the basic sound there is a may be [FL]
there may be a [FL] there may a [FL] all kinds of sounds are there, but if you see once I
play it; I cannot recognize it. Because I do not know the language, but sound structure is
same that is why we say IPA; International Phonetic Alphabet; which alphabet represent
the sound only, it does not depends on the language.
So, [FL] it may in Bengali exist [FL] may exist in Hindi [FL] may exist in Tamil that
sound may exist in Malayalam, that may exist in Japanese, that may exist in Chinese,
that can exist in German also. So, phonemes are the sound sequence of phoneme produce
the message. Now, production of phoneme involve this part; so, there is problem that
problem is that some phoneme say it maybe exist in Bangla, but it may not be exist in
Hindi or it may not be exist in English.
Some phoneme which is exist in English, but not in Bengali. So, what is happened
suppose; I produce the English, the speech which I produce in English language
sequence of sound is such a way that it follow the English language grammar, but what
happen is that my English cannot be say; it is a British English, how the British is
pronounced. Because in case of British English; my mother tongue is first language is
Bengali. So, I learned that Bengali phoneme production when I learn English; I try to
copy that production with the English language. So, that is called l1; l2 acquisition.
So, details I am not discussing; details are there. So, there is another part also segmental,
supera segmental that part I will come later on. So, what I am saying is a segmental
property that in a segment; what is the look like this. So, this is called segmental
property. So, I request all of you to find out to records or to analyze one number of
speech signal; see it in spectogram or time domain try to find out manner of articulation
145
of that production. If you see, there is a vowel; vowel transitory effect also there if I say I
have a Bengali sound; I think I have; I will show.
Sample English that Bangla two [FL]. So, if you see there is a word one word is end and
another word is began, but there is no gap it is a coarticulate it likes a complete single
word that details I will come later on that.
So, there is no gap main things is that there is no word boundary which is define speech
signal; there is no silence it cannot say that this word is defined by this word is there is a
gap in written things. So, there may be a silence in speech no, all words are related with
each others with a co articulate effect then there is a prosody will come then the silence
will come based on the prosodic structure. So, I am not coming that part then I can say
the phonemes are have a co articulate effect [FL] followed by [FL]. So, there is a [FL]
then [FL] to a transition then to then o transition. So, all kind of transitory movement
will be there; now if you see there will be exist some combination which cannot produce
by me of which cannot produce which cannot producible. So, that combination will not
exist in that language.
So, if you see there is no you cannot find a case where there is a two aspirated sound will
be club together; that is not pronounceable. So, if it is two aspirated then one will be
unaspirated; one will be aspirated. So, depending on the how tongue movement
restriction, some combinations components does not existing for a particular language
or; so, that is not valid. So, those kind of things also you learn; so, then there is a called
146
phonologies that the written words not offer with the pronunciation so, that part I will
discuss in the TTS; when I talk about the TTS; phonologic that word to phoneme
conversion.
So, now I will request all of you to record the sounds and listen that sounds and find out
what kind of consonant it is; then you can say, you can expertise your skill. I said one of
the outcome of this course is that for a given signal, you should able to level the signals.
So, for that skill development you should practice it; where is that this kind of example I
sometimes given that sometimes I given that some I will give you some spectrogram and
some words; next I say [FL] or I can say that one, one, two. Let us seven see what is; I
have given now if I give the spectrogram of the three words and told you identify which
is 1, which is 2, which is 7.
Now, if you see all three are vocalic; this one is nasal murmur to vowel. If you see here
starting is a stop consonant of plosive; then there is a 2 vocalic; 2, then if you hear there
is a fricative sound in here; then there is also nasal murmur ending and say then there
will be a voiced consonant here. So, I can say seeing this is first one is fricative then I
can say this 7; first one is plosive two first one is vocalic one that kind of things I can ask
you. Another things is that I have just closed it again I open it if you see that I can show
you how to see the pitch; fundamental frequency let us see here. So, this is the vocalic
region detail signal crossing I will discuss.
If you analyze the frequency; if you see that this is frequency analysis. So, I say lean
instead of linear view, I can make it log view. Now, if you see this is the frequency
parenthesis and this is the DB in power. Power of the particular frequency, now if you
see this let us scan it; with little up of high things.
So, if you see this is the first peak; so, first peak maybe is called fundamental frequency.
So, I will discuss the find out the fundamental frequency of a signal those methodology I
will discuss why this is the one of the methodology to see the fundamental frequency.
Now, I can move this cursor to here and I can see the scan again fundamental frequency
will be change. So, the movement of fundamental frequency will be there. So, that I will
discuss later on let us not do it.
But you should practice those things; visualize different consonant find out I can say the
level of occlusion period, level the burst, level the VOT, level the steady state portion of
147
the vowels, then level a diphthong, then the level a vowel; vowel combination. So, those
things are there in the slides and once you practice it; you can able to relate it. If you say
that or if you can give the feedback that you have some difficulties; then I will again take
a tutorial to show you those kind of consonant and those kind of vowel combinations.
Thank you.
148
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture – 09
Uniform Tube Modeling of Speech Processing Part – 1
The purpose of the whole course on speech processing is that we should know, how the
human speech is produced. After knowing that we can be able to model mathematically
the human speech production system. Which will help us to study the system property. We
will be able to know what kind of speech property need in acoustic signal.
Now, if you see that we have said that this human speech production system has a 2 part.
One is the glottal vibration and other is the source of vibration. Sound or vibration passed
through the vocal cavity or oral cavity and produce the different kind of sound depending
on the tube structure or the cavity structure. Now we want to developed if the glottis
produced sound, how can we model this tube model.
If we design the mathematical model of the human vocal cord system, by changing its
different parameters, we would know the exact model or approximate model of the human
speech production system and its property.
149
Human speech production system is the complex system. human speech production model
which is not a simple single tube. In single tube, frequency and sound depends on the
length of tube.
But in human vocal tract this area of the tube is fixed for a rigid tube and this area is not
same throughout the tube. Let us lips, then start from the upper cavity which is almost
fixed, and if you see the lower cavity, the tongue of the tip and the back tongue of the tip
as shown in above figure. So, depending on the different sound production system the
structure of the cross-sectional area of the different portion of the cavity for different sound
is different.
So, there are 2 kind of variation. When we produce the speech, speech signal changes
along the time; that means, the constriction of the tube or structure of the tube is changing
along the time. In other word we can say for particular time ‘t’ the cross-sectional area of
the tube in all section is not same. if we say 𝑥 = 0 at glottis or vocal cords and 𝑥 = 𝐿 at
the lip. So, length of the tube is L, but throughout the length cross-sectional area is not
same as shown in above figure.
For example, when we produce ‘k’ the back of the tongue touches the upper cavity and
when we produce ‘p’ lip touches close together main cavity structure will be different. So,
we can say for different sound, when it is continuously produced, the cavity structures is
changing with respect to time also.
150
For the simplification let us we consider for particular sound with fix cavity structure is
there.
Assume whole human vocal system is nothing but a single cavity tube of length ‘L’ which
cross-sectional area is fixed as shown in above figure. Which means from glottis or vocal
cords to mouth, throughout the whole system cavity is uniform cross-sectional area ‘A’.
This is called uniform tube modeling.
Walls of the cavity is not rigid, when the air pressure increases wall of cavity will deform
results in the area cross-sectional area of tube will change. For the simplification of model,
we are considering walls of the cavity is rigid means cross- section area of tube is uniform
throughout the length ‘L’. We are also neglecting friction losses; viscous losses and no
thermal conduction is there.
151
To derive the equation of this human vocal contract of uniform tube model. let input is
x[n], output is y[n] and transfer function h[n].
𝑦[𝑛]
ℎ[𝑛] =
𝑥[𝑛]
So, we know how the sound is propagated, input is provided by the glottis or vocal cords
after vibrating it injects with particular velocity, and then it will propagate along the cavity
tube, and radiated from the mouth. So, sound wave is propagated along this tube. Now
considering that amplitude of the pressure, wave is propagated linearly (i.e. linear wave
equation). The amplitude of the sound is not bad enough that propagation become non-
linear. So, amplitude is within the limit and sound is propagated linearly in the medium
and medium is also homogeneous. Then we can derive the linear wave equation.
Let consider a single cube. I can say this is a piston kind of things which produce the
pressure in the air and that pressure is propagated along this cube.
let us I consider a single cubicle inside that tube whose dimension are ∆𝑥, ∆𝑦, 𝑎𝑛𝑑 ∆𝑧.
152
Initially there is atmospheric pressure (P0) inside the tube. When the pressure is applied
(P1), pressure will propagate and change in pressure will depend on the position and time.
So, change of pressure is the function of time and position.
When pressure is applied, the particle start moving, but average velocity of the particle is
0, because medium is homogeneous when sound wave is propagated medium does not
change, that means; particle velocity exists but average velocity is zero. Let us particle
velocity is ‘𝑣(𝑥, 𝑡)’. Since it is a pressure wave condensation and relaxation. So, density
will also change say 𝜌(𝑥, 𝑡).
So, we assumed linear wave equation. Now we have to know the propagation of pressure
form x = 0 to x = L, i.e. mathematical equation of the change of pressure with respect to
position and time, mathematical equation for particle velocity with respect to position and
153
time, and also the relation between the pressure wave pressure with respect to position
time with the particle velocity and (Refer Time: 14:26).
Assume inlet pressure is P1 and Area A1 and Outlet Pressure P2 and Area A1.
So, difference of force (∆𝐹) is the cause of the motion of this block. From Newton’s
second law of motion (F = m*a). Resultant force causes an acceleration of that cube.
Now Gas Law of Thermodynamics, which relates pressure, volume and temperature under
the adiabatic condition. If we apply a pressure on this cube, volume will change but mass
will not change it increases the density according to the conservation. Volume will change
because that cube will be deformed due to the applied pressure. if the pressure is increased
the volume is decreased. But mass of the cube will be remained same and this is an
adiabatic condition.
With the help of all these three equations, we will derive the linear wave equation.
154
(Refer Slide Time: 17:28)
155
P1 = 𝑝(𝑥, 𝑡)
P2 = 𝑝(𝑥 + ∆𝑥, 𝑡)
𝛿𝑝
𝑝(𝑥 + ∆𝑥, 𝑡) = 𝑝(𝑥, 𝑡) + 𝛿𝑥 ∆𝑥
𝛿𝑝
= 𝑝(𝑥, 𝑡) − 𝑝(𝑥, 𝑡) − 𝛿𝑥 ∆𝑥
𝛿𝑝
Force = A1 [ 𝑝(𝑥, 𝑡) − 𝑝(𝑥, 𝑡) − 𝛿𝑥 ∆𝑥 ]
𝛿𝑝
= - A1 𝛿𝑥 ∆𝑥
𝛿𝑝
=− ∆𝑥. ∆𝑦. ∆𝑧 (Since A1 = ∆𝑦. ∆𝑧)
𝛿𝑥
Since, F = m.a
𝛿𝑝 𝑑𝑣
− ∆𝑥. ∆𝑦. ∆𝑧 = 𝑚. 𝑑𝑡
𝛿𝑥
𝑑𝑣 𝛿𝑣 𝛿𝑣 𝛿𝑣
= + (𝑣 𝛿𝑥); is negligible.
𝑑𝑡 𝛿𝑡 𝛿𝑥
𝑑𝑣 𝛿𝑣
So, 𝑑𝑡 = 𝛿𝑡
Now,
𝛿𝑝 𝛿𝑣
− ∆𝑥. ∆𝑦. ∆𝑧 = 𝜌. ∆𝑥. ∆𝑦. ∆𝑧 . 𝛿𝑡 (since 𝑚 = 𝜌. ∆𝑥. ∆𝑦. ∆𝑧 )
𝛿𝑥
𝛿𝑝 𝛿𝑣
So, − = 𝜌. 𝛿𝑡 ………………(1)
𝛿𝑥
156
(Refer Slide Time: 20:31)
Gas Law.
𝑃𝑉 = 𝑛𝑅𝑇
𝑛 = 𝑚𝑜𝑙𝑒𝑠 𝑜𝑓 𝑎𝑖𝑟.
𝐽
R = Gas constant = 8.314 ⁄𝐾. 𝑚𝑜𝑙
T = Temperature (oK)
157
𝜌𝑉
𝑃𝑉 = 𝑅𝑇
𝑀
𝑅𝑇 𝛾𝑅𝑇
𝑃= 𝜌 ( Since 𝑐2 = 𝛾 = Specific heat ratio of air = 1.4)
𝑀 𝑀
𝑃𝛾 = ρ𝑐 2
𝑃𝑉 𝛾 = 𝐶𝑜𝑛𝑠𝑡𝑎𝑛𝑡
𝑑
[𝑃𝑉 𝛾 ] = 0
𝑑𝑡
𝛿𝑝 𝛿𝑉
𝑉𝛾. + 𝑃. 𝛾. 𝑉 𝛾−1 =0
𝛿𝑡 𝛿𝑡
Volume , V = A. ∆𝑥
𝛿𝑉 𝛿𝑣 𝛿𝑣
= 𝐴. 𝛿𝑣 = 𝐴. 𝛿𝑥 . 𝛿𝑥 = V. 𝛿𝑥.
𝛿𝑡
So, we get
158
𝛿𝑝 𝛿𝑣
𝑉𝛾. + 𝑝. 𝛾. 𝑉 𝛾−1 . 𝑉 =0
𝛿𝑡 𝛿𝑥
𝛿𝑝 𝛿𝑣
𝑉𝛾. + 𝑝. 𝛾. 𝑉 𝛾 =0
𝛿𝑡 𝛿𝑥
𝛿𝑝 𝛿𝑉
+ 𝑝. 𝛾. =0
𝛿𝑡 𝛿𝑥
𝛿𝑝 𝛿𝑉 𝛿𝑣
= − 𝑝. 𝛾. 𝛿𝑥 = −ρ𝑐 2 𝛿𝑥 …………… (2) (Since 𝑝𝛾 = ρ𝑐 2 )
𝛿𝑡
𝛿𝑝 𝛿𝑣
− = 𝜌. 𝛿𝑡 …………………….. (1)
𝛿𝑥
𝛿𝑝 𝛿𝑣
= −ρ𝑐 2 𝛿𝑥………………………(2)
𝛿𝑡
𝛿2𝑝 𝛿𝑣 1
So, − 𝛿𝑥 2 = ρ 𝛿𝑥 . 𝛿𝑥
𝛿 2𝑝 𝛿𝑣 1
− 2
= ρ𝑐 2 .
𝛿𝑡 𝛿𝑥 𝛿𝑡
159
Comparing both the equation, we get
𝛿 2𝑝 1 𝛿 2𝑝
= 2
𝛿𝑥 2 𝑐 𝛿𝑡 2
Similarly,
𝛿 2𝑣 1 𝛿 2𝑣
=
𝛿𝑥 2 𝑐 2 𝛿𝑡 2
So, this is the wave equation, when the wave is propagated along the x axis only. Now we
know, how the pressure wave is propagated. Now we have to derive mathematical equation
to know the solution of the p. This is a second order differential equation, we can solve
that second order differential equation, and find out the equation for p solution. And then
try to find out how what is do the transfer function of this vocal tract tube. So, next class
we try to derive the transfer function of single tube.
Thank you.
160
Digital Speech Processing
Prof. S. K. Das Mandal
Center for Educational technology
Indian Institute of Technology, Kharagpur
Lecture - 10
Uniform Tube Modeling of Speech Processing Part – II
𝛿 2𝑝 1 𝛿 2𝑝
=
𝛿𝑥 2 𝑐 2 𝛿𝑡 2
𝛿 2𝑣 1 𝛿 2𝑣
=
𝛿𝑥 2 𝑐 2 𝛿𝑡 2
𝑢
Volume flow rate through the tube 𝑢 = 𝑣 ∗ 𝐴 or 𝑣= 𝐴
𝛿𝑝 𝛿𝑣
= −ρ 𝛿𝑡
𝛿𝑥
𝑢
Put 𝑣 = 𝐴, we get
𝛿𝑝 𝜌 𝛿𝑢
= − ……………………(1)
𝛿𝑥 𝐴 𝛿𝑡
161
(Refer Slide Time: 01:17)
Similarly,
𝛿𝑝 𝛿𝑣
= − 𝜌. 𝑐 2
𝛿𝑡 𝛿𝑥
𝛿𝑝 𝜌𝑐 2 𝛿𝑢
= −
𝛿𝑡 𝐴 𝛿𝑥
𝛿𝑢 𝐴 𝛿𝑝
= − 2 … … … … … … (2)
𝛿𝑥 𝜌𝑐 𝛿𝑡
Equation (1) and (2) are analogous to electrical transmission line equation.
𝛿𝑉 𝛿𝑖
So, − 𝛿𝑥 = 𝐿. 𝛿𝑡
Similarly,
𝛿𝑖 𝛿𝑉
− 𝛿𝑥 = 𝐶. 𝛿𝑡 .
162
(Refer Slide Time: 02:04)
we can say that pressure (p) in acoustic domain is analogous to voltage (V) in electrical domain.
Similarly, Volume velocity (u) in acoustic domain is equivalent to electrical current (i) in
electrical domain.
𝜌
So, 𝐿 = , this is called acoustical inductance.
𝐴
𝐴
And 𝐶 = , this is called acoustical capacitance.
𝜌.𝑐 2
163
Consider a vocal cord which acts as a piston which has uniform cross-sectional area ‘A’a and
at x = 0 where the force is applied in the tube and x = l where the tube radiate the sound energy
in the air as shown in above figure.
At x = 0 At x=l
Now try to find out the traveling wave solution in the tube what is travelling wave solution?
𝑥 𝑥
𝑢(𝑥, 𝑡) = 𝑢+ (𝑡 − ) − 𝑢− (𝑡 + )
𝑐 𝑐
𝑥
𝑢+ (𝑡 − ) This is called forward wave
𝑐
𝑥
𝑢+ (𝑡 + 𝑐 ) This is called backward wave.
Similarly,
𝜌𝑐 𝑥 𝑥
𝑝(𝑥, 𝑡) = [ 𝑢+ (𝑡 − ) − 𝑢− (𝑡 + ) ]
𝐴 𝑐 𝑐
𝑥 𝑥
𝑝(𝑥, 𝑡) = 𝑧𝑇 [ 𝑢+ (𝑡 − ) − 𝑢− (𝑡 + ) ]
𝑐 𝑐
𝜌𝑐
𝑧𝑇 = , This is called impedance of the transmission line.
𝐴
164
(Refer Slide Time: 08:51)
Where omega (Ω) is the continuous frequency. Say if we excited the tube with unit and have
to find out the every frequency response at the output.
165
Let us at the lip, low frequency radiation load is completely not there. So, at atmospheric
pressure, p (l , t) = 0 at x = 0.
𝑥 𝑥
P (l, t) = 0 = 𝑝(𝑥, 𝑡) = 𝑧𝑇 [ 𝑢+ (𝑡 − 𝑐 ) − 𝑢− (𝑡 + ) ]………………(4)
𝑐
So, since the differential equation are linear with constant coefficient, the solution must be in
the form of 𝑘 + 𝑎𝑛𝑑 𝑘 − represent the amplitude and the forward and backward wave.
𝑥
𝑥
So, 𝑢+ (𝑡 − 𝑐 ) = 𝑘 + 𝑒 𝑗𝛺(𝑡− 𝑐)
𝑥
𝑥
And, 𝑢− (𝑡 + 𝑐 ) = 𝑘 − 𝑒 𝑗𝛺(𝑡+ 𝑐)
𝑘 + − 𝑘 − = 𝑈𝐺 (𝛺)……………..(5)
Similarly,
166
𝑙⁄ ) 𝑙
0 = 𝑧𝑇 [ 𝑘 + 𝑒 𝑗𝛺(𝑡− 𝑐 + 𝑘 − 𝑒 𝑗𝛺(𝑡+ ⁄𝑐) ] ………………(6)
𝑙
+
𝑒 2𝑗𝛺 ⁄𝑐
𝑘 = 𝑈𝐺 (𝛺) 𝑙
1 + 𝑒 2𝑗𝛺 ⁄𝑐
𝑙
+
𝑒 2𝑗𝛺 ⁄𝑐
𝑘 = − 𝑈𝐺 (𝛺) 𝑙
1 + 𝑒 2𝑗𝛺 ⁄𝑐
167
After putting the above value we get,
𝑙 𝑙
𝑒 2𝑗𝛺 ⁄𝑐 𝑗𝛺(𝑡− 𝑙⁄𝑐 )
𝑒 2𝑗𝛺 ⁄𝑐 𝑙⁄ )
𝑢(𝑥, 𝑡) = 𝑈𝐺 (𝛺) 𝑙 .𝑒 + 𝑈𝐺 (𝛺) 𝑙 . 𝑒 𝑗𝛺(𝑡+ 𝑐
1+ 𝑒 2𝑗𝛺 ⁄𝑐 1+ 𝑒 2𝑗𝛺 ⁄𝑐
(2𝑙−𝑥⁄ 𝑥
𝑒 𝑗𝛺 𝑐) + 𝑒 𝑗𝛺 ⁄𝑐
= 𝑈𝐺 (𝛺). 𝑒 𝑗𝛺𝑡 [ 𝑙 ]
1+𝑒 2𝑗𝛺 ⁄𝑐
168
Similarly,
sin(𝛺(𝑙 − 𝑥)/𝑐)
𝑝(𝑥, 𝑡) = 𝑗 𝑧𝑇 𝑈𝐺 (𝛺)𝑒 𝑗𝛺𝑡
𝛺𝑙
cos 𝑐
𝜌𝑐
Where 𝑧𝑇 = 𝐴
𝑝(𝑥,𝑡) 𝛺(𝑙−𝑋)
Impedance 𝑍 = = 𝑗𝑧𝑡 tan[ ]
𝑢(𝑥,𝑡) 𝑐
This is nothing but an acoustics impedance inside the tube at any point of x.
𝜌
𝑍 ≈ 𝑗. . ∆𝑥. 𝛺
𝐴
𝜌∆𝑥
→ Can be thought as an acoustic mass.
𝐴
Now we want to interpret, how the pressure wave amplitude and volume velocity amplitude is
very inside the tube.
169
Then I can say only x axis is here. So, one is sin another is cos. cos is the volume velocity,
when the particle velocity is maximum then pressure wave is minimum. One is cos another is
sin.
cos(𝛺(𝑙 − 𝑥)/𝑐)
𝑢(𝑥, 𝑡) = 𝑈𝐺 (𝛺) 𝑒 𝑗𝛺𝑡
𝛺𝑙
cos 𝑐
𝛺(𝑙 − 𝑥)
cos ( )
𝑐
𝑅𝑒[𝑢(𝑥, 𝑡)] = 𝑈𝐺 (𝛺) 𝑒 𝑗𝛺𝑡 . csc 𝛺𝑡
𝛺𝑙
cos 𝑐
At 𝑥 = 𝑙, 𝑢(𝑙, 𝑡)
𝑢(𝑙,𝑡)
Transfer function 𝑉𝑎 (Ω) = 𝑢(0,𝑡)
1
𝑢(𝑙, 𝑡) = 𝛺𝑙 𝑈𝐺 (𝛺) 𝑒 𝑗𝛺𝑡
cos
𝑐
170
𝑢(0, 𝑡) = 𝑈𝐺 (𝛺) 𝑒 𝑗𝛺𝑡
1
So, 𝑉𝑎 (Ω) = 𝛺𝑙
cos
𝑐
What do you mean by pole position? Pole you know that pole.
𝑃(𝛺)
So, any transfer function 𝐻(Ω) = 𝑄(𝛺)
Solution of p will give me the zero solution of Q will give me the pole.
So, when Q (Ω) is equal to 0 then the amplitude value is infinite; that means, at the pole the
resonance of the system is occur. So, we get the maximum energy resonance will be occur at
every pole.
If I have a 5 pole, I get 5 resonance frequency because Q is the order of 5th; that means, I can
get the 5 solution, which give me the 5 resonant frequency.
𝛺𝑙
cos =0
𝑐
𝛺𝑙 𝜋
= (2𝑛 + 1) 2 where n= 1,2,3,4……
𝑐
171
(Refer Slide Time: 27:18)
So, we can see that if each solution point of Q(Ω); that means, every pole the system response
will be infinite; that means, system will resonant, and every solution for P(Ω) which is nothing
but a zero. The system response will be 0. That is why it is called pole zero.
So, every pole corresponding to a resonance in speech those resonance is called formant.
1
𝑉𝑎 (Ω) =
𝛺𝑙
cos 𝑐
𝛺𝑙
When, cos = 0, then resonance will occur.
𝑐
𝛺𝑙
So, at which frequency this cos = 0?
𝑐
𝛺 = 2𝜋𝑓
𝜋
cos 𝜃 = 0; 𝜃 =
2
𝛺𝑙
𝑆𝑜, cos =0
𝑐
𝛺𝑙 𝜋
=
𝑐 2
172
2𝜋𝑓𝑙 𝜋
=
𝑐 2
𝑐
𝑓=
4𝑙
35000
So, 𝑓1 = = 500𝐻𝑧
4∗17.5
𝛺𝑙 3𝜋
If, =
𝑐 2
𝛺𝑙 5𝜋
If =
𝑐 2
173
So, we can see distance between the first formant (f1) and second formant (f2) is 1 kHz Similarly
between second formant (f2) and third formant (f3) is 1 kHz. It is fixed because tube length is
fixed.
This is similar with earlier physics theory that if one side is closed, then the first harmonics
𝜆
will occur at 4.
𝜆
So, 𝑙 = 4
𝑐
𝑓= there would be odd multiples,
4𝑙
𝑐
So, 𝑓 = (2𝑛 + 1) 4𝑙
now draw the frequency response of the uniform tube like there is no loss no nothing any loss.
So, we can consider frequency axis having 500 Hz, 1.5 kHz, and 2.5 kHz. So, at 500 Hz power
is maxima so on; it will be going.
So, we can put an arrow at top. So, it is infinite power; infinite impulse with the impulse power
at 500 Hz, and bandwidth is equal to zero, infinite power resonance will occur at 500 hertz, 1.5
174
kHz, 2.5 kHz, when the length of the vocal tube is 17.5 centimeter and velocity of the sound is
35000 centimeter per second.
So, we know this is the frequency response for the uniform tube model without considering
any kinds of loss, or walls is rigid. So, next lecture I will discuss how this response will be
affected if I consider one loss at a time. Then ultimately, we can get what should be the
frequency response or what should be the response of this vocal tract tube and what will be the
transfer function.
Thank you.
175
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture - 11
Uniform Tube Modeling Of Speech Processing Part – III
We have derived that transfer function of tube, without considering any loss infinite
power at 500 Hz, 1.5 kHz, and 2.5 kHz.
Now, we consider that the tube the wall is not rigid. So, the tube wall can modify if
pressure is high, because of that cross-sectional area of the tube changes. Now area ‘A’ is
also function of x and t.
176
So, at any time the cross sectional area 𝐴(𝑥, 𝑡) = 𝐴0 (𝑥, 𝑡) + 𝛿𝐴(𝑥, 𝑡)
So, if you see the above slides. Here, 𝐴0 (𝑥, 𝑡) is fixed, and 𝛿𝐴(𝑥, 𝑡) is expansion.
Neglecting the second order terms in 𝑢⁄𝐴 and 𝜌𝐴 , the wave equations become
𝜕𝑝 𝜕 (𝑢⁄𝐴 )
0
− =𝜌
𝜕𝑥 𝜕𝑡
The tube is made of muscles, muscle is nothing but a spring mass action.
177
So, it is nothing but a spring mass mechanical oscillator, if we consider the second order
equation of the spring mass mechanical oscillator, then this will be like this.
The differential equation relationship between area perturbation 𝛿𝐴(𝑥, 𝑡) and the
pressure variation 𝑝(𝑥, 𝑡).
178
𝑑2 (𝛿𝐴) 𝑑(𝛿𝑤)
𝑚𝑤 + 𝑏𝑤 + 𝐾𝑤 (𝛿𝐴) = 𝑝(𝑥, 𝑡) … … … … . . (1)
𝑑𝑡 2 𝑑𝑡
𝜕𝑝 𝜌 𝜕𝑢
− = … … … … … … … … … … … … … … … … … … … . (2)
𝜕𝑥 𝐴0 𝜕𝑡
𝜕𝑦 𝐴0 𝜕𝑝 𝜕∆𝐴
− = 2 + … … … … … … … … … … … … … . . (3)
𝜕𝑥 𝜌𝑐 𝜕𝑡 𝜕𝑡
After deriving the 3 equation, we get an analytical solution of what will happen in the
frequency response which represented as.
𝑈(𝛺)
𝑉𝑎 =
𝑈𝑔 (𝛺)
Now, if the length of the tube is 17.5 centimeter and 5 centimeter square is the cross
sectional area, and 𝑚𝑤 = 0.5 g/cm2, 𝑏𝑤 = 6500 g/cm2 dyne-sec/cm3 and 𝑘𝑤 = 0, 𝑚𝑤 is
nothing but a mass per unit length, 𝑏𝑤 𝑖𝑠 damping per unit length and 𝑘𝑤 stiffness per
unit length of the vocal tract wall.
179
If I consider that value, then it was found in the frequency response earlier it was infinite
energy 500 hertz, 1.5 kilo hertz, and 2.5 kilo hertz like that. It is said that complex pole
with non-zero bandwidth. So, here it is 0 bandwidth, Now if we introduce the loss what
will happen the instead of 0 bandwidth some bandwidth will be generated. So, the poles
are complex with nonzero bandwidth. Slightly higher frequency will be formant instead
of 500s hertz it maybe 505 hertz. Slightly formant equation is shifted towards the higher
frequency side, most affected in the lower band. So, lower frequency bandwidth will
increase more compared to the higher frequency. So, most affected is the lower band.
Earlier we did not consider the friction loss, viscous loss and thermal loss, now if we
consider the all kinds of friction loss, thermal conduction on the wall and viscosity all
kind of losses, then we found it increases the bandwidth of the complex pole, and
decreases the resonance frequency slightly. So, ultimately we can say that, if we consider
the losses and not rigid to wall. So, instead of 0 bandwidth infinite impulse at every
resonant frequency it becomes a finite bandwidth with a slightly shifting higher direction
of the formant.
But if you see more or less it will be on an around of 500 hertz slightly shifted. So, we
can easily say normally if the tube is totally open, first formant is 500 hertz, second
formant is 1.5 kilo hertz, third formant is 2.5 kilo hertz, fourth formant is 3.5 kilo hertz
fifth formant is 4.5 kilo hertz. So, if a signal is band limited with 4 kilo hertz, then we
can get up to 4th formant. We cannot get the fifth formant, because fifth one is 4.5
kilohertz. So, we will not get it.
Now second difficulty is the effect of radiation at lips. Once the air is radiated from the
mouth. So, I can say at the opening of the tube the acoustic wave is radiated.
180
(Refer Slide Time: 07:00)
So, what will be the radiation losses? What kind of effect I will get due to this radiation?
Now, we assumed that 𝑝(𝑙, 𝑡) = 0 at lips; (the acoustical analogue this short circuit ).
If I say transmission line, P = 0, the voltage = 0. So, output is short circuit. There is a no
acoustic load. So, it is loaded ‘0’ output is short circuit, it is ideal condition, but it may
not be the ideal condition. So, what will be the effect of air load on the lip or you can say
the radiation effect on the acoustics wave transmission?
181
If we put the microphone then we can say what kind of signals are expecting from the
mouth. Let us this is the tube and this is the mouth. Here it is radiated, so whole load is
the atmospheric load at the lip. The load is nothing but an inductive and resistive load
are𝐿𝑟 𝑎𝑛𝑑 𝑅𝑟 respectively. Now 𝑝(𝑙, 𝛺) is constitute load, which is represented as
𝑝(𝑙, 𝛺) = 𝑧𝑅 𝑈(𝑙, 𝛺)
𝑝(𝑙, 𝑡) = 𝑧𝑅 𝑢(𝑙, 𝑡)
𝑗𝛺𝐿 𝑅
Where, 𝑧𝑅 = 𝑗𝛺𝐿 𝑟+𝑅𝑟
𝑟 𝑟
Then, 𝑧𝑅 = 0
That means, there is not acrostics load or radiation load, it is totally short circuit which is
the idle condition 𝑝(𝑙, 𝑡) = 0.
So, the frequency response whatever we get due to the considering the loss and those
things will remain constant. That means, at low frequency, components are not affected
by the radiation load, because at low frequency, radiation load does not affect the
frequency response of the acoustical tube. So, low frequency is less affected.
Then, 𝑧𝑅 = 𝑅𝑟
So, load is totally resistive. So, radiation load is resistive means there is a radiation loss.
So, we can say, high frequencies are affected due to the radiation loss.
182
(Refer Slide Time: 11:55)
So, we can say that frequency response which we get after the considering the losses
with the bandwidth. If it is passed through the radiation loss, then low frequency will not
affected much more, but high frequency will be lost. So, I will explain it in here.
If I consider the wall vibration then I say the bandwidth increases, slightly amplitude
increases in high frequency. And bandwidths in the low frequency are much more.
183
If I consider the viscous loss friction loss then again the bandwidth increase and slightly
decrease the formant frequency.
Now, if we consider the radiation loss. Then if you see high frequencies are resistive
radiations. So, this circuit this will be suppressed as shown in figure below.
So, due to the radiation loss high frequency components amplitude will be decrease.
184
𝑃(𝑙, 𝛺) 𝑃(𝑙, 𝛺) 𝑈(𝑙, 𝛺)
𝐻(𝛺) = = = 𝑧𝑟 (𝛺)𝑉𝑎 (𝛺)
𝑈𝑔 (𝛺) 𝑈(𝑙, 𝛺) 𝑈𝑔 (𝛺)
If (𝛺) is very high then there is a lot of attenuation in the high frequency.
From the above graph, if we see the speech signal it will look like that high frequency or
amplitude are less it is due to the radiation loss.
So, this is the frequency response of the uniform tube if I consider whole tube vocal tract
in a single tube, where the frequency response is that formant frequency are 500
hertz ,1.5 kilo hertz, 2.5 kilo hertz, but the response is high frequency are attenuated due
to the radiation loss. And it is not zero bandwidth because of we consider the losses
inside the tube.
So, vocal tract can be characterized by a set of resonance that depends on the vocal tract
area function with shifts due to the loss and radiation.
The bandwidths of the two lowest resonances depend primarily on the vocal tract wall
losses𝐹1 𝑎𝑛𝑑 𝐹2 .
The bandwidth of the highest resonance depends the highest frequency resonance depend
primarily on the viscous friction loss, thermal loss and radiation loss, because the
185
bandwidth is most affected frequency or low frequency in case of variation in the wall.
In case of viscous law high frequency bandwidth are introduced.
Now, next one is nasal coupling effect. This is oral one cavity.
Now, if there is a cavity and it produces sound now if I put a hole somewhere in cavity
sound will change. Like you see the flute there is a lot of hole on the top of the flute and
closing the hole we can change the sound.
186
So, if we put a hole in cavity, frequency response will be change. Now in our case in
vocal tract also there is a nasal coupling in here. So, once the velum open. So, nasal
coupling is coupled with the oral tract give structure is changed frequency response will
be change.
So, sound pressure the same as at input for each tube and the volume velocity is the sum
of the volume velocity at the input to nasal and oral cavity.
Closed oral cavity can trap energy at certain frequency preventing those frequencies
from appearing in the nasal output.
So, if you think about while producing a nasal consonant or nasal sound if my oral cavity
is totally closed. Then the air is coming through the nasal. If the oral cavity is not totally
closed then the nasal and oral both cavity are, so if we consider the nasal case sound is
coming in nasal path then some sound pressure will be trapped by these oral cavity. So,
that trap will produce some anti resonance frequency.
So, nasal resonance have border bandwidths than non-nasal voiced sound 𝑑𝑢𝑒 to greater
viscous friction and thermal loss due to large surface area of nasal cavity.
So, we can say this point (point touching the x-axis) will be much larger point, anti-
resonance will be introduced. So, compared to the oral sound, nasal sound bandwidth is
larger.
Now let us I stop here because next class I will start the another topic which is called
how the excitation will excited the vocal tract, then will derive the 2 boundary condition
which is at the lip and at the glottis. After derived the 2 boundary condition, will derive
the total transfer function in digital domain and then try to implement it in uniform tube
model. Then you go for the multi tube modeling of the human vocal tract.
Thank you.
187
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture - 12
Uniform Tube Modeling of Speech Processing Part – IV
Last class; we have discussed about that effect of that different kind of losses; nozzle effect
on sound spectrogram and also derived the transfer function of the uniform single tube
vocal track.
188
Now we can see in the above picture, the air flow from lungs passes through the vocal
cords and once it passes through the vocal cords, which create obstruction in the airflow
and due to this airflow, the vocal cord is vibrating and this vibration goes through the vocal
track and produce the sound. So, this is the vocal track model. we excited that model using
the vocal cord vibration. Vocal cord vibration can simulate electrically as shown in above
figure. That means, the air pressure which is coming from lungs; if the vocal cords are
closed; the pressure increases, once vocal cords start opening the increased pressure force
airflow through the vocal cords opening and that airflow causes a vibration in the vocal
cords and that vibration create the sound.
As we know that particle volume velocity ‘u’ is analogues to electrical current ‘i’. If the
vocal track is excited by a pressure (Pg) which is coming from lungs, lungs pressure will
cause a particle volume velocity ‘u’ in the vocal track.
Let us assume resistance and inductive impedance of vocal track is, RG and LG
respectively. So, when pressures pass through that impedance that create a vocal cord or
we can say volume velocity ‘ug’ is the input to the vocal track system. Since vocal cords
is a time a time dependent system. When it is completely closed means 𝑢𝑔 = 0, means no
volume velocity because air flow is completely stopped. So, impedance at here will be
infinite.
1
Impedance = 𝐴𝐺
Now, we know the vocal track transfer function, load condition or boundary condition at
the lips which is the radiation load, and input condition vocal cords load into the tube.
Now can we implement that tube in a vocal or within circuits or in digital domains?
189
(Refer Slide Time: 05:12)
So, let us; a vocal track, which is a uniform tube of length ‘L’. there is a radiation load (zL)
at the lip and here is a pressure source Pg and impedance zg. So, this is a complete electrical
model as shown in above figure. Now if Pg is replaced by a current source ug then this
impedance ‘zg’ come in parallel. This is the complete model.
Now, considering all those things; can I derive this tube mathematical model of this tube
considering 2 boundary condition now?
190
(Refer Slide Time: 08:00)
From the above figure, Current coming out is 𝑢𝑇 (𝑙, 𝑡). 𝑤ℎ𝑖𝑐ℎ is shown by
𝑢𝑇 + (𝑙, 𝑡) 𝑎𝑛𝑑 𝑢𝑇 − (𝑙, 𝑡) 𝑓𝑜𝑟 top and bottom respectively. So, 𝑢𝑇 (𝑙, 𝑡) passes through 𝑧𝐿
and create pressure ‘𝑝𝐿 (𝑙, 𝑡)’ across the load.
At, 𝑥 = 𝑙
𝑝𝐿 (𝑙, 𝑡) = 𝑝(𝑙, 𝑡)
𝑙 𝑙 𝑙 𝑙
𝑧𝐿 . [𝑢𝑇 + (𝑡 − ) − 𝑢𝑇 − (𝑡 + )] = 𝑧𝑇 . [𝑢𝑇 + (𝑡 − ) + 𝑢𝑇 − (𝑡 + )]
𝑐 𝑐 𝑐 𝑐
𝑙
From the above equation, we want to find out the backward wave i.e. 𝑢𝑇 − (𝑡 + 𝑐)
191
(Refer Slide Time: 11:20)
𝑙 𝑙 𝑙 𝑙
𝑧𝑇 . 𝑢𝑇 − (𝑡 + ) + 𝑧𝐿 . 𝑢𝑇 − (𝑡 + ) = 𝑧𝐿 . 𝑢𝑇 + (𝑡 − ) − 𝑧𝑇 . 𝑢𝑇 + (𝑡 − )
𝑐 𝑐 𝑐 𝑐
𝑙 𝑙
𝑢𝑇 − (𝑡 + ) [𝑧𝑇 + 𝑧𝐿 ] = 𝑢𝑇 + (𝑡 − ) [𝑧𝐿 − 𝑧𝑇 ]
𝑐 𝑐
𝑙 [𝑧𝐿 − 𝑧𝑇 ] 𝑙
𝑢𝑇 − (𝑡 + ) = 𝑢𝑇 + (𝑡 − )
𝑐 [𝑧𝑇 + 𝑧𝐿 ] 𝑐
[𝑧𝑇 − 𝑧𝐿 ]
𝑆𝑎𝑦, = 𝑅𝐿
[𝑧𝑇 + 𝑧𝐿 ]
𝑙 𝑙
𝑆𝑜, 𝑢𝑇 − (𝑡 + ) = −𝑅𝐿 𝑢𝑇 + (𝑡 − )
𝑐 𝑐
𝑙 𝑙
𝑢𝐿 (𝑡) = 𝑢𝑇 + (𝑡 − ) − 𝑢𝑇 − (𝑡 + )
𝑐 𝑐
𝑙 𝑙
= 𝑢𝑇 + (𝑡 − 𝑐) + 𝑅𝐿 𝑢𝑇 + (𝑡 − 𝑐)
𝑙
= [1 + 𝑅𝐿 ] 𝑢𝑇 + (𝑡 − 𝑐)
192
(Refer Slide Time: 12:21)
193
(Refer Slide Time: 16:51)
Let us there is tube, which is nothing but a delay, so this is my vocal track. So, vocal track
has a forward wave and backward wave.
In case of delay circuits for forward wave. At x=0, 𝑢𝑇 + (0, 𝑡) after the delay we
get 𝑢𝑇 + (𝑙, 𝑡) 𝑎𝑡 𝑥 = 𝑙, while in backward delay at x = l, 𝑢𝑇 − (𝑙, 𝑡) after delay we get
𝑢𝑇 − (0, 𝑡) 𝑎𝑡 𝑥 = 0 𝑎𝑠 𝑒 𝑐𝑎𝑛 𝑤𝑒 𝑖𝑛 𝑎𝑏𝑜𝑣𝑒 𝑓𝑖𝑔𝑢𝑟𝑒.
𝑙
𝑢𝐿 (𝑡) = 𝑢𝑇 (𝑙, 𝑡) = [1 + 𝑅𝐿 ] 𝑢𝑇 + (𝑡 − )
𝑐
So, this is the boundary condition [1 + 𝑅𝐿 ] multiply with the forward wave and [−𝑅𝐿 ] is
multiplied by the backward wave as we can see in above figure.
194
Now from the above slide 𝑍𝐿 ≫ 0 ,then 𝑅𝐿 = 1
𝑧 −𝑧
Where, 𝑅𝐿 = 𝑧 𝑇+𝑍𝐿
𝑇 𝐿
Lip side is nothing but 𝑍𝐿 ; at glottis side pressure source is replaced by a current source
which is nothing, but a volume velocity which is 𝑢𝑔 (0, 𝑡) 𝑜𝑟 𝑢𝑔 (𝑡). Now the glottal
impedance (𝑧𝑔 ) comes in parallel. So, signal flow is like forward wave-backward wave.
𝑝𝑔 (𝑡)
𝑢𝑔 (0, 𝑡) = 𝑢𝑔 (𝑡) −
𝑧𝑔
1
= 𝑢𝑔 (𝑡) − 𝑧 . 𝑧𝑇 [𝑢𝑇 + (0, 𝑡) + 𝑢𝑇 − (0, 𝑡)]
𝑔
𝑧
= 𝑢𝑔 (𝑡) − 𝑧𝑇 . [𝑢𝑇 + (𝑡) + 𝑢𝑇 − (𝑡)]
𝑔
195
(Refer Slide Time: 24:22)
𝑧𝑇
𝑢𝑇 + (𝑡) − 𝑢𝑇 − (𝑡) = 𝑢𝑔 (𝑡) − . [𝑢𝑇 + (𝑡) + 𝑢𝑇 − (𝑡)]
𝑧𝑔
𝑧
(𝑡) 1 − 𝑧𝑇
𝑢𝑔 𝑔
𝑢𝑇 + (𝑡) = 𝑧𝑇 +
−
𝑧𝑇 . 𝑢𝑇 (𝑡)
1+𝑧 1+𝑧
𝑔 𝑔
𝑧𝑔 𝑢𝑔(𝑡) 𝑧𝑔 −𝑧𝑇
= +𝑧 𝑢𝑇 − (𝑡)
𝑧𝑔 +𝑧𝑇 𝑔 +𝑧𝑇
𝑧𝑔 −𝑧𝑇 𝑧𝑔 1+𝑅𝑔
Now, 𝑧 = 𝑅𝑔 𝑆𝑜, 𝑧 =
𝑔 +𝑧𝑇 𝑔 +𝑧𝑇 2
So,
1 + 𝑅𝑔
𝑢+
𝑇 (𝑡) = . 𝑢𝑔 (𝑡) + 𝑅𝑔 𝑢−
𝑇 (𝑡)
2
196
(Refer Slide Time: 27:19)
So, this is a tube and delay circuit diagram in the forward and backward wave consisting
the boundary condition at lip and glottis. Here delay can be replaced by
𝑧 −𝑛 , 𝑤ℎ𝑒𝑟𝑒 𝑛 𝑖𝑠 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑑𝑒𝑙𝑎𝑦.
So, suppose I have a 17.5-centimeter-long tube and sound velocity is 35000 centimeter per
second. So, time taken to cover the tube length.
197
17.5 1
𝑡 = 35000 = 2000 𝑠𝑒𝑐.= .5 msec.
Now, I want to implement in the above circuit, if it is sampled at 8 kHz then single sample
1
delay means = 125 𝑚𝑠𝑒𝑐, but we required 0.5 milli seconds, we can calculate the
8 𝑘𝐻𝑧
number of z is required (delay = 𝑧 −𝑛 ). So, we get the digital circuits if we know the
𝑅𝐿 𝑎𝑛𝑑 𝑅𝑔 .
So, next class will discuss about the multi tube modeling.
Thank you.
198
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture - 13
Uniform Tube Modeling of Speech Processing Part – V
If we see the speech production system of human being as shown in above figure, red color
mark is that our vocal track. We can see the cross-sectional area of the vocal track is not
uniform throughout the tube. So, the constriction is made, that from 𝑥 = 0 𝑡𝑜 𝑥 = 𝑙 the
tube cross-sectional area is not uniform.
199
Now, instead of a single uniform tube with a fixed cross-sectional area ‘A’. we want to
make the model with a multiple tube of different cross-sectional area connected to each
other as shown in figure below.
Let us, left side is the glottis and right side is the lips, and whole structure is of vocal track.
Now, we can say whole vocal track instead of modeling by a single uniform tube can be
model as a multiple number of uniform tubes connected with each other.
So, let us the whole vocal track is model by ‘N’ number of tube with a fixed length ‘l’. So,
we can say 𝑙𝑘 length tube and ‘N’ number of tubes is connected with each other.
We have assumed there are 5 number of tubes of length 𝑙1 , 𝑙2 , 𝑙3 , 𝑙4 , 𝑎𝑛𝑑 𝑙5 and area of
200
𝐴1 , 𝐴2 , 𝐴3 , 𝐴4 , 𝑎𝑛𝑑 𝐴5 respectively as shown in above figure. Where lengths of all the
tubes are same (i.e. 𝑙1 = 𝑙2 = 𝑙3 = 𝑙4 = 𝑙5 ) but they have different cross-sectional area
(i.e. 𝐴1 ≠ 𝐴2 ≠ 𝐴3 ≠ 𝐴4 ≠ 𝐴5 ).
Now, we consider that whole tube is a connected tube instead of a single tube. when sound
will travel throughout the tube, if it is a single tube, we can derive the equation for signal
flow equation from the input and output, and can find out the transfer function of the tube.
But there will be affected junction also. So, the main purpose for multiple tube modeling
is that we have to find out what happening at each and every junction of the tube. Which
means, when there is change in cross sectional, what will happen to the volume velocity
and pressure wave equation in that junction or corresponding junction.
201
So, let us there are two tubes 𝐾 𝑡ℎ 𝑡𝑢𝑏𝑒 𝑎𝑛𝑑 (𝐾 + 1)𝑡ℎ 𝑡𝑢𝑏𝑒 of length
𝑙𝑘 , 𝑙𝑘+1 𝑎𝑛𝑑 𝑎𝑟𝑒𝑎 𝐴𝑘 , 𝐴𝑘+1 𝑟𝑒𝑠𝑝𝑒𝑐𝑡𝑖𝑣𝑒𝑙𝑦. Now we want to know what will happen at
junction of these two tubes as shown in above figure.
Volume velocity at inlet which will travel along the tube forward wave is 𝑢𝑘+ (𝑡) and at the
junction in forward wave is 𝑢𝑘+ (𝑙𝑘 , 𝑡). In case of backward wave, volume velocity at the
junction is 𝑢𝑘− (𝑙𝑘 , 𝑡), and at the inlet is 𝑢𝑘+ (𝑡).
+
At the inlet of (𝐾 + 1)𝑡ℎ tube, forward wave volume velocity is 𝑢𝑘+1 (𝑡) ; backward wave
volume velocity is 𝑢𝑘− (𝑡). At the outlet of (𝐾 + 1)𝑡ℎ tube, forward wave volume velocity
+ − (𝑙
is 𝑢𝑘+1 (𝑙𝑘+1 , 𝑡) and backward wave volume velocity is 𝑢𝑘+1 𝑘+1 , 𝑡) .
𝑙 𝑙𝑘+1
𝑢𝑘+ (𝑙𝑘 , 𝑡) = 𝑢𝑘+ (𝑡 − 𝜏𝑘 ). Where 𝜏𝑘 = 𝑐𝑘 and 𝜏𝑘+1 = 𝑐
Same for backward wave as shown in figure below.
202
(Refer Slide Time: 07:39)
Suppose, there are different cross-sectional area water pipe have joined together. Water is
flowing from one point to another point. So, at junction some amount of water volume
velocity is injected from 𝑘 𝑡ℎ 𝑡𝑢𝑏𝑒 𝑡𝑜 𝑘 + 1𝑡ℎ 𝑡𝑢𝑏𝑒 𝑖𝑛 𝑓𝑜𝑤𝑎𝑟𝑑 𝑤𝑎𝑣𝑒. While in case of
backward wave some amount of water volume velocity is injected from 𝑘 +
1𝑡ℎ 𝑡𝑢𝑏𝑒 𝑡𝑜 𝑘 𝑡ℎ 𝑡𝑢𝑏𝑒.
So, the purpose is to find out how much forward wave is injected by the 𝑘 𝑡ℎ tube to the
1𝑘 + 1𝑡ℎ tube, and how much backward wave is injected by 𝑘 + 1𝑡ℎ tube to the 𝑘 𝑡ℎ tube.
+
i.e. 𝑢𝑘+1 (𝑡) and 𝑢𝑘− (𝑡 + 𝜏𝑘 ).
203
(Refer Slide Time: 09:49)
+ −
𝑢𝑘+1 (𝑥, 𝑡) = 𝑢𝑘+1 (𝑡) − 𝑢𝑘+1 (𝑡)
At, 𝑥 = 0
+ − (𝑡)
𝑢𝑘+1 (0, 𝑡) = 𝑢𝑘+1 (𝑡) − 𝑢𝑘+1 … … … … … … … … … . . (1)
Pressure wave;
+ −
𝑝𝑘+1 (0, 𝑡) = 𝑧𝑘+1 [𝑢𝑘+1 (𝑡) + 𝑢𝑘+1 (𝑡)] … … … … … … … … (2)
At, 𝑥 = 𝑙
𝜌𝑐 𝜌𝑐
Where, 𝑧𝑘+1 = and 𝑧𝑘 =
𝐴𝑘+1 𝐴𝑘
Now, we know that volume velocity which is exited from the 𝐾 𝑡ℎ tube will be the volume
204
velocity which is the input of the 𝐾 + 1𝑡ℎ tube (i.e. equation (1) = equation (3))
+
𝑢𝑘+1 − (𝑡)
(𝑡) − 𝑢𝑘+1 = 𝑢𝑘+ (𝑡 − 𝜏𝑘 ) − 𝑢𝑘− (𝑡 + 𝜏𝑘 )…………………(5)
Similarly, the output pressure of the 𝐾 𝑡ℎ tube will be equivalent to the input pressure of
the 𝐾 + 1𝑡ℎ tube.
+
𝑧𝑘+1 [𝑢𝑘+1 −
(𝑡) + 𝑢𝑘+1 (𝑡)] = 𝑧𝑘 [𝑢𝑘+ (𝑡 − 𝜏𝑘 ) + 𝑢𝑘− (𝑡 + 𝜏𝑘 )]…………….(6)
+
We have to solve the equation (5) and (6), to find out 𝑢𝑘+1 (𝑡) and 𝑢𝑘− (𝑡 + 𝜏𝑘 ).
+
2𝑧𝑘 𝑧𝑘 − 𝑧𝑘+1 −
𝑢𝑘+1 (𝑡) = 𝑢𝑘+ (𝑡 − 𝜏𝑘 ) + 𝑢 (𝑡) … … … … … … … (7)
𝑧𝑘 + 𝑧𝑘+1 𝑧𝑘 + 𝑧𝑘+1 𝑘+1
205
𝑧 −𝑧
𝑟𝑘 = 𝑧𝑘+𝑧𝑘+1 , 𝑟𝑘 is called reflection coefficient
𝑘 𝑘+1
+
𝑢𝑘+1 (𝑡) = [1 + 𝑟𝑘 ] 𝑢𝑘+ (𝑡 − 𝜏𝑘 ) + 𝑟𝑘 𝑢𝑘+1
− (𝑡)
206
(Refer Slide Time: 19:35)
+
𝑢𝑘+1 (𝑡) = [1 + 𝑟𝑘 ] 𝑢𝑘+ (𝑡 − 𝜏𝑘 ) + 𝑟𝑘 𝑢𝑘+1
− (𝑡)
Now, we want to draw the diagram at the junction point from these 2 equations. We
+
consider four point at the junction. 𝑢𝑘+1 (𝑡) 𝑖𝑠 𝑔𝑜𝑖𝑛𝑔 in forward direction (i.e. forward
wave output volume velocity) and 𝑢𝑘− (𝑡 + 𝜏𝑘 ) is going backward direction (i.e. backward
wave output volume velocity). Input of forward wave direction is 𝑢𝑘+ (𝑡 − 𝜏𝑘 ) and input of
− (𝑡).
backward wave direction is 𝑢𝑘+1 So, from the above expression we can get that
𝑢𝑘+ (𝑡 − 𝜏𝑘 ) is multiplied with (1+𝑟𝑘 ) and 𝑢𝑘+1
− (𝑡)
𝑖𝑠 multiplied with 𝑟𝑘 and is added
together to get forward wave volume velocity. Similarly, for backward wave volume
− (𝑡)
velocity; 𝑢𝑘+1 𝑖𝑠 multiplied with (1-𝑟𝑘 ) and 𝑢𝑘+ (𝑡 − 𝜏𝑘 ) 𝑖𝑠 multiplied with −𝑟𝑘 and
added together. We get the junction diagram as shown in above figure.
What is 𝑟𝑘 .?
𝑧𝑘 − 𝑧𝑘+1
𝑟𝑘 =
𝑧𝑘 + 𝑧𝑘+1
207
𝜌𝑐 𝜌𝑐
Since, 𝑧𝑘+1 = and 𝑧𝑘 =
𝐴𝑘+1 𝐴𝑘
So,
𝜌𝑐 𝜌𝑐
𝐴𝑘 − 𝐴𝑘+1
𝑟𝑘 = 𝜌𝑐 𝜌𝑐
𝐴𝑘 + 𝐴𝑘+1
𝐴𝑘+1 − 𝐴𝑘
𝑟𝑘 =
𝐴𝑘+1 + 𝐴𝑘
From the diagram, we can say, 𝑘 𝑡ℎ tube length is 𝑙𝑘 , so delay will be only 𝜏𝑘 in both the
line. The input of the 𝑘 𝑡ℎ 𝑡𝑢𝑏𝑒 which are 𝑢𝑘− (𝑡) 𝑎𝑛𝑑 𝑢𝑘+ (𝑡). The delay of the 𝑘 + 1𝑡ℎ is
− +
𝜏𝑘+1 and the output of the 𝑘 + 1𝑡ℎ 𝑡𝑢𝑏𝑒 is 𝑢𝑘+1 (𝑡 + 𝜏𝑘+1 ) 𝑎𝑛𝑑 𝑢𝑘+1 (𝑡 + 𝜏𝑘+1 ). So, this
is the complete junction signal flow diagram as we can see in above figure.
If there are ‘N’ number of tubes, then k varies from 0 to (N – 1) tube. Now, I just consider
in this case, what should be the value of 𝑟𝑘 .?
208
𝐴𝑘+1 − 𝐴𝑘
𝑟𝑘 =
𝐴𝑘+1 + 𝐴𝑘
Then, 𝑟𝑘 = 1
It means, when the cross-sectional area of the second tube is much larger than the first tube
then at the junction reflection coefficient will be ‘1’.
If, 𝐴𝑘+1 ≪≪ 𝐴𝑘
Then, 𝑟𝑘 = −1
Now suppose we have ‘N’ number of tubes with equal length. If I consider all tubes are
equal length then what should be the diagram.
𝑙
As we know that delay 𝜏 = 𝑐,
Since, 𝑙1 = 𝑙2 = 𝑙3 = 𝑙4 = ⋯ … … … . . = 𝑙𝑛
So, 𝜏1 = 𝜏2 = 𝜏3 = 𝜏4 = ⋯ … … … . . = 𝜏𝑛
209
Let sampling period 𝑇1 = 𝜏
1
then sampling period 𝑇1 = 8
210
So, in above figure, 𝜏𝑘 delay is replaced by 𝑧 −1 .
Now, considering the boundary condition, we can derive the complete transfer function of
this tube model in z domain.
So, next class, we will derive the total vocal cord transfer function for ‘N’ tube model. So,
instead of the single tube, we consider multiple tube of different cross-sectional area are
connected with each other.
Thank you.
211
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture - 14
Uniform Tube Modeling of Speech Processing Part – VI
In last class, we consider the vocal track is nothing but a junction of several tube, and of
different cross-sectional area; what should be the junction effect we have discussed.
Here, we have assumed that each tube has a fixed length ∆𝑥, and delay line (𝜏) can be
represented by 𝑧 −1 because sampling period 𝑇1 = 𝜏. So, above diagram is the junction signal
flow diagram.
212
Now, if we consider the tube has ‘N’ number of sections. and the tube length is ‘l’.
𝑙
So, Length of each equal section 𝑥 = 𝑁
𝑥 𝑙
And, 𝐷𝑒𝑙𝑎𝑦 (𝜏) = 𝑐 = 𝑁𝑐
If there are ‘N’ number of tubes. So, there will be (N – 1) number of junctions. So, we can draw
the equation as shown in above figure, where 𝑢𝐺 [𝑛] is the input, then the first junction, second
junction, up to (N – 1) junction. So, there are two boundary condition, one at lips and another
at glottis.
Let us assume ‘N’ section of junction. Each section creates a delay 𝜏. So, N section will create
delay N𝜏. So, In the first arrival of the signal delay will be N𝜏. In Second arrival due to the
back propagation. It will be 2𝜏 delay. we can (Refer Time: 04:23) next signal will be integer
multiple of 2𝜏 delay. So, 2𝜏𝑘 is the integer and 2𝜏 is the delay.
213
(Refer Slide Time: 04:48)
So, we assume, my sampling period t = 2𝜏. Suppose we want to create a band limited signal p
whose maximum frequency is 5 kilo hertz.
What is the sampling frequency (T)? = 2*5 = 10 kilo hertz. So, band limited is 5 kilo hertz
Nyquist criteria is 10 kilo hertz, if we want to reach the Nyquist criteria in this 2𝜏 delay. So, 2𝜏
must be equal to the sample period (T). then how do I derive V(z)?
214
𝑈 (𝑧)
Where, 𝑉(𝑧) = 𝑈 𝐿 (𝑧) =?
𝐺
And if 𝑧 −1 is single sample delay. Now my sample delay (T) = 2𝜏. So, instead of 𝑧 −1 it will
be 𝑧 −1/2. So, all the z value will be replaced by𝑧 −1/2 . So, we get the signal flow diagram as
shown in above.
Once we get this signal flow diagram, we can derive the 𝑉(𝑧) transfer function of the tube
𝑈 (𝑧)
which is nothing but an output which is 𝑈 𝐿 (𝑧) . Where 𝑈𝐺 (𝑧) is nothing but an impulse or glottal
𝐺
response. Now what is 𝑈𝐿 (𝑧)? So, we could derive this transfer function.
We consider each one of the junction acts as a lattice, if you see these are symmetrical to each
other, from 𝑟1 𝑡𝑜 𝑟𝑛 all are symmetrical.
Now let us 𝑈𝑔 [𝑛] is coming to the first junction, we will first draw the junction and write down
the signal flows. Now, there is a terminal at glottis which has delay 𝑧 −1/2 . And then the same
junction we can continue there will be a delay 𝑧 −1/2 and Again, there will be a junction let us
this is 𝑁 𝑡ℎ junction and 𝑁 − 1𝑡ℎ junction. At the end lip will be there. that will be boundary
conditions.
We have already derived volume velocity injected from one tube to another in last class,
1
+
𝑈𝑘+1 (𝑧) = (1 + 𝑟𝑘 )𝑧 −2 𝑈𝑘+ (𝑧) + 𝑟𝑘 𝑈𝑘+1
−
(𝑧)
215
So,
1 1
𝑧 −2 + 𝑟𝑘 𝑧 −2 −
𝑈𝑘+ (𝑧) = 𝑈𝑘+1 (𝑧) − 𝑈 (𝑧)
1 + 𝑟𝑘 1 + 𝑟𝑘 𝑘
Similarly,
1
𝑈𝑘− (𝑧) = −𝑟𝑘 𝑧 −2 𝑈𝑘+ (𝑧) + (1 − 𝑟𝑘 )𝑈𝑘+1
− (𝑧)
1 1
−𝑟𝑘 𝑧 −2 + 𝑧 −2 −
𝑈𝑘− (𝑧) = 𝑈 (𝑧) + 𝑈 (𝑧)
1 + 𝑟𝑘 𝑘+1 1 + 𝑟𝑘 𝑘+1
𝑈𝑘+ (𝑧)
𝑈𝑘 = [ ]
𝑈𝑘− (𝑧)
+
𝑈𝑘+1 (𝑧)
𝑈𝑘+1 = [ − (𝑧)]
𝑈𝑘+1
1 1
𝑧 −2 𝑟𝑘 𝑧 −2 1 1
−
1 + 𝑟𝑘 1 + 𝑟𝑘 𝑧 −2 1 −𝑟𝑘 𝑧 −2
𝑄𝑘 = = [ ]= 𝑅̂
1 1
1 + 𝑟𝑘 −𝑟𝑘 𝑧 −1 1 1 + 𝑟𝑘 𝑘
−𝑟𝑘 𝑧 −2 𝑧 −2
[ 1 + 𝑟𝑘 1 + 𝑟𝑘 ]
1
𝑧 −2
𝑅𝑘 = 𝑅̂
1 + 𝑟𝑘 𝑘
So, 𝑄𝑘 = 𝑅𝑘
Now, 𝑈𝑘 = 𝑅𝑘 . 𝑈𝑘+1
Then, 𝑈1 = 𝑅1 𝑅2 𝑅3 𝑅4 … … … … . . 𝑅𝑁 𝑈𝑁+1
216
(Refer Slide Time: 12:29).
217
(Refer Slide Time: 16:31)
Now if you see the impulse response diagram there are 𝑁 − 1 symmetry junction. At the lip
end, junction is not symmetry in 𝑟𝑙 cases, only problem is that there is a no backward wave in
here. Let us I put a backward wave 0. So, one extra block will be added, which is shown by
dotted line in figure below.
So now, there is ‘N’ number of blocks instead of (𝑁 − 1) block; we can get N number of blocks
with input backward wave is 0.
218
So, we can write
𝑈 (𝑧) 1
𝑈𝑁+1 = [ 𝐿 ] = [ ] 𝑈𝐿 (𝑧)
0 0
𝑈1 = 𝑅1 𝑅2 𝑅3 𝑅4 … … … … . . 𝑅𝑁 𝑈𝑁−1
1
𝑈1 = 𝑅1 𝑅2 𝑅3 𝑅4 … … … … . . 𝑅𝑁 [ ] 𝑈𝐿 (𝑧)
0
2 2𝑟𝐺 2 −2𝑟𝐺
𝑈𝐺 (𝑧) = 𝑈1+ (𝑧) − 𝑈1− (𝑧) = [ ] 𝑈 (𝑧)
1 + 𝑟𝐺 1 + 𝑟𝐺 1 + 𝑟𝐺 1 + 𝑟𝐺 1
2
𝑈𝐺 (𝑧) = [1 −𝑟𝐺 ]𝑈1 (𝑧)
1 + 𝑟𝐺
2 1
𝑈𝐺 (𝑧) = [ [1 −𝑟𝐺 ]] 𝑅1 𝑅2 𝑅3 𝑅4 … … … … . . 𝑅𝑁 [ ] 𝑈𝐿 (𝑧)
1 + 𝑟𝐺 0
𝑈𝐺 (𝑧) 2
=[ [1 −𝑟𝐺 ]] 𝑅1 𝑅2 𝑅3 𝑅4 … … … … . . 𝑅𝑁 [1]
𝑈𝐿 (𝑧) 1 + 𝑟𝐺 0
𝑈 (𝑧)
We have to find out 𝑉(𝑧) = 𝑈 𝐿 (𝑧) =?.
𝐺
𝑁 𝑁
1 𝑈𝐺 (𝑧) 2 1 1
= = [1 −𝑟𝐺 ] (𝑧 𝑁/2 ∏ ) (∏ 𝑅̂𝑘 ) [ ]
𝑉(𝑧) 𝑈𝐿 (𝑧) 1 + 𝑟𝐺 1 + 𝑟𝐺 0
𝑘=1 𝑘=1
1 −𝑟𝑘
Where, 𝑅̂𝑘 = [ ]
−𝑟𝑘 𝑧 −1 𝑧 −1
219
(Refer Slide Time: 19:42)
220
(Refer Slide Time: 23:43)
So, N = 2, Then;
1 2 1 1 1 −𝑟1 1 −𝑟2 1
= (1 − 𝑟𝐺 )𝑧 2/2 . [ −1 ] [ ][ ]
𝑉(𝑧) 1 + 𝑟𝐺 1 + 𝑟1 1 + 𝑟2 −𝑟1 𝑧 −1 𝑧 −𝑟2 𝑧 −1 𝑧 −1 0
2 1 1 + 𝑟1 𝑟2 𝑧 −1
= 1+𝑟 (1 − 𝑟𝐺 ). (1+𝑟 [ ]
𝐺 1 )(1+𝑟2 ) 𝑟1 𝑧 −1 − 𝑟2 𝑧 −2
221
So, 𝑉(𝑧) for two tube vocal track.
If you see the transfer function (𝑉𝑧 ), we have a 0 at origin, but I have a 2 pole or second order
𝑁
equation. So, we have second order equation. So, if we have ‘N’ tube model, We can say 0
2
222
So, we can say this whole 𝑉𝑧 can be same, all 0 are in center and all pole model.
𝐺
𝑉̂ (𝑧) =
𝐷(𝑧)
𝑁
𝐷(𝑧) = 1 + ∑ 𝛼𝑘 𝑧 −𝑘
𝑘=1
So, instead of writing this the all 0-model 𝑧 −1 have been here, 𝑧 −1/2 , we can write this dz in a
signal flow diagram as a all pole model.
All-pole model
𝑁 𝐺
𝑉̂ (𝑧) = 𝑧 2 𝑉(𝑧) =
1+ ∑𝑁
𝑘=1 𝛼𝑘 𝑧
−𝑘
So, we can write down the signal flow diagram in all pole model. Mathematically it is proved
that a vocal track transfer function can be simplify as an all pole digital filter, if we know 𝛼𝑘
which is nothing but in term of reflection coefficient. If we know all reflection coefficient, then
we can simulate this 𝐷(𝑧). So, it is impossible to implement this 𝐷(𝑧) using digital filter.
223
(Refer Slide Time: 29:58)
We know,
𝑁
𝐷(𝑧) = 1 + ∑ 𝛼𝑘 𝑧 −𝑘
𝑘=1
If 𝑟𝐺 = 1 (𝑧0 = ∞)
So, 𝐷0 (𝑧) = 1
𝐷(𝑧) = 𝐷𝑁 (𝑧)
Thank you.
224
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture - 15
Uniform Tube Modeling of Speech Processing Part – VII
So, we have derived that the all pole model for vocal track.
𝐺
𝑉̂ (𝑧) =
𝐷(𝑧)
Where,
𝑁
𝐷(𝑧) = 1 + ∑ 𝛼𝑘 𝑧 −𝑘
𝑘=1
225
(Refer Slide Time: 00:26)
So,
𝐷(𝑧) = 1 + (𝑟1 𝑟2 + 𝑟1 𝑟𝐺 )𝑧 −1 + 𝑟2 𝑟𝐺 𝑧 −2
𝑧𝐺 − 𝑧𝑟
𝑟𝐺 =
𝑧𝐺 + 𝑧𝑟
𝐷(𝑧) = 1 + (𝑟1 𝑟2 + 𝑟1 )𝑧 −1 + 𝑟2 𝑧 −2
𝐷0 (𝑧) = 1
𝐷1 (𝑧) = 1 + 𝑟1 𝑧 −1 = 𝐷0 (𝑧) + 𝑟1 𝑧 −1 𝐷0 (𝑧 −1 )
226
Two tube model
𝐷2 (𝑧) = 1 + (𝑟1 𝑟2 + 𝑟1 )𝑧 −1 + 𝑟2 𝑧 −2
𝐷2 (𝑧) = 𝐷1 (𝑧) + 𝑟2 𝑧 −2 . 𝐷1 (𝑧 −1 )
𝑁
𝐷(𝑧) = 1 + ∑ 𝛼𝑘 𝑧 −𝑘
𝑘=1
227
(Refer Slide Time: 03:26)
228
(Refer Slide Time: 06:18)
𝑁
𝐷(𝑧) = 1 + ∑ 𝛼𝑘 𝑧 −𝑘
𝑘=1
If we know the reflection coefficient, we can implement digital implement the tube. Or
other hand if we know the signal which is coming out from the tube from there I can
229
estimate the area function of different junction. because
𝐴𝑁+1 − 𝐴𝑁
𝑟𝑁 =
𝐴𝑁+1 + 𝐴𝑁
if we know the cross-sectional area then we can the derive the reflection coefficient and
we can model the tube in digital domain. On the other hand, if we know the pitch signal
and if we able to find out the value of reflection coefficient 𝛼𝐾 . Then we can find out the
cross-sectional area of the different tube.
There is a different cross-sectional area for a vowel. So, this vowel area functions can
generate the vowel or if we know the vowel we can derive the area function.
230
(Refer Slide Time: 10:00)
I want to digitally implement the tube. So, we have to implement D(z), which is
𝑁
𝐷(𝑧) = 1 + ∑ 𝛼𝑘 𝑧 −𝑘
𝑘=1
So, there is a 𝑈𝑁 (𝑛) impulse then we get an output which is 𝑈𝐿 (𝑛 . So, there will be simple
delay by 𝑧 −1 𝑎𝑛𝑑 𝛼1 has to be added with there. Similar up to 𝑧 −𝑁 𝑎𝑛𝑑 𝛼𝑁 as shown in
above figure.
231
So, I can implement in digital filter the line nothing but a digital filter the equation looks
like a nothing but a digital filter. So, we can easily implement it using that vocal track.
So now, there is some concept poles of vocal track. So, we know all pole models are vocal
track tube.
232
We can say,𝑉(𝑧) is modeled by using...
𝐺
𝑉(𝑧) =
𝐷(𝑧)
Where;
𝑁
𝐷(𝑧) = 1 + ∑ 𝛼𝑘 𝑧 −𝑘
𝑘=1
If it is of N number of poles. So, I can say 𝑉(𝑧) is a all pole model which has N number
of poles.
So, if I have a 10-junction tube or 10 section tube then we can say 10 poles will be there.
So, the D(z) have either the all real pole will be the real functions or there all pole has a
complex conjugate in nature.
So, if D(z) has a N number of pole then we can say that 𝑁⁄2 complex conjugate pole will
be there.
So,
𝐺
𝑉(𝑧) =
∏𝑁/2 −1 ∗ −1
𝑘=1(1 − 𝛼𝑘 𝑧 )(1 − 𝛼𝑘 𝑧 )
Suppose, if we have unit circle. one is the real axis and another is the imaginary axis. If a
pole occurs somewhere with an angle of 𝜃. There will be another pole which is conjugate
this at an angle −𝜃. As shown in figure below.
233
So, every pole has a complex conjugate pole. So, a complex number has 2 poles. So, this
is a 𝜃 and an amplitude r. then
𝛼𝑘 = 𝑟𝑘 𝑒 𝑗𝜃
𝛼𝑘∗ = 𝑟𝑘 𝑒 −𝑗𝜃
𝐺
𝑉(𝑧) =
∏𝑁/2
𝑘=1 1 − 2𝑟𝑘 cos 𝜃𝑘 𝑧 −1 + 𝑟𝑘2 𝑧 −2
1
𝐿𝑒𝑡, 𝑉𝑘 (𝑧) =
1 − 2𝑟𝑘 cos 𝜃𝑘 𝑧 −1 + 𝑟𝑘2 𝑧 −2
Let,
𝑟𝑘 = 𝑒 −𝑏𝑘
𝑏𝑘 = − ln 𝑟𝑘
1
𝑉𝑘 (𝑧) =
1 − 2𝑒 −𝑏𝑘 cos 𝜃𝑘 𝑧 −1 + 𝑒 −2𝑏𝑘 𝑧 −2
234
(Refer Slide Time: 16:07)
1
𝑉𝑘 (𝑧) =
1− 2𝑒 −𝑏𝑘 cos 𝜃𝑘 𝑧 −1 + 𝑒 −2𝑏𝑘 𝑧 −2
Now importance of this 𝑏𝑘 is nothing but produce a band width and 𝜃𝑘 gives the formant
position.
When the value of 𝑟𝑘 approach to unit then we can get the formant frequency or resonance
235
frequency. And if the 𝑏𝑘 value is non-zero then we get a band width pole, there will be a
pole there will be a formant which has a band width. So, 𝑏𝑘 provide the band width and
𝜃𝑘 provide me the formant position. So, this information will be used when we develop
the model using linear prediction model for speech production system.
So, if we know the 𝜃𝑘 and value of 𝑟𝑘 then we can model the system, 𝜃𝑘 give me the
formant frequency position and if 𝑟𝑘 tends to zero, circle amp amplitude of that pole tends
to your close to unit circle that give us the formant frequency. So, if there are ‘N’ number
of pole. So, N/2 complex conjugate pole will be there. So, each pair of complex conjugate
pole give me a formant frequency. So, if we have 5 formant frequency if I give you a
spectrogram or let the I told you the spectrum of the speech signal is like this 1, 2, 3, 4.
Let us 𝐹1 , 𝐹2 , 𝐹3, 𝐹4 then 4 pair of complex conjugate pole will be there. So, if ‘N’ tube
model is there, N number of poles are there in a transfer function then we can say N/2
complex conjugate pole will be there. So, literally we can get N/2 formant frequency in a
spectrogram. So, if I give you the formant frequency and formant band width if we able to
find out for a speech event we can able to derive the transfer function for that speech event
the vocal track transfer function.
Now if I say that event is given. if I take the steady state vowel r and analyze the Spectro
frequency analysis, and find pout the formant frequency and formant band width then we
can able to find out the transfer function of that vocal track. Do not confuse to the number
236
of complex conjugate pole and the number of normal poles. So, if there is a N tube model
N tube model is there. Then we get N/2 complex conjugate pole. So, there will be a N/2
format.
10
Let us there is a N = 10, then I we get = 5 complex conjugate pole. So, we can get 5
2
formant frequency.
Let us the length of the vocal track 𝑙 = 17.5 centimeter and the velocity of sound 𝑐 = 350
meter per second. then, find the number of sections required to generate 5 kilo hertz band
width voiced signal.
𝑥 𝑙
𝜏= ; 𝑤ℎ𝑒𝑟𝑒 𝑥 = 𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑠𝑒𝑐𝑡𝑖𝑜𝑛 =
𝑐 𝑁
𝑙 17.5 1
So, 𝜏 = 𝑁𝑐 = 𝑁.35000 = 2.𝑁.103
1
Since, 𝑇 = 𝐹 = 2𝜏
𝑠
237
1 1
So, 𝜏 = 2.𝐹 = 2.𝑁.103
𝑠
𝐹𝑠 10000
=≫ 𝑁 = = = 10
1000 1000
So, similar kind of mathematics we can expect that if I want to generate 4 kilo hertz band
width signal and 𝐹𝑠 = 8 kilo hertz sub (Refer Time: 24:32) then find out the number of
sections is required. So, we can find out similar to 5 kHz and model it.
238
(Refer Slide Time: 24:57)
So, I can see that whole V(z) can be. So, we can say this digital implement is again same
each of the chunk represent the formant frequency. This is the formant 1 formant 2 formant
3. So, that will come. So, each stage represents one formant frequency. If we have N/2
number of stages, then N/2 formant frequency will be there. So, we can say that lossless
vocal tube model can be done by a linear system which is nothing but a V(z) is a linear
system
239
(Refer Slide Time: 26:07)
So, we can say in summary that this is the vocal track V(z), lip radiation is R(z) and glottal
pulse is G(z). So, total transfer function is nothing but a H(z), which is represented by
𝐻(𝑧) = 𝐺(𝑧)𝑉(𝑧)𝑅(𝑧)
If we say from the speech, we want to find out H(z) which is the product of glottal pulse
transfer function, vocal track transfer function and lip radiation transfer function. So, if
there is a voice in impulse train will be there that will be modified by (Refer Time: 27:00)
glottal transfer function and that modified signal fed to the vocal track and after lip
240
radiation we get the speech signal. If it is unvoiced speech then we can say it is connected
to the random voice and it will only modify by the vocal track and lip radiation and we get
the speech signal.
1
𝐺(𝑧) =
(1 − 𝑒 −𝑐𝑇 𝑧 −1 )2
𝑅(𝑧) = 𝑅0 (1 − 𝑧 −1 )
So, this is the whole vocal track tube modeling. So, in summary we can say that human
vocal track can be model using a digital signal processing or all can be implemented using
a digital linear filter based on that requirement.
So, we have shown that if we consider this vocal track is (Refer Time: 28:38) is a simulated
using a number of junction or number of section lossless tube section. let us ‘N’ number
of lossless tube section, then it can be implemented using a linear system which is V(z).
And which can be implemented in digital domain. So, that’s why this is called uniform
tube modeling. Throughout the each section the cross-sectional area of the vocal cord is
241
uniform. So, we can say throughout the whole vocal track can be single tube whole vocal
track or can be 2 tube vocal track or it can be N number of tubes, each tube cross sectional
area is constant. Once it is implementable by a V(z) then can we think that output speech
which has collected using a microphone can it be analyzed using linear signal processing?
yes.
So, from there the concept of linear prediction analysis come. So, this system can be
linearly model. So, if we know the signal, it is possible to find out the area cross sectional
area of different section and if it is N tube model of different cross-sectional area is possible
to find out from the behavior of output speech. if we know the area function, we can
implement it digitally, if we excited by impulse response, we can able to produce the
speech. So, this is the tube modeling. So, this is called uniform tube model or loss less tube
modeling of speech production system.
Thank you.
242
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture - 16
Speech Perception – Part I
So, let us discuss that we have discuss about the speech perception. So, this class I will
may be it is required 2 lectures. So, during these 2 lectures we discuss about the how
human being perceive the speech. So, if you ask the why we have to know about the
human speech perception, because if I talked about the digital speech processing then
why you require that you should know that this digital speech perception kind of things.
Now, if you see that human being ultimately we want to copy the human being through a
machine or you can say we want to developed an algorithm or technology by which a
machine can act as an human being or we can use the speech communication among the
human being and that communication that you want to establish in machine.
Now, if I want to do that in speech communication if you see speech production and
speech perception. So, 2 parts speech production, human being produce the speech and
listeners perceive the speech. So, better we understand the how human being perceive the
speech, we can better model that we can better process the speech signal that way. Or
other hand if I say if I produce a speech and human listeners perceive the speech, how
human listener process the speech to perceive it is very important to know, because
unless we do not know what we perceive the development of technology will be not
possible. So, we want to know what kind of speech processing happened in our brain and
it try to follow that kind of processing in digital speech domain, so that we can develop
the speech technology like the speech coding.
243
If you see that like the example of speech recognition and speech coding speech
recognition lets the example of speech recognition. So, what listeners is produce what the
speaker is produced that listen by the listeners and he perceive. Same technology we
want to produce in machine that in front of a micro phone, we want to speak out that
speech signal of the speaker output he want to give, and from the microphone and the
signal processing technique, I want to develop how human being perceive the speech.
So, if I want to develop that technology, we should know that signal processing involve
in human brain, what kind of signal processing involve in the human brain, so that we
want to know that is the speech perception, how we perceive the speech.
Forget about this speech chain, so if you see the auditory systems we perceive the speech
to the auditory systems of the human being. So, if I see that a human being a human
listeners first covert that acoustical signal to some nerves signal or nerve response we
can say, so that required a transduction. So, trans or transducer which will convert the
input acoustical signal to a nerves signal and that nervous signal that the signal which is
in the nerve will goes to the brain and process it and perceive the speech.
So, we have there is we can sort of 2 part o or you can say the three part one is called
acoustical to neural converter, neural transduction and neural processing. So, acoustical
signal has to be converted to the neuron signal neural converter, and neural trans neuron
has to be transmit the signal from that conversion point to the central brain for
244
processing; so, here if you see the acoustical to this transverse, and the acoustical signal
pass convert to neural representation by the ear. So, we have ear.
So, function of the ear to convert that input acoustic signal to the neural signal that is the
process. Then neural transduction they take place between the output of the inner ear and
neural pathway that means, whatever the output is coming from the neural transduction
has to be taken away to the processing unit. And nerve firing signal, so how did; that is a
human brain, so brain process that firing signal and perceive or understand the speech.
So, perceive and understanding is happen in brain.
Now, we have to know what kind of transduction is happen and what kind of processing
in happen in human brain, so that 2 things we have to know, to know the human
perception. Or either I can say what kind of, so if you see if you study the microphone
the purpose of the microphone is that it has to be convert that acoustical signal to an
electrical signal. Now, I have to know that this microphone has how efficiently this
microphone can convert the acoustical signal to electrical signal, or I can say the
properties of acoustical signal how accurately impose on electrical signal that is the
electrical.
So, acoustical signal is equivalent to electrical signal, if the all properties of acoustical
signal like that its frequency or you can say frequency composition or its amplitude all
things how it is representing in electrical signal is the important point. So that means,
how efficiently a microphone convert the acoustical signal to electrical signal is the
property or you can say depends on the construction mechanism of the microphone.
Similarly, human being has an ear, 2 ear, if you see the 2 ears. So, how efficiently human
ear can convert the input acoustical signal to the nerval signal or nerve signal. So, I want
what kind of limitation this system is imposed on speech perception that means, human
being or what kind of you can say that what kind of constrain it put on the human speech
signal ear conversion, so that that kind of constrain we can exploit in speech processing.
So, that human being cannot hear that error.
245
So, suppose a microphone has if you see the microphone has a frequency response. Let
us discuss about the frequency response. Suppose, in microphone has an frequency
response, if you know the frequency response because frequency response is nothing but
a frequency versus amplitude plot that means for a particular intensity sound of different
frequency how it represent in electrical signal. So, suppose microphone has a frequency
response whose flat area is around let from here it is 30 hertz to 10 kilo hertz, so that is
data set is given that this is the microphone. So, this microphone is efficient to convert
the acoustical signal in between 30 hertz to 10 kilo hertz, and produce the electrical
signal.
So, if an acoustical signal consist of 12 kilo hertz, the electrical signal should not contain
that component because that response of the microphone at 12 kilo hertz is almost 0, may
be almost 0 let us design down in here is 12 kilo hertz. So, it is 0; almost 0. So, I cannot
get 12 kilo hertz response in here so that means, limitation which is imposed by the
microphone in electrical signal is 30 hertz to 10 kilo hertz.
Similarly, suppose I have a human ear, so acoustical signal human ear convert into
normal signal. So, perception of that signal that how we perceive what is the limitation of
our per perception that is the limitation of conversion by the ear to another signal . So,
limitation of perception how weak signal we can perceive how which frequency we
cannot perceive all kind of limitation is impose in here. So, if I say first impose is the 20
hertz to 20 kilo hertz frequency we can heard. So, ear can convert only the acoustical
signal which is lie between 20 hertz to 20 kilo hertz. If I apply 22 kilo hertz to the human
ears, it will not perceive with may not be cannot be converted to nervous system or
nervous system cannot represent or cannot sensitize a response in the human brain.
246
So, I have to know how this signal or how this acoustical signal is converted to nerval
signal, so that is the physiological mechanism we have to understand the physiological
mechanism of the human ear. So, if you see I come this on later, human has 2 ears, if you
see a human has 2 ears why a human has 2 ears, if I do not have one ear what will be the
problem. So, sound the 2 ears help to localize the sounds, sound localization if you close
your eyes, and if sound is come from some direction you can tell that sound is coming
from this direction. If the sound is coming from this direction without seeing the source
you can identify the direction of the source.
So that means localization of the sound is because of we have 2 ears, if you see that there
is a home the stereo earlier there is a mono then comes stereo then comes surround
sound. So, all are effect of sound localization because if it is stereo sound that means, we
can identify the source of the direction of. So, source direction from the sound, Dolby -
surround sounds. Suppose you are watching it a movie where a train is coming from left
corner of the screen to right corner that means, it is a three-dimensional you are
visualizing in a two-dimensional respects.
Now, if I say suppose I am standing in here, a train is coming from this direction, and it
is going this direction. So, what will the effect of sound? If the train sound is coming
from this direction, so I should able to localize the train sound is coming from my right
hand this side, and the sound intensity will be increases when train come to my place and
when train is going away then I can say the sound is coming from this direction. So, if I
able to manipulate this effect then I can say it is a surround sound because sound is
coming from this directions. So, I can understand sound is coming from this direction.
247
So, sound localization is one of the major issues because human has 2 ears that is why
we can localize the sound.
Second one is the focus attention on a particular sound or you can say the noise
cancellation. If you study the radar signal processing in electronics you know the how
the cluttered noise is rejected or kind of things. So, if we have 2 ears help human being
to attention or focus on a particular sound. If in this class there is a 5 people are talking
together even the 5 people are talking and they are the that is audio level is almost equal,
I can make my attention to a particular student speaking so that means, I can focus on a
particular sound that is one of the biggest property in human perception. I can easily
cancel the noise or the signal.
What is the difference of the noise, unwanted signal is nothing but a noise. So, suppose I
want to listen the student A sound, so student B produce sound is noise to that. So, I can
easily ignore the student B’s voice and perceive the student A’s voice. So, I can easily
focus on a particular sound that is also because we have 2 ears. Now, once you put your
headphone inside the ears then localization is within the headphone, it is not outside.
Once you put the headphone sound is producing there only, so whatever the sound ear is
perceived the localization happened in the headphone only. So, there is a sound
localization issue.
248
Now, if I come that I want to make the auditory model, auditory system model. So, I can
say whole auditory system model, it include sound the acoustical signal to normal
conversion and perception of that nerve signal by the brain. So, I can say human auditory
system or you can say human perception of sound, I want to model. Now, if I say I want
to model it, how can I model because a human being is perceive the sound, I cannot
measure every and each and every point what is happening in the signal level, I cannot
do that things.
So, I can think let us the human sound perception is a black box. Now, how do you
define how do you find out the black model property of a system? Suppose I have a
system I do not know anything this is black box, I do not know system property system
things nothing, now how do you determine we excited the system with a particular signal
or known signal then try to find out the observation. So, let us I given a example I do not
know the system if I put 2, it gives output 4; if I put 3, it gives output 9; if I give 4, it
gives output 16. If I able to observe this behavior then if I able to plot 2 then 4 then 3
then 9 then I can know from the plot the behavior of the this black box, so that is nothing
but a I can say this is nothing but a square. So, x square, if x is the input, y is the output y
equal to x square. We can derive this system property from observing the known input
what is the behavior.
Same black box model can be applied in human auditory system or human speech
perception, suppose I apply a particular frequency sound with a particular intensity, then
I want to observe the human perception. So, you can say that physiological observation
and this is not a physiological observer, this is a physical input and physical output, but
here I cannot get that physical output. So, I can say here I can give a physical input
249
which is nothing but a sound acoustical sound and I observe the physiological
observation and try to correlate what kind of stimulus, this is stimulus only input signal,
what kind of stimulus what kind of physiological response is produced. Then try to find
out what is this black box how human being perceive the sound.
So, what I said I excited the this black box with a different kind of known stimulus and
then I have to find out the physiological observation of human and then try to correlate
what kind of stimulus producing what kind of observation, and try to draw a conclusion
what kind of processing is done by the human being. So, that is why if you say there is a
2 dimension of sound parameter, one is called physical dimension, another is called
perceptional dimension.
So, let us a given example intensity, intensity on acoustic wave I can easily measure. You
can know I is equal to nothing but a p square by 2 rho c, rho is the density, and c is the
velocity of the sound, where this p is the pressure amplitude of the pressure wave then I
can say intensity is nothing but p square by 2 rho c. So, I can measure the sound intensity
you can say the sound intensity measure may decimal the I will come dB meter I can
easily measure the intensity of the sound. But once I say loudness once I say loudness
this sound is louder than the previous sound it does not mean the intensity of the first
sound is twice than the previous sound. If I say this sound is twice louder than the
previous sound then I cannot say the current sound is not 2I or previous, or if the
previous sound is I it is 2I. The perception of loudness is human perception, but intensity
is a physical parameter.
So, perception, so there is a 2 dimension one is called physical dimension where I can
directly measure the quantity measure the parameters directly measure the parameter
250
intensity frequency all are directly measurable, but if I say loudness is a human behavior
physiological behavior of a human. So, I stimulus or stimulate the human being by
intensity and I want to observe that his behavior then observation behavior is called
loudness and input is called intensity. So, there is a physical dimension and there is a
perceptual dimension. So, I will come details on that physical dimension and perceptual
by dimension.
Now, let us start with the physiology or you can say the anatomy of human ears some
sort of what is there in human ears. If you see this is a color nice pictures, which is that
there is a human ears. So, there is a you consider outer ear which is this one only if you
see this one is the outer ear. So, this is called pinna. So, what is the function of this pinna.
If you see if the sound wave is coming in this room in medium air in medium now if I
put a like this, if the sound wave is coming in this medium if I put a area here then the
sound wave will be strike here. Now, if the intensity of the sound wave is I then what is
the total power of the acoustical wave which is strike in this plate if the area of the plate
is A then I can say a power or force is equal to I.A. So, I can say the principle of the
pinna working principle of the pinna is the or you can say the work of the pinna is that it
251
collect that acoustical wave and channelize it to the middle ear. So, if I increase the pinna
the channelize power will acoustical power will be increases that is why sometimes when
you want to listen the very low sound, we put our hand in here.
Once we put our hand in here it increase the pinna area. So, more acoustical signal is
channelize to their ear sometimes. If you see the rabbits to listen the sound rabbit can
move his pinna, we cannot move my pinna, so that is why we have a head movement.
So, once I want to listen the sound I can this I can rotate the head, but if you see the
rabbit can rotate their pinna to the direction of the sound. So, the pinna work is that it
should channelize the acoustical wave power to the middle ear that is the working of the
pinna.
So, if it is large the more acoustical wave will entered, if it is small the area if the reduce
then a power will be reduced, so that is the working principle of the outer pinna outer ear
or pinna, pinna, we can say the pinna or penna. Then there is a middle ear middle ear is
what middle ear that acoustical signal passes in the middle ear and middle ear consist of
a membrane and bone. I will come to what is the work and then inner ear has a cochlea,
if you see there is a cochlea.
Now, so in the auditory canal this is called auditory canal middle ear auditory canal,
acoustic wave is coming in here, now it is strike in this membrane. Once this acoustic
waves strike in the membrane, it produce the vibration that vibration transmitted to the
cochlea by a conduction process. So, there is a bone and that bone this acts as a
252
mechanical resonator or you can say not resonating a amplifier we can mechanically
amplify the sound and transmit to the cochlea. If you see the guitar not that electronics
guitar, if you see there is a mechanical amplifier or you can resonator that you can say
resonator that if it is not that is not there, if I strike a string the sound will be less. But if
it is there then it acts on mechanical resonator, then the sound is EMA a larger. So, that is
that is the effect in ear also that is this acts in the mechanical kind of things and that
transfer the sound to the cochlea. So, that vibration goes to the cochlea.
Now, cochlea, so I am not reading the slide, you can read the slide. So, inner ear is the
cochlea. So, this is the semicircular canal and there is a cochlea. So, you know that the
ear has 2 functions, this whole system has 2 function one is balance the body and
another one is that listening. So, in inside the cochlea there is a basilar membrane, which
is responsible for listening the sound and this semicircular canal. So, the cochlea is fill of
liquid. So, you can say the it is a nothing but a packet of liquid.
So, if I vibrate that outside wall of the liquid then what will happen that vibration will
transfer to the liquid a liquid is spread the vibration to everywhere, so that is the
mechanism for the inner ear to perceive the sound, and this semicircular canal is full of
liquid that acts the body balance. So, this is nothing but a you can say if you see that
sometimes when we are balancing or you can say leveling some floor, we use the water
balance that to make a tube with full of water and then the water level balancing we said
253
whether the plane is correct or not, where is the slope or kind of things. Similarly, this
thing is used we have 2 semicircular canal in 2 side and which is responsible for find out
the body balance like that water balance same things is in here also.
Now, in inner ear, so this is basic description of the how human being what the function
of that nature in part of the human ear. Now, I go to the details. So, this is ok, this is
physiological we understand. Now, in engineering model, so I should know what kind of
action or what kind of things is happened in pinna that is outer ear, and what kind of
response is happened in the middle ear, and the final is the inner ear.
254
So, if I see the frequency response of the outer ear that is the pinna, so pinna if you see
that it is not channelize all frequencies same amplitude I can say. So, what I say the outer
ear if I see the frequency this, this axis is the frequency and this is the response in dB.
So, all frequency 0 dB is not same intensity or you can say the 0 dB of 0 forget about 0
as let us see 0.3 kilo hertz; that means, lets 300 hertz. So, 300 hertz sound of lets 0.5 dB
not less than 0.2 dB and 1 kilo hertz sound of 0.5 dB will be the same response because it
is amplify or you can see the amplification of the power or I say them all frequency by
the outer ear is not same.
Similarly, amplification or you can say the frequency response of the middle ear is nor
flat throughout the frequency it also different frequency is a different response. If I
combine this 2 middle and inner, not middle, and outer ear the combine response is come
as a red color. So, I can say human ear is not equal sensitive to all frequency.
So, I can say we are not equally sensitive or I can say let 5 dB sound of 100 hertz and 5
dB sound of 2 kilo hertz is not produce same intensity sensation or you loudness
sensation in our human brain. So, this may be louder compared to this because of this
response. So, if I inverted this curve, if I inverted this curve, the curve will come like
this. So, all if it is this is the frequency, this is the dB amplitude. So, all frequency has not
equal response in case of human ear that is the frequency response of human ear that is
the frequency response of the microphone. If it is flat for 300 hertz to 10 kilo hertz that
means, the microphone can produce the output or you can as equal sensitive the
frequency between the 300 hertz to 10 kilo hertz. But other frequency the conversion
from my acoustical signal to electrical signal is very low either very low or almost 0.
255
Similarly human ear is not sensitive or intensity sensitive to all frequency equally. So,
may be here is 1 kilo hertz, here is after 5 kilo hertz it also increasing. So, 5 dB of 100
hertz, 5 dB of 2 kilo hertz the intensity perception by the human being will be different
that is why we say because of the this combined curve. I will come this is called the
threshold of hearing. I will come later on that one.
So, now, I am not a describing that function of cochlea and all those things you can read
this, this is that read this slide that is nothing is there.
256
So, I can say this is the schematic representation. So, pinna collect that acoustical signal
channelize to auditory canal, and middle ear convert that acoustical signal to a
mechanical vibration and that mechanical vibration transfer to the cochlea, inside the
cochlea there is a basilar membrane. So, this is apex and this is the beginning. So, this is
the beginning of the cochlea, this is the apex of the cochlea. So, inside the cochlea, there
is a basilar membrane which is responsible for speech conversion of that vibrate
mechanical motion to nerval signal.
So, how it is done it? So, I am not discussing here lets in the next class I will discuss how
the basilar membrane is converted or you can responsible for or convert that vibrant or
motion mechanical vibration to the nerval signal.
Thank you.
257
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture - 17
Speech Perception - Part II
So, if you see this picture this is the basilar membrane which is inside the cochlea. So,
the cochlea is full of liquid and basilar membrane is floated there. So, you can see the
basilar membrane divided into the cochlear in 2 part, one is called a upper and lower part
now this is the beginning part and this is the end part of the basilar membrane. So, this is
the apex and this is the base. Along the cochlear membrane there is a some you can say
the there is some kind of normal arrangement that is called if you see it is a there is a we
call inner hair cell.
258
So, there is a inner hair cell, basilar membrane there is a this kind of you can say the
narwhal sensor, let us think about narwhal sensor there is a sensor kind of things in there.
So, if I see if I all along the basilar membrane and it is response on the high frequency
sensing are present in here and low frequency are here. Why because if you see in speech
production side due to the radiation load.
The low frequency signals are high amplitude frequency or high amplitude compared to
the high frequency. So, the acoustic wave which is peach away which is coming that has
a high frequency or less amplitude compared to the low frequency sound. So, the
mechanical motion or high frequency mechanical motions are amplitude are less.
So, that is why this this high frequency sensor are in the beginning and in the apex are
the low frequency sensor. Now it is this kind of sensor acts as a responsible for a
particular frequency range. So, I can say let us this is the basilar membrane let us this
divided with a this is the sensors, all each of the sensor is responsible for particular
frequency band. So, let us this perceive 0 to 5 hertz, lets this perceive 5 hertz to this 10
kilohertz or you can say 2 hertz to 10 6 hertz. So, there is a overlapping frequency band
is perceived by each and every sensor.
So, I can see the basilar membrane is nothing but a output of a different filter bank, if I
think in engineering model I can think the human conversion of acoustical signal or you
can mechanical signal to the normal signal is nothing but a sensor are nothing but a tank
circuits or which is a particular filter or tank circuits or you can the sensor is active for a
particular band then it is called tank circuit.
259
So, it is a tank circuit this is h1 sensor, there will be a h2 sensor there will be a h3 sensor.
So, h1, h2, h3 is nothing but a band pass filter I can consider. So, I can say it is nothing
but a filter bank analysis by the human basilar membrane whatever the input signal
come, it passed through a particular band of filter and ultimately response of the each
band is the neuron signal and that goes to the central vein system. So, that it can process
for human cognition ok.
So, how it is convert mechanism is called inner hair cell, there a lot is a like the inner
hair cell each in each of that sensor, there is a 10 to I think 10 nerve fiber each of the
different diameter one inner hair cell consists of 10 fiber auditory not fiber. So, that
converts that thing. So, now, come to that we know that how human being perceived the
sound, how human basilar membrane convert that mechanical motion to normal motion
now think about an engineering model we said. So, basilar membrane responds
maximally to different input frequency, frequency tuning occur at basilar membrane. So,
I can say that each of the 10 circuit is responsible for a particular frequency band which
is a constant q filter. So, I can say let us this is particular frequency band, this is also
responsible for particular frequency band then I want to know if this band are linear or is
this band are non-linear.
So, bandwidth of the each of the particular filter is it linear overlapping or non-linear
overlapping how do you define that, how do you I do know that things whether it is a
non-linear and linear. So, we can do some experiment and later will come. So, these
bands are called you can say that bark scale. So, I will come later on what is bark scale.
So, physiologically I can say human perception or human the frequency which is heard
by the human being or sound which is now sense by a human being is down on the base
basilar membrane well it is as a different frequency tank circuit and bandwidth of non-
linear then we will come later on that what is how do you define this non-linear scale.
260
Then how do you perceive the frequency? We said each on one is the particular band if it
is band then how do you perceive the frequency because this is nothing but if it is this is
a particular where this is a knob fiber if I say this is the high frequency and this is for low
frequency then or this suppose I said I have a frequency respond 20 hertz to 20 kilohertz,
there is a 100 tank circuit.
I know the bandwidth of each tank circuit, now suppose I want to perceive some. So, is
it is 20 hertz, it is 30 hertz like that way or non-linear way and whether this fiber how
this fiber is responsible particular frequency that I want to know. So, this there is a 2
theory one is called temporal frequency which is called temporal theory and another is
called place theory.
So, be first theory say that the basic of the temporal theory of the pitch perception is
timing of neural firing, which occur in response to vibration of the basilar membrane.
Timing of the neural firing is perceived as a frequency. So, periodic stimulation of
membrane match frequency of sound.
Once electrical impulse at every week every peak next one is called place theory.
261
Where move down basilar membrane, stimulation increase increases a peak and peak
quickly taper location of the peak if you see if the simulation I have stimulated the
cochlear and that increase the peak motion is. So, I can say let us this is my basilar
membrane this is a. So, if I wave this mechanical wave is stimulated, then this will be
loop like this why we moving like this along the basilar membrane in a wave.
So, stimulus increases the peak and quickly taper if I say this portion is increased and
taper increases and taper. So, increase and taper and increase it this kind of motion will
be happen. So, location of the peak depends on frequency of sound lower frequency
mean farther away. So, if the peak is occur in here, location of the peak if occur in here
then it come it perceive the frequency. So, there is a one is called timing, another is
called location of the pin. So, that is why because local place theory another is called
frequency or temporal theory frequency theory ok.
Now, human being perceive the sound. So, I have I can say that human perception of an
input sound has 2 part, one is the perception of the frequency another is the perception of
the amplitude. So, if I produce you can say the let us 500 hertz acoustic signal it has a
frequency it has an amplitude. So, human being perceived frequency and also perceives
as well as perceived the amplitude of the particular frequency. So, 2 perception thing is
here. Now I have to know how accurate we are in perception of amplitude and frequency
what do you mean by accuracy? You can say think about the resolution ok.
So, suppose I have a straight line and I want to measure the straight line in a some small
scale or I say there is a line which is like this, if I want to measure in a scale small scale.
So, I put the scale along this line, now if I scale this larger then I put the scale here and
262
again I put the scale here. Now if you see if I approximate it long section then I may lose
some area or may introduce some error so; that means, how accurately I can perceive the
sound. So, either a human being perceive every hertz of the sound, if it is possible that
the human being will perceive 1 hertz, 2 hertz, 3 hertz, 4 hertz, 5 hertz, 6 hertz, 7 hertz
and that like that way or for a particular frequency in a band.
Similarly, if I increase the intensity are we able to linearly perceive the in the intensity
increase or in which scale, whether I able to differentiate between the 2 dB and 2.5 dB or
2.112 dB or not. So, that is the resolving power of the human being I have to know; how
good an estimation of the fundamental frequency.
So, in the frequency domain and intensity domain there is some parameter. Fundamental
frequency the resolving power is 0.3- 0.5 percent. So, if I say my f0 is 2 hundred hertz
and somebody’s f0 is less 201 hertz whether can I am able to differentiate between these
200 hertz and 201 hertz it said it is 0.3 to 0.5 percent; that means, if the difference is lie
between this range we cannot differentiate that 2 frequency. So, we cannot separate
them.
So, let us point 3 let us point oh 0.3. So, 0.3 percent means 0.3 divided by 100 into 200
place; so it is 0.6. So, if it is let us if it is 4 the 300 hertz 0.9. So, I can say approximately
201 hertz frequency and 200 hertz frequency I cannot differentiate. If it is 1 kilo hertz
then 0 point 3 by 100 into 1 k it is nothing but the 3 hertz. So, if I say the perception of
263
the frequency is different in different frequency range. So, this is not a linear scale it is a
non-linear perception.
So, how do you perceive? If it is a high frequency the difference the perception will be
more rough I can say the bandwidth will be perception bandwidth will be a larger. So, I
can say that initially for the low frequency my resolving power is very high, but at high
frequency delivering power is very low so; that means, at high frequency I cannot
separate the 2 lists 1 kilo hertz and 1.100, 1.03033 hertz then I cannot slue hertz I cannot
separate.
But at low frequency almost one hertz difference I can understand. So, low frequency
level human perception resolution power is much more, high frequency level we do not
have that much of resolution power. So, in high frequency accuracy; so suppose I want
to copy an instrument. So, I can say high frequency details information are not that much
of required because human being cannot perceive the high frequency in a high
resolution.
Fundamental then the formant frequency, you know the formant frequency is nothing
with a formant position in the speech; so first formant around 500 hertz. So, if it is
uniform to you 500 hertz. So, almost 500 around 500 around 500 as will be first formant.
264
Then 3 divided by hundred into 5 hundred. So, almost 15 hertz error informant human
being cannot differentiate. So, suppose I extract a formant frequency of a particular
vowel lets 450 hertz if it is differ by of 10 hertz instead of 450m if it is 460 hertz then
also human being cannot distinguish the difference because we do not have resolving
power of this formant position, then formant bandwidth 20 to 40 percent error. So,
suppose [FL] let us [FL] has a formant by a first formant bandwidth.
265
Now I come to the physical dimension of the sound and physiological or you can say the
cycle over perceptual dimension of the sound.
So, that is obvious you know that that height of the amplitude frequency wavelength and
their property, you know that now I auditory perception.
So, auditory perception is branch of psychophysics you know that, so perception and
physiology physical property of stimuli. So, physical dimension as we have said is the
measurable, but perceptual dimension is the mental exercise on you physiological output
if I excited by a physical stimulus, what kind of output I observe based on that
characteristics I can measure the perceptual dimension parameter.
266
I have not detailed discuss about the human visual auditory system usual psychophysics
auditory psychophysics pitch loudness timbre are called perceptual dimension of the
speech or a sound wave or sound perception of the human being pitch loudness timbre
physical properties fundamental frequency, intensity spectrum envelope or amplitude
envelope you can say. So, what is this it is a pitch sometime you see that many the many
of you may be practices music you say that somebody is from guru said that you said the
harmonium at plat b scale; because your pitch will be around let us 200 hertz of flat b or
180 hertz plat b.
So, that is nothing but the average pitch we say average speech, but pitch in your
perceptual parameter. So, 180 hertz is nothing but a fundamental frequency average
fundamental frequency of that singer not the pitch plane. Sometimes we say the pitch
example is that suppose I play a Tanpura and a harmonium and a say guitar let [FL] and
with the same frequency harmonium [FL] has 200 hertz tanpura [FL].
Also has a 200 hertz also let us the tanpura harmonium and guitar 200 hertz, all
instrument or producing [FL] of the same frequency same frequencies [FL], but even if I
closing my eyes I can identify this sound is coming from harmonium this is for tanpura
this is for let us guitar.
So, how do we find? All if the all pitch are same if I say the fundamental frequency is the
pitch. So, pitch is in perceptual parameter it may include something else, for which we
can understand this is the source of the sound is different although they are produces in
the same fundamental frequency. So, that is the pitch in perceptual domain. So, main
parameter physical domain parameter of pitch is called fundamental frequency or f0.
Many places I refer to f0 also f0 fundamental frequency of that sound. Then loudness
intensity physical parameter is intensity I can measure the intensity 5 dB, 6 dB, 7 dB, but
267
loudness is in perceptual parameter I can say this is louder sound compared to previous
sound this is louder sound compared to previous sound.
So, the production mechanism that complexity whether or you can say physical
dimension is called spectral structures we say it is spectral structures. So, our timber or
complexity of the speech is a perceptual dimension. I can say this complexity, but if I say
how do you measure it one of the procedure is spectral envelope spectral composition of
this segment and this segment will be different. So, pitch loudness timbre perceptual
dimension, fundamental frequency intensity and spectrum envelope is the properties of
physical dimension of the sound ok.
Similarly, you can see that there is a huge hue brightness and sharpness not shape
sharpness and then a wavelength luminance contrast all kind of things are there in
perceptual dimension also how human being is perceived a picture that is called visual
perception of the human being. So, we have again limitation in human perceptions
perception of visual dimension also. So, speech dimension I have a limitation. So, I have
to find out that limitation and you can say the visual dimension there is a limitation. So,
there is experiment to find out those limitations now I am not describing this thing again
now human range of hearing.
268
Q
Sound frequency and sound direction both sound frequency and sound direction
threshold of hearing. So, there is a human threshold of hearing.
So, threshold of hearing; that means, the minimum pressure acoustical pressure which
can create a sensation in our nervous system is called threshold of hearing. So, suppose I
produce a 1 kilohertz
Acoustical signal of pressure P we just create the human sensation in the years, than of
normal hearing person the person must be normal hearing person not defect. So, that I
mean defect I do not discuss in here. So, for a normal hearing person the required
amount of pressure to create just sensation in the human ear is called Threshold of
269
hearing. Now if I if you remember I said the perception or equal to the intensity
perception of the human being, but different frequency is different because of the outer
ear and middle ear frequency responsible if I draw that.
So, I can say the sensation created the amount of pressure required to create a sensation
for 1 kilohertz and create a sensation for 300 hertz signal, may be will be different. So,
300 hertz signal require more power to create the sensation, but 1 kilohertz signal
required less power to create the senses. So, if I able to define the threshold of hearing
then for a particular you are defining the threshold of hearing for measuring the intensity,
I can say let us define the frequency for which I can say threshold of hearing we said it is
0 dB. So, 0 dB is called threshold of hearing decibel for that particular sound play sound
intensity which require to create the sensation for 1 kilohertz signal.
So, I can say 1 kilohertz acoustical signal or 1 kilohertz monotone the amount of
intensity required to create the sensation in human ear is called 0db. So, one I say 0 dB it
is not defined for all frequency 0dB means I take a 1 kilohertz acoustical signal or
monotone signal and I want to find out the intensity I, we just create a sensation in the
human being that I call 0 dB. So, that I is called 10 to the power minus 12 watt I think 10
to the power minus 12 watt per meter square. So, this is threshold of hearing. So, that is
defined by 0 dB. So, what is dB; what is dB sometimes we said it is dB; dB is called
decibel. So, one bel is equal to log of I by I reference where I reference is related to the
10 to the power minus 12 watt per meter square which is defined as 0 dB decibel.
So, deci means 1 by 10. So, decibel means one bel is equal to 10 log I by I reference
decibel dB deci bel. So, if the intensity of the sound is i.
Then I can say convert in dB is nothing but a 10 log I by I reference, where I reference is
equal to 10 to the power minus 12 watt per meter square.
270
So, sound intensity of a sound is a physical quantity that can be measured and quantified
acoustic intensity, I define as the average flow of energy through an immunity air inlet;
area you can say the power per unit area or force per unit area is nothing but the intensity
power per unit area is nothing but the intensity. So, watt per square meter I by I0 is 10 to
the power minus 12 watt ok.
271
So the p reference intensity in air is minus. So, this is also depends on the conducts the
intensity that I said the threshold of hearing is the amount of sound pressure required to
create a sensation in human hear for a 1 kilohertz acoustic wave ok.
So, I can change if it is water then the amount pressure medium is water. So, it required
less amount of intensity. So, this I am not going details this is a acoustical apart then
conversion of dB to percentage kind of things there is some mathematics also you can
say then loudness curve.
Now if you remember the combined frequency response of the outer air and middle
layer, I can say this is look like this. So, this way is the frequency this way is the
loudness in dB, reference is 10 to the power minus 12 watt. now if you see the for this
frequency and for this frequency let us this is for 1 kilo hertz, this is 5 kilo hertz, If this is
200 hertz lets.
Now, intense intensity required to create a sensation for a 200 hertz signal, when we
require high amplitude. Intensity created for 300 hertz signal let us require high
amplitude for 1 kilo hertz this amplitude, 5 kilowatts this amplitude. So, this is called
threshold of hearing curve for different frequency. So, this is the minimum intensity
required to create a sensation human hear since the frequency response for different
frequency. The intensity frequency response of the intensity for different frequency the
intensity of the sound perception is different, that is why I said for 200 hertz I require
larger intensity sound to create the threshold of hearing compared to 1 kilohertz.
Similarly, I can let us this is a 10 dB curve. So, let us 10 dB at 1 kilo hertz to perceive the
same loudness for 200 hertz signal, I require larger intensity. To perceive the 3 hundred
hertz signal I require larger intensity, but along this line the loudness which I perceive
will be the same. Even from 1 kilohertz 10 dB maybe 300 hertz maybe 20 dB I required
272
intensity, but if I perceive that sound loudness of the sound will be the same. So, I can
say if I go along this line the loudness will be same that is why it is called equal loudness
curve or pone curve equal loudness curve or pone curve. So, along the line loudness will
be equal let us see it is 20 dB curve ok.
So, perception of sound this is the perception of sound, intensity is different intensity for
1 kilo hertz it is 10 dB, maybe here intensity is 25 dB, but perception of the sound is
same that is why it is called this line is called equal loudness curve. So, loudness is a
perceptual quantity that is related to the physical property level or pressure level of
sound or intensity, but it is not direct intensity. So, this curve is called pone curve or
equal loudness curve.
Now how do you then measure loudness in pone sometime it is called also pone, now I
want to find out the relations between the intensity and loudness.
How do you do it? So, I can say I design a perceptual experiment, what I take a normal
hearing capability sound and this axis I put normal human capability of here normal
hearing capability human being I take, I produce loudness in pone.
273
So, I take let us 1 kilohertz acoustical signal I take or equal loudness curve if I then
composition overtake. So, let us 1 kilohertz signal.
Then I produce take the let us intensity a dB and I produce the sound listeners perceive
it, then again I produce the tech a1 dB listeners perceive it. I call the listeners when you
perceive the sound is twice then this a dB you raise your hand. So, I first produce let us a
dB sound to the listeners I told him and next a1 dB and gradually I increasing the dB and
told the listener when you perceive the loudness is double then you raise your hand .then
I will can say that here a perceived the loudness is double here a perceived loudness is
double here a perceived on this is double then I can draw that curve then I derived the
equation of the curve and find out the loudness scale and relation be to the intent this is
intensity and this is the loudness.
So, loudness and intensity relation this is the derivation of the loudness and intensity
relation I am not deriving it again. So, l is equal to 445 I to the power 0.333. So, that can
say if I know the intensity equal loudness in sone. So, this will be sone, sones that much
of sones. So, l is equal to 445I to the power 0.33. So, if it is 5 dB I if I know the intensity
then I can calculate loudness. So, this is a loudness sones curve.
Then there is a example this is called sound pressure level for different if it is 160 jet
engine close up threshold of pain 130. So, you see this we should not expose to the large
amount of sound large or I can say high amplitude of sound, if it is high amplitude it can
create pain in our ear ok.
274
So, threshold of pain is 20 to 130 dB, if I expose to a sound intensity of 130 dB it can
create a pain in the ears and also if I expose more than that then my ear permanently can
be damaged also. So, then if you see the pollution control board has said that do not
produce do not produce the sound which is more than 60 dB.If It is more than 60 dB it
may cause hearing problem or it not it is not good to expose in a always above 60 dB
sound.
Today we if see a lot of people are wearing a headphone many time let that in 24 hour, 8
hours or 12hour day providing the wearing the headphone what is the problem what can
effect. So, once is that was you wearing a headphone it lose the sound localization is lost;
that means, you wearing a headphone and walking in a road you cannot assume that is
why the car is coming on your front or in your back forget about that you just close your
eyes you cannot listen that horn also and also car noise also car sound also. So, if you
wearing a headphone sound localization happens inside your ear. Now if you expose
your ear with a continuous stimulus. So, this is a 24 hour you are listening a music.
So, you are exposed your human hear is exposed to a continuous vibration that can
damage the elasticity problem of your middle ear. So, the elasticity of your middle ear
may be lost then what will happen that may be response curve will change. So, you may
be your distinguishing sound listening capability. So, low intensity sound you may not
able to listen or you can you cannot heard the low intensity sound. So, your threshold of
275
hearing will be increased continuous exposure to the sound. So, that is why if you see the
person who are working in a factory environment, his threshold of hearing is increased.
So, he cannot have the intent low intensity sound you cannot hurt. So, your sensitivity of
your ear is decreasing that is the problem.
So, do not expose you or do not you expose your ear continuously in a sound
environment, do not expose your ear to a very high intensity loudness or a high loud
sound because that can permanently damage your ear. So, do not expose things. So, there
is a sound propagation you know that even if your source is produced 120 dB, you may
say I am if it is 10 feet away then you know the 1by r square. So, I is reduced in one by
r square form. So, if you distance is double the every doubling the distance 6 dB down.
So, if it is a source away from the source 10 feet, it will reduce the sound intensity, but
do not always expose in a large sound so that your ear sensitivity is lost. So, the
threshold of hearing threshold of pain all of there, if you see this curve sound intensity
level and that this is the black line in threshold of hearing this is the music and this is the
speech region loudness of the speech region, contour of damage reeks and threshold of
pain is here.
So, next class we will discuss about the frequency perception, ok.
Thank you.
276
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture – 18
Speech Perception – Part III
So, last class we have discussed about the perception of amplitude or we can say the
perception of intensity that is loudness and how the loudness is converted into dB, what
is 0 db? What is equate loudness curve? What is scone curve? That we have discussed.
Today, we want to discuss about, how human being perceive the frequency.
If you see that we have the pitch, we generally call the pitch of this sound is very high.
Pitch of this sound is very low. Your pitch is very high pitch.
So, what is pitch? So, pitch is an perceptual parameters, where as the frequency which is
related to the pitch is the physical parameter frequency can be measurable, but pitch is a
physiological parameter which is perceptually measurable. So, this is a perceptual
parameter. So, pitch and frequency are not same. If I say pitch is a perception of the
particular frequency. If the frequency not in a pure tone, then the pitch may be different
from the frequency like that, if I give an example you have heard it that suppose there is
a harmonium and there is a guitar, there is a like sitar, there is a string different kind of
string instruments are there every instruments are playing the let’s the base sa is a
particular frequency, let us 250 Hertz frequency.
But, if you perceive the sound, you can easily understand which sound is coming from
guitar, which sound coming for sitar, which sound is coming from other string
277
instrument so; that means, the perception of frequency is not as that the physical
frequency. So, pitch is mainly a perceptual dimension of the frequency and pitch mainly
correspondent to the fundamental frequency, but not it is one to one corresponds.
Let’s we have to interest that, suppose human being perceive the frequency yes we
perceive the frequency. How good we perceive the frequency; that means, in which scale
how we perceive the frequency.
So, suppose I have a x axis, there is a scale physical frequency in Hertz. Now, I want to
know if we perceive the frequency what is my resolution power? In which scale a human
being normal shearing human being perceive the frequency. So, that is scale is called
Mel scale. What is the logarithmic scale? Mel scale. How it is derived.
Now, suppose I do a perception text, that I varying the frequency of a pure tone signal
and playing the signal in a normal audio or micro phone in a loud speaker and I told the
listeners when you perceive that the frequency is double then raise your hand like that
amplitude. So, I can develop a scale on which human being perceive the frequency. If
you see it will come look like this. So, this scale is called Mel scale, which is a
logarithmic scale.
So, I can say the human perception of frequency the, I can derive by experiment these
axis the frequency. So, I generate a pure tone of different frequency , suppose I generate
a pure tone of 50 Hertz, then I said that you raise your hand when you perceive the
frequency is double. So, let us it is come around 100 Hertz, then I just 55 Hertz, all kind
of variation I play and he raise in here; raise your hand in here raise hand in here like
278
here then may be here then may be here. So, from that point I can draw a curve and find
out that the perception of the human is not linear.
Initial period may be up to 500 Hertz to some 500 Hertz its scale is linear, but after that it
becomes a non-linear scale. This scale is called Mel scale.
So, in Mel scale I can derive the mathematical equation of this curve and the Mel scale
even pitch in Male, there is a two equation I can use any one of them like 3322 log10,
𝑓 𝑓
(1 + ) or 1127 loge (1 + ). So, this equation is actually fitting this curve.
1000 700
So, this equation is called pitch Mel equation, pitch in Mel scale. So, suppose I have f =
2 Khz, then I can find out what is the value of the f in Mel scale? So, I can put the pitch
2000
in Mel scale, let us pitch(mel) = 1127 𝑙𝑜𝑔𝑒 (1 + ). So, I can find out the Mel
700
frequency. So, this is called Mel scale you can say Mel scale conversion of the hertz to
linear perception of the frequency.
279
Next, if I heard of the I will say I have discussed it, during the basilar membrane that we
perceive that each sensor along the basilar membrane is corresponding to a particular
band of frequency so that means, the ear cannot distinguish sound within the same band
that occur simultaneously.
Suppose, a cochlear that along the basilar membrane particular band of frequency is
perceive by a one sensor. So, if within this band if the frequency occur, then I cannot
distinguish the difference between the two frequency.
So, this band is called critical band. So, the auditory system can be roughly modelled as a
filter bank consist of 25 overlapping band pass filter, which is varies from 0 to 20
Kilohertz. So, that is band so I can say instead of human auditory systems I can think
about engineering model of some band pass filter, whose frequency bands are non-linear
and 25 overlapping non-linear band pass filter, non-linear band width band pass filter can
be completely model the human auditory system.
Now, band width are each critical band is about 100 Hertz. So, signal below 500 Hertz it
is linear and if it is increase above 500 Hertz it is becomes non-linear. So, that band
width is define as a bark. So, one bark is equal to width of a critical bandwidth. So, band
width in bark scale. So, bark scale = f / 100 if f 500 Hertz 9 + 4 log2 f / 1000, sorry if it is
f is greater than 500 Hertz.
So; that means, within 500 Hertz bandwidths are 100 Hertz bandwidth, with overlapping
if it is 50 percent overlap. So, that is a 100 Hertz bandwidth, then from 50 Hertz, another
100 Hertz bandwidth. So, up to 500 Hertz bandwidth are linear 100 Hertz. After
280
500Hertzthe bandwidth is non-linear. So, I can find out how many critical band is
required to cover the 0 to 20 Kilohertz. So, that bandwidth is called bark scale.
So, that is the pictures of critical band, and I will again discuss with when ever do think
about the MFCC kind of things.
So, this is the bark scale. Next another phenomenon is called frequency masking. So,
perception of frequency and another is called masking. So, what I said the human being
has a threshold of earing. So, threshold of earing curve, I can say this is the frequency
this is the threshold of earing.
281
Now, it is said that presence of a particular tone, if some tone is present in here; pure
tone suppose some tone is high tone is present in here then it is said that nearby threshold
of earing bandwidth will change; that means, if there is a strong particular tone is present
less f = 1 Khz then nearby; frequency threshold of earing is shifted upwards.
So, I cannot perceive if it is this tone is not present, I can perceive this frequency, but
since this tone is present and if I have frequency amplitude of like this I cannot perceive
it required a amplitude to cross this limit. So, that means, I can hide this frequency
because of presence of this tone. So, this phenomenon is called frequency masking. And
this is utilize in speech coding to hide the noise. So, suppose the coding generate a noise.
If the noise is within this limit, then I can say this noise is not persuadable.
So, to hide this noise frequency masking is much more useful. I am not going details
about the frequency masking there is a lot of details on frequency masking.
So, if you are studying the speech coding, then auditory masking is very important. Next
one is different view of auditory perception. So, there is a I can say the auditory
perception has a 2 view: one is called functional view, which is means which means.
282
Suppose, I cannot know what is happening in here, I cannot do the anatomy and I cannot
biological system I cannot measure.
But, if I consider the human hair human auditory system perception is nothing, but a
black box.
Which, I discuss in the first class, then I can say I can stimuli the systems. So, I can
excited the system by a known stimulus. And then I can measure the physiological
behaviour coming out from the system like that; clear the development of shown scale of
mel scale is the example of functional modelling. Another one is called structural
modelling, based on the study of physiology or anatomy. How various body parts work
with emphasis on the process neural processing of sound? So, this is another kind of
study now structural study I have to analyse or human anatomy excited the signal
measure the nervous signal all kind of things can be done which is called structural
analysis.
So, I can say the functional analysis like that, how human being perceive the frequency is
the functional modelling? I play a different sound and human being listen the sound. So,
283
I excited the human perception by external stimulus which is known stimulus and
observe the output. Then try to develop the how human being perceive the frequency that
is scale; that is Mel scale. So, there is a functional kind of auditory perception.
Second one is the why we have to know the perceptual modelling or how human being
perceive the frequency. So, perceptual effect include the most auditory model: spectral
analysis on a non-linear frequency scale, spectral amplitude compression, loudness
compression via logarithmic scale. So, we are not we can say that physical intensity is
not equivalent to the perceptual loudness. So, there is a logarithmic compression is there.
Decreased sensitivity at lower frequency; you know that lower frequency are very we
have sensitive to the lower frequency, but higher frequency we average there is a
bandwidth critical band is very big. So, average. So, resolution human resolution of the
higher frequency is very rough. I can say roughly approximate. So, I can say decreased
sensitivity at lower frequency and increase sensitivity at; decreased sensitivity at lower
frequency or you can say the change of perception of the frequency and lower than do it
is much linear and the upper bandwidth is non-linear.
284
Then, utilization of temporal features and auditory masking of tones, so those
phenomenon can be used in auditory modelling or when we extract the speech parameter
from the speech, we have to include this perceptual variation. So, that the parameter
directly represent the human speech perceive what we have done in the inside the ear.
So, that is called perceptual modelling. So, the different auditory models are available.
Perceptual linear prediction this is called PLP only known as PLP. So, this details I will
cover during the linear prediction analysis this is called perceptual scale linear
prediction.
Then, Seneff auditory model, Lyon cochlear model, Gamma tone filter bank model or
inner ear or inner hair cell model. So, all are called auditory modelling of human speech
processing. So, let us I just discuss one or two model ad then you can study it.
So, one model is that Seneff auditory model. If you see, what is auditory model? So, how
human being perceive the speech sound, I have to implement in auditory model if you
285
see there is a. So, stage 1, stage 2 and stage 3. So, what is stage 1?What we have done?
We have a series of basilar membrane filter.
So, I can say I can develop a basilar membrane filter, by a filter bank; non-linear filter
bank. So, there is a pre filtering; per filtering means since it is a digitized signal. So, what
I can do I can develop a pre filtering after adc. So, that high frequency and low frequency
signal are corrupt. And then I pass the signal through a band pass or equation of chunk of
band pass filter which is called basilar critical band filter.
So, critical band filter bank. So, I have a speech signal I pass this speech signal let’s
forget the pre-filtering, pass this speech signal through a various frequency band filter.
So, each filter output or you can say the critical band filter output give me the response
of a basilar membrane filter bank each sensor. Now I have to know which sensor is firing
I have to define which; so this is called the hair cell firing. So, this hair cell firing I have
to find out which filter is firing based on the collected energy at the output of the each
filter band. So, second stage is called you can say the modelling of hair cell. So, this is
called the hair cell modelling.
So, half wave rectifier, rectification for find out the energy, then short term, adaptation,
synchrony residence, now the rapid agc automatic gain control, then I can get the
envelope detector mean rate spectrum and synchrony detector synchrony spectrum. So, I
get a spectrum how human being perceives frequency at the output of this stage. So,
details are here stage 1, stage 2 stage 3 you can read.
286
So, this is nothing, but a idea is that I have a input speech I pass through that as a chunk
of band pass filter, which is actually critical band filter and each filter output is nothing,
but a sensor response. So, I calculate the response and I have to adjust that firing
throughout that you can say the amplitude compression that part as to be done. So, that
pass I have done and I the estimate the firing of the hair cell to perceive the frequency
that is called Seneff auditory model.
Then, Lyon’s cochlear model same things. So , acoustic signal outer ear, middle ear, pre
emphasis. So, acoustic signal is if we pass through a you can see we have a this kind of a
frequency response of the middle ear and outer ear. So, these can be pre emphasised
signal can be pre emphasized using inverse response. After the pre emphasis I can pass
the signal through a chunk of filter. So, so that is 86 cochlear filter banks; here it is
designed 86 cochlear filter bank in mel or bark scale.
Then it is pass to the half wave rectifier to detect the amplitude. Then agc automatic gain
compression to find out the frequency response of this acoustic signal, as per the auditory
response of human being.
287
Then there is another model. So, this is the Lyon’s cochlear gram this I will come later
on spectrogram.
Then Ensemble interval histogram EIH model. So, the model of cochlear hair cell
transduction. So, I can say this we along the basilar membrane; there is a cochlear hair
cell sensors each sensor consist of ten fibres.
So, that EIH-Ensemble interval histogram is nothing, but a model of cochlear hair cell
transduction. This transduction this motion how it is transduced this motion to the
transduction is done place take place that is model here. How it is model?165 channel
equal space on a log frequency scale between 150 to 700 Hertz.165 channel filter. So,
each cochlear filter design match neural tuning curve of for cats minimum phase filter.
Array of level crossing detector that model motion-of-neural activity transduction of the
inner hair cell. And then we sum it and get the response; details you can also study if it is
required.
288
Then cochlear filter design how it is design.
That is EIH measure the spatial extend of coherent neural activity across auditory nerve.
289
So, this kind of auditory models are mainly we used that PLP that we will discuss during
the linear perceptual coding LP.
When we discuss the LP; that time we discuss the PLP details. Now, why these auditory
models are important in human speech perception? So, non-linear frequency scales.
So, suppose I have a speech signal, I extract the parameter in linear frequency scale then
physically I am doing is that I can extract the physical frequency, but how human being
perceive, that is important also do we incorporate. So, if I want to incorporate that then
all this frequency scale up the signal must be non-linear which is logarithmic scale either
in mel scale or bark scale.
Then spectral amplitude dynamic range timber is in important parameter. So, spectral
amplitude dynamic range or compression or loudness log. So, if I see each tone has an
amplitude. Now perception of that tone, that particular frequency depends on the
threshold of earing. So, perception of amplitude for all frequency is not same I have to
go through the equal loudness curve. So, I can say spectral amplitude compression has
for the equal loudness curve I have to do. So, I can dynamic range or amplitude of the
spectrum as to be compressed as per the threshold of earing.
Those earing sensitivity of the human ear to the amplitude is not same or not equal to all
frequency. So, as per the sensitivity is changing along the frequency so that sensitivity
changing as to be model in spectral envelope also or equal loudness curve, log
spectrogram integration and temporal features. So, I can say the temporal feature is very
important in speech signal. How the spectral dynamics is changing is also a part of the
speech signal also a features of the speech signal. So, that is also very important for
speech perception.
290
Timber is one of the example, so suppose I am producing a signal somebody else is
producing a signal same tone or same thing suppose that is example I have already given
that suppose I am singing a song which is followed the same notation same lyrics same
you can say the duration is also same I exactly copy my guru, but if you see the guru
sound my sound cannot be equal because of complexity of the speech. So, dynamic of
the spectrum is very much important to know the complexity of the speech .
Then, what do learn from the auditory model? If I see speech; once I say the speech that
if I say segmental and I can say the speech is a not a stationary signal it is not a pure tone
it is a stationary signal.
So, it change along the time; if I take a sort duration 20 millisecond for a let say for a
phone and long duration for a speech segment. So, along the time property of the speech
signal is change. So, dynamics of the speech signal is change. So, I can say the 20
291
millisecond is also in some kind of parameter we can get and from long interval also we
can get some kind of parameter which is also important for speech processing.
So, temporal structure is very important for. So, I can say that speech contained, what
kind of information not only the segmental information, supra-segmental information
across the segment information is very much important. You can do an experiment;
suppose you get a speech signal you record a speech signal and find out the fundamental
frequency for this segment. Let us the fundamental frequency 200 Hertz. Design a filter
or a you can say that I can let us 0 to 200 Hertz; design a 0 to 200 Hertz filter low pass
filter cut the signal.
Again if you play the signal you still perceive the 200 Hertz is the fundamental
frequency. How you got? That mean; spectral dynamics told you the 200 Hertz is the
fundamental frequency. So, along the spectrum is also important and you can see the
segmental and supra-segmental both information are necessary. So, dynamic features is
also necessary the how the spectral dynamics is changed that also is necessary. So,
dynamic features is change importance compression of loudness compression of the
scaling of the frequency all are important which we learn from the speech perception.
So, in summary I can say how you perceive the intensity. Intensity is a physical
parameter. So, how intensity related to the speech loudness is important. How the
frequency which is a physical parameter is related to the perception of the pitch of the
signal? So, human perception of frequency, human perception of loudness and there
mathematical model will be used in speech processing. So, that my parameter extraction
when I do, I it should follow the human auditory systems. So, that is called speech
perception.
Thank you.
292
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture – 19
Time Domain Method In Speech Processing
So, let us start, so it is time, domain methods in speech processing. So, here we want to
discuss about some methods which are in time domain methods, now we are not
analyzing the frequency of the speech, in time domain methods for speech processing,
and that is used many cases those kind of methods are used. Now, first of all that one
things is that once the speech is digitized and taken to the computer, so these are we are
saying that digital signal processing the speech signal is now in digital domain.
So, we always denote the speech signal with x[n], so there is a digital signal we know
the sampling frequency of Xn that is Fs. So, now speech if you see the speech is a time
varying signal; that means, along the time if this is the time axis, then speech property
has been changed this may be voiced, that may be unvoiced, then there may be a voiced,
293
there may be noise, then there may be a voiced again, then there will be noise like that,
so speech is not a stationary signal.
So, what I say that if I take this portion of the speech the property is different, if I take
this portion of the piece property is different, so speech is changing along the time. Now
what happened suppose I want to find out the energy of this whole signal? So, I can
easily find out energy E is nothing, but a summation of n equal to minus infinity to
infinity x square of n whole signal whatever the signal is there I can take that whole
signal at a time x square n that is the energy.
Now if I take that energy, then if I want to know that this portion is voiced, so energy of
this signal is high this portion is unvoiced energy is low this portion is noise energy is
low compared to this; this is high, this portion again voice it is high. So, if I want to
know that information if I take the whole signals at a time and find out the energy. So, it
gives me the average energy or if I make it average, then 1 by I can make the average
also, so if I make the average, so I can get either total energy or average energy of the
whole signal. So, that information is whole signal energy is but that does not give me any
kind of parameter by which I can process or I can use this; those, parameter to some
purpose in the speech processing.
So, what we have done instead of doing that whole signal at a time we try to analyze the
signal for a particular window, for a particular signal for a particular segment. Now this
segment I take this segment and find out the parameters and move to the next segment
next segment; next segment, then I can get a segment. So, if this is my Fs sampling
frequency, so rate of sample in 1 second I get Fs number of sample. Now if I window the
signal let us 20 millisecond window. So, 1 second signal how much frame I will get 50
frames, I will get if I slice the whole signal in a 20 millisecond window, so I get the 50
slice, so I have a 1 second signal I slice it every 20 millisecond, so I get 50 slide.
So, then I can say this is Fs/50, so it is Fs/20, so instead of Fs sampling rate now for
every window I get some parameter suppose parameters are sampled at 20 millisecond
per second in 50 some 50 window per second. So, instead of Fs number of sample, now I
get if each window representing 1 sample then I get 50 samples. Now the problem is
that, if I do that things then what will happen since this boundary, in this boundary effect
will come in the processing, so there may be a this boundary I just this boundary when
294
you fall in a very large signal. So, that this sample is included this window and this
window energy will be less.
So what I want to know, I want to remove this boundary effect also, so then what we will
do I may take 20 millisecond window, but I slide the window let us 50 percent of all
them, so effectively 10 milliseconds it. So, I can say frame rate in here is in 1 second
then I get 100 frames. So, if it is frame rate is Fr, then I can say that if it is 10 millisecond
sliding of the window, I take the window 20 millisecond, but if I slide it 10 millisecond,
then I get 100 frame in a second, so frame rate is 100 frame per second here sampling
frequency I get Fs number of sample per second.
Now, if I say for every 10 millisecond I take a point then I get 100 point in 1 second
signal, so I can say it is down sample the signal. So, if I want to process the signal taking
every sample is very difficult because what Fs if it is 8 kilo hertz 8k sample. So, instead
of eight k sample let us I divide that whole 1 second signal in a 100 frame, and 100
frame, a for every 10 millisecond I get a parameter, now slide the window. So,
effectively what I am doing I am windowing the signal with overlap methods.
So, this is window length, let us let 20 millisecond and this is frame shaped, 10
millisecond. So, 0 to 20 millisecond I take the signal I process it find the parameter, I
find out the parameter vector lets vector a lets it is x1 or let us a vector 1, then I shift the
window with 10 millisecond and again take that 20 millisecond I find out a two vector
then, I again shift a 10 millisecond and take 20 millisecond a3. So, in that case I will get
a100, so if I write like this way a1 a2 a100 data I will get, why I take this one, why I take
20 millisecond, 10 millisecond, why I take this kind of statement, or what is that what is
the logic behind it.
295
Now, if you see time speech is an time variant signal. So, if I say that time along the
time speech signal change its color. So, if I want to increase the time resolution, I should
analyze the signal for every sample, but that analysis does not provide me any
information. So, what I want, I want a small window. Let us 5 to 10 millisecond if I shift
it, so, for 5 to 10 millisecond window very short signal. So, due to small amount of data
now if you see the pitch let us I have take a five millisecond window, now it may found
that within five millisecond maybe 1 or 2 fundamental period is there if it is male voice,
then may 1.5 fundamental period is there.
So, if I analyze it; it is no use now if I take it, lets 100 millisecond, then what will happen
I may lose the transitory portion of a consonant to vowel transition because speech is
changing, so let us there is a ka vast then vot you know that, that conclusion vast vot
transition. Now if I take a 100 millisecond my window may fall around here, so what is
the information which is very important to know this information I lost this information,
because this information is taken an average with the vowel and some part of the
consonant.
So, I lose the transitory part, so instead of 100, millisecond. So, if at small amount signal
processing problem large amount signal I lose the time resolution, so what I want; I want
a optimum length of the window, so that it does not affect so much. So, I can say let’s 20
to 25 millisecond of window, if I take then one transitory period, roughly 40 millisecond
to 60 millisecond maybe even faster speech it may be a 30 millisecond. So, at least I can
ensure that my window length is lower than the transitory part ok.
So, then I can say yes that details signal information we along that time I can find out, if
I take the window length 20 to 25 millisecond, so that is why if you see in whole
processing of the speech, we use 20 to 25 millisecond as a window length, and shifted
the frame by 10 millisecond, so that I get 100 frame per second you can shift at it 15
millisecond also, but what will happen for every 15 millisecond I get a vector, in here for
every 10 millisecond I get a vector for analysis or I get a parameter vector.
What kind of parameter I can analyze for every 10 millisecond I get a if it is fundamental
frequency, then also for 10 millisecond I get one pitch parameter. So, it may contain
average of three or four pitch period inside the 10 millisecond maybe 3 or 4 to 5 pitch
296
period is there if it is female voice. So, now, we are framing the speech signal and
shifting the window by 10 milliseconds.
So, this is the example if it is here, so this is the interval 1 and then, shifting of the frame
by 10 millisecond or less 50 percent, if it is 20 millisecond window 50 percent of overlap
means 10 milliseconds Shifting.
So, I get 100 frame per second, I can take 25 millisecond as a window and shifted the
frame 10 millisecond that also I get 100 frames per second ok.
Now, this is the processing part I have discussed, now what kind of time domain
parameters are there, so I can say short time energy. So, for every window I can find out
the short term energy of the speech signal. So, suppose let us application side you can
come, suppose I have a speech signal there is a voicing, then there is a let us noisy part
297
then, there is a silence part again there is a voicing part and then again there is in silence
part.
Now, suppose some anyhow I want to find out which part is voicing part and which part
is unvoiced or silenced part this I want to find out. So, I want to distinguish between this
I want to mark this point, this point and this point that I want to mark, now even in
assumption if you say that if I find out the short term energy of the whole speech signal.
So, for a lets I take this path this window find out the energy, then 50 percent overlap
taken window find out the energy E1, E2 then take an energy E3 three. So, that way if I
find out the energy now if you see since it is a voicing signal the energy will be very high
since it is as noisy signal the energy value will very low it is a silence energy will be
more low.
So, from that short term energy this E value can give me a plot like this that this is
maybe energy is very high. So, this is the high; high, then coming down low then low
then again it will go high, because of this kind of transition, because window may take
this portion and this portion. So, it is average out the energy, so that is why you get this
kind of transitory part also get, so from that curve I can find out this up to this point may
be the voicing up to this point may be the silence part.
So, that I can easily understand, so short term energy is one kind of parameter, then short
term average magnitude.
298
Short time energy, or short time average energy, or short time average magnitude, now
what is the problem in energy, when I calculate energy the signal x n has to be square up.
So, every sample value lets see if it is a 16 bit speech sample. So, let us the value is come
around 28000 for a 1 sample, so it is 32000 plus and 32000 minus if with 8 bit maximum
possible, now if it is 28000 if you square it the number will be integer number will be
very high.
So, instead of squaring the signal, what I can do I can find out the short term average
energy short term magnitude energy average magnitude means, if it is Xn is my signal.
So, magnitude is mod of Xn is the magnitude, and then I can add it take the average they
take the sum value and divide it by the number of sample I get the average magnitude,
So, that I can use as a parameter for this; this instead of E1, so that kind of mathematical
little sun coming. So, that I can use it, then short time 0 crossing, if you see a lets come
in here I if you see I do not know either there is a specter time signal is there or not I do
not have any signal.
Now, if you see any signal any speech signal, if you see that this portion lot of time
signal is cross the 0 line, this portion may be the number of zero, so this is the zero line
zero amplitude line, so zero line crossing will number within a particular window will be
less compared to this. So, short times zero crossing can be used to find out whether it is a
voice signal or it is a noisy signal. So, short term energy is short term zero crossing can
be used, so the number of zero crossing can be a one parameter to extract the number of
zero crossing can be a time domain parameter.
So, this is a time domain parameter I have to extract, the per frame extract the number of
zero crossing then short time autocorrelation I come, then short time average magnitude
difference, those are the parameter I can use from the speech signal and those every
parameter has its own purpose also, now I just come to that one by one short time energy
mathematics.
299
So, energy E of a signal which is n is equal to minus infinity to infinity x[n] square is the
energy. So, this is the whole signal energy long term definition.
Now, if I take the long term definition lets this is the my signal and if I take the whole
energy at a time or this is my signal, and I take the whole energy at a time no use what I
do that with that parameter, because I want to find out where voice signal, unvoice signal
silence, so if I take the whole signal at a time find out the energy I get a value which is
energy value of that speech signal, but I can I infer any information from that energy
value. So, there is no use if I use whole signal at a time.
So, what I will do instead of taking whole signal at a time, I will multiply the signal this
signal this signal E for a particular window.
300
So, I cut the signal particular window here, if I cut the signal I take the signal of
particular window. So, what I am doing it, so there is a infinite length signal I place a
window over the signal, so this is l equal to 0 to L-1 which is Wn. So, I can say short
time energy in vicinity of n cap n, so Tx is equal to x square value and omega n is equal
to 1, if it is within L equal to lets n equal to 0 to L-1 else it is zero. So, I cut the signal
from the long signal I cut a window and take the energy ok.
So, if I say that find out the short term energy, first I recorded record your let us record
your name in a let us Fs is equal to 16 kilo hertz and 16 bit mono record your record your
name in computer.
So, put the microphone connect to the computer record your name, once you let your
name consists of three second signal all let us 5 second signal long name 5 second signal
it is consist of 5 second speech signal so; that means, 16 kilo hertz sample 16 k sample
per second, so I can get 16 k*5= 80 k sample ok.
So, I have a 80 k sample this side, I take a window of 20 millisecond. So, I take the
sample number 0, so if it is Fs is 16 kilohertz then how many sample, will be there in 20
millisecond, in 1 second there will be a 16 k sample, so in 1 millisecond 16 sample.
So,120 millisecond I can get 320 sample, so there is a 320 sample in 20 millisecond I
have a take the first frame 0 to 320, and since I am multiplying a window function with
amplitude 1 so; that means, from take every sample so, lets take this is xn, so I take x0
then I take sum x equal let us k equal to 0 to 320 minus 1. So, n L is equal to 320 L
minus 1 x of k x square of k equal to E 1.
So, first frame power I get, then I shifted this frame by 10 millisecond; that means, 10
millisecond means 320 divided by 2 so; that means, 160, 160 sample. So, I take that
301
another window from 160 samples, to here so this is 0 to 319 samples, so this is 160
sample to another. So, here will be 319 plus 160. So, I get take that that sample and find
out E2, then that way I can so, if it is 10 milliseconds shifting, so I get 100 frame per
second, so in five second I will get 500 frame. So, I can get E1 E2 E3…… ….E500. So, I
get 500 short term energy data point.
Now, if I told you that if I draw that this diagram, so what I what is the diagram here is
Xn this is Fs if it is energy then this function is square.
So, if it is short term energy then, I can find out let us this is my signal Xn whose
sampling frequency is Fs I operator is x of so, I can say square; square, is the operator
signal square I take the signal square then, I pass the signal, so all signal is every sample
is square rate is still Fs. Now once I pass through the window Wn after the window my
rate is Fs/R, R is the 15 10 millimeters if it is 10 millisecond, so if it is for every 10
millisecond I get one sample you can say that one data.
302
So, I can say here I get En which is nothing, but it for every 10 if it is 10 milliseconds
shifting, so 100 frame 100 per second earlier it is Fs let us see it is 1 Fs per Fs number of
sample per second instead of Fs number of sample I get 100 number of samples per
second. So, I get En value En is equal to 1,2,3 …… 100 for 1 second I here Fs sampling
E1 2 up to Fs, Fs number of sample or not. So, this is called I can say that short term
energy.
Now, if I interested instead of energy I can replace these walks same walks by only
magnitude again it is Fs again I if I put a window Wn, I get Fs/R does this is Mn, so
instead of squaring the signal, so Mn is nothing, but a lets k equal to 0 to L minus 1 that
is a within the vicinity of the window length you are taking x of m lets x of k mod into W
n minus k where n is the here Mn, n is the starting polar window number of window, so
if it is first window n value equal to 0 if it is second window n value is equal to 160
shifting is 160 sample if it is third window n value will be starting of the window will be
320 forth window starting of the window will be 1320 plus 60, I can get. So, for every
window I get 1 Mn ok.
So, then I get Mn, so instead I can get En I can get Mn En is the short term energy Mn is
the average magnitude average magnitude, it is a sum of magnitude if I want to make it
average 1 by put 1 by l here. So, I can normalize it also 1 by l l number of sample up
there, so I can one by l I get the average. So, average magnitude. So, using these 2 see
that, so this is some plot of a boost to parameter you cannot extract after recording your
name extract that En.
So, I give you the problem, so problem is you record your name in 16 bit 16 kilohertz
and 16 bit 16 kilo hertz and mono record your name using any you can use cool edit you
can use plot record it, then find out En and Mn value of En and Mn En in the rate of 100
frame per second and 20 millisecond is the window length take 20 millisecond window
length and find out the 100 pairs of shift is 10 millisecond find out the En and Mn.
Then plot the 3 signal 1 is your first recorded signal, then plot it En then plot Mn. So, for
recorded signal there are lot of samples will be there for En for 1 second, I get 10 sample
values for Mn also for 1 second I get 10 sample values. Now you see whether you are
able to find out the voice zone using En and Mn or not. So, maybe if it is voiced zone En
value will be high, if it is unvoiced or sibilant yeah maybe if it is noisy unvoiced
303
amplitude may not be that much of holly it may rise somewhere, and then again voice in
you it may be raised sometime if it is voice bar like this. So, I do not know either it is a
voice bar and whether it is a noise.
So, for that purpose we use another parameter, but at least using En I can find out where
the speech signal has high energy where it is there is no energy that I can find out. So,
this can be used as a voice detection, you drop plot it and see how it is behaved like, then
you can get a real exposure. Now another parameter is called 0 crossing.
So, how many time signal crosses the 0 line, so it is 1, 2 if it is period, so if it is I start
from 1 then 1, 2 then this point is part of the next period. So, I can say if it is a pure sine
wave then 2 number of zero crossing per cycle; path cycles, 2 number of zero crossing
ok.
Now, what will happen if you say some time you may say that I have recorded the signal
and this is my 0 line m 0 amplitude line by my sin wave in here, this is never crosses the
0 line; that means there is a 0 line, but the sine wave is DC shifted d c shifted means the
304
0 there is a sine wave during the recording there is a DC shift in above that is why 0 line
shifted to here, 0 line is this line. So, if it is DC shifted what I can do either I can pass the
signal through a low pass filter or I can take the average and subtract it. So, first
normalize that DC shift of the signal then find out the 0’s ok.
Now, I just come to the calculation then derive that equation then find out the
parameters. Now if we see that if it is a pure sine wave 2 number of zero crossing per
cycle, now if the sampling frequency of this sine wave is a Fs, and the fundamental
frequency is a zero, that means length of the period is 1 by f0, then I can say Fs/F0
number of Fs/F0 sample per cycle, so how many sample will be there in per cycle Fs by
f0. So, if it is 8 kilohertz sampling frequency and I have a 200 hertz f0, then how many
samples will be there in one cycle 8k divided by 200 hertz, so 40 samples per cycle ok.
So, this is number of sample or I can say sample per cycle. Now if you see 2 number of
zero crossing per cycle, so number of zero crossing Z1 is equal to 2 crossing per cycle
into cycle per sample, this is sample per cycle. So, cycle per sample is nothing, but a
F0/Fs, so I can say twice F0/Fs number of zero crossing per sample, number of zero
crossing per sample, so if I want to find out find out the number of zero crossing of 80
sample. So, let us M sample then Z1 is nothing, but a 2F0/Fs into M ok.
So, next class we try to derive the generalized formula, how do I find out the zero
crossing for non periodic signal or you can say the for speech signal or the periodicity is
periodicity quasi periodic signal, so for there, how do I find out or even non periodic
signal how do I find out the number of zero crossing ok.
Thank you.
305
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture – 20
Time Domain Method in Speech Processing (Contd.)
So, we have saying that suppose I have a simple sinusoidal, simple sinusoidal. So, every
cycle, every cycle a simple sinusoidal cross the 0 line 2 times. And if it is DC shifted, let
us correct this DC shift then find out the 0 crossing, 2 times it crosses the 0. So, 2 time
pass cycle. So, for every cycle, it crosses the 0 line 2 times.
Now, if the sampling frequency is, let us Fs and the cycle that the fundamental frequency
of or this signal frequency is F0. So, for every cycle it crosses the 0 line 2 time. So,
sampling write Fs. So, if how many cycles are there in per sample. So, Fs samples are there
by F0 sample per cycle. So, for every cycle how many samples are there? So, if Fs is the
number of sample and F0 is the frequency. So, Fs by F0 sample will be there per cycle,
which I have discussed already. So, 2 time per cycle. So, every cycle how many samples
are there? Fs by F0 every cycle?
306
𝐹𝑠
𝑐𝑦𝑐𝑙𝑒 =
𝐹0
𝐹𝑠
𝑧=2×
𝐹0
So, if I convert this to number of 0 crossing per sample. So, I can say 2 into it will be cycle
convert to sample. So, it will be F0 by Fs. Because cycle is 2 by cycle 2 per cycle. So, it is
nothing but a 2 by Fs by F0. So, it is nothing but a 2 F0 by Fs, number of 0 crossing per
sample. So, I can say for a pure sinusoidal, the number of 0 crossing per sample is 2 F0 by
Fs. So, suppose I said find out the number of 0 crossing of a sign half sign wave for 80
sample, or for m sample. So, I can say the number of 0 crossing for m sample is nothing
but a 2 into F0 divided by Fs into m, is it clear?
𝐹𝑠
𝑧𝑚 = 2 × ×𝑚
𝐹0
So now suppose I have a signal 1 kilohertz, signal F0 is equal to pure ton of 1 kilohertz
sampled at Fs 10 kilohertz. If I say find out the number of 0 crossing for 40 or 400 sample.
How many time signal cross the 0 line for 400 samples? So, I can say it is nothing but a
Z400 is equal to nothing but 2 into F0 by Fs into 400. So, I can say 2 into 1 kilohertz means
10 to the power 3 divided by 10 into 10 to the power 3 into 400 cancel. So, it is nothing
but 80 number. So, 0 crossing for 400 samples is 80 number. Even if instead of 400 sample,
307
I said how many time signal is cross the 0 line for 30 milliseconds? 30 millisecond in the
30 millisecond signal how many time signal cross the 0 line?
𝐹0 = 1 𝑘𝐻𝑧, 𝐹𝑠 = 10 𝑘𝐻𝑧
2 × 103
𝑍400 = × 300 = 80
10 × 103
So, I we say Fs is 1 kilohertz; that means, 1 this millisecond has 1k sample or 10k sample.
So, one millisecond has 10 sample. So, 30 millisecond has 300 sample. So, instead of 400
I can say 2 into 1k divided by 10k into 300, 60 number. So, I can find out the number of 0
crossings.
2 × 1𝑘
𝑍400 = × 400 = 60
10 × 1𝑘
Now instead of sign wave, let us I have a speech signal. So, it is not I cannot say 2 times
per cycle. So, I have arbitrary signal. I do not know number of cycles, which is the period
I do not know anything. There is the signal only. I want to know how many times signal
cross the 0 line for a 40 millisecond segment. Or of a 20 millisecond segment for one
window.
308
(Refer Slide Time: 05:23)
So, my problem is that I have to find out how many times signal cross the 0 line for 20
milliseconds, I take a window. So, that is why it is called short term 0 crossing rate. If I
want to calculate how many times signal cross the 0 entire signal, that can also I calculate,
what that will be not use, because a signal which is time varying. So, this may be voice
this may be silence this may be noise. So, all kinds of steps or signals are there. If I want
to know which part is voice which part is noise. So, I have to instead of taking the whole
signal at a time I have to take a small window. Same as and calculation of energy that is
why called short term 0 crossing rate.
So, once I make a short term let us 20 millisecond window, I want to find out how many
times signal will cross the 0 line. Here my formula will not work because I do not know
where the cycle number of cycle F0 I do not know. Fs I know. I know only Fs. Sampling
frequency I know. I have sampled the signal something else. Now let us this is an xn is a
signal, first I explain it then I put the generalized formula. So, this is the first window this
is the first widow and signal is xn. So, xn is 20 millisecond signal. If Fs is equal to 8
kilohertz, then how many sample will be there in 20 millisecond? 10 millisecond 80
sample. So, I can say there will be a 160 sample value.
𝐹𝑠 = 8 𝑘𝐻𝑧
309
So, 𝑥[𝑛] has 160 value. So, if it is start from 0 then it will be 𝑥[159]. Now I want to find
out how many times signal cross the 0 line. So, if I say if I say generally drawn suppose
those are the samples. So, when the 0 crossing will be happened this is sample once the
signal sample will come this side. So, this is the positive sample this is the negative sample.
So, there will be a 0 crossing. Similarly, there is a negative sample negative sample then
once it is a positive sample then there will be a 0 crossing. So, I can say the 0 crossing only
happen if the signal value, signal or sample, sample sign in change positive to negative
sign.
So, every time when a previous sample if it is positive next sample is negative, 0 crossing
occur, previous sample is negative next sample is positive 0 crossing occur. So, I can say,
let us compare where the So, there is a 0 crossing will occur in between 2 samples. So,
within this 2 sample I do find out whether there is a 0 crossing occur or not. So, how do I
find out? I do check within this 2 samples whether the sample value change the sign or
not. So, I can check let us this is x0 and this is x1. So, what I define? Let instead of x0 x1
let us write it 1, 1, minus 1, minus 1.
So, if I see this sample this is 1, this is 1, this is minus 1, minus 1, minus 1, 1. Now if I say
so, I have to check 2 sample at a time I have to check 2 sample at a time. So, I can say if
this minus this if both are 1. So, 1 minus 1 is equal to 0.
If the sign is changed then only I want one count. So, I can write down a function, So, I
will let us use the separate slide.
310
(Refer Slide Time: 11:37)
Give me every time. So, if I take the mod of this function what will give. This is plus 1.
So, this in this case, this is minus 1. So, let us this case this case I take. So, coming here.
So, this case. So, x3 minus and I take x2. So, I am taking the sample the x3 and x2. So,
this is x3. So, if it is x3, x3 is negative sample so minus 1 minus x2 is positive sample
minus 1 then take the mod then the value will be 2. If it is 1 and 2 value up 2 second sample
is positive. So, 1 value of first sample is positive equal to 1, take mod 0.
|1 − 1| = 0
So, if I take this function and check for all sample, take the sum. So, every time signal
cross the 0 line I get the value of 2. But what I want I want number of 0 crossing through
the that window, let us this is 20 millisecond. So, I can say this will be 1 by 2 has to be
normalized. Because 0 crossing is number 1 number 2. So, instead of number 1 every time
in this function I get value 2. So, 2 has to be normalized. Now if I want to normalize with
respect to window length then I get normalized with l, this is for first window. So, same
thing we will happen for any window throughout the signal. So, generalized formula for
calculating the 0 crossing rate is this one.
𝑛̌
1
𝑍𝑛 = ∑ |𝑆𝑛𝑔 (𝑥[𝑛]) − 𝑆𝑛𝑔 (𝑥[𝑛 − 1])| 𝑤
̌[𝑛̌ − 𝑚]
2𝐿𝑒𝑓𝑓
𝑚=𝑛̌−𝐿+1
311
𝑆𝑔𝑛 (𝑥[𝑛]) = 1 𝑥[𝑛] < 0
And this is the block diagram first signal pass through the minus 1 and one I make it, then
take the difference take the mod, then I pass through the window and get the So, what I
get for every window I get a z value. So, for the window number 1, I get a z1. For window
number 2, I get a z2 for window number 3 I get a z3. So, how many if there is 100 frame
per second. So, I can get 100 z value. So, instead of Fs I can get 100 points of z value. So,
Fs is 1 second the 1 second is divided into 100 point. So, sampling frequency them come
down to the instead of Fs it is 100. Sampling rate the frame rate divided that sampling
frequency. So, I get a one value here one value here that a 100 value I get. So, 100 z value
like it. Then if I plot those 100 z value. So, if the signal is sibilant length or if the fricative.
So, if I see that see then example if the sibilant signal. So, number of 0 crossing will be
much high. So, z corresponding window z value will be very high. If it is signal is worst
signal is worst kind of signal number of 0 crossing will be less. So, z value will be low.
312
(Refer Slide Time: 15:16)
So, depending on the z value I can say whether this signal is sibilant, or this signal is
silence. In the problem is that even through the signal is non sibilant let us silence, but
there is a simple noise is there, then also I can get the 0 crossing rate very high or let us
there is a 50 hertz frequency, find the line frequency disturbance, then also I get the 0
crossing rate with the like the voice signal. So, that is why the 0 crossing rate is not a you
cannot say this is a very rigid parameter for detecting the voiced or unvoiced detection,
but yes I can do the voiced unvoiced detection using 0 crossing rate.
So, normalized 0 crossing rate means it is normalized with respect to window length. So,
if it is. So, I can say normalized 0 crossing rate is z is equal to nothing, So, I have said the
nth window, nth window. So, n equal to 0 n equal to 1, n equal to 3 n equal to 4 should I
know the which window it is part. And for each window I calculate the normalized 0
crossing rate with respect to 2 l the here is the l is the length of the window.
𝑛̌
1
𝑍𝑛 = ∑ |𝑆𝑛𝑔 (𝑥[𝑚]) − 𝑆𝑛𝑔 (𝑥[𝑛̌ − 𝑚])|
2𝐿𝑒𝑓𝑓
𝑚=𝑛̌−𝐿+1
So, let us I have a signal. Let us this same see this example. For a 1 kilo 1 kilohertz sign
rate as input using 40 millisecond window length, with various value of sampling
frequency we get the following. We get the same value of 𝑍𝑚 . So, window length is 320.
So, if you know that z1 is nothing but a 2 into F0 by Fs. And if it is multiplied for particular
313
40 millisecond, then I know how many sample will be there in 40 millisecond and I can
find out the z value, if it is see the z value is same 𝑍𝑚 is same.
𝐹0
𝑍1 = 2 ×
𝐹𝑠
But z1 is different, per cycle per sample with the number of 0 crossing per sample is
different, but number of 0 crossing for a particular number of sample is same. For
particular number of time sorry, particular number of time is same. So, this is I can say 𝑍𝑚
is independent of or I can say the 𝑍𝑚 is independent of sampling frequency, that is called
normalized 0 crossing rate.
314
(Refer Slide Time: 18:51)
Then there is another parameter we said time domain parameter is call autocorrelation
coefficient. The details I will discuss during LPC analysis and F0 analysis, but I just give
you a hint what is autocorrelation that idea. So, what is correlation? You know suppose I
want to find out the correlation between these 2 things. That is nothing but a similarity
between the 2 object is called correlation. Similarity between these 2 object if 2 object are
similar, then I can the similarity is similarity I have to measure the degree of similarity.
So, there is a requirement of some parameter with respect to which I can find out the degree
of similarity.
315
So, suppose it is a digital signal xn another digital signal yn, I want to find out the similarity
between this signal and this signal. So, I want I will do for every sample similarity of the
whole signal. So, similarity number 1 r0, I have to find take the one sample of this signal
take the first sample of this signal, take the product, then take the sum. So, it is nothing but
a product xn multiply by yn. Now if I find out the similarity of degree one r1. So, I will do
I will shift the one signal by one sample. So, that is the correlation. So, if it is
autocorrelation then yn is equal to xn it is similarity between the signal itself. So, if that
case the r0 first coefficient is nothing but a xn square.
𝑟(0) = ∑ 𝑥 2 [𝑛] = 𝐸0
If sample has to be square and then some of. So, it is nothing but energy E0. Similarly, if
it is r1. So, the signal is same signal is shifted by one then multiply. So, x1 the x0 will be
multiply with next x1 here x0 is not x1 x0 x1 is there as treated as a 0 and last sample will
be multiply with the 0, because if I take that the outside the vicinity of the window signal
is 0. Then I get the value then I get the value r2 then I get the value of r3 just shifted. So,
if I shifted it if you think in a sign wave manner. This is a sample, this is a sample, this is
a sample, this is a sample, this is a sample, this is a sample, again sign wave 1, 2, 3, 1, 2,
3. Now once I multiplied these with this, these with these, these with these with these and
add it similarity is maximum I get E0.
Once I shifted the sample with once is. So, the next time I extract this sample. Then I
multiply this with this, this with this, this with this, and this with this, this with this, and
take the sum. Now if you see seen there is a signals are not similar not in phase you can
say the signal is not in phase. So, in that case the similarity becomes reduce again it will
be multiply with the same signal when it will come in here. So, after complete one period
again the signal once this same kind of similar kind of signal will multiply each other. So,
summation will be increases. So, I can say r0 r1 r2 r3 if I calculate those r value with
respect to different l. So, again I get the maximum value at rl where l represent the time of
316
or if Ft l is equal to nothing but a length of the signal, where it is complete period it has a
complete period.
So, l is nothing but a t complete period. So, if you see what the signal or not coming details
I will cover it will look like this. So, if you see the rate color is the maximum. So, maxima
occur at red color. So, what is the definition of a fundamental frequency. Signal repeat
itself. So, when it will be repeat if the sample value are similar. So, I can say the signal
repeated in here, signal repeated in here, signal repeated in here. So, this is nothing but a
F0 this is nothing but a twice F0 this is nothing but a thrice F0. So, that can be used to
calculate the fundamental frequency of the signal. So, that is the autocorrelation parameter
for time domain processing.
317
(Refer Slide Time: 24:18)
Details again I will discuss during the F0 extraction. Next is called average magnitude
difference function AMDF. Instead of taking the multiplication of the signal, I just take
the mod magnitude difference mod, mod of the magnitude difference.
So, suppose I have a x1 take the x0 minus x0 take the mod. So, I take a n number of sample
signal then take the difference. So, this is there is a n number of signal in my signal and
take the first sample. I take collect the difference of the first sample with another sample,
and take the sum and divided n.
318
𝑁−𝑙−𝑘
1
𝐷(𝑘) = ∑ |(𝑥[0]) − (𝑥[𝑛 + 𝑘])|
𝑁
𝑥=0
So, see the philosophy. If this is my x0, let us my x0 is in here. So, I take the x1 take the
difference of x1 is in here, let us x1 value is here. So, if I if I say x1 x2 x3 I take the
differences. So, differences will be minima, when the value of this one this x0 which
matched with similar kind of period will come. If you see if you take the plot, I take this
first sample, I take this first sample. Take the difference between this sample, this sample,
this sample, this sample, this sample, this sample, this sample and sum it up. Then take
this sample, then take this sample with difference of this sample this sample difference of
this sample, this sample, difference of this sample. So, x0 minus x of n which n varies
from 0 to n minus 1 k. So, k equal to 0 means I x x x 0 minus x0 x0 minus x1 x0 minus x2
x0 minus x3 I have taking it and take the sum.
∑ 𝑥[0] − 𝑥[𝑛]
Then I take x1 minus x0 x x1 minus x2 x1 minus x3 x1 minus x3 and take the sum. So, if
I do that way when it will be minimum? When I take the difference from here, same signal
with the same signal. So, x0 minus x0. So, this is nothing but a look like a x0 all though it
is a not xl.
𝑥[𝑘] − 𝑥[0] = 0
Then the difference will be minimum almost 0. So, I can say the definition of a phase is
that if the 2 samples are identical after n delay then n is call period. So, the if you know
that x of n is equal to x of n plus n. So, x of n after n sample if the same sample is appearing
then I can call the period of the signal is xn. So, difference will be minima when the signal
repeat itself. So, if I say this kind of plot I will get.
𝑥[𝑛] − 𝑥[𝑛 + 𝑁] = 0
So, first minima we will occur at F0. Second minima we will occur at twice F0. Third
minima occur thrice F0. So, details I will discuss that what extraction of F0 what the
drawback of this algorithm or call things is, I will. So, this is call average magnitude
difference. So, all those time domain parameters can be used to detect the speech and
319
suppose I want to detect the speech and non-speech. So, what is the important? If I do not
talk suppose I want to develop and which is a speaker identification and systems, and I
want to find out why are the speech event is started and where the speech event is ended.
So, let us I quiet for sometimes and start recording. I want to find out the time when I start
speaking and when I end speaking, speech and non-speech.
So, beginning of speech interval and ending of a speech interval that have to detect. So,
how to detect it?
320
(Refer Slide Time: 30:00)
I have to accurately locate the beginning and end of the speech; with a noisy background
something will be there. So, those kind of things will be there. So, what is the problem?
Now suppose there is some noise, and I start the speech with a plosive consonant pa, ta,
ka this kind of consonant. If you see the period of pa is nothing but a silence. So, I do not
know where the pa is started. Because when it starts voicing I know here the voicing is
started. So, that is nothing but a pa to next voicing transition, but I do not know where the
pa is begin whether it is begin in here whether it is begin in here I do not know.
321
Similarly, if there is a sum fricative with fricative like fa then also detection of that fricative
is very tough so weak, plosive fricative weak fricative even nozzle, suppose you started a
nozzle consonant the voicing is started like this, and suddenly like this. So, why exactly it
is started? It may be coincide with the noise. So, those are the problem in voicing and non-
voicing detection. But using this energy or you can say the average magnitude, and 0
crossing rate I can detect the voice you know beginning and end point of the voicing.
322
So, what I will see, if you see is a log energy log energy separated voice from unvoiced
and silence. So, this is the silence, this is the unvoiced, this is the voiced. Log energy in
Db. So, this is the 0 crossing silence unvoiced and voiced. 0 crossing pa for 10 millisecond
interval. So, I take the window side is 10 millisecond. So, suppose I can develop an
algorithm like this, this is my energy. So, I can get that the interval this is this block line
is plotted for every 10 millisecond. Every 10 millisecond there is a point. So, I can say I
can explain it better here.
323
So, suppose I have signal then silence. So now, suppose this term. So, what I will do I take
for every 10 millisecond I find out the average magnitude. Or I can say that average
magnitude Am or I can find out the Em energy. And 0 crossing rate, for every 10
millisecond I get that one. So, if it is 1 second signal if it is hun 10 millisecond is the frame
rate. So, I can get the 100 100 frame value.
So, with concede with the signal. So, I can say this frame corresponding to this energy,
this frame corresponding to this energy, this frame corresponding to this energy, this frame
corresponding to. So, I can get energy plot like this. Now if I have to define a threshold,
that if it is within this threshold value, then I call it is a silence if it is above the threshold
value then I call it is a voice. So, I normalize the amplitude because your according then
you say that if I record the high volume then threshold value will be change. So, what will
do? You can normalize the speech signal, with respect to some sample you can say that
this is the maximum level of So, I get a signal. I normalize the amplitude of that whole
signal and take find out the threshold value was some from previous study. And I just the
threshold value to find out the voice and un-voice detection.
Similarly, for every 10 millisecond I get a zm value. So, if it is noisy. So, zm value will be
very high. So, I can get the zm plot will be look like this, again down down down. So,
again I can say the zm value if it is high above this threshold value I can say it is a sibilant
below this threshold value I can say it is a voiced or silence. If there is a silence can also
have a high threshold value, but if you see the zm and energy if I combine, then I can say
whether it is a silence or sibilant. sibilant amplitude will be high at least voicing sivilent
there will be some power, but in silence only background noise will be there. So, power
will be reduced. So, that way I can find out this parameter can be used for voicing and un-
voicing detection.
For PDA and VDA voice detection or PDA can PDA and VDA, I discuss later on the
speech detection algorithm and voice detection algorithm. So, this can be act as a voice
detection algorithm, but every PDA is nothing but a VDA. If I able to find out the speech,
speech is existing only for voicing signal. So, I can say VDA PDA is exist only for voice
detection. So, VDA PDA. PDA speech detection algorithm VDA voice detection
algorithm. So, I will discuss those later on details of when a discussing about the F0
extraction. So, time domain methods is complete. So, next class I will discuss about the
LPC modeling.
324
Thank you.
325
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture - 21
Introduction to Linear Prediction
So, let us start with that new chapter which is called Analysis Synthesis of Pole-Zero
model. Basically, here we are dealing with that LPC analysis and LPC synthesis.
326
Now, if you see in the speech production system that 𝐻(𝑍) is my as for the discussion in
this speech production system; 𝐻(𝑍) is the transfer function of the speech production
system which is nothing but a 𝐺(𝑍) glottal transfer function multiply by𝑉(𝑍) mobile track
transfer function and radius and transfer function. So, this is the three transfer function
which has to be multiplied to get that overall 𝐻(𝑍); if you see in the figure the overall
𝐻(𝑍) is this.
𝐻(𝑍) = 𝐺(𝑍)𝑉(𝑍)𝑅(𝑍)
So, if you see that how it is speech is model in digital domain let’s there is impulse
generator; then there is glottal pulse modulator. Glottal transfer function then there is a
gain; speech gain and if the speech is voiced, then it is connected to here, if it is unvoiced
you to connected to here which is random noise generator. And that 𝑢𝐺 [𝑛] pass through
the vocal track; find out the 𝑢𝐿 [𝑛]; which is that output of the vocal track and to the lip
radiation; I get the speech.
So, this is the vocal track you can say that LPC synthesis model or I can say; this is the
vocal track; this is your transfer function how the vocal track is can be digitized or how a
vocal track can be implemented, that block diagram. So, 𝐻(𝑍) is nothing but a 𝐻(𝑍), 𝑉(𝑍)
and 𝑅(𝑍). If you remember that what is the 𝐺(𝑍)?
327
(Refer Slide Time: 02:30)
1
𝐺(𝑧) =
(1 − 𝑒 −𝑐𝑇 𝑧 −1 )2
𝑒 −𝑐𝑇 ≈ 1
𝐺
𝑉(𝑧) = ]
∏𝑁/2
𝑘=1
(1 − 2𝑟𝑖 cos 𝜃𝑘 𝑧 −1 + 𝑟𝑘2 𝑧 −2 )
𝜎
𝐻(𝑧) =
1− ∑𝑝𝑘=1 𝑎𝑘 𝑧 −𝑘
𝑅(𝑧) = 𝑅0 (1 − 𝑍 −1 )
𝐺𝑅0 (1 − 𝑍 −1 )
𝐻(𝑧) =
(1 − 𝑒 −𝑐𝑇 𝑧 −1 )2 ∏𝑁/2
𝑘=1
(1 − 2𝑟𝑖 cos 𝜃𝑘 𝑧 −1 + 𝑟𝑘2 𝑧 −2 )
So, this is the basis point why we do LPC analysis linear predictive analysis for voice.
Since vocal track transformation can be modelized or can be model using all pole model
that is why that we have liner prediction can be possible in vocal track or you can in this
speech production systems. Now, if I go to that is the linear prediction systems.
328
(Refer Slide Time: 03:32)
𝑆(𝑧) 𝐴
𝐻(𝑧) = = ℎ
𝑈(𝑧) 1 − ∑𝑖=1 𝑎𝑘 𝑧 −1
𝑆(𝑧) [1 − ∑ 𝑎𝑘 𝑧 −𝑘 ] = 𝐴𝑈(𝑧)
𝑖=1
329
𝑝
So, I can say if you see the; this is the linear combination. So, I can say if suppose there is
a P sample is like this and what is the P sample; this is the P signal those are the sample.
Now, I can say does a sample number; this sample or this sample can be predicted or can
be generated from the previous P number of sample. So, if I get the previous P number of
sample and the value of a 1 to a P, then I can say I can generate the current signal which
is S[n] current sample S[n]. So, what is this? This is nothing but a linear prediction. So,
why it is called linear prediction? Suppose there is a line if this is the line; if I know this
point this point and this point, I can say I can predict this point by linear combination of
the previous point with some coefficient factors.
330
So, this is called linear prediction; so, I can say I can predict current sample from previous
p number of sample; with a corresponding factor 𝑎1 , 𝑎2 and 𝑎𝑝 . So, what I get into here.
So, if I know 𝑎1 , 𝑎2 and 𝑎𝑝 , those are the call coefficient. So, those are the linear prediction
coefficient I can say those are describing the properties of the speech signal and they will
be multiplied with the previous sample which can be implement easily by a delay. So, I
can say this is nothing, but a filter; those are the coefficient of the filter and those filter if
I design and all pole filter using this coefficient and if I pass the previous P number of
sample; then the presence P signal can be predicted.
So, this is called linear prediction. So, there is a two problem; one is that I can generate S
[n] or if I know 𝑎1 , 𝑎2 and 𝑎𝑝 . Suppose I do not know 𝑎1 , 𝑎2 and 𝑎𝑝 , then I can say yes if
I know the current sample and previous P number of sample; P number of sample then I
am able to predict the value of 𝑎1 , 𝑎2 and 𝑎𝑝 for which the prediction is 100 percent
correct; so, what is the prediction?
331
(Refer Slide Time: 09:26)
𝑠̂ [𝑛] = ∑ 𝛼𝑘 𝑠[𝑛 − 𝑘]
𝑘=1
I can synthesis the current signal or if I know the current signal and previous P number of
sample; from the previous P number of sample I can predict the set of value of a for which
the error will be 0. So, one is called analysis; when you are deriving the value of 𝑎1 , 𝑎2
and 𝑎𝑝 is called LPC analysis; all pole analysis. When I know the value generating the
signal S[n]; this is called LPC synthesis; I synthesize the signal let us describe in a block
diagram in the error.
332
[1 − ∑𝑝𝑘=1 𝛼𝑘 𝑧 −𝑘 ]
= 𝑠[𝑧]
𝐴[𝑧]
I can say this is my signal which is 𝑆[𝑛]; if I pass through this signal through let us I write
down this is called let us say estimation 𝑃[𝑧]; I can get estimated signal which is 𝑠̂ [𝑛]. So,
𝑆[𝑛] previous samples are pass through estimator,
𝑝[𝑧] = ∑ 𝛼𝑘 𝑧 −𝑘
𝑘=1
I get capital the estimated signal; now if I differentiate that those 2 signal; I get the error
signal which is e n or I can system diagram, I can say from the beginning that 𝑠[𝑛] minus
𝑠̂ [𝑛] is my error.
333
(Refer Slide Time: 13:56)
So, I can say that if I synthesize 1 by A[z] and if I pass the value of 𝐴𝑉𝑔 [𝑛] through this
filter; I get signal speech signal 𝑆[𝑛].
𝐴[𝑧] = 1 − ∑ 𝛼𝑘 𝑧 −𝑘
𝑘=1
This we have done; reversely if I know A[z] if I implement A[z] and pass the signal S[n];
I can get e[n] because e[n] is nothing but a S[z] into A[z]; So, I can say if I pass S[z]; S[n]
through A[z] filter I get e[n]. So, I can say; I can get the error signal, if I implement A[z]
and pass the presents P signal I get the error signal or which is nothing but A[n] is nothing
but 𝐴𝑉𝑔 [𝑧]; if there my estimation is very correct or 𝐴𝑉𝑔 [𝑛] I can say; if it is N then I write
𝐴𝑉𝑔 [𝑛]. If my estimation is very correct; so, that only I as an error. So, if I do this one this
is called analysis; if I do this one this is called synthesis; if I pass 𝐴𝑉𝑔 [𝑛] and implement 1
by A[z]; I can estimate S[n].
334
(Refer Slide Time: 15:58)
Now, what is the boil down the principle? Principle is that I am to either estimate the value
of 𝑎1 , 𝑎2 and 𝑎𝑝 or if I know 𝑎1 , 𝑎2 and 𝑎𝑝 ; I have to pass the 𝐴𝑉𝑔 [𝑛] with 1 minus A[z]
to get the signal synthesis. So, how do I estimate this values? If you see, I have only one
equation; if you see this equation I have only one equation where I can say,
So, in a 𝑝𝑡ℎ order this equation I get P number of unknown but only single equation. So,
if I have a 2 unknown how may equation I require to derive that unknown to equation?,
But here I see there is a P number of unknown but only I have a single equation. The trick
is that; how do I find out a set of solution for which this is equal to 0? So; that means, that
if I know the previous N number of sample, some linear combination of those previous N
number of sample; we will provide me the current sample.
So, suppose I know previous three sample; I know some set of combination of this previous
three sample will provide me the current sample. So, some set of previous sample with
some multiplication, some kind of combination of this; we will provide with the current
sample. Now, I have to find out which combination which multiplication factor with
sample number 1, which multiplication factor with sample number 2, which multiplication
sector of for sample number 3.
335
So, there may be infinite set of solution; I do not know; 𝑎1 , 𝑎2 and 𝑎𝑝 can be with the set
can we take the any value in infinite plane infinite set. So, for any value I can get this
current sample but for particular some value the error will be 0. So, for a infinite set of
solution; I have to find out the optimum set of solution or I can say find out the set of
solution or find out the set of the way 𝑎1 , 𝑎2 and 𝑎𝑝 value for which the either is 0.
So, I have to minimize the error and find out the set of solution. So, I can say if my
estimation of 𝛼𝑘 is, let us I estimate that 𝛼𝑘 .
336
(Refer Slide Time: 20:00)
So, I can say e n error; there is a N number of signal speech signal I can take Nth position.
So, I can say; I have to estimate I know the error, so I have to find out the set of 𝛼𝑘 for
which this error is minimum. So, what is the procedure to minimize the error? One of the
337
procedure is mean square error minimization. So, I can say find out the mean square error.
So, what is mean square error?
𝐸𝑛 = ∑ 𝑒𝑛2 [𝑚]
𝑚
𝑝 2
Now, what I have to know; I have to find out a set of value of 𝛼𝑘 for which this error is
minimum; how do minimize the error? Let us take that you know that function
minimization problem. So, minimize by setting what? So, how do minimize? I have to
minimize the function. So, the first order differentiation with respect to for which value I
for with respect to which value I minimize it; with those value is α. then I get the set of
value of 𝛼𝑘 for which the function is minimum.
𝛿𝐸𝑛
=0
𝛿𝛼𝑖
338
(Refer Slide Time: 23:23)
∞ 𝑒 2
∂𝐸𝑛 ∂
= ∑ (𝑠𝑛 [𝑚] − ∑ 𝛼𝑘 𝑠𝑛 [𝑚 − 𝑘])
∂𝛼1 ∂𝛼𝑖
−∞ 𝑘=1
∞ 𝑝 𝑝
∂
= 2 ∑ (𝑠𝑛 [𝑚] − ∑ 𝛼𝑘 𝑠𝑛 [𝑚 − 𝑘]) (− ∑ 𝛼𝑘 𝑠𝑛 [𝑚 − 𝑘])
∂𝛼𝑖
−∞ 𝑘=1 𝑘=1
𝑝
∂
−𝑠𝑛 [𝑚 − 𝑖] = − ∑ 𝛼𝑘 𝑠𝑛 [𝑚 − 𝑘]
∂𝛼𝑖
𝑘=1
∂
𝛼𝑘 𝑠𝑛 [𝑚 − 𝑘]) is constant with respect to for 𝑘 ≠ 𝑖
∂𝛼𝑖
So, I can say phi; now if I just come out this side S[n], so minus infinity to infinity; I will
take other papers, let us take this one.
∞ 𝑝
∞ 𝑝 ∞
339
(Refer Slide Time: 26:06)
So, I can see there is leading to set of P equation in P unknown that can be solve in efficient
manner. Now if I say; I write it in matrix form. So, what I will write? This form; so, I can
write this side; if I want to write this side; how do I write this side? This side I can write;
𝜙𝑛 [𝑖, 𝑘] = ∑ 𝑆𝑛 [𝑚 − 𝑖]𝑆𝑛 [𝑚 − 𝑘] 1 ≤ 𝑖 ≤ 𝑝
𝑚
340
(Refer Slide Time: 28:58)
Now, if I able to solve this matrix I can get the value of α1, α2 and αP. Once I get the value
of α1, α2 and αP for which my error is minimum; then I can say; using this set of αvalue,
I can correctly estimate the signal; the difference between my present signal and the
estimated signal error will be minimum.
So, if the error is minimum that can I say; this set of αvalue actually representing the
production system or representing H[z]; for that time. Since the speech is time varying
signal; let us I take the time instant this instant I get the some speech value, for those
speech value if I extract this αvalue; then I can say these αvalue can represent this H[z] or
H[z] can be implemented using this set of α value; to generate the current signal.
So, both way I can synthesis or I can estimate. So, once I estimate this α value those set of
α value represent that time the production system. So, those can be parameters for those
speech events; those parameter is called LPC parameter; Linear Predicted Coefficient
parameter α1, α2, αP are could the LPC coefficient.
Now, how to estimate this α value; I have to solve this matrix. There may be a number of
method to solve this matrix; with the next class we will discuss one by one methods. So, I
ultimately I have to solve this matrix, so if I want to solve this matrix there may be a
341
number of method is available. So, using those methods; how efficiently I can solve this
matrix so that I can estimate this value and that is my target.
So, first we will discuss about the autocorrelation method. So, there is a three kinds of
method autocorrelation methods, matrix method, the co-variance method and another one
is called the latex filter methods. So, using these 3 methods; we will try to estimate the
value of1, α2 and α3. So, next class we will do that.
Thank you.
342
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture - 22
Autocorrelation Method of LPC Analysis
ERROR FUNCTION: -
𝐸𝑛 = ∑ 𝑒𝑛2 [𝑚]
𝑚=−∞
343
∞ 𝑝 2
∞ ∞ 𝑝 ∞ 𝑝 𝑝
∞ 𝑝 𝑝 ∞ 𝑝
∞ ∞ 𝑝
344
(Refer Slide Time: 00:44)
345
Using the equation (1) and (2), we have to find out the set of value of 𝛼 or I have to solve
that matrix.
if I take the whole speech signal at a time and try to find out the estimate the value of 𝛼1 ,
𝛼2 , and 𝛼3 . then my purpose is not sub because a space signal is not a stationary signal. It
can consist of different type of signal with respect to time, it may be voiced, unvoiced, or
346
sibilant. So, if I take that all (voiced, sibilant, silence) at a time and try to find out the set
of 𝛼1 , 𝛼2 , and 𝛼3 but there is no use because whole signal is set and we do not which time
I have to supply those impulse response. Since the speech is a time variance signal, I try
to modulate or try to take the signal for the small window.
let us speech is a time invariance signal that we have already discuss in time domain
methods. I take a small window from the long space signal and within that window I
consider speech property is not changing. if I take that then I can say the speech signal
which I have taken is nothing but an infinite series as shown below.
So, for the long speech signal I let this is the nth window. So, here m = 0. So, here (m = L-
1) where (n+L-1) th is window. So, suppose I have a this is the first window so, the from
starting is 0 signal up to 159 sample, 160 sample period windows.
Next, next window start from 160 sample to that signal. So, I can say if it is frame rate is
10 milliseconds then I can say overlap signal. So, first window maybe I have taken 160
sample then shifted the window by 80 sample and I take the window from 160 to 239. So,
let us this is my nth window where I take m = 0, my window length is L. So, outside the
window length the, I consider the signal is 0 means there is a speech signal outside the
window length.
347
So outside the window signal is 0 within the window signal is there. So, this is m = 0 and
this is m = L-1. Now, suppose I want to predict the first sample in here what is the previous
sample, all are 0. So, from a 0 sample, all s [n -1], s [n -2] and all s [n – p] all are 0. So, I
get all are 0, but make first sample value is not 0. So, let us its value is something x. So,
the total error is (x – 0), So, total sample itself is the error. So, I can say that at the beginning
of the prediction the prediction error will be maximum.
because, if I say here my window is started and outside the window signal if 0. So, because
those samples are not there, I forcefully make them 0 by windowing. So, once make them
0, I am estimating the first sample from the all 0 sample. So, I get the estimation 0. So,
error is the sample 1.
Take the next sample, I am estimating the next sample from previous p samples where
only 1 sample is there which is s n -1 is non-zero. So, my prediction will be 𝛼1 𝑠[𝑛 − 1].
then also error will be maximum because this is not linear combination of all sample. So,
once I move towards this inside the window my prediction error will be minimum.
when I want to exit from the window, I have (L-1) sample, but I have pth order.
348
(Refer Slide Time: 13:11)
So, I can say this is my window this is the first sample, second sample, third sample, and
this is the last sample. So, once I predict this, all previous sample are 0. So, one side predict
this one from 0 in the maximum error will get, I want to predict this one all 0, but one
sample is there will be slightly reduce I want to predict this one from 2 nonzero slightly
reduce. So, once I go towards the window my error will be decreasing, once I come to the
end sample then, I want to predict 0. this is my current sampling 0 from the previous p
sample. So, current s[n] = 0. So, linear combination of previous p sample cannot be 0. So,
I can say there will be an error which will be maximum. all kinds of error may be positive,
negative. So, if I see the error plot it started from e0 and I get e[L+p-1]. So, my order of
the error will be added here. So, this is called LPC error. So, if I see the LPC error will be
maximum at beginning of the window as well as at the end of the window.
349
(Refer Slide Time: 15:46)
𝐸𝑛 = ∑ 𝑒𝑛2 [𝑚]
𝑚
𝐿+𝑝−1
𝐸𝑛 = ∑ 𝑒𝑛2 [𝑚]
𝑚=0
So, Now
𝐿−1+𝑝
350
(Refer Slide Time: 16:30)
If you see the above graph, only the overlapping region of the graph contribute to solve
the equation ∅[𝑖, 𝑘].
𝐿−1+(𝑖−𝑘)
∅𝑛 [𝑖, 𝑘] = ∑ 𝑠𝑛 [𝑚 − 𝑖]𝑠𝑛 [𝑚 + 𝑖 − 𝑘]
𝑚=0
Let, 𝑖 − 𝑘 = 𝜏, So
351
∅[𝑖, 𝑘] = 𝑅𝑛 [𝑖 − 𝑘] = 𝑅𝑛 [𝜏]
𝐿−1+𝜏
If 𝜏 = 0
𝐿−1+0
𝑅𝑛 [0] = ∑ 𝑠𝑛 [𝑚 − 𝑖]𝑠𝑛 [𝑚 + 𝜏]
𝑚=0
∅[𝑖, 0] = ∑ 𝛼𝑘 ∅𝑛 [𝑖, 𝑘]
𝑘=1
∑ 𝛼𝑘 ∅𝑛 [𝑖, 𝑘] = ∅[𝑖, 0]
𝑘=1
∑ 𝛼𝑘 𝑅𝑛 [𝑖, 𝑘] = 𝑅𝑛 [𝑖]
𝑘=1
𝐸𝑛 = ∅𝑛 [0,0] − ∑ 𝛼𝑘 ∅𝑛 [0, 𝑘]
𝑘=1
𝐸𝑛 = 𝑅𝑛 [0] − ∑ 𝛼𝑘 𝑅𝑛 [𝑘]
𝑘=1
∑ 𝛼𝑘 𝑅𝑛 [𝑖, 𝑘] = 𝑅𝑛 [𝑖]
𝑘=1
352
𝑅𝑛 [0] 𝑅𝑛 [1] … 𝑅𝑛 [𝑝 − 1] 𝛼1 𝑅𝑛 [0]
𝑅 [1] 𝑅𝑛 [0] … 𝑅𝑛 [𝑝 − 1] 𝛼 2 𝑅 [1]
[ 𝑛 ][ ⋮ ] = [ 𝑛 ]
⋮ ⋮ ⋮ ⋮ ⋮
𝑅𝑛 [𝑝 − 1] … … 𝑅𝑛 [0] 𝛼 𝑝 𝑅𝑛 [𝑝]
𝑅∗𝛼 = 𝑟
𝛼 = 𝑅 −1 𝑟
WE can apply the Levison Durbin algorithm to solve this matrix equation.
353
(Refer Slide Time: 19:39)
354
(Refer Slide Time: 21:35)
355
(Refer Slide Time: 23:42)
356
(Refer Slide Time: 26:36)
∑ 𝛼𝑘 𝑅𝑛 [𝑖, 𝑘] = 𝑅𝑛 [𝑖]
𝑘=1
𝑅∗𝛼 = 𝑟
𝛼 = 𝑅 −1 𝑟
WE can apply the Levison Durbin algorithm to solve this matrix equation.
𝐿−1
357
𝐿−1
𝑅𝑛 [0] = ∑ 𝑥[𝑘]𝑥[𝑘 − 1]
𝑘=0
So, we can calculate those autocorrelation Rn values easily from the given signal and put
it in this matrix and using Levinson methods I can solve this matrix for value of
𝛼1 , 𝛼2 , 𝛼3 𝑢𝑝 𝑡𝑜 𝛼𝑝 . So, instead of writing ∅[𝑖, 𝑘], I have now writing in autocorrelation
this method is called autocorrelation methods with LPC coefficient analysis.
So next class, I will try to derive those 3 Levinson recursion equations by which this matrix
can be solved using this type of autocorrelation value.
Thank you.
358
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture - 23
Autocorrelation Method of LPC Analysis (Contd.)
Now, we have to solve the above equation; based on the Levinson and Durbin methods.
∑ 𝛼𝑘 𝑅𝑛 [𝑖 − 𝑘] − 𝑅𝑛 [𝑖] = 0 1≤𝑖≤𝑝
𝑘=1
359
(Refer Slide Time: 00:59)
So, basically these equations can be solved to set up optimize predictor coefficient. So, I
am optimizing the alpha value will satisfy these two equations. I included the minimum
prediction error; that means, if I estimation is very correct, then it should produce the
minimum prediction error. So, minimum prediction error equation should satisfy. So, if I
put these two equations in a matrix form it will look like this.
360
Now, if I want to solve this matrix; using the Levinson recursion; that means, the ith order
solution can be derived from (i – 1) st order solution. Suppose, I want the order of the
predictor is p. So, I can iteratively predict the alpha value, then at pth order will give me
the optimum solution or pth iteration give me the optimum solution.
So, in Levinson recursion, we can predict the current i value or current this matrix equation
from the previous or ith order solution. So, what is the (i-1)th order solution?
𝑅𝑛𝑖 𝛼 𝑖 = 𝐸𝑛𝑖
𝑖−1
Toeplitz matrix has special symmetry we can reverse the order of the equations
361
Combine the two sets of matrices with a multiplicative factor ki
1 0 (𝑖−1)
𝐸𝑛 𝛾 𝑖−1
(𝑖−1) (𝑖−1)
−𝛼1 −𝛼1 0 0
𝑖
𝑅𝑛 ⋮ − 𝑘𝑖 ⋮ = ⋮ − 𝑘𝑖 ⋮
(𝑖−1) (𝑖−1)
0 0
−𝛼𝑖−1 −𝛼𝑖−1 𝑖−1 (𝑖−1)
[ 0 ] [ 1 ] [𝛾 ] [𝐸𝑛 ]
Choose of 𝛾 𝑖−1 so that vector on right has only a single non-zero entry, i.e.
1 1 1
−𝛼1𝑖−1 −𝛼1𝑖−1 −𝛼1𝑖−1
−𝛼2𝑖−1 −𝛼2𝑖−1 − 𝑘 −𝛼2𝑖−1
. = 𝑖
. .
𝑖−1
−𝛼𝑖−1 𝑖−1
−𝛼𝑖−1 𝑖−1
−𝛼𝑖−1
[ −𝛼𝑖𝑖 ] [ 0 ] [ 1 ]
362
(Refer Slide Time: 03:32)
𝛼𝑖𝑖 = 𝑘𝑖 … … … … … … … … … … … … … … … … … … … . . (4)
𝛼𝑗 = 𝛼𝑗𝑝 1≤𝑗≤p
363
B. With prediction error
𝑝 𝑝
𝑣 𝑖 = ∏(1 − 𝑘𝑚
2
) 0≤𝑣≤1 − 1 ≤ 𝑘𝑖 ≤ 1
𝑚=1
364
So, if I write down the above four equation and implement these four equations in a
software; then I can calculate the alpha value and the prediction error.
Let us, I want to find out the order of the predictor P = 3; so, i varies from 1 to p and j
varies from 1 to p. If, I want to find out k1.?
𝑅𝑛 [𝑖] − ∑𝑖−1
𝑗=1 𝛼𝑗
𝑖−1
𝑅𝑛 [𝑖 − 𝑗]
𝑘𝑖 =
𝐸 𝑖−1
365
Now, here I say i = 1and j = 1. So, 𝐸 0 = 𝑅[0]
𝑅𝑛 [1]
𝑘1 =
𝑅𝑛 [0]
𝑆𝑜, 𝛼11 = 𝑘1
If you see; this is the implementation the program is here. So, you can write the program
using these four equations. Now, sometime you may find out that is called normalize
autocorrelation. if you see the autocorrelation; the value of reflection coefficient 𝛼𝑖 and 𝑘𝑖 ;
both depend on R value.
So, sometime if you find that value of its magnitude change of the speech signal, then R
value will be changed and autocorrelation coefficient may slightly change. So, what I
want? I want normalized autocorrelation that with respect to R [0]; I can normalize that
correlation value. So, suppose my correlation value is R (1); R (2), R (3). Now, I can
normalize R (1), R (2), and R (3) with respect to R (0); because R (0) is the energy of the
signal.
366
(Refer Slide Time: 13:21)
𝑅(1)
So, instead of R (1), I can use 𝑟(1) = 𝑅(0), Similarly, instead of R (2) and R (3) we can
𝑅(2) 𝑅(3)
write 𝑟(2) = 𝑅(0) and 𝑟(3) = 𝑅(0) respectively.
If I use ‘r’ to the extract the value of Ki and Ei and 𝛼𝑖 , then I can say this is normalized
𝑅[0]
autocorrelation. If it is normalized, then prediction error E0 = 1 because 𝐸0 = 𝑅[0]
367
Then, that is the task I have given; consider a sample P is of different order as shown in
above figure. X[n] = [1, 2, 5, -1, 2] and P= 2. Find out 𝛼1 , 𝛼2 𝑎𝑛𝑑 auto correlation error
(E).
Now, how do you define the order of the P? So, I say prediction order. So, on which factor
the P is depended? prediction order I take a signal which is 8 kilo hertz sampling rate, 16
kilo hertz sample or 32 kilo hertz sample. So, what kind of order should I use to get that
correct coefficient value? If I increase the order, error will be less. but I cannot take infinite
order. So, what kind of order is optimum?
368
So, if you see that in tube model for formant; every formant is separated by 1 kilo hertz. if
it is Fs is my sampling frequency. So, the basement frequency is Fs/2. Two complex
conjugate poles are required to realize one formant. So, I can say Fs/1000 number of
complex poles.
So, if it is 8 kilo hertz; then 8k/1000=8; So, those are for formant for tube only. Next is
there is a glottis and there is a radiation loss; for glottis 2 poles and for radiation loss 2 to
4 poles. So, if I say radiation 2 pole and glottis 2 pole; then, 2 + 2 = 4 pole; 4 + 8 =12 pole,
if my signal is sampled at 8 kilo hertz.
If my signal is sampled by 16 kilo hertz; then I can say 16 + 4= 20; if glottis and radiation
loss is used by 4 pole. So, order of that LPC analysis; it is not arbitrary, it depends on the
sampling frequency. So, based on the sampling frequency; I can take the order and find
out the LPC analysis.
Now, there is another question LPC analysis? then if I have the speech signal; I extract
that filter LPC analysis give me the alpha 1, alpha 2, alpha 3 value. Those actually
characterize the vocal track. if those coefficients represent the actual pole position. So, I
can say if I take the frequency task form of those coefficient; I should get that LPC
spectrum which will give me the resonant frequency. they should represent the peak at
forward frequency.
369
Now, if order of the analysis is reduced. So, suppose I required let us 14 order and I make
it 12 order. So, some of the formant will combine together and give me some broad kind
of structure. So, if you see in this picture
If LPC order is increased is more or less; copy the actual (blue colour) spectrum. And if
you see the red colour why the LPC analysis is 10; so, there is a lot of formants are there.
So, I can see that; it is roughly estimate that spectral envelope, if I increase the LPC order;
the number of variations is increases. So, it accurately copies the spectrum envelope.
So, depending on our requirement if I want the smooth spectrum; I do not want lot of
variation, then I can reduce the LPC order. If I want exactly copy the spectrum, then I can
increase the LPC order. So, using this I can also draw the spectrogram also this is the real-
life example of LPC spectrum.
370
If you see there are 7 formants; each formant requires two complex poles. So, 7*2= 14
order LPC analysis. This is called LPC spectrum. and then there is a LPC spectrogram.
how is it plotted?
Suppose you have a speech signal, take a window, find out the LPC coefficient and take
the spectrogram of the LPC spectrum, draw the magnitude spectrum of the LPC
coefficient. Then for that time plot this in this frequency scale; again, shift it and plot it
and that way you get the LPC spectrogram.
371
So, if I say complex conjugate pole. So, suppose this is unit circle and if there is a formant
and pole angle is theta. So, the system is real; so, there will be a complex conjugate pole
minus theta (−𝜃) here and this is r. So, if r is close to unit circle the bandwidth will
increase; if it is close to 0 the bandwidth will be 0. So, if you see the formant frequency is
determined by this angle theta and the formant bandwidth it determined by r; which is
already explained in tube modelling.
𝐹𝑠
𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 = 𝜃
2𝜋
𝐹𝑠 𝐹𝑠
𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ = − log(𝑟). = − log(𝑎𝑏𝑐(𝑧𝑘 )) .
𝜋 𝜋
So, if Fs is my sampling frequency; so, what is the normalized frequency of the digital
signal? Fs = 2𝜋. So, theta maximum value is 2𝜋 and which is equal to Fs. So, if the theta
𝐹
is let us say it is x radian then I can say 2𝜋𝑠 𝑥 is the frequency, if it is hertz then I can convert
𝐹𝑠
Fs in hertz. So, I can say 𝑥 hertz is formant frequency.
2𝜋
372
𝜋
So, suppose the theta is equal to 3 . So, what is the formant frequency? If Fs is equal to 16
𝐹 𝜋 16 𝑘
kilo hertz. So, 2𝜋𝑠 ∗ 3 = ; it is 1 kilo hertz by formant frequency and formant bandwidth
6
𝐹 𝐹
is − log(𝑟) 𝜋𝑠 = − log(𝑎𝑏𝑠(𝑧𝑘 )) . 𝜋𝑠 . So, if I know the sampling frequency and r, I get the
formant bandwidth. Or vice versa if I know the formant bandwidth, I can find out the value
of r; if I know the formant frequency, I can find out the value of theta then I can derive the
transfer function.
So, this is autocorrelation method I have described. Then there is a covariance method for
linear prediction. So, I can find out the covariance method to find out the LPC coefficient.
So, we want the solution of that matrix solution of the same matrix; that we have already
derived. But the difference is that; in this matrix that I have not taken the outside the
window signal is 0.
The key difference model method is that limit of summation includes the term before m=0.
So, if the order is P instead of 0; if the order is P; P number of previous samples I have to
consider.
So, the window does not matter here whatever window function I can use, but I should say
that P number of previous samples is required in this analysis. Then I have to solve this
matrix equation.
373
∅𝛼 =Ψ
𝛼 = ∅−1 Ψ
The solution of the matrix equation is called the Cholesky decomposition, or square root
method.
∅ = 𝐴𝐷𝐴𝑡
Where, A= Lower triangular matrix with 1’s on the main diagonal; D=diagonal
Determine elements of A and D by solving for (i, j) elements of the matrix equation.
374
Now, I can solve the above matrix,
𝑗−1
2
𝑑𝑗 = ∅𝑗𝑗 − ∑ 𝐴𝑗𝑘 𝑑𝑘
𝑘=1
∅𝑖1
𝐴𝑖1 =
𝑑1
Else
𝑗−1
So, we
375
Now, once I get the d and A value; I find out
𝐴𝐷𝐴𝑡 𝛼 = Ψ
𝐿𝑒𝑡 𝐴𝑌 = Ψ
1 0 0 0 𝑌1 𝛹1
𝐴21 1 0 0 𝑌2 𝛹2
[ ][ ] = [ ]
𝐴31 𝐴32 1 0 𝑌3 𝛹3
𝐴41 𝐴42 𝐴43 1 𝑌4 𝛹4
𝑖−1
𝑌𝑖 = 𝛹𝑖 − ∑ 𝐴𝑖𝑗 𝑌𝑗
𝑗=1
𝐷𝐴𝑡 𝛼 = 𝑌 =≫ 𝐴𝑡 𝛼 = 𝐷−1 𝑌
1⁄ 0 0 0
1 𝐴21 𝐴31 𝐴41 𝛼1 𝑑1 𝑌1
𝐴32 𝐴42 𝛼2 0 1⁄ 0 0 𝑌
[0 1 ] [𝛼 ] = 𝑑2 [ 2]
0 0 1 𝐴43 3
1⁄ 0 𝑌3
0 0 0 1 𝛼 4 0 0 𝑑3 𝑌4
1⁄
[ 0 0 0 𝑑4 ]
376
𝑝
𝑦𝑖
𝑤𝑒 𝑔𝑒𝑡: 𝛼𝑖 = − ∑ 𝐴𝑗𝑖 𝛼𝑗
𝑑𝑖
𝑗=𝑖+1
𝑌4
𝛼4 =
𝑑4
𝑌3
𝛼3 = − 𝐴43 𝛼4
𝑑3
𝑌2
𝛼2 = − 𝐴43 𝛼4 − 𝐴32 𝛼3
𝑑2
𝑌1
𝛼2 = − 𝐴43 𝛼4 − 𝐴32 𝛼3 − 𝐴21 𝛼2
𝑑1
So, in LPC analysis; if I know the alpha value or K value, I can design this filter and if I
know the voice signal is nothing but an impulse. So, I generate the impulse based on that
F0 value of the voice signal and multiply the gain; that is G and then if I can pass through
this filter; I can generate the speech signal s[x].
377
So, if it is the simple LPC encoder and decoder. So, if I want to transmit this signal from
this point to this point; using simple LPC encoding and decoding, that means I want to
send that F0 value; Z value, alpha value or Ki value in the receiver end. And I can generate
the speech signal. So, F0 is extracted based on the speech signal, I can extract that Ki value
using the autocorrelation technique or covariance technique. Then next I have to know the
G value; what is the G value? Gain value. So, how do you calculate the G?
𝐺
𝐻(𝑧) =
1− ∑𝑝𝑘=1 𝛼𝑘 𝑧 −𝑘
𝐻(𝑧) = ∑ 𝐻(𝑧)𝛼𝑘 𝑧 −𝑘 + 𝐺
𝑘=1
ℎ[𝑚]ℎ[𝑚] = 𝑅(0)
378
𝑝
If you know the alpha value; r[k] value and R (0) value; I can calculate G.
So, next class we will discuss about that which is very important lattice modelling for LPC
extraction and also LPC parameter that extraction also implementation of pole (0) filter all
those kinds of things; we will discuss.
Thank you.
379
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture - 24
Lattice Formulations of Linear Prediction
So, now we are discussing about the lattice formulation of linear predictions. So, if you
see in whether it is auto-correlation or covariance method, we try to solve the matrix
380
We have calculated ∅. Phi is nothing but a correlation value. In case of covariance matrix,
we calculated the correlation value of phi (∅) and then, we solved the linear equation.
So, whether it is auto-correlation methods or covariance methods, there are two steps. First
step is to compute this phi (∅) using correlation for covariance method and using auto-
correlation for auto-correlation methods. Next try to solve p number of linear equation to
find out the value of alpha (𝛼).
The lattices method says that I do not want to use that auto correlation, compute correlation
and solve the linear equation in two separate steps. Can I combine these two steps? Yes, it
is possible in case of lattices formulation combining these two steps and use a simple step.
381
(Refer Slide Time: 02:41)
Now, first I have ‘m’ number of speech signal or sample. So, there are lot of sample in an
equation. So, this is S[m] and this sample is let S [m – i]. So, using this previous ‘i’ sample,
I want to predict S[m] sample. So, if I do that, then it is called forward prediction.
Similarly, I can predict S [m – i] sample, based on i number of sample right side. So, either
I can predict S[m] from S [m - i] side or I can predict S [m – i] from s[m] side. If I predict
s [m - i] from s[m] side, then this is called backward prediction.
If the system is linearly predictable, then equation of the system that H(z).
𝐴
𝐻(𝑧) =
1− ∑𝑝𝑘=1 𝛼𝑘 𝑧 −𝑘
Where,
𝑝
382
(Refer Slide Time: 04:50)
So, in z-domain
𝑖
𝐸(𝑧) = 𝑠(𝑧) − ∑ 𝛼𝑘 𝑠(𝑧) 𝑧−𝑘
𝑘=1
𝑖
= 𝑠(𝑧) [1 − ∑ 𝛼𝑘 𝑧−𝑘 ]
𝑘=1
= 𝑠(𝑧)𝐴𝑖 [𝑧]
In Z-Domain
383
𝑖
= 𝑠(𝑧)𝑧 −𝑖 [1 − ∑ 𝛼𝑘 𝑧 𝑘 ]
𝑘=1
𝑖
−𝑖
= 𝑠(𝑧)𝑧 [1 − ∑ 𝛼𝑘 (𝑧 −1 )−𝑘 ]
𝑘=1
= 𝑠(𝑧)𝑧 −𝑖 𝐴𝑖 (𝑧 −1 )
𝐵[𝑧] = 𝑧 −𝑖 𝑠(𝑧)𝐴𝑖 (𝑧 −1 )
384
Levinson Recursion method
Step-1
𝐸 (0) = 𝑎(0)
𝑎00 = 0
Step-2 Weighting factor of ith pole model
385
We know,
𝐴 (𝑧) = 1 − ∑ 𝛼𝑘 𝑧 −𝑘
𝑖
𝑘=1
𝑖−1
= 1 − ∑ 𝛼𝑘𝑖 𝑧 −𝑘 − 𝑘𝑖 𝑧 −𝑖
𝑘=1
𝑖−1
𝑖−1 −𝑘
= 1 − ∑[𝛼𝑘𝑖−1 𝑧 −𝑘 − 𝑘𝑖 𝛼𝑖−𝑘 𝑧 ] − 𝑘𝑖 𝑧 −𝑖
𝑘=1
𝑖−1 𝑖−1
𝑖−1 −𝑘
= [1 − ∑ 𝛼𝑘𝑖−1 𝑧 −𝑘 ] + ∑ 𝑘𝑖 𝛼𝑖−𝑘 𝑧 − 𝑘𝑖 𝑧 −𝑖
𝑘=1 𝑘=1
Let, 𝑘 ′ = 𝑖 − 𝑘
𝑖−1 𝑖
𝑖−1 𝑖
= [1 − ∑ 𝛼𝑘𝑖−1 𝑧 −𝑘 ] − 𝑘𝑖 𝑧 −𝑖
[1 − ∑ 𝛼𝑘𝑖−1 𝑧 𝑘 ]
𝑘=1 𝑘 , =𝑖−1
386
Forward prediction error in time domain
𝑒 𝑖 [𝑚] = 𝑒 𝑖−1 [𝑚] − 𝑘𝑖 𝑧 −1 𝑏 𝑖−1 [𝑚 − 1]
𝑆𝑖𝑛𝑐𝑒, 𝐵 𝑖 = 𝑧 −𝑖 𝑠(𝑧)𝐴𝑖 𝑧 −𝑖 = 𝑧 −1 𝐵 𝑖−1 − 𝑘𝑖 𝐸 𝑖−1 (𝑧)
So, we get
Backward prediction error in time domain
𝑏 𝑖 [𝑚] = 𝑏 𝑖−1 [𝑚 − 1] − 𝑘𝑖 𝑒 𝑖−1 [𝑚]
Where, 𝑘𝑖 𝑖𝑠 𝑡ℎ𝑒 𝑝𝑎𝑟𝑡𝑖𝑎𝑙 𝑟𝑒𝑓𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡
387
(Refer Slide Time: 20:07)
So, I can say the forward prediction error can be expressed in top of previous forward
prediction error and backward prediction error. Similarly, backward prediction error can
be expressed in top of previous backward prediction error and forward previous forward
prediction error.
So, next class we will discuss about how we draw the signal flow diagram, what should be
the pictographic or signal flow diagram of these two equations.
Thank you.
388
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture - 25
Lattice Formulations of Linear Prediction (Contd.)
If the order of the error is zero, it means we are not estimating anything (i.e. 𝑠̂ [𝑚] = 0)
𝑒 0 [𝑚] = 𝑠[𝑚]
389
𝑏 0 [𝑚] = 𝑠[𝑚]
Now, try to draw the signal flow diagram or lattice filter structures using the above 3
equations. So, let this is the equation number 3.
We know 𝑠[𝑚]. So, which give the 𝑒 0 [𝑚] 𝑎𝑛𝑑 𝑏 0 [𝑚]. now let us find out the 𝑒 1 [𝑚]. From
the equation (1), we get
𝑏1 [𝑚] = 𝑏 𝑜 [𝑚 − 1] − 𝑘1 𝑒 𝑜 [𝑚]
Similarly, if we want to predict 𝑒 2 [𝑚] 𝑎𝑛𝑑 𝑏 2 [𝑚]. So, again we have to add one delay 𝑧 −1
and this signal should be added here. So, if you see the same structure is repeated here. So,
this is called lattice structure; single lattice, second lattice, third lattice and that way I can
get the value 𝑒 𝑝 [𝑚] 𝑎𝑛𝑑 𝑏 𝑝 [𝑚] of pth order prediction.
390
If the whole speech production system is linearly model, then if we apply impulse
𝐴𝑈𝑔 (𝑧)𝑡ℎ𝑒𝑛 we get speech signal 𝑠(𝑧).
𝐺
𝐻(𝑧) =
1− ∑𝑝𝑘=1 𝛼𝑘 𝑧 −𝑘
𝑆𝑖𝑛𝑐𝑒, 𝐴(𝑧) = 1 − ∑ 𝛼𝑘 𝑧 −𝑘
𝑘=1
𝐺
𝑆𝑜, 𝐻(𝑧) =
𝐴(𝑧)
Since, G is a constant, So
1
𝐻(𝑧) =
𝐴(𝑧)
If I implement A(z) and apply s(z); I can get error E (z); which is forward prediction error
or backward prediction error, at the pth state both will be same. if I apply S[m] (time
domain signal) and if I implement this A(z), I will get
391
Speech is nothing, but an excitation into the transfer function. So, the A[z] is subtracted
1
by 𝐴[𝑧], remaining is the glottal excitation. So, that is gain into glottal excitation; so, error
So, we can implement easily A(z); if I know value of k1, k2 and kp then implement this
diagram and apply speech signal, we get the excitation signal. So, this is called prediction
lattice filter for A(z) or error filter implementation.
Suppose I have a speech signal which has let us 8 kilohertz sampled signal.
and window of 20 millisecond and 8 kilohertz encoded into 8 bit. So, in 20 milliseconds;
160 samples will be there. Now, each sample is 1 byte; so, I can say this 20-millisecond
window can be stored or can be required 160-byte memory to store.
Now, suppose the problem is that I want to transmit this 20-millisecond speech from this
point to this point. I have to transmit 160 byte from this point to this point. Can I reduce it
using this LPC method? since it is 8 kilohertz; what should be the order of the prediction?
So, for Fs = 8 kilohertz;
𝐹𝑠
+ 2 + 2 = 8 + 2 + 2 = 12
100
392
that means, we require 𝛼1 𝑡𝑜 𝛼12 . let us each one of 𝛼 is.
1
If I know the alpha (𝛼) value and ki value. So, I can implement the filter which is .
𝐴(𝑧)
and if I apply the gain and error signal; error signal nothing, but an excitation. So, if it is
voice; then its excitation is nothing, but an impulse if it is unvoiced excitation nothing, but
a noise. So, we have to find out whether the segment is voiced or unvoiced that require 1
1 bit, if it is 1 then it is voice; if it is 0 then it is unvoiced.
If it is voiced then we have to generate the impulse, that is also transmitted. So, I can say
again the value of 𝐹0 will be also transmitted. If 𝐹0 value is transmitted using 1 byte. So,
48 bytes plus 1-byte total (48 + 1 = 49) 49 bytes is required.
So, instead of 160 bytes; I can transmit 49 byte and recover the same segment at the
receiver end, this is called LPC encoding. So, in LPC encoding; I only transmit this alpha
(𝛼) value; whether it is a voiced and unvoiced and then receiver side I connect. So, if I
want to draw the decoder.
So, if it is voiced let us this is an impulse generator. Then there is a gain, let us gain also
transmit G and gain is transmitted using 1 byte; so, total is 50 byte. this is gain, this is noise
source. There is a switch; if it is voiced, the switch will be connected to here. if it is
393
1
unvoiced it will be connected to here. And then I implement 𝐴(𝑧) which is suppled from Ki
1
So, 𝐴(𝑧) implementation is already explained. Now I have to implement ; which is
𝐴(𝑧)
nothing, but 𝐻(𝑧), this is called a lattice formation of vocal track transfer function.
1
Implementation of 𝐴(𝑧).
I have applied s[m] and get 𝑒 𝑝 [𝑚]; now I want to apply 𝑒 𝑝 [𝑚] and reverse the signal and
1
get s[m] back that is 𝐴(𝑧). At the output 𝑒 𝑝 [𝑚] is the error signal.
We know,
Put, 𝑖 = 𝑝, 𝑤𝑒 𝑔𝑒𝑡
𝑒 0 [𝑚] = 𝑒 1 [𝑚] + 𝑘1 𝑏 0 [𝑚 − 1]
𝑒 0 [𝑚] = 𝑠[𝑚]
This is the implementation of H(z). we can also refer to the picture below for more details.
394
(Refer Slide Time: 14:21)
395
You can go through these algorithms and implement this in a program; so, I give you a
task? Let us record in your own voice, record the vowel /a/, then you take the signal of 20
millisecond from the middle you cut 20 millisecond speech signals. So, record /a/ vowel
with 16 kilo hertz sampling frequency, 16 bits encoding then cut 20 millisecond of speech
signal. So, let us try for only one frame; then what should the order of the LPC analysis.
16∗1000
So, 𝑝 = + 2 + 2 = 20; we get P = 20 then implement A(z) in a program and
1000
1
implement 𝐴(𝑧) using a program in MATLAB or C and calculate Ki value.
396
Input is S[m]; which is 20 millisecond signal of /a/. So, if it is 16 kilohertz; so, there will
be 320 sample value. Apply this sample value through A (z) and find out the error signal.
1
Now apply this error signal and implement 𝐴(𝑧). K value will be same in both function and
𝐿−1+𝑖 𝐿−1+𝑖
2
𝑖
𝐸𝑓𝑜𝑟𝑤𝑎𝑟𝑑 = ∑ [𝑒 𝑖 [𝑚]]2 = ∑ [𝑒 𝑖−1 [𝑚] − 𝑘𝑖 𝑏 𝑖−1 [𝑚 − 1]]
𝑚=0 𝑚=0
𝑖 𝐿−1+𝑖
𝜕𝐸𝑓𝑜𝑟𝑤𝑎𝑟𝑑
= 0 = −2 ∑ [𝑒 𝑖−1 [𝑚] − 𝑘𝑖 𝑏 𝑖−1 [𝑚 − 1]] 𝑏 𝑖−1 [𝑚 − 1]
𝜕𝑘𝑖
𝑚=0
𝑓𝑜𝑟𝑤𝑎𝑟𝑑 ∑𝐿−1+𝑖
𝑚=0 𝑒
𝑖−1 [𝑚]𝑏 𝑖−1
[𝑚 − 1]
𝑆𝑜, 𝑘𝑖 =
∑𝐿−1+𝑖
𝑚=0 [𝑏
𝑖−1 [𝑚 − 1]]2
397
k is called PARCOR (partial reflection coefficient or partial correlation coefficient).
𝑓𝑜𝑟𝑤𝑎𝑟𝑑 𝑓𝑜𝑟𝑤𝑎𝑟𝑑
𝑘𝑖𝑃𝐴𝑅𝐶𝑂𝑅 = √𝑘𝑖 𝑘𝑖𝑏𝑎𝑐𝑘𝑤𝑎𝑟𝑑 = 𝑘𝑖 = 𝑘𝑖𝑏𝑎𝑐𝑘𝑤𝑎𝑟𝑑
∑𝐿−1+𝑖
𝑚=0 [𝑒
𝑖−1 [𝑚]𝑏 𝑖−1
[𝑚 − 1]]
𝑘𝑖𝑃𝐴𝑅𝐶𝑂𝑅 = 1/2
[∑𝐿−1+𝑖
𝑚=0 [𝑒
𝑖−1 [𝑚]]2 ∑𝐿−1+𝑖[𝑏 𝑖−1 [𝑚 − 1]]2 ]
𝑚=0
you can use covariance methods which is equivalent to call; I can say burge method.
Minimize sum of forward and backward prediction errors over fixed interval (covariance
method)
𝐿−1
𝑖 2 2
𝐸𝐵𝑢𝑟𝑔 = ∑ {[𝑒 𝑖 [𝑚]] + [[𝑏 𝑖 [𝑚]] ]}
𝑚=0
𝐿−1 ∞
2 2
= ∑ [𝑒 𝑖 (𝑚) − 𝑘𝑖 𝑏 𝑖−1 (𝑚 − 1)] + ∑ [−𝑘𝑖 𝑒 𝑖−1 (𝑚) + 𝑏 𝑖−1 (𝑚 − 1)]
𝑚=0 𝑚=−∞
𝑖
𝜕𝐸𝐵𝑢𝑟𝑔
= 0, 𝑤𝑒 𝑔𝑒𝑡
𝜕𝑘𝑖
𝐵𝑢𝑟𝑔 2 ∑𝐿−1
𝑚=0 𝑒
𝑖−1 (𝑚)
− 𝑏 𝑖−1 (𝑚 − 1)
𝑘𝑖 = 𝐿−1 𝑖−1
∑𝑚=0[𝑒 [𝑚]]2 + ∑𝐿−1 𝑚=0[𝑏
𝑖−1 (𝑚 − 1)]2
𝐵𝑢𝑟𝑔
• −1 ≤ 𝑘𝑖 ≤ 1 𝑎𝑙𝑤𝑎𝑦𝑠
398
(Refer Slide Time: 26:12)
1
using this equation, you can implement 𝐴(𝑧) 𝑜𝑟 and record a vowel /a/ take a 20-
𝐴(𝑧)
1
millisecond window; pass through 𝐴(𝑧), find out e[m] with implementation and get
𝐴(𝑧)
Now comparison; If you see that lattice method, combine the 2-stage covariance that
correlation or autocorrelation or correlation matrix and linear solution; they are combined
399
both together and use single instance. So, those are the computational issue is there, you
can read it from that slides.
400
if I know alpha 1, Alpha 2, alpha 3 and all LPC coefficient, those represent the transfer
function at that position and those are implemented the pole position equivalent to the pole.
So, each pole represents the formant frequency actually alpha 1, alpha 2, alpha 3. should
represent the formant frequency.
Now, if I take fourteenth order; then the alpha value which is not close to unit circle do
not give you the formant frequency. So, the alpha value which root value close to the unit
circle; give you the formant value.
401
(Refer Slide Time: 29:47)
Then I have already discussed in my tube class that LP analysis is related to the loss less
tube model. If you remember that this is the lossless tube model, if I consider all tube has
same dimension; then I know 𝐿 = 𝑁∆𝑋.
𝑐𝑁
𝐹𝑠 =
2𝑙
Where, c is the velocity of sound, N is the number of tube sections, 𝐹𝑠 is the sample
frequency and 𝑙 is the total length of the vocal track.
Reflection of coefficients (𝑟𝑘 ) are related to the areas of the lossless tubes
𝐴𝑘+1 − 𝐴𝑘
𝑟𝑘 =
𝐴𝑘+1 + 𝐴𝑘
If, 𝑟𝐺 = 0
𝜌𝑐
⁄𝐴 − 𝑍
𝑟𝑁 = 𝑟𝐿 = 𝜌𝑐 𝑁 𝐿
⁄𝐴 + 𝑍
𝑁 𝐿
402
(Refer Slide Time: 29:58)
Now, I give you the problem; can we estimate the area function from the speech signal?
Suppose I produce /a/, I want to know how the area function is changing during the
production of /a/?
Let us this tube is model. So, if it is a pth order LPC analysis. So, I can say the tube is
divided into P number of sections.
−1 ≤ −𝑘𝑖 = 𝑟𝑖
𝑟𝑖 = −𝑘𝑖
Now, if I know 𝑟𝑘 𝑜𝑟 𝑟𝑖 value then I can express 𝐴𝑘+1 in term of 𝐴𝑘 or I can say that
𝐴𝑘+1 1 − 𝑘𝑖
=
𝐴𝑘 1 + 𝑘𝑖
𝐴𝑘+1
So, if I know the 𝑘𝑖 value, I can get 𝑟𝑖 value, once I know 𝑟𝑖 value; I can find out . So,
𝐴𝑘
if I know the previous tube cross sectional area; then I can find out what should the cross
sectional area of the next tube.
403
So, if I know A1 and k1, then I can find out A2. Similarly, if I know k2 and A2; I can estimate
A3. So, that way I can say if I know first consequence relation 1; then I can say how the
different cross-sectional area is created during the production of that vowel; from where I
estimated the ki value.
So, for a given vowel I can draw area function if I know the speech signal. So, reversely
if I know the area function, I can generate the speech signal that is a tube synthesis. Now,
if I know the speech, I can find out the what kind of constriction; what kind of different
cross-sectional area is made during the production of that sound.
Thank you.
404
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture - 26
Over View Of Short – Time Fourier Transform (STFT)
405
Suppose this is my speech signal as shown in below. Now if I take the Fourier transform
of whole speech signal, then I get a spectrum or frequency domain representation which
consists of average of all the variation that means.
If you see along the time speech signal is not stationary. The speech signal is changing,
somewhere some different kind of voicing, noise silence. So, if I had a long speech signal
or I have a speech signal of a word or sentence. If I take whole word or sentence at a time
it may consist of several variation of the signal; that means, signal is not stationary along
the time.
So, if I take the whole signal and do the frequency analysis, I do not get any conclusion or
any information of different speech. During the pronunciation speech signal property is
different at different time. So, we want to analyze that property, how it varies across the
time?
So, I have to select a short segment of the signal and do the analysis. Again, I have to select
the next segment and do the analysis. So, instead of taking the whole signal, I am taking a
small portion of the signal and do the analysis.
406
Let us I have taken 1269 sample then I do the frequency analysis ok.
So, as you know in the recap; that what is frequency analysis we are doing? That if you
know that if x[n] is my time domain signal, then if I take the discrete Fourier transform or
Fourier transform, then signal domain it is discrete, but frequency domain it will be
continuous. So, if I take the Fourier transform, then I get 𝑋(𝜔). If it is discrete Fourier
transform, then I get 𝑋(𝑘) that I have already discussed in the Fourier transform view.
Once both domains are discrete, then I say it is DFT-discrete Fourier transform. So, FFT
is nothing, but an algorithm of implementation of DFT. So, once I get that omega (𝜔)is
2𝜋
discrete, then it is DFT; that means, here 𝜔 = 𝑘, where N is the length of the DFT
𝑁
Now if I get 𝑋(𝑘), that is a complex number. So, it has a two property: one is |𝑋(𝑘)|,that
is called amplitude. So, if I take the spectra of magnitude only then it is called magnitude
spectra, if I take the phase only then it is called phase spectra. So, if x-axis is the frequency
and y-axis is the magnitude of discrete Fourier transform then I call this is a magnitude
spectrum. Since 𝑋(𝑘) = 𝑎 + 𝑗𝑏, So magnitude spectra are √𝑎2 + 𝑏 2 and phase spectra
𝑏
𝜃 = tan−1 𝑎 , So, if I plot the theta (𝜃) against the frequency (f), then I call this is a phase
spectra.
407
(Refer Slide Time: 06:25)
So in frequency analysis, we find out two kinds of spectra. One is called phase spectra;
another one is called magnitude spectra. If I analyze it if I analyze, you see above picture
x-axis is the frequency, if it is in linear view; that means, frequency scale is linear; if it is
not linear view, then I say the frequency scale is log scale and y-axis is the amplitude of
the particular component. So, these spectra are magnitude spectra.
Now, if you see magnitude spectra of each portion of the signal are different spectra. So, I
can say the speech is not a stationary signal. So, I have to analyze the speech signal with a
segment wise. So, instead of taking the whole signal, I have to take the signal part of a
signal.
408
So, let us I have recorded my voice of a sentence and it has sampled at 16 Kilo Hertz with
16 bit. So, each sample is encoded with 16 bit and sampling frequency is 16 Kilo Hertz.
Suppose I record a sentence which is 3 seconds long. So, if it is a sentence is 3-second-
long, how many samples will be there?
3 ∗ 16 = 48 so, 48 thousand samples will be there. If I take whole signal at a time do the
frequency analysis. So, during the sentence, speech is not same sometime because I said
different consonants/vowels at different time (Refer Time: 08:12). So, all kinds of
variation exist.
If I take the whole signal and draw the magnitude spectra that is nothing, but the average
spectra of whole signal. I cannot get the local variation. So, instead of taking the 48000
samples at a time, I divide the signal into some segment, which is called hundred frame
per second; that means, in one second, I will analyze hundred frames. I cut a segment of
the speech and do the frequency analysis.
So, once I cut a segment of the speech, then it is called is short time and do the Fourier
transform. So, this is called Short Time Fourier Transform.
So, I take frame by frame and analyze it and I have to get back the same frame again if I
take the inverse DFT. So, whatever modification I do in the spectral magnitude, then again,
take the inverse transform to get back the same frame. that is called synthesis.
409
So, once I get the signal back that is called synthesis and the analysis part is called analysis.
short segment of time is analyzed that is why it is called STFT (short time Fourier
transform).
if you see this is the frequency scale. So, what is the maximum frequency if I need
normalize discrete way? So, y-axis is omega (𝜔) scale and this is 𝑓 = 2𝜋 , 𝑓𝜋is the
maximum baseband signal. And I analyze let us for this time instant So, this portion of the
410
signal I have analyzed and get the spectra of respective portion. Then again, I analyze for
another time and get the spectra. You can see in the above picture.
So, now if I want to mathematically represent this thing. x[m] is a digital signal, it varies
infinite to minus infinite or 0 to 4800 in case of real signal. So, if it is infinite to infinite.
Where, n is the number of segments. w represents the window function, which is (n-m).
Let us this is my whole signal 𝑥[𝑚]. So, this is the nth instant. From the nth instant, I want
to cut L number of samples. So, this is L length sample. So, once I cut it means I am
multiplying a rectangular window of L length. So, suppose this is my rectangular window.
So, what I am doing I inversing the time of the window this side and cut the signal, (n –
m) number of signals I cut and multiply it.
Now, if I want to make this omega (𝜔) is discrete; then it is called discrete Fourier
transform.
411
Time origin tied to window
Now if the time origin tied with the window, then I am shifting the signal of a shifting the
window with the time.
∞
−𝑗𝜔𝑛
𝑋(𝑛, 𝜔) = 𝑒 ∑ 𝑥[𝑛 + 𝑚] − 𝑤[−𝑚] 𝑒 −𝑗𝜔𝑛 = 𝑒 −𝑗𝜔𝑛 𝐷𝑇𝐹𝑇 (𝑥[𝑛 + 𝑚]𝑤[−𝑚])
𝑚=−∞
412
(Refer Slide Time: 17:38)
Suppose I have 3 second recording and 48000 samples. Let us I take 10 milliseconds
window, so; that means, 160 sample. So, in 160 sample I do DFT, then again take another
160 sample then I do the DFT. If I have a signal of long signal 4800 sample, I take a
window size of let us 20 millisecond and then shifted the window 10 milliseconds that
mean; 100 frames per second. So, that first I take 320 sample from 0th sample. Next
analysis window I will shift the window only 10 milliseconds so, next will be 160 to
320+160. Again, it will 320 to 320 + 320. So, that way I have shifted the window with 50
percent overlap.
413
∞
Now this is DTFT discrete time Fourier transform and omega (𝜔) is continuous. now I
want to make omega is discrete.
Let us I have a frequency scale, I have DFT (discrete Fourier transform), then the length
of the discrete Fourier transform is involved. If N is the DFT length; that means, 0 to 2𝜋
that 2 𝜋 frequency scale I have divided in N number of samples. So, each division is
2𝜋
nothing, but . So, if it is a 16 Kilohertz signal and length of the DFT 1000, then I can
𝑁
So, I can say I can divide this omega (𝜔) with respect to k in term of discrete frequency
which is called k. So, instead of omega I can write k, we get
∞
2𝜋
𝑋(𝑛, 𝑘) = ∑ 𝑥[𝑚] − 𝑤[𝑛 − 𝑚] 𝑒 −𝑗 𝑁 .𝑘.𝑚
𝑚=−∞
2𝜋
𝑓= .𝑘
𝑁
Once I do that, then this process is called DFT (discrete Fourier transform).
414
2𝜋
is the frequency resolution. So, if I have a 16 Kilohertz sampling frequency and 𝑁 = 1
𝑁
then every band is nothing, but a 16 Hertz. So, this is 16 Hertz, 16 to 32 Hertz, and 32 to
48 Hertz so on. So, I can say, number of bands is represented by N. So, if it is N =1000;
that means, number of bands is 1000. N is called band number also.
415
If you see x-axis is the time and y-axis is the frequency. intensity represent the amplitude
of the particular frequency. Suppose, I want to write a program for this pictogram analysis,
how do I write?
If you see the settings and then spectral, resolution is 32 band means that whole frequency
𝐹𝑠 is divided into 32 band.
416
(Refer Slide Time: 26:20)
Let us y-axis is frequency and x-axis is time. Now frequency Fs is divided into 32 bands.
So, N = 32 because it is implemented in FFT. So, 2 to the power something it has to be.
So, either it will be 32, 64, 128. So, that way it will be increase. For this time, I taken this
signal and analyze DFT, what I will get?
let us this time instant is 1. Then I get 𝑋(1, 𝑘), if I take the mod of 𝑋(1, 𝑘) (i.e. |𝑋(1, 𝑘)|)
then I get the amplitude spectra. So, this amplitude is decoded into a color scale. What is
the color scale? If it is black if I say if it is black then intensity is maximum. So, let us
maxima is one (1) and if it is white then intensity is minimum.
So, the value of 𝑋(1, 𝑘) will come a value. So, within that value I can assign appropriate
color of this band for k = 1, this is the first band. Similarly, for another band. again, if I
shift the time next block, I color this block. So, if I plot that way, I get this spectrogram.
So, depending on the intensity or the particular frequency component the color will come.
417
(Refer Slide Time: 28:57)
So, this is the STFT analysis on DFT view. So, I in DFT view I can say if I want to do that.
If I have a signal take a small portion of the signal and analyze the frequency using DFT
technique. once I do the IDFT then I can get back this frame again. Now what will happen,
if I just separate the portion, once I get back, I will have a problem in the junction, because
next band is here. because if I say this DFT of 𝑋(𝑛, 𝑘) is nothing, but a frequency response
of the signal multiplied by the frequency response of the window. So, I can say that
frequency response whatever I get it is a convolution of frequency response of the signal
and the frequency response of the window. So, at the boundary there is a window effect.
So, if I do segment by segment and there will be a window effect, so instead of doing that,
what we will do? We take a window of let us 20 millisecond and then shifted the window
by 10 milliseconds; that means, 50 percent overlap. So, I analyzed for 50 percent overlap.
I will discuss how much amount of overlap we should allow? How much amount of
overlap we not allowed, so that we can get the signal back again, that we will discuss in
synthesis part.
So, if I recorded for 48000 sample. I take a window of 20 millisecond; that means, 320
sample, then I analyze the DFT and then I shifted the window for the next frame by 10
millisecond means; 160 sample. Then I get the next frame do the DFT, again, I shifted the
by 10 millisecond I analyze the window and do the DFT then I get the signal. Similarly,
for the spectrogram also, I can take a signal for 20 millisecond; and analyze for 20
millisecond and shifted the time shifting by10 millisecond or 5 millisecond.
418
If it is shifted by only single sample then single sample delay. So, see the computational
complexity for 48000 sample, I have to analyze 48000 times.
Later, I will say what should be the trade off? What is the redundancy is there? That we
will discuss then the for-synthesis purpose also. So, now, depending on the shifting you
get the resolution I will discuss.
So, next class we will discuss about what is the filtering view of STFT then we go for the
time frequency trade off and then we go for the synthesis part. We draw the analysis
diagram and then go for the synthesis.
Thank you.
419
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture - 27
Short - Time Fourier Transform Analysis
I want to analyze the signal for a particular band. So, if I say this is a band pass filter. So,
I analyze the signal for each band. So, if I have a signal x[n] pass the signal through
2𝜋
different band pass filter and each band width is .
𝑁
let us I want to find out the output frequency response for a particular band or a particular
signal frequency 𝜔0 .
420
So, I have a signal x[n], I want to find out whether this x[n] contain 𝜔0 or not? So, what
I will do you to get that output, I do a convolution for a particular and generate a particular
frequency signal 𝑒 𝑗𝜔0 and convolved with x[n] that is my DFT equation.
So, I can say I find out the convolution or the particular component with the signal and
then the convolution output pass through a multiply with the window or I can say if I
remove the bracket then I get,
If I have a system h[n], it is the system response system, if I have a signal at x[n] pass the
with h[n], the output is y[n]. which is given as.
Similarly, x[n] output frequency response is a convolution of the term with window
function. That is called a frequency shifting or modulation.
421
𝑥(𝑛, 𝜔0 ) = 𝑥[𝑚]𝑒 𝑗𝜔0𝑛 ∗ 𝑤[𝑛]
So, instead of n-m, I should write n because I have taken n is the fixed time instant.
If I want to draw the sys the block diagram of this convolution function.
422
Suppose base band signal x[n] has a frequency response of X (𝜔). it has to be modulated
by 𝑒 −𝑗𝜔0𝑛 . Let us the small section is 𝜔0 . Once I do the modulation. So, the origin has to
be shifted by be 𝜔 + 𝜔0 . Once this modulation done, lets I have a filter which impulse
response is w[n], but frequency response is 𝑊(𝜔). So, this portion is the output I get:
Now instead of convolving the frequency response of the signal lets convolved the
frequency response of the window.
So, this is frequency response of window 𝑊(𝜔). Let us modulate this thing to here. So, I
modulate this thing to 𝜔0 . So, I shifted that things to 𝜔0 and I pass this thing to a filter
which only filter out this portion.
So, once I get the 𝑊(𝜔) that is the frequency response of the window, pass the base band
signal with the frequency response of the window. So, if the base band signal is frequency
response is this one then this window will be come here. which is 𝜔0 . now if it is 𝜔0 , I
have to shifted this 2 here. So, I will multiply this with 𝑒 −𝑗𝜔0𝑛 .
So, at output filter will be shifted or demodulated to come to origin. then the diagram will
change.
423
(Refer Slide Time: 12:07)
first x[n] pass the signal with a window frequency response 𝑊, then demodulated shift the
frequency with 𝑒 −𝑗𝜔0𝑛 then also I get 𝑋(𝑛, 𝜔0 ) . Now if I generalize this equation instead
2𝜋
of 𝜔0 ,I can write 𝑘.
𝑁
let’s x[n] is the signal, I have framed the signal first, x[n] is a fixed length signal. this
signal consists of several frequency component and I design a several band pass filter,
which is a fixed frequency band and each band pass filter is shifted to a particular
frequency. each bandwidth is 2𝜋/𝑁.
424
So, this is shifted frequency. let us I write down the synthesis diagram.
2𝜋
So, this is my x[n]. this is 0th component when 𝑘 = 0. So, 𝑥[𝑛] = 𝑊[𝑛]𝑒 𝑗 𝑁 0.𝑛 then
2𝜋 2𝜋
demodulate it is 𝑒 −𝑗 𝑁 0.𝑛 . when 𝑘 = 1, So, 𝑥[𝑛] = 𝑊[𝑛]𝑒 𝑗 𝑁 1.𝑛 again demodulate, it is
2𝜋 2𝜋
𝑒 −𝑗 𝑁 1.𝑛 . similarly, last term in k form will be, 𝑥[𝑛] = 𝑊[𝑛]𝑒 𝑗 𝑁 𝑘.𝑛 then demodulate it is
2𝜋
𝑒 −𝑗 𝑁 𝑘.𝑛 . So, output will 𝑋(𝑛, 0) , 𝑋(𝑛, 1)𝑎𝑛𝑑 𝑋(𝑛, 𝑘)𝑟𝑒𝑠𝑝𝑒𝑐𝑡𝑖𝑣𝑒𝑙𝑦.
425
Now in synthesis portion what is my requirement from their I have to apply some process
by which I can some up them and apply the inverse transform, I should get back the x[n].
1. If x[n] has length N and w[n] has length M then X(n, 𝜔) has length N+M-1.
2. The bandwidth of the sequence X(n, 𝜔0) is less than or equal to the bandwidth of
w[n]
3. The sequence X(n, 𝜔0) has the spectrum centered at origin.
1 𝜋
𝑋(𝑛, 𝜔) = ∫ 𝑊(𝜃)𝑒 𝑗𝜃𝑛 𝑋(𝜔 + 𝜃)𝑑𝜃
2𝜋 −𝜋
A fundamental problem of the STFT and other time-frequency analysis techniques is the
selection of the windows to achieve a good tradeoff between time and frequency
resolution.
426
(Refer Slide Time: 18:09)
if you see in any DFT or STFT analysis, there are 2 words one is called time, I have taken
a small segment of the signal and I have applied a frequency transform number of bands
that is k.
Let us I take a signal of 10 millisecond which content 160 samples. So, N length of the
DFT is the nearest 2 to the power something unless I have to put more 0 padding and the
amplitude will be vanished. So, if it nearest then I can say it is N = 256. If the signal is 16
16 𝑘
kilohertz sample then is my resolution.
250
427
Now, if I want to increase the frequency resolution. So, instead of 10 milliseconds if I take
16 𝑘
20 milliseconds and 320 sample, then I can say N = 512; now, resolution will be . So,
512
resolution increases. what is the resolution? if I divide this long window. if I want to
2𝜋
increase the resolution. So, division has to be very small size. So, value decreases mean
𝑁
resolution increases.
So, once I increase the length of the DFT, then I require a large of signal at a time. So,
instead of short segment if I take a long segment then I losing the time resolution. So, if I
increase the frequency resolution, I decrease the time resolution. So, that is called time
frequency tradeoff. So, if I increase the frequency resolution time resolution decreases, if
I increase the time resolution frequency resolution will be decreases. So, that is the time
frequency tradeoff in STFT analysis.
Different window has different effect. The output of frequency analysis is X(n, 𝜔), this is
nothing, but a convolution of frequency response of the window and the signal frequency
response.
428
(Refer Slide Time: 22:27)
So, I want to increase the main lope should be flat ideally if the main lope is flat then the
effect of side lope it is negligible, then I can say the frequency response whatever I get it
is exactly the frequency response of the window. So, if you see there are different kind of
windows as shown in figure below:
429
I have already discussed in the review of DSP lecture that what are with the Blackman
window, Hamming window, Henning window, Kaiser Window. Different window has
different kinds of frequency response. So, my intention is that main lope has to be
increased and side lope attenuation should be very slow disturbing then I can say the
analysis of frequency analysis of the signal is almost (Refer Time: 24:15).
So, I want to choose such a window whose main lope with is very high. If you see in the
spectrogram when I discussing about the spectrogram analysis in this software there is a
window like hamming window, triangular window etc. So, depending on the window
function that frequency response will be different, because the effect of window frequency
response will be transfer to there.
Even if I do the spectral analysis analyze spectral by scan there is a different window
selection, Hamming window, Henning window, Blackman window and Kaiser window or
Gaussian. There is the also FFT size. So, this is the choice of N and I can go their frequency
resolution is very high. So, depending on the choice is there window size. window size
will come from the length of the your FFT. if the window length is 20 milliseconds; that
means, 320 sample is there then N is fixed to almost 320 or 512, because it is implemented
in FFT. So, length of the window has tied with the frequency time resolution, and type of
the window, I will choose what kind of details analysis, one maximum cases in speech and
Hamming window is used.
Thank you.
430
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture – 28
Short – Time Fourier Transform Synthesis
So, from the last lecture it is a time frequency sampling that constraint that we said that
Lw the decimation in time should be less than equal to 2 pi by the bandwidth of the
analysis window and we said that Nw or you can say 2 pi by Nw must be less than equal
to 2 pi by N ok.
So, when it is said that basically that complete recovery of the signal is possible if I
choose the window whose length is Nw, which should be less than the number of depth I
have taken N, which define the number of channel and so if I say that I have a 3220, 16
kilo hertz speed signal and if I take 20 millisecond window. So, length of the signal is
nothing, but the 320 sample, if I take the DFT length, if I take the DFT length N is equal
to 512; that means, this is Nw this is N. So, Nw, 2 pi by N w must be less than equal to 2
pi by N and I said that Nw must be 2 pi by b.
So, how much decimation in time is possible, how much decimation in time, how much
decimation in time is allowable is defined by the omega c which is the bandwidth of the
analysis window. So, if it is hamming window then bandwidth b is equal to 8 pi by Nw.
So, Lw should be 2 pi by B is equal to Nw by 4 so, 75 percent overlap. So, Lw is equal
to 75 percent of overlap upto 75 percent so this way we can find out the time and
frequency sampling. So, lets not go, sorry I have done mistake this is should, be this
431
should be 2 pi by Nw must be greater than or equal to 2 pi by N. This is given in the slide
so there may be mistaken while I am writing in the book board ok.
So, time frequency sampling is there, now suppose I want to write down that overlap add
method in algorithm. So, if you see if you go through this slide you can make the
algorithm. So, this is given, by this way you can write a program for your synthesis, now
I come to the application of this.
So, summary in summary OLA method there is no time aliasing if window length Nw is.
So, that 2 pi by N is less than equal to 2 pi by Nw and no frequency domain aliasing
occur if that omega c is equal to 2 pi by L so if 0 are allowed in omega then condition 2
can be relaxed ok.
432
Now, in summary FBS method 2 pi by N this one, N w is greater than equal to N, if 0 in
w N are allowed then the condition 2 can be relaxed ok.
So, now I come to the short time Fourier transform magnitude. So, before that we should
explain it.
So, what I said that x of Nk is nothing, but a complex number Fourier transform so every
Fourier transform component has 2 parts, one is called magnitude and another is called
angle phase. So, if it is all component let us one of the component written is a into j b in
433
complex number then magnitude is nothing, but root over square a square plus b square
and theta is nothing, but a tan inverse b by a.
So, if I draw this one with respect to frequency then I call magnitude spectra which
already I have discussed in beginning class, magnitude spectra if I tell frequency versus
theta then I call phase spectra. So, STFTM, STFTM short term Fourier transform
magnitude, short term Fourier transform magnitude. So, only magnitude part will take
we discard this phase part do not use this phase part and this is many cases it is used for
time scale modification and speech enhancement, now it was found that I will not go in
details mathematics of STFTM synthesis. So, STFTM analysis is that I do the same
analysis and calculate the magnitude part.
So, analysis part is ok, so synthesis part there is a detail mathematics available, I am not
going that details you can go through the books and you can, if you want you can really
go through the books and find out of the details. So, it is said that from the magnitude
part if I maintain certain constraints then the from magnitude part also signal recovery is
possible. So, it is said that from the magnitude part itself I can recover the signal. So,
signal recovery is possible if I only take the magnitude part. So, that if I take the
magnitude part then analysis is called STFTM short term Fourier transform magnitude
and synthesis is STFTM synthesis. So, I am not going details of the mathematics ok.
So, there will be a constant on analysis window and the signal if I there is some constant
is allowed then it is, it can be shown that from the STFTM also I can recover the signal
ok.
434
So, I am not reading the slides slide you can read better than me. So, now, come to the
applications. So, why we do STFT, STFT analysis, synthesis all kinds of things why we
will do it that what is the application of STFT signal estimation and modification STFT
or STFTM both.
Suppose I have a signal X[n] that may contain lets 2 kilo hertz to 5 kilo hertz component,
I want to reduce the amplitude of this portion. So, I analyze X[n] in frequency given and
reduce whatever modification I want, I modify the spectrum and then take the synthesis
again I will get back the X[n] which is modified signal ok.
So, I can do it in both terms, I can calculate x of N omega and do the modification in x of
N omega or I can calculate x of N n omega and take the magnitude part and modify in
the magnitude part then also recovery is possible. So, STFT or STFTM is used to signal
estimation. So, I want do the power spectral magnitude of this estimate that spectral
density of the signal, I can do the estimation or I can modify the some spectra and that I,
then another one is called time scale modification I can do the time scale modification of
the speech signal and noise reduction. So, time scale modification goal, what is time
scale modification?
435
I can give you example, I want to either speed up or slow down the speech signal while
maintaining the approximate speech, if you remember earlier that there is a you can tape
recorder or you can say the gramophone that like gramophone recorder, gramophone that
disk gramophone disk if you see the top of the gramophone disks the r p m was
mentioned. Rpm was mentioned it is 60 r p m record, it is 40 rpm record, it is 30 rpm
record that is mentioned during the recording time the recording speed is mentioned
there. So, that plain time if I play the same speed then only I get the exact recorded
signal, if I increase the speed what will happen?
Now, I increase the speed so suppose if I increase the rpm. So, if there is a 5 track is 3
second now increase the rpm time will be reduced, speed up. So, time will be reduced.
So, instead of 3 seconds song I can get 2.5 second song or lets 2 second song if I modify
more then what will happening, once I increase the speed the spectral characteristics of
re sampling that is nothing, but re sampling spectral characteristics is also changing.
436
So, the fundamental frequency or speech is also changing. So, that is why when you play
a 60 rpm record in 40 rpm what will happen, if it is 60 rpm was there you quickly play it.
So, fundamental frequency may shifted to female voice you can do that. So, male voice
become female voice, female voice become male voice, this is happens in cassette also if
you see the cassette if your tape recorder is very low very old then if the motto speed it
change then the quality of the sound will change ok.
So, I want some time I want time scale modification, suppose you experience with the
movie I if you know that fast you have done that acting and then the voice is
synchronized with your acting. So, suppose when you dub it that time you found that
recorded speech is slower than the movement of the lips during the acting. So, what I
want, I want therefore, recorder speech little bit of faster. So, I can play it fast that much
changing the sample re sampling it or what I can do I can cut some portions. So, that is
called time scale modification. So, in time scale modification cut and paste.
437
Suppose I have this is my voicing segment lets I say something and there is a voicing
segment this voicing segment represent lets, this to this represent is lets one second if I
want to match with the lip movement in the video I have to make it lets 50 instead of 1
second I have to make it 0.8 second.
So, how can I do it, either I can cut some portion of the signal and add them, if I cut then
I can paste them. So, if I cut in arbitrary position what will happen there may be a pitch
mismatch a half period is cut this side and half period is cut this side. So, there is pitch
mismatch also if I want to expand it one second I want to make it 1.2 seconds. So, what I
will do I will cut some signal from here and paste it here and so at that boundary there
may be a pitch mismatch in adjacent phase or this boundary. So, there may be pitch
mismatch. So, that is the method, but it is a good not good method, but if you cut it
precisely it is a good method. I can show you in the next class if I take I can show you in
a usual cool edit I can cut exactly one period of voice and paste it I can show you how it
should be cut ok.
So, if we cut in arbitrary position there will be a problem then they said how do we solve
it, if you see any speech signal, any speech signal if you see if you remember the opening
closer and the opening of the vocal fold if you see in if I want to see the vowel “a” it will
see look like this vowel so this is a complete here, again it will be look like this.
Now, suppose I cut in here and cut in here and cut it out so what will happen this is half
speech this is also half speech so there will be pitch mismatch. So, to avoid the pitch
438
mismatch next method is called piece OLA, pitch synchronous overlap add method
details I will discuss these things during the pitch synthesis. Pitch p s OLA td, pitch ola
Es nola all we have discussed. So, pitch synchronous overlap add method then they said
if I able to find out the epoch point the open closing and opening point of the vocal fold
from the signal, then I can synchronize the pitch. So, I can exactly cut one period then no
problem if I cut one period and paste it one period here there is no problem. So, that is
called pitch synchronous.
Now, to get that exact pitch period from the voice signal automatically it is a very
difficult. So, estimation of pitch synchronization or you can make the pitch
synchronization is very difficult. So, that is why this process is also very difficult other is
called STFTM, STFTM synthesis to avoid pitch synchronize problem use only the
magnitude spectra of the frame, from the magnitude spectra it is possible to synthesize
the speech signal. So, compute x of N L omega at appropriate frame interval or
decimation rate and appropriate window length and appropriate you can say the length of
the DFT, then modify decimation rate with the new rate m equal to L by 2 to speed up
factor by half and then take the inverse transform and estimate the same.
So, STFTM I can use to time scale modification of the pitch signal if you see it here.
So, this is decimation in time in L this is L, 2L, 3L, 4L, 5L, 6L now once I synthesize
time I make the decimation time is L by 2. So, it is half m, 2m, 3m, 4m, 5m, 6m so I can
say squeeze the number of samples. So, I can speed up the second if I want to slow down
the signal so instead of half I can say 2L so it is double. So, this way I can do the time
scale modification of the speech signal which is the application of STFTM.
439
So, this is the block diagram ok.
Next one is the noise reduction, you know that in cool edit there is a also a button called
noise reduction.
So, suppose I have recorded a signal, speech signal this is silence then I start recording
this portion I have not speech speak anything then I will start speaking. So, the noise in
the silence region also is here the noise also there noise is spread. So, much during the
speech and this speech is going on, I want to remove the noise if I consider the noise is
440
stationary signal mean; that means, noise is not changing over the signal then I can
estimate the noise power from this silence zone and subtract it from the spectral
information of this speech and again resynthesis I can remove the noise. So, what I will
do, I first estimate the noise power by analyzing this portion take a single window and
this portion and I can estimate that Nk n omega then I apply that noise estimation. So,
noise estimation is lets Sb, Sb is the noise estimation. So, I can subtract the noise
estimation from the signal spectral estimation and I re synthesize it I can get back the
signal the noise is removed ok.
So, this way there is a lot of, lot of algorithms you can develop. So, there is a algorithm
there may be a complex procedural I can find out I can estimate the noise in different
way I can subtract not full. So, what is the problem is that if the noise this is a noise and
speech signal contains a sibilant sound which is also a noise, if I subtract it the sibilant
sound may go so this is the problem. So, then what kind of estimation I can say if this is
a sibilant sound then do not subtract this.
So, those kind of algorithm I can write and I can remove the noise from the speech
signal. So, STFTM can be used or STFT used for noise reduction, time scale
modification, speech enhancement all those all things is that de synthesis is done by o L
a method or overlap add method or f b s method. OLA method is recommended you can
write an OLA method algorithm to reconstruct because I have to get back the signal
again only deconstruction is only possible if I suitably choose the window and also the
shifting of the window.
So, that is why if you see in any parameter extraction we do those frequency analysis of
the speech signal what we done.
441
So, suppose I have a long speech signal lets this is 4800 sample which is 16 kilo hertz
sampling rate and 3 second signal. So, I have that number of samples, you have not taken
all sample at a time and do the frequency analysis what I will do, we will divided this
signal with a frame rate and window length. So, we choose a window length analysis
window Nw is equal to lets 20 milliseconds.
So, Nw is equal to 20 milliseconds, then what we will do which one kind of window you
will use either hamming window, honing window or rectangular window what any kind
of window we can use depending on the window that define me the omega C. So, L is
equal to 2 pi by b. So, that define me what kind of shifting is possible then I shifted the
signal that way then take another window from here to here. So, if it is 10 millisecond is
shifting. So, this is 10 milliseconds. So, again 10 to so this is 10 millisecond, this is 20
milliseconds. So, I can first window is 0 to 20 milliseconds. So, fast window fast frame,
frame 1 0 to 20 milliseconds, second frame 10 to 30 milliseconds. So, I take 10 to 30
milliseconds third frame 20 to 40 millisecond.
So, I can now instead of milliseconds I can number of sample I can do. So, I said 100
frame per second; that means, the shifting of the window is 10 milliseconds if I say I
want 200 frame per second shifting of the window I can easily understand 5
milliseconds. So, that is why we are do the speech processing in frame rate then I can do
the STFT I can use the O L A method to get back the signal. So, I analyze the signal this
way if you see there is a picture I have shown you, you can go through this slide also this
one.
442
So, I take frame by frame 70, I can make 75 percent of overlap also I analyze frame this
frame this frame and I get back. So, those are the inverse Fourier. So, I get back the
frame then I add all those frame I get back the signal ok.
So, this is the overall STFT analysis, the STFTM synthesis is I am not described if you
want you can just I want that if you raise in the forum that STFTM is also required or
you can go with that book which I have referred that book you can go with the
mathematics and if you want, you can I can take one another half an hour class in the end
because time will not be permitted to details go to the STFTM. So, this is nothing, but a
mathematics all kinds of mathematics deductions will be there. So, you just go through
the book if you not understand the deductions then again I come back ok.
Thank you.
443
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture – 29
Lattice Formulations of Linear Prediction
We have already discussed about the STFT analysis. Now we will discuss about the STFT
synthesis.
In STFT analysis, we have taken one segment of the long speech signal and for that segment we
have done that STFT. If 𝑥[𝑛] is my signal, I apply discrete Fourier transform on 𝑥[𝑛] I will get
𝑥[𝑘] once I apply the IDFT, I should get back 𝑥[𝑛]. So, this is the normal Fourier transform
property.
Now STFT also follow the normal Fourier transform property. So, if I have done STFT, can we
get back the signal again? So, some analysis I want to do in the frequency domain and then again
I want to get back the time domain signal. So, what I want, suppose I have a speech signal and I
want some processing in frequency domain some enhancement of this signal, and then again I
want to get back the speech signal. Since we have not taken the whole speech signal at a time,
444
we have taken part of the speech signal. For example suppose I have a 16 kilohertz sample
signal, I have recorded 3 second of my speech then I have 4800 sample in 3 second. So, I have
not taken all samples at time and done the Fourier transform then it is very easy. If I take whole
signal at a time and done the Fourier transform and get back the signal then it is difficult. Since
Speech is a non-stationary signal; that means, along the time speech signal changes its property.
So, instead of analyze whole signal at a time, we take a small segment. So, if I take 20
millisecond windows then I get 320 samples. So, if I do that STFT analysis using Fourier
transform. So, nearest 2 to the power something which is 512 point DFT I have applied (N =
512). So, I have (Refer Time: 3:37), 320 sample plus some 0 to get 512 sample. So, once I done
the reverse transform I will get a signal which is not the exactly 320 sample, it is something 512
sample, since I cannot get back the x[n] itself because that I have multiplied that signal with a
window function. So, it is a multiplication of window function and the signal. So, my intension is
that if I want to do STFT synthesis, I have to get back the x[n] exactly.
445
𝑋(𝑥, 𝜔) → 𝑋(𝑥, 𝑘)
↓ ↓
DTFT DFT
𝑓[𝑚]
𝑥[𝑚] =
𝜔[𝑛 − 𝑚]
IDFT, m = n
𝑓[𝑚]
𝑥[𝑚] =
𝜔[0]
𝜋
1
𝑥[𝑛] = ∫ 𝑋(𝑛, 𝜔)𝑒 𝑖𝜔𝑛 𝑑𝜔
2𝜋𝜔[0] −𝜋
446
So, this is a synthesis equation of STFT analysis, now for every m = n, I have to evaluate the
value of x[n].
So, I can say the x[m] recovery is possible with 2 condition, one is that I have to know sample by
sample recovery process, for every m = n, I have to do the recovery process. And I have to know
𝑥(𝑛. 𝜔)every 𝜔 value. So, I can say that complete recovery is possible, if sample by sample
recovery process is done and 𝑥(𝑛. 𝜔)must be known for all 𝜔.
Suppose I have some portion of a long signal and I do STFT, then what is required that for every
sample I have to recover it. I have 4800 sample point then synthesis is possible if I shifted the
window analysis for every sample which is computationally not feasible because FFT calculation
is a time consuming and matter lot of complexity is there. So, if I want to sample by sample
recovery that is not a possible solution. Second one is that I have to know 𝑥(𝑛. 𝜔)for every𝜔. So,
what kind of shifting is allowed, which make possible of complete recovery without doing
process sample by sample.
447
I have done that STFT of a signal of nth window which is nothing, but𝑥(𝑛, 𝜔). So, let us I have
taken the DFT length is equal to N. So, whole frequency scale 0 to 2𝜋 is the highest frequency I
have divided 2𝜋 by N resolution. So, I can say whole frequency scale is divided into N number
of channel or N number of filter. When I do the STFT I multiply the signal with a window
function.
So, let us the bandwidth of the window is B. Now in STFT analysis, the frequency response of
the window function is convolved with the original frequency response of the signal. There will
2𝜋
be a bandwidth b at every modulation as we already explain in DFT filter view. Now if I say
𝑁
2𝜋
if this is my bandwidth. So, the distance between the bandwidth .
𝑁
2𝜋
If the > 𝐵 that means, there is a gap between the 2 filter. So, this portion of frequency does
𝑁
not pass through the filter. Then I cannot say the 𝑥(𝑛, 𝜔) is known for all omega (𝜔). So,
complete recovery of x[m] is not possible, due to the spectral loss. The optimum condition
2𝜋
should be ≤ 𝐵. then I can say none of the omega is left out all omega will pass. So, pictures
𝑁
will be this is my 2 pi. So, this is b again there will be a b. So, this is b again there will be a 2 pi
there will be a b in here. So, I can say if the bandwidth is b then all frequency. So, this gap is not
there that is why I know X n of omega for all omega. So, that is the one limitation, second
limitation is that I have said the sample by sample recovery. So, I have shifted the window one
448
sample which may not be necessary only thing is that band decimation factor is L. So, STFT is
applied for every L sample. So, x (w) of N is non zero over the N (w) is the length of the window
then𝐿 > 𝑁(𝑤).
If you consider this slides. So, let us this is a triangular window. Suppose I have a signal. So, I
said STFT completely recovery if I shifted the signal one by one sample.
449
So, the length of the window is Nw. If decimation factor means shifting of the window is greater
than the 𝑁𝑤 (i.e. 𝐿 > 𝑁𝑤 ). So, next window may be come in here. So, I do not know this portion
of the signal. So, complete recovery is not possible. Complete recovery is possible only when𝐿 <
𝑁𝑤 .
So, 𝑥[𝑛] is invertible if the temporal decimation factor L is equal to or less than the size of the
analysis window (Nw) and the frequency sampling interval2𝜋/𝑁 ≤ 2𝜋/𝑁𝑤 . So the distance
between the 2 channels is 2𝜋⁄𝑁 this must be less than the bandwidth of the analysis window. If
these 2 conditions are satisfied, then it is completely invertible process.
(Refer Slide Time: 19:55)
450
There are two methods for STFT synthesis;
• Traditionally short-time synthesis method that is commonly referred as the Filter Bank
Summation (FBS).
• FBS is best described in terms of the filtering interpretation of the discrete STFT.
▪ The discrete STFT is considered to be the set of outputs of a bank of filters.
▪ The output of the each filter is modulated with a complex exponential
▪ Modulated filter outputs are summed at each instant of time to obtain the
corresponding time sample of the original sequence.
451
(Refer Slide Time: 20:45)
So, if you remember that STFT analysis block diagram. We pass the signal through the window
2𝜋
then we modulate the signal or you can say demodulate the signal or shifted the signal 𝑒 −𝑗( 𝑁 )𝑛 .
So, this is the synthesis block portion, now if I want to analyze this thing. So, after demodulation
1
I have to modulate it for every x (n, 0) and take it sum and I get 𝑁𝑤(0) and get back the signal
y[n].
452
𝑆𝑜, 𝑦[𝑛] = 𝑥[𝑚] 𝐼𝑓 𝑤[0] ≠ 0
Let us go to the details math. Frequency domain representation of nth window is𝑋(𝑛, 𝑘), now if
I take the IDFT, I get the time domain signal. So, let us time domain signal is y[n], then y[n] is
equal to inverse Fourier transform of 𝑋(𝑛, 𝑘)
𝑁−1
1 2𝜋
𝑦[𝑛] = ∑ 𝑋(𝑛. 𝑘)𝑒 𝑗 𝑁 𝑛.𝑘
𝑁𝑤[0]
𝑘=0
∞
2𝜋
𝑋(𝑛, 𝑘) = ∑ 𝑥[𝑚]𝑤[𝑛 − 𝑚]𝑒 −𝑗 𝑁 𝑚.𝑘
𝑚=−∞
𝑁−1
1 2𝜋 2𝜋
𝑆𝑜, 𝑦[𝑛] = ∑ [𝑥(𝑚)𝑤[𝑛 − 𝑚]𝑒 −𝑗 𝑁 𝑚.𝑘 ] 𝑒 𝑗 𝑁 𝑛.𝑘
𝑁𝑤[0]
𝑘=0
𝑁−1
1 2𝜋
𝑦[𝑛] = 𝑥[𝑛]∗ ∑ 𝑤[𝑛]𝑒 𝑗 𝑁 𝑛.𝑘
𝑁𝑤[0]
𝑘=0
Finite sum over the complex exponential reduce to an impulse train with period N
∞
1
𝑦[𝑛] = 𝑥[𝑛]∗ 𝑤[𝑛] ∑ 𝛿[𝑛 − 𝑟𝑁]
𝑁𝑤[0]
𝑟=−∞
Y[n] is the output of the convolution of x[n] with a product of the analysis window with a
periodic impulse sequence.
453
(Refer Slide Time: 24:20)
So, I can say y[n] is the output of the convolution of x[n] with a product of analysis window and
impulse strength.
So, what is the meaning of product of window function and impulse strength?
454
(Refer Slide Time: 27:48)
𝑦[𝑛] = 𝑥[𝑛]
𝑤[𝑟𝑁] = 0, 𝑓𝑜𝑟 𝑟 ≠ 0
𝑁−1
2𝜋
∑ 𝑊 (𝜔 − 𝑘) = 𝑁𝑤[0]
𝑁
𝑘=0
455
This expression states that the frequency response of the analysis filters should sum to a constant
across the entire bandwidth. Since 𝑁𝑤[0] is constant. So, the summation of all frequency
response of analysis filter must be a constant. So, next class I will describe the pictographically
this one.
Thank you.
456
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture – 30
Lattice Formulations of Linear Prediction (Contd.)
FBS CONSTANT
𝑁−1
2𝜋
∑ 𝑊[𝜔 − 𝑘] = 𝑁 𝑤[0]
𝑁
𝑘=0
Suppose this is my filter w[n], and this is Nw. And I have taken an analysis window which
is N. Nw is less than analysis window, so it is satisfied FBS constant. So, summation
frequency response of the analysis filter should be a constant, which is nothing but N w[0].
∞
1 𝜋
𝑥[𝑛] = ∫ [ ∑ 𝑓(𝑛, 𝑛 − 𝑟) 𝑋(𝑟, 𝜔)] 𝑒 𝑗𝜔𝑛 𝑑𝜔
2𝜋 −𝜋
𝑟=−∞
457
• It can be shown that any f(n,m) that fulfils the condition below makes the synthesis
equation above valid
∞
∑ 𝑓[𝑛 − 𝑚]𝑤[𝑚] = 1
𝑚=−∞
• Basic FBS method can be obtained by setting the synthesis filter to be a non-
smoothing filter:
𝑓[𝑛, 𝑚] = 𝛿[𝑚]
458
Now consider a discrete STFT of decimation factor or I have a signal and shifted the frame
with factor L.
∞ 𝑁−1
𝐿 2𝜋
𝑦[𝑛] = ∑ ∑ 𝑓[𝑛, 𝑛 − 𝑟𝑙] 𝑋(𝑒𝐿, 𝑘)𝑒 𝑗 𝑁 𝑛𝑘
𝑁
𝑟=−∞ 𝑘=0
Then,
So,
∞ 𝑁−1
𝐿 2𝜋
𝑦[𝑛] = ∑ ∑ 𝑓[𝑛 − 𝑟𝑙] 𝑋(𝑒𝐿, 𝑘)𝑒 𝑗 𝑁 𝑛𝑘
𝑁
𝑟=−∞ 𝑘=0
This equation holds when the following constrain is satisfied by the analysis and synthesis
filters as well as the temporal decimation and frequency sampling factors:
For 𝑓[𝑚] = 𝛿[𝑚] 𝑎𝑛𝑑 𝐿 = 1 this method reduces to the basic FBS method.
459
(Refer Slide Time: 06:05)
460
If 𝐿 > 1 , 𝑡ℎ𝑒𝑛 𝑓(𝑛) is an interpolating filter
FBS says that that analysis window size (N w) must be less than equal to length of the
DFT (𝑖. 𝑒. 𝑁𝑤 ≤ 𝑁). So, if I use 20 millisecond window which is 320 sample, in that case
the DFT length should be more than 320. if I N = 512 then, complete recover is possible.
In overlap add method, so take inverse DFT for each fixed time in the discrete STFT.
instead of dividing out the analysis window from each of the resulting short time sections
perform an overlap add operation between the short sections. So, I take the window
analyze it and take the inverse transform, if I want to exactly get back the signal, then I
𝑓[𝑛]
required 𝑥[𝑛] = 𝑤[0] but instead of division I take 2 section and overlap and add them.
461
𝜋
1
𝑥[𝑛] = ∫ 𝑋(𝑛, 𝜔)𝑒 𝑗𝜔𝑛 𝑑𝜔
2𝜋𝑊[0] −𝜋
If x[n] is averaged over many short-time segments and normalized by W (0) then
𝜋 ∞
1
𝑥[𝑛] = ∫ ∑ 𝑋(𝑝, 𝜔)𝑒 𝑗𝜔𝑛 𝑑𝜔
2𝜋𝑊[0] −𝜋
𝑝=−∞
∞ 𝑁−1
1 1 2𝜋
𝑦[𝑛] = ∑ { ∑ 𝑋(𝑝, 𝑘)𝑒 𝑗 𝑁 𝑘𝑛 }
𝑊[0] 𝑁
𝑝=−∞ ⏟ 𝑘=0
𝐼𝐷𝐹𝑇: 𝑓𝑝 [𝑛]=𝑥[𝑛]𝑤[𝑝−𝑛]
462
(Refer Slide Time: 12:20)
∞
1
𝑦[𝑛] = ∑ 𝑥[𝑛] 𝑤[𝑝 − 𝑛]
𝑊[0]
𝑝=−∞
∞
𝑥[𝑛]
𝑦[𝑛] = ∑ 𝑥[𝑛] 𝑤[𝑝 − 𝑛]
𝑊[0]
𝑝=−∞
∑ 𝑤[𝑝 − 𝑛] = 𝑊[0]
𝑝=−∞
𝑊[0]
For decimation in time by factor 𝐿 → ∑∞
𝑝=−∞ 𝑤[𝑝𝐿 − 𝑛] = 𝐿
∞ 𝑁−1
𝐿 1 2𝜋
𝑦[𝑛] = ∑ [ ∑ 𝑋(𝑝𝐿, 𝑘)𝑒 𝑗 𝑁 𝑛𝑘 ]
𝑊[0] 𝑁
𝑝=−∞ 𝑘=0
The above equation depicts general constrain imposed by OLA method. It requires that
the sum of all the analysis windows (obtained by sliding w[n] with L-point increments)
to add up to a constant.
463
(Refer Slide Time: 13:57)
464
So, I can say that sum of all analysis window which is shifted by L sample must be give a
constant. Suppose this is n number of samples, first window it is shifted by L, the second
window will be shifted by 2L. So, sum of all analysis window shifted by time distribution
factor must give me a constant. which is given by
𝑤[𝑝𝐿 − 𝑛] = 𝑊[0]
In OLA method it is the shown that that this constant is satisfy for all finite bandwidth
2𝜋 2𝜋
analysis windows whose maximum frequency is . this is possible, when 𝜔𝑐 ≤ .
𝐿 𝐿
So, in FBS method, 𝑁𝑤 ≥ 𝑁, in that case there is a special constant, that 𝑛 = ±𝑁 like that
way the relaxation is possible, here also relaxation is possible, but in that case
2𝜋 2𝜋𝑘
𝑊 (𝜔 − )=0𝜔= ,
𝐿 𝐿
2𝜋 2𝜋𝑘
If 𝜔𝑐 > , then I have to ensure at every 𝜔 = it should be 0.
𝐿 𝐿
465
(Refer Slide Time: 22:05)
Time-Frequency Sampling:
Summery of sampling issue for those two methods that gives motivation for our earlier
statement that sufficient but not necessary conditions for invertibility of the discrete
STFT are:
So, time frequency sampling which is very important, why we are doing that FBS
method and OLA method? Suppose I have a signal, I cannot take whole signal at a
time, and analyze its frequency domain and do some modification in frequency
domain, and take the inverse to get the time domain signal this is not possible, because
the signal is non stationary. So, I want to take part of the signal, and analyze it in
frequency domain and do some modification take the inverse transform to get the
signal back.
So, I want to know how much amount of shifting of this sliding is possible, generally
it is possible, but it is very time consuming, and also time complexity because if I have
466
4800 sample then 4800 times, I have to analyze the STFT. So, I want maximum
allowable shifting for that I can recover the signal from the inverse transform, and also
what kind of window I should use for that I know 𝑥(𝑛, 𝜔) for every omega.
467
• Consider window /short-time signal:
➢ 𝑓𝑛 [𝑚] = 𝑤[𝑚]𝑥[𝑛 − 𝑚], and
➢ 𝑋(𝑛, 𝜔) – Fourier transform of 𝑓𝑛 [𝑚]
➢ Analysis window duration of Nw
• From Fourier transform point of view:
➢ Reconstruction of 𝑓𝑛 [𝑚] from 𝑋(𝑛, 𝑘) requires a frequency sampling of at
2𝜋
least 𝜔 or finer.
𝑐
2𝜋
𝐿≤
𝜔𝑐
2𝜋 2𝜋
=
𝑁 𝑁𝑤
468
Conditions for signal reconstruction are:
To avoid aliasing:
469
So, suppose take the example of a rectangular window length of Nw.
Thank you.
470
Digital Speech Processing
Prof. S. K Das Mandal
Center for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture – 31
Segmental and Supra-Segmental features of speech signal
So let us start the new week which is this week we will discuss about the speech features
extraction. So, all kind of that I can say that speech production systems we know. Now
this week we try to find out what kind of features or why it is required the features ok.
So, extraction of speech features this week title I can say the extraction of speech
features. If you see in a speech signal I will saw you in cool edit Software this is a speech
signals.
If I want to see the whole sentence let this is the whole sentence of this speech signals.
So, if you see along the time line there is a different speech event. If you say the this is a
speech event, this may be speech events speech events. So, I can say along the timeline
471
speech is different or characteristics of the speech signal is different. You may say digital
speech it selves is a features yes, once I do the sampling each sample can have a features
x has a features.
So, suppose. So, we said that speech sample can be a features. Now if you see if you
consider a speech segment of one second how many samples will be there? If it is
recorded with 16 kilo hertz, then I can say 16 kilo sample will be there. Even if I make a
window let us 20 minute second window. So, 320 sample will be there 320 sample if
compare sample by sample. Then the feature factor dimension is 320. Otherwise that if I
say this time “a” after one day or after next time “a” 2 signals are sound like “a”, but if I
recorded those 2 signal and I try to compare I found they are different sample wise they
are different. So, if I take the sampling as a feature extraction it is not that good.
So, what will be the speech features or why it is required. So, if I say that then I can start
from here.
472
What is features or parameters what about what about I said? So, what is speech features
I can say lets take the speech signal and find out some parameters which is in lower
dimension, or I can say that that there is a speech signal I take a signal segment and I
represented in such dimension which represent the speech signal it selves. So, I can say
the features is a measure of property of the speech signal. So, I can say some property of
the speech signal will be there that may be a natural and non natural. And those property
actually represent the speech signal.
So, reason for feature extraction redundancy, if I say if I compare sample by sample
there may be many sample which does not required to represent the signal. So, I can say
like that suppose if you see the extract sound features by which you can remember me.
So, actually we are not remembering my each and every. So, suppose I saw you my
image, you are not remembering each and every pixel of my body. So, we are finding out
some features which represent my body, but So they are not exactly my body. So, what I
what I can say that redone then reformation which exist in my face you can discard those
and find out the salient point which actually represent my face.
So, features extraction or the speech features in feature domain is nothing but they
reduce the redundance information. Or you can also remove the redundance information.
If there is a 320 dimensional vectors, any classifier or any kind of comparison when I do
it is a computationally complex every member I have to compare. So, that is the 320
member if an have simple equivalent distance.
So, dimension is increase computational complexity is very large. And also modeling,
suppose I want to model that using those features. So, if the features dimensionally very
high the modeling is very difficult. So, actually feature extraction means that I want I
want to convert this segment in a such away or such the I can say that I can represent this
speech segment with a such a vector which actually represent this signals, and within
redundance; that means, all redundance information is deleted and all the key
informations are there to form a vector which the vector represent that speech event. So,
that is called speech features ok.
So, there are many kinds of features. Some things may be natural features something
may be unnatural speech features. If it natural speech features if I say if you see in this
speech like from here.
473
there if you know that Acoustics phonetics So, if you see natural speech features are the
phoneme different time consist of different phoneme. So, that can be a natural speech
features. If you see the movement of the f0 fundamental frequency during the whole
sentence is not a constant. So, movement of the f0,f0 itselves can be a features phoneme
can be its features. If you see the duration if I say this is the duration of the sound this is
the duration of the consonant to vowel transition this is the transition vowel duration
vowel to consonant transition.
So, all duration kind of speech can be a features. Then I can say articulation place of
articulation in a speech event has a features ‘k’ velar consonant ‘p’ belavia consonant.
So, that can be a speech features. So, some features are natural some features are non
natural they are abstract represent of the speech event, but some features has a meaning.
Phoneme articulation, duration all has a real meaning in this speech, where from
similarly formant frequency, if you if you remember we plot the f 1-f2 plot of all vowels.
So, formant plot of the how all vowels.
So now, using the f1 f2 f3 all formant representation can be a features, by which I can
identify the speech signal or by on the other hand I can say those formant frequencies
represent the speech signal. So, formant may be a natural features. So, there is a some
non natural features that not natural like that sitram linear predictive coefficient, there
represent as be signal, but literally they does not have a meaning. So, if I say the LPC
coefficient a1 ,a2------ ap all represent that speech flame there is parameters or features,
but they are non natural features .similarly sitram then FCC, bogotrax area functions all
kind of features are there which in non natural.
474
So, we discuss about the extraction of sound natural features and non natural features.
Now the problem in feature extraction is that whatever the features is there the extraction
procedure must be robust that means, what is robust that means, if this is the speech
signal, if I this time I extract the features if the next time also I extract the features I will
get the same features for the same speech signal.
And they should exactly represent the segment. So, sometime these features is very high
and if I extracted in next time this features is very high, then there is a some problem. So,
the main problem is the I should call those of the speech features which extraction
procedure is robust and also computationally less complex. If it is computationally very
complex then the feature extraction may required much more time. So, I cannot develop
the system which is real time. Any kind of application if you say ,speech recognition,
speech synthesis, figure identification, LPC coefficient, LPC coding or any speech
coding, all are nothing but a extract the speech features and using those features I can
apply any kind of classification or I can coding or I can synthesis algorithm So that I can
reward back the signal.
So, what can say the speech features actually the representation of the original speech
signal. Using those features I can develop different technology which can be say as the
speech application. So, during this whole week we discussed about those features
extraction. Now I come to that, if you seen the slides or if you see along this speech
signal if you see the spectrogram is different in different segment. Or I can saw you in
here.
475
If you see the speech segments properties or signal property of the speech segments are
different in along the time. If I say this time this portion is completely silence, this time
vocalic, this time noising, vocalic, noising, silence vast. So, all kinds of speech events
are there. And if I say the signals space different time the speech segment signals are
different.
So, if I extract the features for this small segment, then I can say those are call segmental
features as you understand. Or not if I extract the features for the segment then I call
segmental features.
There may be some features which very across the segment. Like that f0 movement,
duration of the speech event. So, they are very across the segment. So, I can say those are
called supra segmental features. So, I can say during the best of the extraction I can say 2
types of features, one is called segmental features another is called supra segmental
features. So, segmental features which related to particular segment may be a consonant
to vowel transition is a segmental features, pure occlusion period is a segmental features,
bass is a segmental information.
So, if the features is extracted segment by segment or for a particular segment of the
speech signal then I call this is a segmental speech features, if the features is extracted
across the segment then I call it is supra segmental speech features. So, if I say segmental
speech features all the speech event phoneme may be part of the phoneme also segmental
speech features occlusion period vast consonant to vowel transition. So, all are segmental
476
features. If it is supra segmental features movement of the fundamental frequency change
of duration of phonemes or syllable, and if you say the duration profile amplitude profile
change of amplitude across the whole speech is also a supra segmental speeches ok.
So, speech features are basically 2 type, one is called segmental features another is called
supra-segmental features. So, f0 duration, amplitude profile, are the supra-segmental
features segmental features I can say phonemes or part of a phoneme is also a segmental
features, like occlusion period ,vast if I able to extract the vast, occlusion period, BOT all
are segmental information.
Now, how do we extract the I will come supra segmental features I will come later on the
intonation, pause, duration, stress all are called supra-segmental features how they
extracts. So, will discuss about the 2 kinds of methodology, one is segmental features
extraction another is supra segmental features extraction. And supra segmental features
are use to model speech prosody now consider a segmental features.
477
If you see if I show a segment of this speech signal that I this segment and if I analyze
the frequency see, this is the frequency spectrum of that segment. So, in frequency
domain the speech is represented in a frequency domain of this segment.
Now, if you see there is a movement of the formant, if I choose small segment, there is a
movement of the formant. So, you can say the peak denotes the frequency component of
the signals. So, if you see this slides there is this is the spectrum information. And if you
see the if the red line is the envelope of the spectrum and those peak, Peak of the
envelope denotes the formant. So, if I say formant is my features. So, I have to extract
those formant, how do I find out those formant? So, I have to find out an algorithm by
which I can robustly extract the formant frequency ok.
478
So, there is a I required an algorithm by which I can extract all the formant frequency
f1,f2,f3, f4. One of the methods is that you know that if there is LPC coefficient a 1 to a
p if I take the frequency task form of those LPC coefficient they are actually represent
the formant position of the signal. So, formant frequency are extracted using the LPC
formant, LPC spectrum analysis also. So, I want an algorithm by which I can extract the
formant frequency then is called formant frequency extraction, and those formant
frequency for a particular speech segment.
Now, if you see in parameter domain there is a lot of speech parameters or features, one
is call filter bank. So, features parameter extraction methods lot of like LPC we have
discuss LPC parameter extraction method we have discussed. So, there is a frequency
domain parameters there may be a time domain parameters. So, features can be extracted
from the time domain representation of the speech signal.
479
So, if this speech signal is this. So, I from the time domain representation I can extract
the features then I call time domain features from the frequency domain representation
from this representation. If I extract the features then I call frequency domain parameters.
So, based on this analysis we can say there is 3 types of parameter frequency domain.
Parameter extraction or 3 types of parameter extraction methods one is call frequency
domain parameter extraction methods, time domain parameter extraction method, time
frequency both parameter extraction methods and extracted parameter are called
parameters.
So, if the extract LPC parameter extraction extracted using LPC, LPC analysis then all
coefficient a1 a2 a3 all are called LPC parameters. So, those represent the speech signal
for a particular speech segment. So, if the feature is extracted segment wise. So, it is a
called features or the extracted segment wise, that I will come later on why it is. Then
that is a called frequency domain , filter bank analysis, short term spectrum analysis,
spectral analysis, formant parameters MFCC, delta MFCC all kinds of parameters are
called frequency parameter time. Domain parameter LPC shape parameter, frequency
time domain parameter, perceptual linear prediction and wavelate analysis ok.
So, all parameter extraction methodology lets which try to explain one by one. So, learn
how to extract the all kinds of speech parameters. Now if I say this is the whole speech
signal, I say the parameters are extracted segment wise. Now if I say the phoneme or I
say phoneme is a segment. How do we define the phoneme boundary in a continuous
speech. Manually I can listen and I can find up to here to here it is ‘s’ kind of signal. But
think about I want the extraction method by which automatically I want to extract what
kind of phoneme it is. So, instead of extract that is very difficult. So, instead of
extracting that what kind of phoneme it is, I want the some kind of representation of the
signal which is unique in certain domain.
So, I can say probably if this is this kind of parameter extraction, and those parameters
belong to particular phoneme may be So, although the extraction of phoneme or part of a
phoneme is a meaningful features or you can natural features of the speech signal, but
detecting the boundary of a phoneme is very difficult in continuous speech. So, how to I
can find out a phoneme |f| how do I can find out phoneme |s|.
480
So, instead of finding out phoneme |s|, let us blindly analyze the speech signal frame by
frame, and assign each frame to a particular phoneme. Or I can say the extracted
parameter represent this frame later on I try to classify whether this frame belongs to |s|,
|f|, or ‘p’ that is my job. So, I can say the features are extracted segment wise, and they
are the representation of that segment either it can time domain or it can be frequency
domain. So, if those features are extracted frequency representation of that segment then
I call frequency domain parameters. If they are extracted from the time domain
representation of the signal then I called it is a time domain parameters .
So, speech features natural features like phoneme we are not extracted really. So, what
we are doing we try to design an algorithm, by which segment or you can say the
framing I done the framing of the speech signal frame by frame speech signal will be
analyze. And either time domain or frequency domain parameter will be extracted which
will represent exactly that frame of the speech signals ok.
So, that is my job. Suppose I have a speech of fast discuss about the filter bank analysis.
What is filter bank analysis? Come here. So, where discuss about the filter bank analysis.
481
So, suppose forget out the slides let us Sn is my signal speech signal recorded speech
signal. So, lets I give real example let us I recorded my name and my name has consist of
the signals of lets 1.5 second and that signal is recorded 16 kilo hertz. So, I can say in
one second there is a 1 second 16 k sample is there. So, 1.5 second or I can say one and
half second. So, there will be a 16+8 = 24 k sample and k sample. If you see if I recorded
my name. So, different speech event has different kind of speech signal.
So, I am not analyzing whole speech signal at a time and represent by a single vector. So,
instead of doing that if I do that then time resolution, I completely loss, as we discussed
in the frequency analysis. So, what I will do instead of analyzing that I will hold that this
is my whole signal and I frame the signal. So, I can say I window the signal. So, I can
say lets take 20 minute second window, and shifted the window by 10 minute second.
So, I can say in a 1 second I will get 100 frame 100 such frame 10 milisecond, 10
milisecond, 10 milisecond. So, 100 such frame and each frame from each frame I get a
representative feature vector that is my job? So, in 1 second or we can 1.5 second I can
get at least 150 frame and each frame is represented by the some feature vector ok.
Let’s those features are let us those features are extracted using filter bank analysis. So,
what is filter bank analysis? If I say so, whole speech signal let us S[n] is there ok.
Let this is by S[n], n which is the nth frame of the speech signal. This should be pass
through some parallel filter, I can pass that signal with a parallel filter, some parallel
filter and each filter has a passband. So, let us this is 100, 0 - 100 hertz. This is again I
said 100 hertz. So, there is a n1 overlapping filter there may be let us n1 overlapping 100
hertz to 200 hertz, then 200 hertz to 300 hertz. So, each filter has a particular passband.
So, from the each passband I will get the power if that particular frequency component is
there in the speech signal.
482
So, if there is a any component 0 to 100 hertz. So, output of the filter is nothing but a 0 to
100 hertz representation of the signal. That is 100 hertz to 200 hertz signal, that is 200
hertz to 300 hertz signal. So now, if I quantize the lets the power let what about the
frequency component is there let us sum their power and take as a parameter a1. So, I
can take those sum as a parameter of a2 those sum a3 let us there is ‘am’ number of
filters. So, I can say up to am. So, this a1 a2 a3 and a m represent the feature vector
which is extracted using the filter bank analysis, I am I clear.
Now, how to design those filters are particular nothing but a band was filter because
there particularly particular band is pass if the whole signal passes through this filter. So,
0 to 100 hertz band will be pass, and what is the collected the passband signal? I find out
the some energy. Energy of 0 to 100 hertz signal. So, I can say 0 to 100 hertz signal what
about the energy is there I can sum them or I can take the average energy of the 0 200
hertz. So, that energy represent one parameter. So, am parameter am number of
parameters are there which is designed by a filter bank analysis is clear. Now those
filters design of those filter has a different methodology. So, I can design them either
overlapping I can design them non overlapping mirror filter not kind of design of those
filter can be done.
And whatever I design the output is call filter bank analysis of the speech signal.
So, each of the filter out put if this is the complete filter bank analysis model, if you see
the filter pass through a bandpass filter. So, this is the first band, second band, third band
483
upto q number of band is there. So, if I say I have a signal call let us I have a 16 kilohertz
sampling frequency. So, what is the basement frequency. Baseband frequency signal is 8
kilo hertz. So, my speech signal has a component up to 8 kilo hertz. So, if I say I am to
design 100 hertz band filter. So, how many filter will be there every filter is 100 hertz
non-overlapping condition. So, I can say 8k divided by 100 hertz. So, I can say it is
nothing but a 8000 divided by 100 hertz. So, I can say there is a 80 filter will be there.
So, q is equal to 80. So, 80 bandpass filter will be there. And each output filter then there
is nonlinearity I can pass through a nonlinearity and low pass filter and then sampling
reduction, and then amplitude compression and I get a parameter x 1. So, that each filter
output I can defined is filter output if the signal is there will filter output will be 0 to n
number of let us see though signal content each frame at 0 to n number of sample each
filter output will be 0 to n number of sample. So, that output I can quantize or I can find
out the average energy. So, that is my parameters. So, this is the complete filter bank
analysis.
Now, how to design those filter? I can uniformly uniform filter when design I can say
every filter is 100 hertz and there is a no overlapping region. I can say every filter band
width will be determine by a long frequency scale. Or I can say every filter band width
will be determine by a mel scale, which is the perceive frequency by human with the
scale of perceive of human human being frequency. So, how the human being perceive
frequency in mel scale. So, I can design those filter bank filter band width in a mel scale,
or I can say those are the overlapping filters. So, I can say the filters are design using 50
percent overlap. So, if this is my first band 0 to 100 hertz next till there will be 50 hertz
to 150 hertz ok.
484
So, then I can say there 50 percent overlap filter bank. Then I can say the bandwidth of
the filter is not linear. So, not always 100 hertz in the low frequency there everywhere
100 hertz, but the high frequency bandwidth may be larger, and that bandwidth may be
determine by mel frequency mel scale. So, all kind of things we can use and we can
design the filter and find out the filter bank parameters.
So, like this uniform filter bank then I can go for the non linear log frequency filter bank,
then I go for the mel scale filter bank. And structure of the filter also I do not want this is
the flat, I can want there will be triangular filter or instead of rectangular filter I can say
there a triangular filter, I find out the energy within that triangle. So, whatever I design
here which is call filter bank analysis. So now stop here. So, next lecture will go for the
another methods for parameter extraction ok.
Thank you.
485
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture – 32
Cepstral Transform Co-efficients(CC) Parameters extraction
So, we discussed about the filter bank analysis. So, output of a filter is nothing but a
frequency parameter that is why the filter bank analysis is call frequency domain
parameter extraction methods.
Now, design of those kind of filters has a little bit of problem means they are not that
much of robust because design an ideal filter of pass band 0 to 100 hertz is very difficult
you know that an DSP, how to design this filter. So, instead of doing that let’s we try to
represent other way.
486
So, what is that if you look at the slides I am interested about this envelop only the red
line because what you know the speech is nothing but this is human vocal track.
Lets this is hn, and there is a excitation e n. So, en pass through the, so glottal excitation
passed through the vocal track produce the different speech event. So, I can say this is
nothing but a sn speech signal. So, different speech event who is responsible to produce
different speech event h n; either en maybe excitation is present or excitation is not
present. If it is excitation is present, then we call voice signal; if excitation is not present
there is a random noise then we call unvoiced signal.
So, let’s there is a voice signal. So, the excitation is present and that excitation is same
for all voice signal only difference is that shape of the vocal track is different and
different kind of speech signal is produced. So, our ultimate aim find out the actual
representation of hn from the sn. So, I recorded the speech signal sn which is nothing but a
convolution of h n convolved with en. Now, I want that how do I extract hn eliminate en I
want to eliminate en. So, if it is a convolution time domain, it is a convolution, now if I
take the frequency transform of this signal. So, this is sk , this is hk , this is nothing but a
multiplication of ek .
So, if I take the DFT here - discrete Fourier transform, or if I represent the speech signal
in frequency domain, so this is nothing but a product of spectral representation of vocal
track with the spectral representation of source excitation source. So, if I say excitation
source is nothing but a impulse, then I want a speech or signal processing methodology
by which I can separate hk from the product of hk into ek . So, what I can do if you see if
it is a product, if I take the log of Sk is nothing but a log hk + log ek. So, I can take the
absolute log or I can take the simple log also. So, in log domain the product is nothing
487
but a additive. So, if it is additive then can I separate hk from ek , from the addition of log
hk and ek .
So, if you see if this is a log spectra that means, after DFT analysis I take the log this is
log of hk this plot is log, I have taken a signal Sn plane signal I take the DFT, and after
DFT I what I get Sk then I take the log magnitude of Sk log of magnitude of H k. Then if
I plot it, it will be look like this, this is log frequency axis is in log, if you see I can show
you.
If you see, this is a linear view. So, this is the frequency scale is linear. So, here instead
of taking the log I find out that mod of Sk , and I plot that mod of Sk with respect to
frequency that is spectrum that this axis is the frequency axis. And this axis is the mod of
x, which is root over of 𝑎2 + 𝑏 2 because Sk is in complex domain. So, I get this kind of
response which is linear. Now, instead of plotting this if I plot the log of this with the
frequency then instead of linear view, if I take the log view. So, if I take the log view, I
get this one.
Now, if you see that I am interested about this envelop. So, envelop represent if this
slides envelop represent the hn or Hk and variation represent the Ek. So, if I want to
extract that if I consider this is a signal, let us consider this log plot is a signal. So, there
is a signal if I told you that if you if you remember that suppose there is a sine wave
there is a signal which is high frequency signal, and there is a signal in pure sine wave.
488
If I take the product of them, what I will get the sine wave impose will be the high
frequency signal. So, now, I want to extract this smooth percent of the signal this sine
wave. So, I can say the sine wave is a low frequency component, this is nothing but a
high frequency component. If this is my complex signal, so I can say if I pass this
complex signal through a low pass filter, so low pass output will be the smooth variation
of the signal; and high pass output will be the high variation of the signal. So, I can say if
I pass this spectrum with a low pass filter, this is consider a signal, and pass this through
a low pass filter, then I can say the output of the low pass filter actually give me the
envelop representation, so that is my extraction . So, since I want the extract only Hk, so
after log if I pass this portion to the a low pass filter, then I can eliminate Ek that is my
target and to find out log of H n. This kind of signal processing has a special name, this is
called homomorphic signal processing.
489
So, I am not explaining again this slides. So, I want a methodology by which I can de-
convolve the convolve signal. So,this type of signal processing represent as a
homomorphic signal processing.
Now, what is LTI system, linear systems or conventional linear system support the
superposition principle that means, if I apply L is a transform of x n is nothing but the L
of x1 n plus x2 n. Or of I apply every input separately and output can be added up both will
be same that is the LTI system superposition principle. So, if the signal fall in non
overlapping frequency band then they are separable. So, suppose I have the signal which
is nothing but x1 ,xn is nothing but a x1n plus x2n, addition of two signal. Now, if x1 n
consist of frequency 0 to 𝜋 and x2 n consist of frequency 𝜋 to 2𝜋 then I can easily
separate by a linear filter I can easily separate. But if x1 is convolved with x2n then it is
very difficult to separate them or some of the x2 is overlap with xn the it is difficult to
separate.
490
So, what I want we want in principle of homomorphic signal processing. So, importance
of homo system is a speech processing lie in their capability of transforming non-linearly
combined signal to additively combine signal. So, basic purpose is that transform the
non-linearly combine, so I want that convolution signal can be represented by a additive
signal. So, what kind of transformation I should do, so that convolution become simple
addition. So, if you see the slide this figure represented the homomorphic system for
convolution, so xn some kind of transformation D then it represents the instead of
convolution I get x1 n + x2n. Instead of x1 n convolved with x2 n, I want a transformation D
star which will represent x1 so the which may be the 𝑥̂𝑛 which in the form of 𝑥̂
1𝑛 + 𝑥̂
2𝑛
that I want.
Then it can pass through the linear filter and I do the de-convolution inverse transform of
this D, then I get the y. So, linear filter is that suppose I want to extract this one. So, I can
extract this one. and I can inverse filter I can apply inverse transform I can apply, I can
get the x1. So, this is the purpose for homomorphic convolution.
491
So, if you see this picture, there is a three 1, 2, 3 - three part. First part system take input
combine by convolution and transform them into additive output. Second part system is
conventional linear system I can suppose linear filter inverse of first system take the
additive input and transform to them convolution output. So, whatever modification I
will do in L part linear time in variance of those will be in additive signal. So, this is
called canonic representation of homomorphic system. If I see this diagram for
convolution, I can write a diagram for multiplication also.
So, if I want that convolution should be represented in a some form, then the system
transform function should be like this.
So, at the end, I want to find out the x1 n instead of convolve, I want a 𝑥̂𝑛 which is
nothing but 𝑥̂
1𝑛 +𝑥̂
2𝑛 that is my target and this is my input. So, I apply this input. Let’s
take the z transform,z domain. So, let z transform. What I will get at output I will get Xz.
So,Xz is nothing but X1 z multiply by X2 z,z transform of frequency domain representation
or I can say it is a DFT. So, I can get xk is nothing but X1k into X2k.Then I take log if I
take the log, what we will get I will get log of x1k + log of x2k, x1k and x2k both are
492
complex. Now, these things so this is in k domain. Now, I apply inverse z transform or
IDFT. So, I get 𝑥̂
1 𝑛 +𝑥̂
2 𝑛 which is nothing but a 𝑥
̂.𝑛 So, I can say this whole system can
So, if you see the slides x n, z transform, so convolution become product dot, then dot
becomes addition, then addition becomes addition, I take the inverse transform in time
domain. So, same things whatever I do here, it is there in the slide x1 n in z domain then I
take the log I get cap take the inverse transform then I get the 𝑥 ̂2 cap, this is called
̂1 × 𝑥
Cepstral domain. If you see it is not exactly x n, I taking this log signal as a time domain
signal and take the inverse DFT. So, this is not exactly x n. So, this domain is call
Cepstral domain. Now, if you see it is a additive signal, so I can say it can be pass
through a low pass filter to suppose I want to discard x2 find out x1 , so that x1 actually
represent.
So, let us x1n = hn and x2n =en then I can say I have remove en, what about the 𝑥̂
1 𝑛 is
represent the envelop of the cepstrum.So, x1 n is the cepstrum of that envelop. So,
actually x1n represent the time variant vocal tract, so that x1 n those can be used as a
parameter who will actually represent the envelop portion of the spectral signal. So, this
493
kind of homomorphic signal processing is used to extract the Cepstral parameters or are
called CC parameter I will come.
So, this is the computational consideration. So, if I say D star, DFT, either complex log
or I can take the log magnitude. So, if it is a complex log then I call complex cepstrum; if
it is a log magnitude only then I call real cepstrum. So, if this instead I can take only log
magnitude if you see the slides, if I see the only log magnitude then I can this is the real
cepstrum. If I take the complex log this is called complex cepstrum.
This is called D star inverse because I take the log I take the x complex exponential
494
So, Cepstral analysis, there is a two kind of Cepstral one is called real cepstrum, which is
represented by a cn is called real cepstrumor it can be a complex cepstrum which is 𝑥̂.
𝑛 If
I only take the log magnitude part then it is real cepstrum, if I take the whole signal this
1 𝜋
is a complex cepstrum. So, complex cepstrum ∫ 𝑥̂(𝑒 𝑙𝜑 )𝑒 𝑗𝜑𝑛 𝑑𝜑
2𝜋 −𝜋
have you understand
So, this complex cepstrum consist of phase and magnitude both, but real cepstrum only
the magnitude part, phase part is not there. So, in the next day or one class I will take or I
can say relationship between the complex cepstrum to a real cepstrum that is if xn real if
my signal is real then.
So, if let’s the signal is real signal, xn is real signal, so if I say the spectrum log
magnitude of the frequency equation of the signal,|𝑥(𝜑)| is nothing is is a real part. And
even thus the log of log |𝑥(𝜑)| will be also a real and even part of the signal. And if you
know that, if I take the DFT of this or frequency transform FT of this I can get 𝑥(𝜑) is a
complex things which has a magnitude which is mod of 𝑥(𝜑) and an angle of x omega a
+ j d.
495
, so if I take the mod part, so mod part if it is xn is real this mod part is nothing but the
real. So, if it is real then it is has a even function; and if it is even function if I take the
cepstrum of take the log of this part, then what about cn, I will get cepstrum I will get this
is nothing but the even part of the signal. And all the angle of that part this is nothing but
a complex part. So, this is nothing but the odd part of the signal. So, if 𝑥̂𝑛 is my real
cepstrum then I can say the. So, if complex cepstrum then the real cepstrum is nothing
but a𝑥̂𝑛 + 𝑥̂
−𝑛 /2 , this proof I will do in the next class. So, this can be proved real
Now, homomorphic filtering what I said that these has to be passed through a filter to
eliminate x2 n. So, if I say this plane in cepstral domain, so I said this signal treated as a
time signal and this signal is nothing but a frequency domain of that time signal, it is not
actually frequency domain or also it not a actually time domain, so I can say it is called
quefrency. So, low quefrency slowly varying component, high quefrency fats varying
component. So, removal of the unwanted component filtering can be attempt in Cepstral
domain it selves and that filtering calls liftering instead of filtering we say it is liftering
because this is this filtering of this here this is not a time domain signal then this filtering
come liftering.
496
Then you can discuss the homomorphic filtering. So, if log x omega then I can pass
through a. So, in view of the time signal, I can pass the low pass filter and I can find out
the slowly varying component. So, I can say I am not going details of that in the filtering
technique of here.
So, what are the step to find out the Cepstral coefficient. So, basic processing stepfor
frequency parameters, I get a xn is the signal.
Then what I will do in this speech signal I do the pre-emphasis to emphasize the high
frequency component. Why it is required, if you see in the speech signal low frequency
are emphasized, but high frequencies are not that emphasized.
497
So, what I will do, first step I will get the speech signal lets sn, I do the pre-emphasis. So,
it is nothing but a or I can say it is sn. So, yn is nothing but a here I have sn = xn I have
made then I can say it is nothing buta 1 - a , xn - a × x [ n – 1], so that is given. and a
value may be or alpha value may be 0.5, 0.96, 0.95 to 0.96 to 0.97 those are the
variation. Then what I will do I do the framing what kind of framing I have a speech
signal I extract I put the window length one window red color to red color one window
then I shifted the window if it is 50 percent shifted if it is 20 millisecond and 10
millisecond shifting then 50 percent overlap. So, for each window for each frame I can
get frame one frame two frame three, and this each frame after windowing pass through
a window.
So, what I am doing I just priming the signal I am not doing the windowing there then
after I cut the red to red I multiply W n. After I cut green to green, I multiply with Wn
which is window signal then what I will do windowing hamming, hanning cosine or
rectangular if I not multiplying anything that means, I am multiplying with one which is
nothing but a rectangular window then I do the DFT. So, the upto this process is convert
Sn to Sk that conversion has to be done.
498
Then what we will do for power cepstrum analysis, this sk will be go to the log then I do
the IDFT. If I do that, then output is call complex cepstrum. If I want the real cepstrum
then what I will do instead of Sk, I do another block here which is nothing buta mod of
Sk, then take the log. So, if I mod of Sk is omitted complex cepstrum, if I take the mod
then it is called real cepstrum. Then beginning, then I can be pass through a low pass
filter; that means, beginning portion is a slowly varying component. So, those are called
cepstrum coefficient or cepstral transform coefficient - CC or cepstral coefficient, you
can say the cepstral coefficient.
So, if I extract the, so whatever I get here, so after IDFT, lets I take the n length IDFT.
So, k will varies from 0 to n - 1, so all the beginning component will represent the slowly
varying component. So, you can say let c0 to c13, let n is equal to something else c13 or
c20 those actually it represent the envelop those are called cepstrum or cc-cepstral
𝑛
coefficient. So, basically this is done ,because k = n - 1 is not required because DFT
2
𝑓𝑠 𝑛
analysis is symmetric property. This is , you know that this is nothing but a . So, I
2 2
𝑛
can say k = 0 to is sufficient. So, I can say take 0 to something some length I can take
2
which represent the envelop cepstral coefficient. So, this is the extraction of the cepstral
coefficient. So, what we have done, we are consider the spectrum has to be passed
through a homomorphic signal processing or what one kind of homomorphic signal
processing we have done by which we can extract the envelop of the spectrum which is
called cepstrum.
Thank you. So, next class we will discuss about the MFCC.
Thank you.
499
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture – 33
Mel Frequency Cepstral Coefficients
So, let us start that last class we have said that the real Cepstrum is C n, then it is nothing,
but a complex Cepstrum is x n plus complex Cepstrum of minus n, then divided by
2;𝐶 [𝑛 ] = (𝑥̂𝑛 + 𝑥−𝑛
̂ )/2This we said if the x n is a complex Cepstrum and C n is the real
Cepstrum.
Now we will go for the proof of it; can we prove it, yes, we can prove it. So, now, I
going to prove it. So, what we will do first; think about that real Cepstrum; what is Cn?
Real Cepstrum means inverse Fourier 1 ∫𝜋 log|𝑥𝑒 𝑗𝜑 |𝑒 𝑗𝜑𝑛 𝑑𝜑 ; this is a real Cepstrum.
2𝜋 −𝜋
If it is a real Cepstrum, then if I compute this one, what is 𝑒 𝑗𝜑𝑛 ? 𝑒 𝑗𝜑𝑛 can be replaced by
500
real, if the input signal is real, then we said the real Cepstrum C n is nothing, but the even
function.
So, if it is even function, then I can say that if this only exist cos term sin term cannot be
exist; that means, 1 𝜋
∫ log 𝑥𝑒 𝑗𝜑 𝑐𝑜𝑠𝜑𝑛𝑑𝜑𝑛 because log of minus
2 −𝜋
𝜋
∫−𝜋 log|𝑥𝑒 𝑗𝜑 | 𝑠𝑖𝑛𝜑𝑛𝑑𝜑 = 0 because this is the even part. So, even part is 0. So, only the
odd part is exist. So, 𝐶 = 1
∫
𝜋
𝑙𝑜𝑔𝑒 𝑗𝜑 𝑐𝑜𝑠𝜑𝑛𝑑𝜑 this is proved.
𝑛 2𝜋 −𝜋
Now, find out the; so, this Cn; I can write down C n in here in one corner. So, later on we
can use that.
So, I in one corner I write down𝑐𝑛 = ∫𝜋 log 𝑥𝑒 𝑗𝜑 𝑐𝑜𝑠𝜑𝑛𝑑𝜑 . Now what is complex
−𝜋
Now, what is this? This is a complex function. So, I can say log 𝑥𝑒 𝑗𝜑 is nothing, but
log |𝑥𝑒 𝑗𝜑 |. So, if I take the log𝜏 component. So, a plus jb can we express as root over of
𝑎2 + 𝑏 2 suppose this is x. So, it is nothing, but a |x 𝑒 𝑗𝜏 |. So, if it is 𝑒 𝑗𝜏 then if I take the
log. So, it is nothing, but a 𝑗𝑎𝑟𝑔𝑥𝑒 𝑗𝜑 . So, I can replace this𝑥𝑒 𝑗𝜑 by 1 𝜋
∫ log|𝑒 𝑗𝜑 | +
2𝜋 −𝜋
Now, there are 2 term a + b, c + d. So, both term will be there, if I multiply they are only
a 4 term, again if xn is real, then odd component minus pi to pi. So, I can that arg e to the
power j omega u omega only the odd component exists only the odd function exists. So,
this multiplied with cos omega n will be 0.Similarly if this multiplied with sin omega n
will be0. So, then we once we get instead of 4term; 2 term will be 0 if 2 term 0, then this
501
will be a 1 ∫𝜋 log 𝑥𝑒 𝑗𝜑 |𝑐𝑜𝑠𝜑𝑛𝑑𝜑 + 1/2𝜋 ∫𝜋 𝑗(−1) . So, minus I can say arg of caps x
2𝜋 −𝜋 −𝜋
Now, what is x-n, this will be only plus. So, it is nothing, but a 1 by 2 pi minus pi to pi
log of x of e to the power j omega. So, this is capital X mod plus 1 by 2 pi minus pi to pi
arg x of e to the power j omega sin omega n d omega. So, if I add this x of k and x plus
n. So, if I add these 2 term; what I will get. So, if I add this term will be canceled, only
these 2 term will be there.
So, I can say it is nothing, but 2 into 1 by 2 pi minus pi to pi. So, I can say X cap n plus
X cap minus n is nothing, but a 2 into this pi to pi log of x of e to the power j omega mod
cos omega n d omega. So, if put make this by 2 is equal to this. So, if it is this, then I can
this is equal to this. So, I can say X cap n plus X cap of minus n divided by 2 is equal to
C n.
502
So, I can say the real Cepstrum is nothing, but a sum of complex Cepstrum x of n and x
of minus n divided by 2 is proved. So, proof is there in the slide also even refer to the
slide also now come to another one which is called LPC Cepstrum. So, what is LPC
Cepstrum. So, what we said that if I have a x signal and if a1 a2…….ap are my LPC
coefficient those are the LPC coefficient of p th order, then if I take the frequency
transform of this, if I take that DFT of this what I will get I get LPC Cepstrum, I get an
LPC Cepstrum which is nothing, but a I can get easily LPC Cepstrum, if I instead of
signal if I use signal xn and take the DFT, I will get xk which is called a real spectrum
signal spectrum and this is give me the LPC Cepstrum.
503
So, how do I do that we said that c 0 is nothing but a log of G square what is G? G is
called gain model gain LPC gain during the LPC analysis I have already discussed how
to calculate the G for a given signal. So, if I know G; c0is nothing but a log of G square,
then I can say cm is equal to a m plus k equal to 1 to m minus1 k by m into ck a m-
1;where1= m. So, it varies from m varies from1to p, then I can get c1 c2…………..cp using
this equation, then it is said cm is equal to k equal to 1 to m minus 1 k by m into ck a m
minus1is if m is greater than equal to p; then I get cp +1,cp + 2……….cn = 1using this
formula.
So, I get c0 c1 c2 c3 upto cn – 1 using this similarly reverse also possible that if I know the
cepstral coefficient c1 c0 c1 c2………………… cn - 1, then I can calculate the p order LPC
coefficient. So,a0 ,a1 ……..ap how do I do that in that case gain is nothing, but a 𝑐0 ,this
𝑒 2
coefficient of LPC spectrum using these 3 formula; I get the cepstral coefficient from
LPC spectrum.
So, LPC cepstral coefficient; if I know the cepstral coefficient I can calculate the LPC
coefficient using this formula. So, all are the speech parameter this cepstral is a speech
parameter is a LPC coefficient is a speech parameters and LPC to cepstral cepstral to
LPC conversion is possible.
504
Now, we go for the next topic which is called MFCC parameter Male frequency cepstral
coefficient. So, there is a many information in our slide I am not going details of the
slides I just give you the gist Male frequency cepstral coefficient and MFCC; this is
popularly known as MFCC which is the most used parameter in speech application
MFCC are used in most of the cases FCC is used. So, there is a 3simple step in FCC
compute effective power spectrum of the speech signal applied Male filter bank and then
compute DCT; why these steps we required.
So, what we said that if I take the spectrum of a speech signal. So, how to draw the
spectrum of the speech signal you know if xn is my time domain signal if I apply DFT, I
will get xk I am not explaining that framing and windowing. So, that also included the
speech processing trimming and we know instead of taking the whole signal at a time I
have a whole signal I pre emphasize the whole signal then cut the signal in a frame in a
window then shifted window by a frame. So, framing in doing all things I have done and
after that if I take the DFT I will get xk .
Now, if I calculate mod of xk and if I plot with the frequency then I get spectrum of
signal xn what is that. So, what is spectrum? So, this axis is the k-axis discrete frequency
k and this axis the amplitude of the mod of xk . So, this is mod of x k-axis. So, k=0 I get
here. So, I get this kind of speech spectrum note that smooth I am drawing it smooth if
we draw it real then if you say you get this kind of speech spectrum I have already
explained in last class also.
505
So, if you see if I take a signal this is my signal then if I select this portion, if I analyze it
then if I can this is the spectrum. So, spectrogram plot of spectrum of the signal selected
signal. So, this axis is the frequency axis this axis is the amplitude axis of the mod of xk.
2𝜋 2𝜋
So, instead of discrete frequency, I can write f also what are the relations = 𝑘, is
𝑛 𝑛
the resolution where n is the length of that DFT which I have taken now what is my
interest if I want to extract the speech signal speech parameter I interest to find out this
resonant frequency my interesting find out this resonant frequency and those resonance
frequency actually define the speech event which is produced by the vocal tract. So, if I
want to find out those resonance frequencies what I can do let’s instead of taking the
whole spectrum I can take a some filter kind of things. So, this is not perfectly done. So,
I can just shifted this one here. So, I can design some filter let’s triangular filter with 50
percent overlap.
So, this kind of filter I can design let us uniform bandwidth filter I can design if I design
that then I am assuring assuming that if this is the speak of my envelop, then average
power of this filter will be very high. So, actually I am try to capturing the formant
frequency using spacing some filter along the spectrum. So, that can be possible. So,
suppose I have a. So, what is the best band thing in the same sampling frequency is Fs
𝑛
how much n, I should take upto because it is symmetric we have t is a symmetry. So, I
2
𝑓𝑠
can say if it is a Fs, then maximum bass band frequency is . So, if I have a depth in
2
𝑛 𝑛
length is n then I can take up to = is sufficient for design the filter.
2𝑘 2
506
So, suppose I have fs = 8kHz then I know that maximum speech frequency is 4 kilo hertz.
So, suppose I take 100 hertz filter that’s linear all filter on 100 hertz bandwidth . So, if it
4000
is not overlap how many filter is required . So, 40 filter is required; now if it is
100
50percent overlap, then I can easily find out how many filter is required. So, how many
filter is required I can easily find. Once I know that how many filter is required then I
can say let’s instead of all k design this filter.let’s 40 filter,let us I have said if it is 50
percent overlap how many will be there 50 percent of overlap means instead of shifting
100 hertz I am shifting 50 hertz. So, I can say 80 filter is required let’s.
So, instead of taking whole spectrum I am taking 80 point of the spectrum. and find out
the cepstral coefficient I can treat that is a signal pass through the inverse DFT and
cepstral coefficient will be generated. So, information is reduced. Now the problem is
that if it is a linear filter this does not matched with human frequency perception if you
remember in perception during my lecture on human perception of frequency and
amplify speech perception.
We said the signal the frequency perceived by human being is not in linear scale this is in
Male scale. So, instead of taking the linear filter now I convert the filter bandwidth as per
the Male scale what is the Male scale you know this using this equation we have already
derived the Male scale. So, the bandwidth of the filter is depends on the Male scale. So,
instead of uniform bandwidth filter, I take Male scale filter.
507
So, what is human perception in frequency lower frequency our resolution is very high;
that means, lower frequency we can perceive linearly along the physical frequency, but
at the higher frequency we have a some low resolution; that means, the band of
frequency the frequency band we perceive as a same frequency is larger. So, I can say at
the high frequency region the bandwidth of the filter will be large so; that means, who do
not require that much of course, resolution. So, half estimation is sufficient for high
frequency.
So, that is why we take the bandwidth of the filter will be larger. So, bandwidth defined
by the Male scale. So, if I take the Male scale filter then let us I take the Male scale filter
if you find out the dividends 4 kilo hertz I think 20filter will be sufficient to cover the 4
kilohertz frequency or entire frequency range and. So, I can get m1 m2 upto m20 because
every filter give me a single point bandwidth . So, every filter has a single bandwidth
which is m1 , m2, m3, m4 , I can find out 20 Male point.
So, since filter are designed in Male scale then we can say we have locked the frequency
spectra in Male scale instead of hertz I want this spectra in Male scale. So, what I get
instead of hertz I get Male scale in here and here is let 𝑥̂𝑘 instead of xk, I take X can I
take the average energy. So, I instead of may spectrum I get Male scale spectrum once I
get the Male scale spectrum if I analyze cepstral coefficient, then this is called Male
scale cepstral coefficient. So, since my spectrum is frequency warped in Male scale that
is why it is called Male frequency cepstral coefficient. So, what I am actually doing any
mathematics I am designing the filters I will explain in the next class first I designed and
described the equation.
508
So, I calculating 𝑋̂𝑙 ; l is the number of filter Scap L using Male filter bank over the Male
filter, I am calculating Male spectrum and once cepstral coefficient is analyzed based on
the Male shaped spectrum then is called Male frequency cepstral coefficient a MFCC
have you understand a MFCC that is why it is called MFCC Male frequency cepstral
coefficient.
Now, how do you design those things? So, what should have or what should be the block
diagram. So, I am describing the complete well diagram of Male frequency.
So, if you see the basic signal processing block it remains same pre-emphasis window
DFT, after I get DFT what I can do after.
So, if it is S n is my signal, I pre-emphasized it and after DFT let;s I am not doing the; I
am not writing the pre emphasizing on windowing part. So, after DFT I get Sk . So, Sk is
the normal frequency cepstral; so, then from Sk , if I calculate mod of Sk and plot with
frequency. So, mod of sk is nothing, but a frequency versus power plot which is the
spectrum in normal frequency scale; now I want the frequency scale in Male scale
frequency or you can say the Male filter bank I pass that with the Male filter bank when I
509
pass them with the Male filter bank what I will get; I get some spectrum, but in Male
scale instead of frequency I get S cap k which is in Male scale or instead of k; I can here,
right, l is the number of analysis filter. So, if it is a 40 filter or if it is a 20 filter. So, l
variesfrom 0to 19. So, I can get number of filters.
So, once I pass through the Male filter I get Male spectrum once I get the Male spectrum
I can pass through that IDFT to get the Male Cepstrum OMF; I can write MFCC. So,
those delta double delta come later on. So, I get MFCC;MRL frequency cepstral
coefficient. So, the mathematics is this is the mathematics and instead of DFT what I will
do? I will pass instead of DFT this DFT; IDFT can be replaced by a DCT discrete cosine
transform IDFT can be replaced by a DCT. So, this DCT equation is this. So, 𝑠𝑛́ ,m or l
𝑐𝑜𝑠𝜋
whatever 𝑚 − 0.5; I = 0 to c -1 number of cepstral coefficient is or not.
1
So, next class I will discuss how we implement it or I can say that let more details
discuss on implementation issue on MFCC, then we discuss about the delta, double delta,
then PLP and then Rasta, then the features extraction will be completed we will go for
the F0 extraction.
Thank you.
510
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture – 34
MFCC Features Vector
So what we said is that if the spectral coefficient is extracted from Male spectrum, then
we can get the MFCC. So, how do you do that? Just little bit of detailed discussion how
implement it.
So, you know that if I have a Sn; let’s I have signal speech signal which is samples as 16
kilo hertz, a speech signal is sampled at 16 kilo hertz. And then I take a window let’s 20
millisecond. So, how many samples I should get? 20 millisecond 320 sample. So, what
should be the DFT site 256? So, N = 512, DFT I have applied. So, what is the resolution?
16 16000
kilo hertz divided . So, I say almost I can say 32 hertz. 32 to 33 hertz I am
𝑁 512
So, if I calculate if take the DFT, N = 512. Then I get Sk, once I get Sk I can take the
mode of Sk which is nothing but a real square plus imaginary square, I can take it. Once I
take it and then I want to plot it. So, this axis is my k axis this axis is my mod of Sk. Now
if I take it, then what I will do? Sk I have is here. So, what I get this is the k = 0 and then
next k = 1. So, k = 1 means 32 hertz k = 2 means 64 hertz. So, that that is why it will
𝑁
going on going on upto .
2
Now, I am calculating them I want to implement the Male filter. So, based on the Male
skill let’s design the initial filter bandwidth is 100 hertz, let us design 100 hertz and
511
center frequency is 50 hertz. So, I can say this is 50 hertz and this is 100 hertz. So, if it is
100 hertz how many k will be there? So, k = 264 k = 396 hertz. So, k = 3 I can say
almost k = 3. So, what is the power? So, k = 0 multiplied by the filter wave function. So,
Ml(k) is the filter weight function. So, this is my fast filter I am designing. So, that
required k = 0 to k = 3, so the value of Sk multiplied by mk. So, mk means k = 0, S k is k
= 0, I can calculate the mod of Sk . So, mod of Sk I can get it. So, this capital Sk here in
this equation in the slide he represent the mod of Sk original spectrum. So, if it is a
original spectrum this is nothing but a capital S I can write down.
So, original spectrum is multiplied by the m0, what is the m0 coefficient? If 0 then I can
find out k = 1 then k = 2 and then k = 3. So, all those 3 will be multiplied by the m
function. So, m is called filter weight function. So, that filter weight function multiplied
by the individual power, then I take the sum I get the point ,this point. Next I design
another filter with 50 percent overlap. Next I design another filter. So, that way I can
design the filter
So, equation becomes. So, Sk is the spectrum then mlk is the filter weight function
multiply and k equal to LL to uL LL is the k = 0, and upper frequency is k = 3. Similarly
what the next filter of it is 50 hertz k = 2, almost k = 2 or k + 1. If it is k = 1, then k = 1 2
3 4 5 I can calculate that k. So, you know k u L k or LL = 0 for the first filter and ul = 3
for the first filter then second filter I can find out LL = 1, and I can find out ul = 4. That
way I can design the filter and find out that ss cap l S cap l which is Male scale spectrum.
Then I can take the log of S cap l because spectral coefficient has to be calculated using
homorphic decomposition which require the log. So, log of sl and then I take the DCT
instead of IDFT, I take the DCT, so complete block diagram MFCC equations. So, I this
512
block diagram contain IDFT, IDFT can be easily replaced by DCT discrete cosine
transform .
Why we use DCT? This is the reason the DCT is required to disseminate or differentiate
between the uncorrelated the spectral coefficient extracted spectral coefficient.
So, I use DCT to un-correlate this filter out. That is why I use DCT and DCT acts like an
FFT. Here I have use DCT, because this is a complex arithmetic is required. So, for DCT
implementation is easy in this stage that’s why instead of FFT we use DCT, discrete
cosine task.
513
Now, I go for the delta spectrum. So, suppose I calculate MFCC, MFCC coefficient m =
1 2 let us calculate 20 MFCC coefficient m20. So, this is the MFCC coefficient for single
frame.
I can calculate for next this is for f1 frame 1, I can calculate for frame 2,m1 m2 ………….
M20. I can calculate f3 frame 3 m1 m2 …….. m20. So, it is said that out of m20 if I take 12
number is sufficient 12 to 13 number is sufficient. So, instead of 20 I can write m13, I can
take. Let’s I take m13 or 12, let us take 12, 12, 12, 12.
514
MFCC a frame 1 frame 2 frame 3. Now if you remember if you or if see this spectrum
of this signal, I can just showing that spectrogram of his if you see the spectrogram of
this signal. If you see look at this curser. This is the second formant. So, if you see the
second formant are moving, even first formant is also moving. So, I can see there is a
formant movement or I can say there is a dynamic of the parameter changing across the
segment. So, that dynamics also can be a good features. So, here if I want to kept that
those dynamics. So, this is represent the frame 1, this is represent the frame 2 this is
represent the frame 3. So, what is the dynamics? Dynamics difference between the
frame 2 and frame 1 and difference between the frame 3 and frame 2 is the dynamics. So,
if I say first order dynamics then I can say f2 - f1 . Frame 1 MFCC parameter minus frame
2 MFCC parameter .
So, next f3 - f2, then f4 - f3, so this is a first order dynamics. If I want to calculate the
second order dynamics, so deference between the first order dynamics is nothing but a
second order dynamics. So, first order dynamics is called delta coefficient. So, old delta
is nothing but a difference of difference. So, I can say double delta coefficient f3 – f2 - f2
+ f. So, it is nothing but a f3 - 2(f2 + f1). So, I can calculate the first order differentiation,
then I can comes repeat the same process to get the double delta coefficient and input is
this one. So, delta coefficient is the first order dynamics. So, if I say the formant is
moving. So, first order dynamics is nothing but a velocity of the formant. If I take the
second order dynamics that is give me the acceleration of the formant.
So, how formant are moving velocity, how quickly formant are moving acceleration. So,
delta, double delta can be a speech parameter. Now if you calculate just differentiate that
frame 2 minus frame 1 and frame 3 minus frame 2, suppose you have a signal which is
randomly vary which variation is very high. Let this one, this one, this one, and this one
then this one, then this one, then this one. So, if you see the variation is vary random if I
want smooth variation what I should do? So, instead of taking the just difference
between the 2 consecutive sample, I can take the average difference of few sample, that
can be smoothly. So, instead of taking these difference I calculate the delta coefficient, or
double delta coefficient across the few instead of single frame. So, suppose if on the first
one instead of taking m1 of frame 2 minus m1 of frame 1 I can say that take few frame m1
and find out the delta coefficient.
515
So, this is the equation. To get the smooth movement of the format. So, delta coefficient
is not just difference between the frame. It is which takes several frame and calculate the
average difference, and then move the filter like a smoothing function. So, that I can get
smooth variation of the formant frequency so that is called delta coefficient and if I take
the delta, delta double delta coefficient.
So now if I if you remember in the first slides, if I make the MFCC vector features
vectors. So, what I said? I take 12 MFCC parameter. Then I can take 12 delta parameter
delta MFCC. Then I can take 12 double delta MFCC. Now of you see if you remember
this last discussion, when I put the first filter I multiply k0 th coefficient of this spectrum,
if the 0 weight function is 0. So, I have not taken the energy of the signal. So, what I do I
will take the energy of the signal which can directly computed after ending the energy,
average energy of the signal I can directly compute and take energy first energy of the
signal. Once I take the energy I can take the delta energy I can take the double delta
energy. So, that give me 12, 12, 12 ,1, 1, 1, so 36 + 3. 39 dimensional feature vector. So,
MFCC feature vector with a 39 dimension. 39 dimensional feature vector. So, this is
called MFCC feature vector.
Next is PLP very much PLP perceptual linear prediction, PLP parameter PLP perceptual
linear prediction. So, what is that part? Linear prediction we know lp linear prediction
you know what is linear prediction you know. So, what is perceptual linear prediction, so
already said that if I modify my signal as per the human perception. So, if you see the
spectrum of the signal is not at the scale of human perception. So, I modify the spectrum
of the signal as per the human perception, behavior human has a 2 perception of
frequency and perception of loudness. We have not perceived equally loud all frequency
because the LPC coefficient analysis is amplitude dependent it time domain analysis.
516
So, across the frequency if I want to normalized it. So, I normalize the frequency
parameters with respect to human perception. If you see the block diagram I am not
detail discussing the detailed slides are there.
So, what is said take the speech signal find out the first Fourier transform then calculate
the spectrum. And modify the spectrum as per the critical brand or Male frequency
spectrum or box scale, then the modify the loudness. Loudness of the amplitude of the
spectrum as per the equal loudness curve. Then apply the power law of hearing what is
power law. If you remember that relations between the loudness and intensity, I think it
is 445𝑥𝑖 0.333 .
So, what it is that? It is a cube root of intensity. So, intensity and loudness relationship is
the cube root of intensity, this is constant. So, modify the loudness using cube root of
517
intensity. Then take the inverse Fourier transform or then I will get the spectral
coefficient of the spectral coefficient I can calculate the a1 a2 a3 a4, which I have already
discussed, that is called TLP, coefficient perceptual linear prediction.
Then there is another speech parameter which is called RASTA relative spectra.
So, RASTA also there is a 1 another side. So, it is a modulate, relative spectral amplitude
modulation filter of spectrum is equivalent to modulation filter of log spectrum.
So, those are the modification of the frequency parameter. So, this is the all about that
parameter extraction the speech signal.
Thank you.
518
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture - 35
Fundamental Frequency Detection of Speech Signal
So, we have discussed about the frequency domain parameter extraction and LPC
parameter extraction already covered. So, there is a one important parameter which has
to be extracted from the fundamental or the speech signal is the fundamental frequency
or f0 , sometime it is referred to pitch Extraction.
pitch extraction, but pitch is not a physical parameter. It is a perceptual parameter. That
is why you call f0 extraction, fundamental frequency extraction. Now if you think that if
you remember that my tube modelling lectures and LPC lectures.
So, what we said that while human being produces the speech. So, there is a vocal tract,
that vocal tract function which is h[n]. And it is modulated input is call gotal excitation.
So, gotal excitation, either vocal code is closed or vocal code is open. If vocal code is
closed then there is will be a vocalic sound or voice sound. So, if it is a voice sound that
can be modeled as a impulse strain modulated by a gotal transfer function, and then
multiplied by the gain.
So, when we produce the speech, either speech has a voice segment or unvoiced
segment. If it is voice speech then it will be connected to here. If it is a unvoiced speech
then there will be noise source and it will be connected to here. Then I get the speech
signal S[n]. So now, what is f0? So, f part of the vocal code oscillation. So, if it is vocal
code oscillation. So, f0 we will be here the impulse by which we are simulating the vocal
519
code vibration has a period. So, that period is responsible for fundamental frequency of
the speech signal.
Now, think about the reverse thing. I want from the S[n] to extract f0 . I have the voice
signal which is recorded, I want to know if it is voice, what should the input if I it is
generated using a impulse response. So, impulse strain what is the fundamental
frequency of the impulse strain. Or if you remember the LPC synthesis, from the speech
signal we extracted gain fundamental frequency and LPC parameters. So, fundamental
frequency is required to generate the impulse by which frequency impulse should be
going through the vocal tract modification gotal filter and produce that got all source.
So, that fundamental frequency is part of the gotal excitation. It is not part of the vocal
track transfer function, understand? Some of the speech or even this from the speech
signal how do I find out the fundamental frequency. That is called f0 extraction. F0
extraction has many application.
Mainly, when I discussed about the next week speech application, there is a called
speech synthesis. There the f0 is a main important part to produce the melody in the
speech.
So, when I speak if you see, when I speak it is not a robotic voice. If I speak a sentence
the fundamental frequency is changing, it is not fixed all throughout the sentence
fundamental frequency is same. So, suppose I have a speech sentence.
520
Not throughout the if it 5 second signal not throughout the 5 second f0 is constant. F0 is
moving that is called fundamental frequency control. Now if I remove this f0 movement
from the speech become robotic. So, the melody of the speech communication is f0 is one
of the main factor for speech melody. In singing also if you think about the singing what
is sa re ga ma pa da ni sa. If you think about sa is the best frequency next sa is the octave
past octave. So, if it is my best sa is 200 hertz.
So, sa to sa it is 400 hertz next sa is 400 hertz. And then this 200 hertz frequency is
divided in several scale if it is sa re ga ma pa da ni sa there is 7 scale and this can be a
uniform or this can be non uniform, if it is uniform then you can say equip temper scale.
Now if you see when I say the raga of a speech song or in a rag different ragas like biravi
rag biravi, how people are saying that you are not touching the upper node you are not
you are missing the node.
So, those kind of command you have seen from that lot of TV that the song competition
in the TV channel, and you find that the judges are produce a comment. Your upper
tones are not touching or you are not touching the upper nodes what is meaning? So,
meaning is that the movement of the f0 is controlled during the production. So, when I
singing a some song I am controlling the movement of the f0. So, if it is sa then it is ni.
521
Then sa to ni ni da pa ma. So, movement of the f0 is actually defined by that sa re ga ma
pa da ni sa. So, maybe while they will moving from lower node to upper node, he may
not be touched the upper node just falling down because you know sa to sa.
So, if it is my first sa is 200 hertz next upper sa must be 400 hertz. So, when I sing if it is
upper sa, then I have to change my fundamental frequency from lower sa to upper sa. So,
fundamental frequency actually responsible for the melody of song you can say the this
ga ma this define in any song, and also in voice communication when is talked about
when I speak it is not always fundamental frequency is same. Fundamental frequency is
moving to provide a smoothing communication between human beings, unless it will be
a robotic voice. Then initially people are thinking that fundamental frequency has a no
use in speech recognition, or maybe they have used the LCP parameter whatever we
extracted that is sufficient.
But really in modern speech recognition systems people are thinking, that why the
fundamental frequency can be used effectively to improve the speed recognition rate.
The new dimension is coming, which is called spoken language processing. I will discuss
I will discuss some issue in speech application, spoken language processing. That
whatever we are doing ASR, it is not ASR because spoken language is different from
written language suppose discuss and I will in the next week lectures. Then speech
coding yes already we have said that if we even if l basic LPC.
And coding f0 has to be provided in the receiver end, because I have to know what
should be the f0 of the input impulse, which I will to generate that voice signal.
So, f0 is very very important part for speech coding. Then voice conversion very
important if you see this is new technology that voice conversion action conversion.
Suppose I want we will discuss the details in a speech application classes, that suppose
522
that this is my dream that can I develop a technology, somebody is speaking in American
English and I am listening whatever English I am producing. I am not saying speech to
speech, I am not saying the English is converted in my mother tongue I am saying I am
converting the same English speech which is spoken by an American speaker to my own
English whatever English I am producing, because my English is not American English
think about the Japanese professor giving a lectures. If I am listening because my I
cannot understand the Japanese English that clearly. So, my understanding level will be
going down.
Now, if I able to discover a device that Japanese professor speaking his own accented
English, but I am listening in my own accented English, then I think the problem is
solved. So, that is a technology. So, voice converse and speech conversion action
conversion then a spoken language learning even if singing. Suppose I want computer
based singing learning of singing. So, I can say somebody is singing let us there is a guru
sing the song. So, he touches the node 1 2 3 4. So, all node movements are showing now
as a disciple I am singing that same song, and computers shown me that you started here,
but your tradition from this node to this node taking this much of time and you are not
touching this node. So, next time I am producing next way then I you are gradually
touching this next node.
So, that can be used as a tools for learning, similarly language learning also. So, suppose
that many any language has a tone as a phonemes or I can say the trace as an important
parameter for speech clarity. So, those cases f0 is very important. So, there are numerous
application for f0. So, extraction and f0 frequency or fundamental frequency is an
important issue. Even if not only speech, but also music signal also. Lot of music
research is going on and they are also the extraction of the fundamental frequency is very
important, to find out the movement of the node.
So, ragas is defined nothing but a movement of the node. So, you to find out the nodes
are move from the signal itself. So, there are a number of things. So, there is a number of
problem.
523
And number of characteristics also in the fundamental frequency if it is produced by a
human being then you know the f0 can be values from 50 hertz to 400 hertz this is a
variation of f0 child speech can have a 400 hertz frequency.
Now, if you see fs if sampling frequency is my 8 kilo hertz. And if I say that length if I f
0 is 50 hertz, then how many samples will be there. How many samples will be there?
8𝑥103 8000
My f0 is 50 hertz and Fs is 8 kilo hertz. So, I can say 8 k, . So, it is nothing but ,
50 50
160 sample in 1 f0 period. So, if I say 160 sample how many milliseconds? 8 this sample
is 10 milliseconds.
signal is one single f0 . Similarly if you see if I have a best frequency is 50 hertz my f0 can
varies up to 100 hertz within a single sentence can varies because octave second one is 2
is to 1. Next speech is called quasi periodic, if you see the speech signal it is not exactly
periodic. So, speech signal is not exactly periodic.
524
So, when I generated the speech signal using unit impulse response the f0.
Between the 2 conjugative period is not same, they are some differences is there. And
that small differences is called jitter. So, suppose I generate an f0 I have extracted an f0 I
cannot say all the f0 will value will be same for some segments. The f0 is continuously
moving or there may be a difference forward backward that kind of variation So, it is not
a smooth f0 , it can be look like this kind f0 control. So, next period it is less next period
also jitter is randomly varying that is why it is called quasi periodic signal not exactly
periodic signal. Then f0 is influenced by many factors. I cannot say that my f0 will be
same for all time, this time I say one sentence, next time If I say same sentence f0 may
not be the same. F0 may be different, because it depends on my emotion who is doing on
emotion analysis or find out the speaker characteristics f 0 has a part, emotion.
If I say the old man and women man is old woman is not old. So, in spoken language if
you see that if you if you if you recorded it and if you see the f0 movement you find the
f0 define the boundary whether man and women is aroused together whether the old man
an women; that means, man and woman both are old. So, in that case maybe the word
boundary will be here which is called spoken word boundary. Details I will discuss in
that prosody when I do I prosody the analysis and the use of prosody information in
speech processing in the 8th week lecture .
525
If I say I cannot get a purely voice signal, there may be a some noise is included quasi
periodic in nature in speech difficult to estimate speech with low energy. High f0 affected
by a low form and yes this is very important issues. If you see that if you remember the
first formant frequency is in and around 500 hertz. So, it can reduce and it can be goes to
300 and 60 hertz also, so if it is 360 hertz in the first formant frequency and if f0 is in
around 300 hertz.
So, f0 can codified with the first formant frequency very issue. Next challenging issue is
that due to the band limitation. If you remember in telephone speech we say telephone
speech is 300 hertz to 3.5 kilo hertz or 4 kilo hertz if it is that. So, below 300 hertz no
signal no component. So, suppose I am speaking about the telephone my fundamental
frequency is 180 hertz f0 is not present fundamental frequency is not there, but still
receive the fundamental frequency.
Because the property of fundamental frequency says that the spectral information will
repeat if you plot it you know this is f0 next will be twice f0 and all the spectral
component will be repeated here next will be trice f0. So, f0 is not there, but spectral
526
repetition tell me what is the f0 of that signal. So, sometimes this is also important that
we have ever looking for not only f0 , f0 may not be there, but 2 f0 is there 3 f0 is there.
So, the repetition of the spectral information from there I can find out that I perceive
theyf0 frequency. So, if this kind of band limited signal is come then it is very difficult to
find out f0 also. So, there is a lot of technique for f0 extraction, from the beginning there
is sometime domain methods.
Some frequency domain methods. So, let us start with some time domain methods. Now
if you remember during time domain speech processing class, I said 0 crossing, 0
crossing.
So, we say that if it is a pure sine wave signal then within a single period signal will
𝑓
cross the 0 line 2 times. So, from that philosophy I can find out f0 of that signal 𝑠 , we
𝑓0
So, we can find out the number of 0 crossing per second, and then I find out that t 0 value,
for which the signal cross the 0 line 2 times. That is called 0 crossing base f0 detection,
but what is the problem? In this case it is now suppose there is a high frequency, speech
is not a monotone signal. So, it is there is a high frequency component, now you see the
527
number of 0 crossing signal this is the final fundamental frequency, but number of 0
crossing it change within one period, I can show you one picture is there.
So, in that case I fail to detect the f0. So, 0 crossing rate base f 0 detection is not Very
good solution or good methodology for finding out the f0.
But if you see the detecting of number of 0 crossing is the easiest algorithm. So, time
complexity of computation f0 is very less, but it is not at all noise prone So, I can say
method you 0 crossing detection where the f0 detection methods is very not that useful,
or you can say that is not reliable.
528
Next technique is called autocorrelation technique, which is mostly used technique for f
0 detection. I am not reading the slides, you can read the slides.if you remember the,
what is autocorrelation
So, suppose this is my signal. Think about the pure sine wave. So, autocorrelation means
correlation the signal with itself. So, let us I again draw the same signal. Now
autocorrelation told me the equation is that r𝜏 is equal to summation of n equal to let us I
writing in digital domain, or you can say n equal to 0 to n minus 1 x of n x of n plus k let
us write instead of tau I write k.
Now for r0; that means, if this is my signal this is my signal first coefficient is nothing
but the product of the all samples are like this sample this sample this sample multiplied
by this sample. So, I get the maximum value. Next time I shifted this signal by one
sample in here. So, instead of this signal my signal will you start from here. So, I take
this sample with 0. And n then 0 will be padded with the length of the signal. So, I will
get an autocorrelation coefficient r1 which is less than r0. At 2 less than r0, if you say the
signal will become in the in phage where it is reaches here.
So, if this much of amount of shift is there then the 0th sample will be multiplied with
0th sample. Again I get maximum r value. Yes, I get the 𝑓0 also maximum r value
2
because it is pure sine wave. If it is not pure sine wave, then I will get maxima when the
sample is matched exactly with it is period. So, if I see that plot, you see this plot.
529
So, if you see this is my signal. And this is my value r value r 1, r2 , r3 , r4, r5, r 6. Now if you
see at around 50 sample, let us this 45 sample this again come maxima. If it is 45 sample
next to also come at 45 plus 45 maxima. So, this is twice f0 this is thrice f0 . So, I can say
from here to here this 45 sample is the t 0.
1
So, what is f 0? Is nothing but . Suppose this signal has a sampling frequency is 16 kilo
𝑡0
16000
hertz, then I can say f0 is nothing but a let us 45 sample. So, let us 50 sample 300
𝑡0
hertz. So, around 300 hertz, I can calculate the f0. Now how you process it? This
procedure is So, how he process it?
If you remember , that suppose I have a signal, I have recorded my name at 16 kilo hertz
16 bit.
530
So, my name may be 3.5 second signal my name is long. So, it may be less not 3.5
second let us 2.5 second signal. So, how many samples are there? 16 k plus 16 k plus 8 k.
So, around 40k samples of there 40,000 samples are there. So, I said when I say my
name the fundamental frequency is not same throughout the signal, it is moving. And
also the fundamental frequency only present if the signal is voice.
So, suppose I say Kolkata, let us say Kolkata. If you remember that manner of the signal
co is an unvoiced signal, then there will be a worst then consonant to vowel transition
then steady state vowel o then o to law is a voiced consonant then again it will be stop
consonant then again bust. Then again vocalic transition, then again it is a stop
consonant, then again it will a bust. And there will be vocalic transition and again end.
So, every vocalic part upto here this is the vocalic part I can get f0 value. From here to
here I can get f0 value and from here to vocalic part I get the f0 value. Now how do they
know the vocalic? This is very important part. So, instead of doing finding out that in
signal what I will do let us Kolkata is recorded in 0.5 second.
So; that means, there is a 8 k samples are there. Now I frame the signal 100 frame per
second. So, in Kolkata I will get 50 frames signal. So, how each of the frame, which is
duration of 10 millisecond maybe window by 20 millisecond will give me a average
fundamental frequency value? So, if I extracted that to take that 20 millisecond window
and shifted that window by 10 millisecond then I get 100 frame per second, so for every
100 millisecond. So, if I draw the Kolkata in f0 frame. So, if it is 3.5 second signal. So,
there will be a 0.5 second. So, there are a 100 frame. So, 0.5 second will be divided by
150 frame. So, every frame is 10 milliseconds. For every 10 millisecond I can calculate a
average f0 value, but next 10 milliseconds.
So, I can get the f0 contour which each points are 10 millisecond apart. So, 20
milliseconds signal I will calculate rk value and find out the maximum first maximum
531
then calculate the lag 1 by lag is the f I can get the f0 value. So, this is the autocorrelation
base f0 extraction very easy, but it is time consuming. Now if you see this signal in
autocorrelation f0 extraction if you see this signal, signal is decaying. Rk value or you can
say the rk value Plot is decaying.
So, it is nothing but like this way this way, this way, this way, this way, this way, then
this is going down and almost 0. So, I cannot get the first peak and second peak are not
equal, it is down. So, detection of peak is very difficult sometimes may be erroneous
also. So, how do I improve this result? I want this decaying should not be there.
So, if I want the decaying should not be there, then that is called modified or equal to the
normalized autocorrelation, normalized autocorrelation. The formula is this one instead
of calculate.
532
Autocorrelation calculating the autocorrelation using that r k is equal to n equal to 0 to n
minus 1 xn x up n plus k.
Thank you.
533
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture - 36
Frequency Domain fundamental
Frequency Detection Algorithms
So, we have said the cross correlation is one of the method for announce the result of
autocorrelation technique. The other method is called central clipping.
If I have this speech signal like this, I want the only the central part. Because if you see
that f0 I want f0 is or you can say the filtering also. Clipping and that f0 is up to 500 is
534
sufficient. So, I do the central clipping in the signal and all the high amplitude variation
can be avoided. Because high amplitude variation give me the auto correlation function
is long variation. So, instead of taking the high amplitude variation signal, I can clip the
signal in central part and that signal can be used to find out the auto correlation and
calculating the f0 . So, this, the algorithm is very simple. So, clipping how do you find out
the C L generally it is C L one 4th of the peak. So, how do you do that? Suppose I have
taken a 20 millisecond window signal.
So, within that 20 millisecond I can find out the maximum peak. So, let us 20
millisecond signal is represented by x n.
So, if I find out the maximum amplitude within the 20 millisecond, then I can say the k is
that max amplitude will give me the C L value. K value is between 0.6 to 0.8. Now if I
want that I have a signal which is lot of variation in amplitude
535
So, instead of taking the single max I can divided this 20 millisecond signal in maybe
another 4 segment. Maybe 5 millisecond, 5 millisecond, 5 millisecond, 5 millisecond and
from each segment find out the maxima. And take the mean of that maxima to find out
the C L value that is the formula. So, this is called central clipping to improve the
autocorrelation base f0 extraction. Another procedure, instead of taking the signal if you
remember in LPC analysis, that if this is my vocal tract transfer function h n, and if I
apply AUn then I get speech signal Sn.
Now, if you remember that linear prediction, what I want to predict signal Sn and if I
have substrate the predicted signal. So, the remaining is nothing but a AUn. Which is
nothing but en . Ar signal is nothing but excitation signal, inverse filtering. I can say that I
1
have designed and I pass the signal Sn , I get AUn which is the error LPC prediction
ℎ𝑛
error. Now if you remember the f0 is part of the excitation not the vocal tract. So, f0
information is here not in here. So, I can extracted the f0 from en signal. So, instead of
using Sn, speech signal which had a lot of radiation which is introduced by h n. So, instead
536
of take the Sn, I can pass the Sn to a universe filter to get the AUn which is the error
signal of the LPC analysis, from the LPC error signal I can find out the f0 value using
autocorrelation .
Next that not very efficient or high fundamental convolution is very expensive process
that some close and cause of autocorrelation technique. So, computation efficiency can
be improved if I use the FFT algorithm to implement that calculation of the correlation
function.
So, this you can read it. Now instead of autocorrelation since, autocorrelation is required
the same sample is multiplied by the sample. So, if you see, if it is a 16 bit signal, 16 bit
multiplied by another 16 bit sample. See worst case it requires lot of memory, and also
multiplication is computer scenario complex.
537
So, for hardware implementation or when that computation facility is not that enough,
you can use average magnitude difference function, which is called AMDF, average
magnitude difference function. So,what are the theory what are the methods in here. So,
suppose you have a signal x n, which has a periodic signal, some fundamental frequency
is there. Now if I want to plot x0 versus x1, then x0 versus x2, x0 versus x3 are that way.
So, x0 versus x0 will be come here. X0 versus x1 some come here. So, I can get a scatter
plot, if this sample is exactly match with another sample let us xk, then I get a point in
here x0 = xk. Then I can get a point in here, then I can get a point in here.
So, along the 45 degree line all x0 which is matched with xk . Since which is not a you can
say it is a not exactly periodic signal then what is it? So, instead of calculating the
exactly 45 degree line what I want to calculate the deviation from the 45 degree line.
And find out the average value of the deviation for each delay, then plot that average
deviation value against the delay. So, what is the function 1 by n, n equal to 0, n minus1
minus k, x(K) minus x (n + k). So, for all the sample and d xk. So, k equal to 0. So, it is 0
minus. So, x0 - x0 ,x0 - x1 ,x0 - x2 ,x0 – x3. So, for all deviation I am finding out the
deviation and add them and find out the average value of the deviation since next I
calculate k = 1. So, instead of x0, I put x1, I put x2 , I put x3. So, all deviation average
deviation will be there. So, deviation will be minima when it matched with the period.
So, if I plot the deviation, deviation plotted will look like this.
538
So, minimum deviation will be happen at the period or at the f0. So, this is f0, this is
twice f0 this is thrice f0 . So, this can technique is used to find out the fundamental
frequency, when the computation power is less, but what is the drawback. Less
computational complexities .
In implement hardware, the performance is poor and it does not cater for variation in
energy of the speech signal. If you see the deviation depends on the variation of the
energy. So, suppose this time I speak some speech which is less energy and next time is
high energy, the deviation will change. or if I say suppose that a long vocalic region,
then the amplitude energy had changing. Then there will be a variation will f0 will a call
here also. So, that kind of energy the amplitude problem is there.
So, autocorrelation is highly used and this is used when I can say I require a average
rough estimation of f0 is sufficient, but computational complexity will be very less in that
time I use AMDF average magnitude difference.
There is other algorithm also waveform maximum detection some magnitude difference.
So, all others methods are also adaptive filtering circular average magnitude difference
539
function cumulative mean normalized different function, so all others methods also
available.
Then frequency domain PDA’s pda detection algorithm or f0 detection algorithm think
and pda operates on speech spectrum. So, distance between the harmonics is the
fundamental frequency or inverse of the speech period. So, what is the key psychology
behind the frequency domain PDA, that if a signal consists f0 then harmonics will be
repeated f0 ,2f0,3f0, 4f0 .
So, from there I can calculate the f0 . So, main drawback of frequency domain PDA is the
high computational complexity, any signal to transform in frequency domain is a
computation complex .
Now, then the 2 kinds of PDA are all the mostly used harmonic peak detection and
spectrum similarity.
540
So, detect all the harmonics peaks. So, suppose I have a f0 this is my spectrum let’s this
kind of spectrum is there. So, if you see if this is f0 this 2f0 all harmonics are repeated
3f0. So, from there I can find this is 2f 0. So, I can find out the repetition distance and
that give me the f0 value. So, how do you do that? It is called comb filtering, if you
remember the comb, has a peak and gap.
So, if I say this is f0 beginning 2f0 ,3f0,4f0 then I can find out the average difference
between this gap, rap average difference that give me the average value of the f 0, that is
the idea
541
Spectral similarity I have a spectrum I can simulate this is my f0 this is twice f0 this is
thrice f0 and try to similar, find out the correlation whether it is matches with the original
spectra or not once it is matches you said this is my f0 . Spectral simulate device f0
detection. There is a time frequency domain original detection also a lot of algorithm on
time frequency based f0 detection.
But if you see most of the time today recently, the state agents software which is
available in the free and open source by which you can extract the f0 reliably. There is a
other software also if you can use the plot also. Plot also you can export by f0 they plot
use the auto correlation based technique to extend f0.
542
So, using plot also you can extract the f0 value. So, this is the f0 extraction you can say.
So now, if I give you a speech signal you can extract the f0 and you can plot the f0
against the time. So, for every 10 millisecond again if I got a f0 value I can plot it to find
out the f0 contour. During the prosodic modeling I will discuss lot about f0 contour and if
you see how this f0 contour is useful to model the prosody even if f0 can contour can be
useful in ASR also. I will show you that things in lecture number on week 8 prosodic
modeling will go week 8 and the speech application will be the next week this week 7.
Thank you.
543
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture - 37
Text To Speech Synthesis
So, let start that next week, the new week which is we talk about the speech processing
application.
Mainly I have chosen three application, but there are lot of applications. So, if time
permit in last week I may cover the speaker and that identification that the speaker
recognition and identification that part also I covered. So, little bit of GMM I will cover
in that topic, and vector quotation I will cover.But today let us talk about that this kind of
application like speech synthesis and ASR, I am not detail cover that ASR because it
itself can take several classes. So, first I cover that speech synthesis application that how
do we develop the speech synthesis application what are the provision cause everything
and then ASR then we give out that recent train in ASR, and then I talk about the action
conversion which is my dream in research this area. So, lot of research work is going on
in the speech action conversion and speech conversion kind of things.
So, we will discuss about all those things in this total week.
So, let us start talk about the speech synthesis popularly known as you know that TTS -
Text to Speech Synthesis System, TTS.
544
(Refer Slide Time: 01:28)
Now, if I see the name text to speech. So, I can say that I required the machine let’s there
is a machine or there is a systems, where that system will take input as a text and then
output is produce as a speech. Now if I remember, if you remember that speech
production mechanism how the speech is produce that message planning then that coding
and then execution by vocal tracks and produce the speech. So, speech production this
part which convert the text to speech means input text to speech, not only with this
production model, but also other text processing model is also important because when
we say something it is not just segmental information, it contain the supra segmental
information also.
So, if I see the basic block diagram of the speech synthesis, then I can say that text make
contain some. So, what contain information in the text, some linguistic information, non
linguistic information, parallel linguistic information. So, there is a linguistic
information, there is parallel linguistic information, there is a non linguistic information
all those information have processed in our mind and executed by this vocal track system
to produce the speech. So, the speech signal if I say it contain all the linguistic
information, non-linguistic information and parallel linguistic information all those three
are there in this speech signal. Now if I want that a system or a machine or an algorithm
should develop such that it will take input text and covered to speech. So, that all kinds
of text processing and that signal processing are require in case of development of TTS.
545
(Refer Slide Time: 03:44)
So, if you see in the slides if I show you the text is definition, text to speech software is
used to convert words or sentence from a computer documents into audible speech
spoken through the computer speakers.
So, TTS is nothing, but a input is text and output is page. Application if I have a good
TTS think about that it is provides the hand free operation, it remove the digital divice.
Suppose I cannot there is a lot of people are there they cannot read the text, but they can
speak, because speech is you can say the spoken language this is use for the
communication. The person who does not know anything about that written text he can
also speak if you go to the village area lot of people they cannot read, but they can speak.
So, what I want that suppose I have an information stores or text information stores in
computer you see the many forms, many graduate, newspapers all are textual
information if I want to disseminate that information to those kind of people how cannot
read then the TTS is must; that means, machine can read the text, and machine can talk
because they understand the spoken version of the language. So, there are comfortable
with the spoken version.
546
(Refer Slide Time: 05:18)
Now for typically text to speech synthesis system there, the first block is called text
normalization. If you see if I say open any page whether it is English, whether it is
Bengali, whether it is Hindi, whether it is Marathi if you read any text any text messages
there are lot of normalization is require. Suppose there is a some text contain text. So,
text is running and there is a number 2.350 C, today’s temperature is 22.5 or todays
temperature are is written 350 𝐶 while I speak we never say 35 degree. So, should 350 𝐶.
So, if you see this information if I say translate in text then it is a 35degree centigrade.
So, those words, instead of their numerical 35, I want to convert 35degree centigrade
similarly suppose I have a phone number 94320. So, see that here we never said 9, so1,
2, 3.
So, nine lakh four thousand three hundred twenty never said we said 94320 like that if it
is phone number, but if it is rupees then we said nine lakh four thousand three hundred
twenty. So, lot of text conversion or text normalization is required. Another wise
aberration if I say if I write doctor then it has to be pronounced as a doctor. So, there is a
lot of short form of also when we pronounced we complete form we pronounced. So,
there is a lot of text normalization is required unless it will not on the TTS will produce,
but TTS does not know how to produce this. This as to converted in normal text format.
So, that is called text normalization. So, if I write the TTS. So, there is a input block is
text.
547
(Refer Slide Time: 07:42)
Text may contain aberration number or suppose 2/3 then it will pronounce as 2/3.
So, all kind of number aberration will be there in the text and that text has it will be
normalized first; normalization. So, that is a TTS this is called text normalization, text as
to be normalized first. Once it is normalized then this goes to text processing unit how it
is process? Because you know text contain some words. So, suppose there is words and
written form of the word and pronunciation form of the word there is a change. Suppose
if you write psychology the written form and pronunciation form is different. So, I can
say there is a requirement text processing which can convert this is called grapheme to
phoneme or it will sometime it will G to P what is grapheme? You know that if I say
alphabet a b c d all are grapheme. So, the word written in form of grapheme that has to
be convert to from of pronunciation string IPS string. I said in that first week that I can
convert my name in IPS string that depended on the pronunciation.
So, if I write let us write Kolkata, it is a grapheme information is has to be convert to its
pronunciation information which is Kolkata this as to be convert. So, this is call
grapheme to phoneme conversion. If I write psychology pc like that way it as to be
pronounce by written as psychology in pronunciation format which are the phoneme
name you know the number of I pronounce psychology. So, saw is the first phoneme
name. So, I write saw if it is palatal saw I write that symbol if it is dental saw I write that
symbol, so that way all IPS string. So, pronunciation string as to be represented by a IPA
form. So, I get I pronunciation form of the written word. So, that is call grapheme to
phoneme conversion details I will come then there is a call prosody if you see if I say
548
text if I say tomorrow I will go to Calcutta, there is prosody is involve. So, prosody
which contain the melody of the spoken form, you can say the melody of the spoken
form I never say I will go like not never said I said I will go to Calcutta.
So, there is a lot of variation in supra segmental parameter. So, those supra segmental
parameter is not arbitrary, it depends on the synthesis structure of the input text. So, even
supra segment parameter like pause duration f0 and intensity, all are varied depending on
the synthesis structure of the language. So, last week I will deal about the prosody
modelling.
So, for prosody modelling some I have know the syntactic structure of the text, even its
can goes to pragmatic structure also. So, I have to process the text using NLP natural
language processing you know that there is a field call NLP language processing field.
So, I have to use some kind of language processing so that I can model the prosody
which is exit in the spoken form. There is a beautiful example suppose if I write I will go
to Calcutta even if written language you see there is a gap in between the word, but if
you spoken form there is a no gap there is no word boundary. So, this is continues
spoken form, but we will lesion some boundary so that based on supra segmental
parameter.
So, those are called prosodic word boundary, I will discusses about in prosody modelling
that part. So, those called prosodic word boundary. So, some kind of text processing I
require to model the speech prosody. So, that modelling required text prosodic. So, there
is a lot of text processing block two part, mainly one part is converted G to P grapheme
to phoneme conversion, second part is that some kind of synthesis semantic and
pragmatic information even I have to extract from the text by which I can model the
speech prosody. Once I goes those information when supra segmental information and G
to P segmental information. So, once I know the segmental and supra segmental
information, I can use some signal processing algorithm which is nothing, but a synthesis
to produce the speech. So, now, I can use the synthesis to produce the acoustic wave
form.
So, let discuses about that aberration I have already discuses that you can read many
things that text normalization that conversion and aberration. And there is a if you know
some if I if you remember there is a W3C standard W3C (worldwide wave consortia) as
549
a standard under the voice browser activity group that is called SSML. SSML (speech
synthesis mark-up language) it is nothing but a SML structure which is use to do this
kind of job text normalization, grapheme to phoneme conversion and prosody that
marking. So, when you develop website if you all the text all the written whatever the
written communication is there, if it is tagged using SSML then it is very easy to
synthesis that text when it pass to the synthesiser.
So, SSML mainly design to tag you input text such a way that it can take by a synthesis
engine and process that text because the text is structure and it is process, it can produce
the better synthesis speech. So, SSML there is a lot of tag sets are there. So, there is a
study is available you can go through the W3C SSML, you get a lot of document, and
that document contain what are the tag set are there. So, in text normalization they
normal they use say as suppose if I write350 𝐶, if I treat this is a word then I can say for
this word this grapheme, I can say as35 degree centigrade; then there is call I will come
to that PLS that that things also there abbreviation also.
If it is doctors then I can say it is grapheme information is doctors. So, I can say that text
the grapheme information is let I write gh is grapheme information.
550
(Refer Slide Time: 15:54)
So, I can say grapheme information is doctor Dr. and end of grapheme, then I can say it
can be say as doctor. So, I can define that things. So, if I define that things SSML take
care about that or I can developed a dictionary base approach, where all the aberration
there full form will be there. So, once I get aberration I run to the dictionary and get the
full form and that text normalization for number it is difficult because you do not know
where what kind of number system is required because some time if I write 94 32091556
is a phone number, then I can say this have to read this as read as 9432091556 we never
said like that our that systems.
So, but if I write 315Rs then we never said 315, we say 315Rs. So, depending on the
context you have to say what kind of normal text normalization you should apply that is
called text normalization.
551
Now, for G to P if I detail discuses G to P grapheme to phoneme conversion. So, text to
phoneme or I can say grapheme to phoneme conversion there is a lot of papers you find
that for every language the there is a several kinds of G to P engines are available
grapheme to phoneme.
Conversion where the system will take input as a text which is in grapheme information
and converted its pronunciation information wherever is the buffer Bangla, if you search
in the net grapheme to phoneme conversion for Bangla you find my paper which is
publish I think 2008 or 9 that grapheme to phoneme conversion for Bangla and that is
available you can see that we have develop the TTS engine that grapheme to phoneme
conversion. But sometimes it is very difficult to develop the role base of there is a you
can say there is a lot of approaches are there either I can develop dictionary based
approach, where I can say I have a two table here grapheme information here is
pronunciation information. So, there is a W1 word and there is a pronunciation of W1 .
So, there will be a W2 word, there will be a pronunciation of W2 words. So, this is one
kind approach this is called dictionary base approach where I required the pronunciation
dictionary of all the words because I don’t know which text will come in my input. So,
all the words pronunciation dictionary is required, but you know the words list of words
are infinite if you say that there is a some kind of combine word. So, you can
intelligently develop a root word dictionary, and you can form a combine dictionary
based on the on the fly requirement, but combining also there is a problem.
Means, in Bengali sometime if the two words are combine, the middle vowels maybe
pronounced like that we say Rajputra, that Rajputra if it is Rajputra then it is middle
552
vowel is deleted because raj is a word and Putra is a word, they are combine that is why
Rajputra this followed every word pronunciation if I combine I get Rajputra .But some
time vowel is not deleted like it is not one doctor it is not oneo doctor. So, that the we
have pronounce oneo doctor.
So, no there is vowel o is added, those kind of complexity are there. So, if I have, but
simply is that develop a pronunciation dictionary and or you can say it is call a look up
table in computer programming. So, I can say I have a pronunciation dictionary and I can
compel once I get the grapheme word in can pick up the pronunciation word from the
table. But there is a if you know the W3C they have a standard call PLS pronunciation
lexical specification PLS pronunciation lexical specification.
553
Detail I have already study the details of PLS W3C and we as said that some problem for
step present PLS. So, this is nothing, but you can say it is required by both speech
recognition speech synthesis both I will come. So, there is lot of tag set are there in PLS
standard, if you see there is a lexeme; that means, lexiconis started then there is a
metadata then lexeme, then grapheme, then phoneme, then alias and then example.
So, they said, but the problem is that they said the some words if have a multiple
pronunciation use the prefer active we choose the correct pronunciation, but it is not
always true. Some time same word may have different pronunciation even more than one
pronunciation more than one means more than two also pronunciation, but both
pronunciation are the valid pronunciation in that language it is not dialectic variation. So,
we can say I have write that paper that is called homograph problem.
That homograph are the words with same orthography, but different meaning and
different pronunciation meaning is not important pronunciation important. So,
homograph; homograph means orthography grapheme all are same. So, if the pronounce
if the graphemic representation words let us it is W1, but it map to two pronunciation one
is W1P and may be the W2 P,the pronunciation is two may be three pronunciation is the
same word can have a three pronunciation depending on the context where I use that.
So, that is called homograph grapheme is same, but pronunciation is different. In case of
Bengali that I can give you example in Bengali we are [FL] if you know the Bengilo
script then you know the [FL] may have a pronunciation two pronunciation, either [FL]
or it may be [FL]. So, if it is verb it is [FL] if it is adjective then it is simple [FL] means
554
simple then it is [FL]. So, all kinds of homograph variation as to be solve. So, it was
found that parts of speech information of this word somewhat solve that homograph
problem, but not always through. Some time same word in case of honorific in nature
maybe pronunciation is some things and if it is non-honorific nature may be pronounce
in the different ways. So, those kind of variation may not be; that means, semantic
variation, there may be a something information that if it is honorific then the
pronunciation is something like Bangla [FL].
So, those kind of pronunciation variation will be there homograph problem, then
homograph phone problem so parts of speech may solve it you can go through this
document I have already explain it then parts of speech then there is a problem is called
homophone problem, homophone problem is just opposite. That different orthography
had same pronunciation. So, orthography representation is same.
555
But pronunciation different orthography maybe W1 and W2, but the pronunciation is
same like that right write in English both are pronounce as write, but depending on the
context I can say which write I have pronounced. So, those problem will be there in PLS.
So, those problem we have propose some solution and we have made that paper to
submitted this paper to PLS then we have taken it. So, you can go through the detail
paper is available in the net.
So, there is a lot of details examples are there lot of analysis I have done honorific, then
we say the then there is dictionary that you can morphological information is important
many country also rise this issue that morphological is important for korean also
morphological is very important like that bar wave in case of Bangla said that finiteness
and honorificity information may be required. So, you suggest some kind of
morphological analysis for solve the pronunciation problem in Bangla.
So, G to P is not that simple many language it may be simple that they in case of Indian
language yes it is somewhat simple, but Banglais not that simple, but in case of Hindi we
found that it is little bit of simple compare to other language because since we have the
syllabic language then the pronunciation whatever we write we almost pronounce the
same, but it is not always true there may be some variation. In Bengali there is lot of
variation in written script in case of English there is also part of variation.
556
So, but if you see the unfortunately English pronunciation dictionary is available in the
net because that is available, but unfortunately in the Indian language pronunciation
dictionary are very rare you cannot find any pronunciation other language. So, there is
very top challenge to develop that TTS in Indian language, but yes we have done
something for Bangla some other group also there who are doing for that Hindi, Tamil,
Telugu all language TTS are write now there, but biggest problem in G to P grapheme to
phoneme conversion which is very important block for the TTS engine. So, W3C as a
PLS speciation even we can started a community building on that how do we built
community the pronunciation dictionary, there is a website we have develop a web base
engine where you can develop the pronunciation dictionary is PLS standard W3C PLS
standard. So, that is website is available in IIT Kharagpur I think it is pedagogy main like
that I conduct domain name.
So, at the end I will give you that wave address where you can use that website to
develop you own pronunciation lexicon in the W3C format.
Now there is a other kind of G to P approach also they have the rule based approach, that
depending on the grapheme position we can develop some rule where rule base approach
will pronounce the grapheme to pronunciation conversion will be done must mostly use
approach are hybrid approach, that some simple rule will develop and then we use the
557
dictionary for all exceptional cases. So, that is called hybrid approach; then once I done
that text normalization and G to P.
Let’s I am not discussing about that text processing required for prosody modelling
because for prosody modelling I will discuss in the next week, the last week I will be
detail discuss about the prosody modelling of different languages. English there is a
prosody modelling. So, there is lot of prosody model available fujisaki command
responsible model toby model all kind of prosody are there.
So, prosody modelling is one of the important issue will discuss later. So, if text
processing for prosody modelling we excluded the G to P is sufficient. So, I can say the
text, input text will come and then it will converted normalized and announce it is
normalization has done nor normalization done then go to G to P, once is go to G to P
and prosody modelling that text processing for prosody is combined together call
language processing and then it goes to speech synthesis.
So, I will synthesis there are lot of formats are there will discuss details on synthesis
because it is speech processing class. So, we discuss about the synthesis part whose
output is called the speech. Now the one problem is there that text I think many of you
know the Unicode I think so, because if I write text it is in English. So, it is already there
in ASCII code system, but if I write in Bengali, Hindi, Tamil, Malayalam all other
languages those codes are come in Unicode.
558
So, input text is in the form of Unicode. So, you have to take care about the Unicode
processing. So, once Unicode same thing, but every grapheme has a unique Unicode. So,
I don’t have any problem in processing. So, Unicode processing is there then best on that
you can develop the dictionary normalized and G to P all kinds of thing based on the
Unicode of that text. Then synthesis there is lot of algorithm for there in this is
articulatory synthesis, parametric synthesis and concatenative synthesis.
So, in the next class I will detail discuss about the each of the model articulatory
synthesis, parametric synthesis and concatenative synthesis which is the main part of that
TTS engine. So, there is a lot of others approach also there. So, then we discuss we show
some something is available I can show you, I can listen you I can show you the voice all
kind of things.
Thank you.
559
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture - 38
Text To Speech Synthesis (Contd.)
So last class we were discussing about the different synthesis technique. So, if you see
that this slides the different kind of synthesis techniques are there.
So, if I able to mathematically model the Human articulator then, if I excite that
articulator by a excitation which is called vocal codes and then I can generate the speech
560
signal, but what is the complexity in articulated modelling the complexity is that if I
develop that model, some model parameters or you can say that model is changing with
respect to time it is not continues if I say Kolkata model is not same for all voice.
So, once against the time the model are changing; that means, constriction area will be
change air flow dynamics will be change; so if is all change the model composition is
changing. So, this means that I required lot of mathematically you can that super model
that which can take this kind of variation. So, that I can generate the articulatory speech,
but it is very good things, that if I am able to generate this mathematically model this
thing with respect to time and then I can excited that model by a either impels or a noise
to generate the speech. So, that is called articulatory synthesis.
So, lot of research is going on in articulatory synthesis because it not only provide me the
speech synthesis, also I can simulate the human vocal track, also if I able to construct
mathematically. Or I can say instead of this if I able to make a some instrument which
can simulate this thing is fantastic. So, that people are doing this articulatory modelling
lot of research is going on in articulatory modelling, but this is very tough because with
respect to time. Suppose I calculate the cross sectional area for O one frame next frame
the composition will be change. So, mathematical model is change.
So, those kind of computer a heavily computational complex problem is there, but
research is going on that is called articulatory synthesis, then there is a parametric
synthesis if you remember we said in that a speech event or I can say whether it is O or.
Whether it can be determining by the format frequency, if you remember the format
frequency f1 f2 f3 and format band width. If I know the format frequency and format band
width I can generate O. So, that is called parametric synthesis. So, if I know the format
parameter and format brand width for that particular speech event, I can generate the
561
speech event using articulatory synthesis that you know the LPC model. If I know the
LPC feature vector a1 a2 ap then I know you designing a filter, I can generate that event,
but the problem is that if you see that once we speak the movement of the speech format
is, it is not static that this frame it is x this frame it is y it is continuously moving.
So, speech format are moving continuously. So, co-articulatory effect while it is moving
from one phoneme to another phoneme how the format is moving is very important. So,
that format dynamics I have to know. So, that format dynamics if I able to apply, then I
can generate the speech using this kind of parameter that is why it is called parametric
synthesis.
So, format synthesis is called parametric synthesis, which is develop based on set up
parallel filter because nothing. If I know the format frequency, I can design that filter and
based on the gain of that format I can pass imposes to particular and can generate that
speech signal. So, that is called format synthesis then there is a HMM synthesis using vic
order we develop that HMM based speech synthesis which I will details, I will talk at the
end hmm or which is called HTS recently mostly used TTS system is HTS hidden mark
of tool kit base speech synthesis model. This is mostly use TTS engine currently then
there is a concatenative synthesis which is you can say that before, HMM base TTS is
come this concatenative synthesis was that highestly use synthesis technique that initially
it was develop diphone base synthesis. Then we have develop one synthesis technique
call element base synthesis technique and during 2000 I think 2003 to 2008, I was in C-
dak.
So, that time we develop that esnola this is call if you know Ak Dutta myself we are
developing that esnola technique speech synthesis which is nothing but a element we use
the speech element as a unit and that time that, iphone base synthesis was there then
there is a call unit section with prosody modification. And last one is HMM base
synthesis also it is 2 types the one is called HTS one is parametric another is non
parametric only the choice of the segments are detected by HMM model. So, that is
HMM base speech synthesis which is also use in concatenative synthesis.
562
So, now, I not discuss about this thing already discuss.
format synthesis does not use any human speech sample at runtime. Instead the output
synthesis speech as a creative using acoustic model .Parameter such as frequency
amplitude etc are varied over time to create a waveform up artifical speech then
concatenative synthesis, this is a very simple modes of speech synthesis technique.
So, what does it required any concatenative synthesis is required a free required data
base.
563
If I say the machine let’s machine acts as a female voice. So, I required a pre-recorded
female data base, if I recode it pre-recorded data base, then for the text which you has the
input I can generate that text speech from that pre-recorded data base. Very simple
example will come from that that concatenative synthesis you can understand. That, if
you go to the train station there is a lot of announcement if you see that IVR who the
train will come at platform number 2 train come on platform number 7,if you say that
kind of announcement happening. Now if you manually notice if some station had added
one platform extra, after certain time let’s 10 years there is a IVR system available in that
station, but after 10 years there is another flat form is added.
So, let that platform number 5. So, the recorded voice after 10 years’ same voice is not
available; so the recorded from different speaker platform number 5. So, you find that
the train is coming on will be one voice platform number 5, 5 will be otherwise. So, what
is there. So, they have a pre recorded speech for all information and depending on the
train coming and so there adding some unit.
So, there is a pre-recorded unit maybe train number station number platform number
train number train name all are pre-recorded in the data base. So, depending on that
which train is coming they search it added together and play it. So, that is nothing but a
concatenative synthesis. So, instead of word; so if say that it is very easy because if the
train number of train are very less or you can say the finite number of platform or also
finites.
So, I can recorded the finite number of words, but think about the language, I have
infinite language words and they can combine in any way.
564
So, how do we recorded all speech it is very trap to recode all those think. And also if I
recoded then matching it is very tuff because the some part of sentence will be same
some other way next part will be sentence speaking some other way. So, all kinds of
problem will be generated. So, what will they said instead of recoding word sentence all
kind of things initially they thought if I recode only the die phone. So, for a particular
language if I have a you know that phoneme, how many phoneme are there in English
how many phonemes. So, for Bengali suppose there is a 33 consonant and there is a 7
vowel and I can generate how many die phones will be there, I can generate how many
possible die phone will be there.
So, what is die phone is basically if you see die phone is define like mid phone is
suitable, then the edge and we said the mid of the phone is suitable for the edge. So, that
is why it is called die phone both the phone. So, how it is recorded?
Now, if you remember or I can show you in here, suppose I open and waveform then
here let’s this is the waveform.
565
If you see that if I cut the waveform in here there may be some erroneous result will
come. If you remember I said the steady state vowels are more lightly same, if I zoom
this potion if you see the vowel perioder look like same. So, if I cut it here in one of the
epoch potion of every vowel one vowels, then from here to middle of this to middle of
this. Then I can say it is a die phone and when I added them different die phone the error
will be minimum. So, because the steady state is nothing but a steady portion of the
vocalic reason because it there is no dynamic are there I cannot say which one is saw,
because it maybe the pure consonant, but consonant vowel transaction is important. So,
that part also contain the consonant information.
So, I cannot selected that transitory part I can select the steady state part. So, steady state
part to steady state part can cut I can make it a dictionary. So, all possible die phone for a
particular language, I cuts steady state middle part of the steady state next middle part of
the steady state and add prepared a dictionary. Signal dictionary which contain all the die
phone.
566
Now if you see any text will come any sentence or any text will come that is consist of
some combinational those die phone. So, I can pick up those die phone and added
together to combine them and I produce the speech. That is called die phone synthesis
technique.
So, why I cut in the middle because that part is the steady state vowel, I cannot cut here
because there is a transitory portion which is part of the saw also. So, that is why I take
the die phone middle to middle that is why it is called die phone; so this die phone data
base. So, training choose unit. So, what is required I have to recorded all die phone. So, I
can say what how do, I recode them, I cannot recode a single die phone by
pronunciation. So, I can say I can generate some nonsense words which contain all the
die phone. And I pronounce that word and I can cut those that die phone from word
signal now or I can recorded the natural speech.
And I collect or I can have a large text purpose is there. I can analysis that text purpose
to find out how many sentence is require to cover all the die phone which I required I can
search in the purpose, I can find out the sentences and take that sentence recorded that
sentences and cut the die phone from that recode, but the problem is that if it is natural
sentences then every sentence has it own prosody depend on it is you can say the
structure of that sentence.
So, if I cut a die phone from the beginning of the sentence and the same die phone, if I
cut from end of the sentence 2 die phone will be different because of prosody is
changing. So, if I do not want that prosody is not required, when I just simply synthesize
the segmental information, then I can say I discard the prosody I recorded the word with
a carrier sentence which is neutral carrier sentence and then I can cut the signal from that
nonsense word. So, that is called die phone sentence.
567
Now if you see all kinds of die phone having the advantage then the advantage making
recoding consistent die phone should come from mid words not take the beginning words
because beginning words voice is started.
So, beginning word I do not want I want mid of a word that die phone should be taken
Then that same time we are also developing one technique which is call esnola system
non-overlapping add method base TTS which text analysis all those are same only what
you want we instead of die phone we call it is a part name. So, I am not explaining the
prosody part I am only explaining the signal part which is call part name dictionary, what
is part name content if you see I can go to the here.
568
If this is the recoding of a consonant vowel and consonant signal like I recorded and if it
is ka you know this is accusation period, this is burst this is vot and this is transitory part.
So, I can say after burst to the steady state vowel is important because that is a dynamics
part of the signal. So, I can say it is consonant to vowel transitory parts cv segment. Then
I can say upto bust beginning of the accusation period to the end of the bust, I can say it
is a consonant then that steady state vowel then again from steady state vowel to
consonant transitory part.
So, my dictionary content only c, cv, v may content or be may not required also I can
generate this v potion depending on the duration, I will come later on that then vc and
then c. So, for all the consonant and vowel I generate those c, cv, v, vc, those kind of
signal then suppose I want to produce bharat. So, it is nothing but a consonant bha which
is occlusion past burst then bha to a transitory part then steady state a vowel part then a
to ra transitory part then raw is a single consonant part then raw to a transitory part
steady state vowel, then raw to tow transitory part tow I can generate bharat any word
will come I can divide those word like that way and I can generate.
So, here is a Hindi is there bhajan, bha to a transitory part a to aj consonant then o then o
no n bhajan it will pronounced. So, the synthesis root is simple cvc if you see the I can
generate c cv v vc c v and fed out of v o means fed out. So, if you see the bhajan then v
cv if the segment is vcv then I can generate like this way if it is cv yv in can generate like
this is. So, those rules are the synthesis part.
So, this is called esnola base synthesis technique, I can show the quality for Bangla.
569
So, prosody is not there if you see it is flat, I can say it is only all part name I have
collected and all part names are concatenative together to produce the whole sentences.
Now if you I told you that steady state vowel no. So, if you see the steady state vowel if I
show you in signal.
If you see the steady state vowel, see the vowel periods are almost identically. If I say if
this is the beginning of the vowel period. Then the up to here is a one vowel period this
is called epoch the beginning of the gotal impulse, if you analysis it will come in here.
So, here to here is a vowel period. So, if I suppose this steady state vowel I do not know
how much long is required. So, depending on the duration requirement, I can repeat this
vowel portion to generate this, I can show you here also suppose if I cut this vowel I
don’t know whether you are able to listen, it is audible I can select the device I think
device has to be loaded.
570
So, if you able to listen, it I think you are not able to listen, it I will see that what kind of
device it is there. So, there is a sentence now if you see steady state vowel I want to
increase the vowel duration in steady state. So, I if I said this is the steady state
beginning and upto here all are look like same. So, let us I cut it here I will show you.
So, this is the one period of the vowel up to here is one period. Then I can copy it, then I
paste it here again.
Then I can again paste it here. I can again paste it here I can again paste it here, if you
see the vowel length is increases, but quality is same. So, this length of vowel is
increases I just only repeating the single period. So, steady state vowel I can change, the
steady state vowel length by repeating the same period. So, if you see the example, I can
show you this is the less steady state vowels are very small if you see it steady state
vowel duration are very less. If I play this only the steady state vowel portion are
lengthen.
So, this technique is used in Bengali TTS to produce that any. So, this TTS engine can
itself take any word you can give it can pronounce to the same clarity, but the main
problem is that prosody is not there. So, prosody has to be incorporated; so that I will
discuss later on. So, this is Bangla TTS we call esnola epoch synchronous overlap add
method and this is the block diagram of the whole TTS. If this is the block diagram
whole TTS engine, and this is the rule we have write down then we generate this
dictionary is created. So, we recorded how the signal dictionary is created we recorded
all the consonant with vowel in a non-sense words. So, how the dictionary is created I
can told you that problem is that if it is ka and there it is ra.
So, I recorded ka ka ka ka 4 times and I take the most appropriate one middle one and
then I cut the occlusion.
571
Occlusion period plus burst is my ka then from burst end of the burst to steady state
vowel is my to ka vowel transition. Then I do not require the steady state vowel, because
I can repeat the last period many times how many I required and then there is a vowel to
consonant transition. So, it is there will be a vowel and there is a consonant transition.
So, vc portion will be there and then again there will be consonant which is occlusion
plus burst if it is stop consonant, if it is vocalic consonant same up to beginning of the
vot and from the beginning of the vot to steady state is called consonant to vowel
transition. So,ka ka ka ka kha all consonant with vowel combination recorded and cut to
create the dictionary and once dictionary is created any text will come first to the rule
and it can generate that.
So, this is used in esnola synthesis then there is a unit selection method will come.
Because if I see the all die phone base TTS and all that part name is TTS has some
problem in natural prosody because prosody is not there, I have to model that prosody,
but modelling of the prosody is not that is simple. So, they said instead of selecting the
die phone can I select a larger unit. Because today memory does not have any problem
the computer memory become very easy computational power also is there. So, I can say
let us I recorded large amount of speech data, from a single voice because if it is mix
then male female will be mixed.
So, for the single voice I recorded large amount of speech data. Then was the input text
is come I search on that large amount of this is the level speech data.
So, this is resource is large amount of level speech data is available once the level speech
data is available. What about the input text will come I can search on the level speech
data and find out the best possible match part of that corresponding to input text and I
572
can produce that is my synthesis base? So, that is called unit selection best speech
synthesis technique. So, if it is unit selection.
So, if you see only problem, is unit selection is that how do I select the unit. So, given a
big data base is there for each segment that we want to synthesis find the unit in the data
base that is the best to synthesis this target segment. How the best is measure depending
on the target cost. Closest match to the target description in term of phonetic context f 0
stress phrase position and joining cost is that best to join the neighbouring unit. So, when
I join it the end of the segment and another end of the segment, how they are joining
good and how this whole segment and this segments are good to join together in term of
closest match. Target description in term of phonetic context f0 trace phrase position.
So, in unit selection prosody is in build. I cannot use to modify that synthesis base
according to the language. So, the prosody which is come in unit selection method which
is available in the database if the text which is given if it is whole text is available in the
same context then I will get best prosodic match sentences, but suppose that input text
does not have any instant on this data base only some part die phones are there are in
here and there, then if I club together then I cannot minimise the target cost and the
prosody which will come it is not come depending on the language structures it will
come within the available level data prosodic structure. So, prosody is driven by the data,
but if the data is big then my synthesis output will be good if the data is less the synthesis
output quality will be not that good.
573
So, target cost and joining cost based on this cost 2 cost function we select the so canon
valon university that is the unit selection TTS are there festival engine, you heard about
that festival engine. So, that is nothing but a unit selection base TTS system I am not
details describe the target cost which is available in the slide you can go through it then
there is a another TTS is come HMM base synthesis.
This is HMM base synthesis we have develop for Bangla also hmm base synthesis. So, it
is 2 type one is called parametric, if I use vic order then it is parametric synthesis
because here. What I will do speech data base is there we extract to the excitation
parameter. So, if you see the speech signal consist of 2 part one is called segmental supra
segmental or I can say it consist of 2 part one is called excitation parameters another is
called spectral parameter or you can say, if I have a vocal track is excited by a vocal
code. So, I can say excitation parameter and vocal track parameter.
So, this is called spectral parameter and this is called excitation parameter. So, we have
extract we have separate from a large speech data base, with this we have separate the
excitation parameter and also we have a spectral parameters based on this 2 parameter.
We developed HMM parameter estimation technique now once the text is come input
text.
So, text analysis is done contextual labeling is done then depending on that text, we have
estimate the parameter which segmental parameter is best match for that input text and
which excitation parameter has the best match with the input text. Once I get the
574
excitation parameter and spectral parameter then we can use the synthesis filter or vic
order to generate the speech. So, that is HMM base speech synthesis technique which
name as HTS hidden mark of tool kit base speech synthesis technique.
So, next class I will discuss about that ASR some portion of the ASR. And then we talk
about that action conversion that thinks. And the last week is the prosody.
Thank you.
575
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture – 39
Automatic Speech Recognition
So as we said that you have discussing about some speech application. So, my idea is not
to details go about that switch interface speech which ASR because that if I go for the
switch interface in details then it will take another 4 to 6 lectures. So, I am not going
detailed. So, and same things for ASR also a lot of a tricks speech recognition.
What I will know I will basically describe what is ASR, what are the technology involve
and challenges in ASR also I will described. So, that you can take some project and you
can as a student you can do some project on that challenges or as a teachers you can
gives that project to other students also to pursue in that area.
So, my idea is not detailed discuss about the ASR, but read goes issue which are
involved in ASR and rules of project in current scenario also. So, if I said that ASR like
the TTS if I say ASR automatic speech recognition.
576
, automatic speech recognition, so what is at the automatic speech recognition means that
if I have a machine and I have a microphone. If I talk to the machine,machine should
understand not only this is understanding I am not saying understand. Machine should
decode what linguistic message is there in the switch signal; that means, suppose I utter a
sentence machine should recognize each and every word of the sentence. I am not saying
understanding perfectly.
Please understanding in different issues. So, that issue I will come. So, ASR means I
provide a input which is nothing but a acoustic signal, and from the acoustic signal it
should convert the linguistic information, only the linguistic information like the word
which is contained in this utterance, or sentence which I have spoken. So, either I based a
single word may be a sentence, or several words together w1 w2 w3 w4 may form a
sentence any from sentence. I can say that open this folders or I can say opened. So,
either it may single words or it will may be stream of words. So, that should understand
by the machine that is ASR.
Now, application of ASR there is several application .Challenge that most common
application if you see in your mobile phone, there is an ASR if you say home then it will
dial home. If you say friend if you score in number friend number then the mobile dial
friend. So, there may be single words common recognition. Or if you if you seen in the
speak in the Google search engine there is available. In the Google search engine if you
say search this or that restaurant name then the Google will find the restaurant name.
Show you the restaurant name or restaurant place and other things also.
So that means, it provides a hand free operation with the machine .Or it is a man machine
communication man communicate with the machine. So, that may be a command and
that may be dialogue that may be a even I prefer that something like that the main
drawback in Indian language, contained in internets if you see the typing of Indian
577
language is not So easy because it is component script. So, if I develop ASR I just speech
on the font of the machine, machine will type what I am speaking. Then it is first
operation. So, man machine communication even somebody does not know how to write,
but the noise speaking because spoken language is the natural language for the human
communication.
So, even I do not know the how to write the language I can speak. So, if I speak front of
the computer and computer understand what I want, then nothing like that application.
So, all kinds of application different kinds of ASR is required, but basic technology how
it develop the ASR that depends the some acoustic signal I have to find out the word
what I am saying. So, any how I have to do that things. So now, if you see the basic idea
that what is speech recognition, that there is a speech signal I have to find out the next
one w1.
So, it is nothing but a that I should know which word I have spoken. So, machine should
learn the words with the acoustic signal. So, from the acoustic signal of the training data,
I run the machine and which is called model generation I will learn the w 1 model is
generated or machine is learn, then if I said that word based on the learning of the
machine, machine should match the pattern and say you have said these word.
So it is nothing but a pattern matching algorithm. I can say language model all kinds of
things are in come later on. So, nothing but a acoustic signal machine will how the child
is develop that recognition system is developed. So, this speech perception not speech
578
production. So, when the child learn learning some vocabulary you see he learn he listen
it and time and time listen it and then it is memorize. So, you learning that take one word
by one word and then once he learn then he can whatever I say he can recognize.
Similarly machine also machine helps to be learn the would wall. And then what
sentence phone whatever you said and then machine helps to recognize from the input
signal of speech signal.
Now, what is problem? If I say let us think about only simple word recognition problem.
Suppose I spoke a word let’s speech.
Speech then I sink there will be acoustic wave form of the word speech. Now let us
machine learn this or machine store this then I said speech, then that acoustic output has
to be matched with this acoustic what that machine is already is stored acoustic signal,
and if the signal is match then I say I said speech. So, speech that word is stored and then
579
there is So, how why match the signal that is pattern matching that are come later, know
how I match the signal that is AIR pattern matching artificial intelligence pattern
matching classification HBN neural network all kinds of things are there pattern
matching.
I will not going to the pattern matching directly. Let us say there is a speech signal is
there speech spoken form of the speech is there that mean acoustic signal of speech and
if I say speech it compare with the store machine store acoustic signal, and any how
some algorithm is used and if match this two things and give the output speech.
So, purpose is solved, but the what is the problem? Problem is that if you said speech this
time if I next time I say speech the two waveform are not identical because machine
cannot or man cannot produce two identical waveform. Or this time I said speech and
next time I said speech. So, there is a length of the word has also a problem. So, that
problem to do not kind of variability is the problem. Now I stored only speech for
speaker 1 if a speaker 2 said speech then also have a problem. So, that uncertainty how
do you compare the input will loan pattern that has an problem. How to solve this kind of
problem? Second thing is that what pattern matching large vocabulary how many word I
am stored, for any language there is a huge number of words. If I stored many words and
then there is a probability of becoming same word then you can say the politically
similar word will be increase have a sense.
Suppose I said as I give you in that problem the problem in I think in week one lecture.
The best on the manner of articulation and identify the word, but if the similar kind of
words are there like baba and kaka is there, baba and kaka this kind of word. Then I said
the different between the 2 word is only this is ba and this is ka, but if you say ba is also
a stop consonant ka is also a stop consonant. So, identify between these stop consonant
by only deeper by only place of articulation. Because this is also voice occasion period is
voice, but this is stop consonant this place is bilabial this place is less mailer. So, how the
identify the place. So, identifying the different between the 2 signal full very less. So, if I
if my vocabulary side is increases the similarity what will be increases and
accommodation will be fall down.
So, complexity of pattern matching will be increases, then absence of what boundary yes
this is very important. If I develop the word model, but if is say I will go to calcatta,
580
If I say speak signal is very complex if you say, beginning the word this is a word this is
a word this is a word this is a word. In a written the word boundary are definite word
boundary are there, but when a speech when I speak this one if I open that acoustic
signal of this boundary does not exist because this is a co-articulary offense its singals is
very complex.
So, there is a no definite word boundary in the spoken language. So, I cannot say speech
in a word how do I identify speech is a word from a spoken word spoken sentence, but if
it is isolated word is I can know there only word is spoken, but if the quantity was speech
word boundary are absent, but in case of isolated word also I said the single I cannot
reproduce exactly same word twice. Then if I say there is a 2 word like that ice cream
and ice cream. Both pronunciation are same ice cream and ice cream which were I want
to said I do not know the acoustically they are similar.
So, those are the problem in speech recognition. There are other problem also. Human
physiology, physiological change. Suppose today I say some word then I got coal and
cup. And I say the same sentence, but the acoustic signal will be differ even there is a
physiological change in happens in here. So, there will be different similarly the word
spoken by a male voice and female voice in the problem because male for may be a
higher side male for may be a lower side female for may be higher side. So, all kind of
variation will be there. Then speaking style every person at different speaking style the
same word speak I say for speech somebody who has speech somebody who has speech
signals are different.
Speaking rate I may said first somebody said very slow. Then emotional said if I happy
my signal will be different if I sad signal will be different emphasis. English is a bound
trace language I will come that later one bound trace less, but english is a contrast
581
language bengali is a bound trace language. So, if I am saying what about the English I
am pronouncing it is not exactly bound trace english. Similarly emphasis and there is a
accents dialectal foreign word variation, if see the dialectal variation very complex issue.
Same word if there is a dialectal variation the pronunciation is different action different.
Foreign word, if I say suppose there is a Bengali parts and his pronounce zoo, how the
pronounce the pronounce zoo?
If it is spoken by an you can say the convent educated to a portion then he pronounce
zoo, but a bengali person who first language bengali he said zoo. zoo although it is a
foreign word, but when I pronounce in my style not that what about said by that foreign
language style. So, that all variation will be there same thing there is happen that I
developed a english recognize based on that one kind of dialectal english like that. Let us
I tend that using timid database you know the timid it database I trend a ASR, then if I
trend it by timid database and then I try to recognize the bengali person whose first
language bengali as speech as spoken in English.
Then I there is a lot of miss conception is there recognition is happening what error rate
is increased. So, all kinds of accent dialects foreign word variation is environmental and
background noise I recorded speech in a studio environment and on. I am testing with
my mobile phone I recorded the speech in student environment and machine is trend that
and after when I provide in the machine the test with that is recorded on a factory
environment. So, there are lot of background noise how will eliminated the background
noise? So, all kinds of problems are there in ASR that is why there is a you see there are
lot of long history of ASR development. So, history if you say then this is the history.
582
So, trend started with feature based methods then template matching method rule based
method today statistical methods is most usely methods, and right now there is a research
trend in deep neural network with a subsystem.
People are trying to develop the subsystem using the deep neural network method. Every
method has an claws and claws statistical method rule based method template matching
method.
So, I just go one or two method, so that if understand the differ issue of the ASR. Let us
template this method this let us I describe the template base method. So, what is there
which is nothing but. So, template means that suppose let say the I have a rectangle is it
template.
Now I produce similar rectangle, and then I say this is a rectangle because this match to
this. So, this is a template which is store in computer or store in machine, and I have
given a test template this one. So, they map store template with the given template if
583
their match sound match then declare this given template is nothing but a this because
this has matched.
So, I have a machine, I have a rectangular template then circular template triangle
template, and let us like this kind of template I take then if I provide a test template with
circular. Then we try to match to with this we will match then say it is a circle because
the template is said circle. Similarly in ASR what they said plus there is a some words
this is template base matching using in word base like that whatever you doing in mobile
that when you say home, when you record a when you record the number these are
pronounced home. So, that home what template are there in the mobile phone.
Now one I produce home in microphone and I give this acoustic signal to here machine,
there machine try to match further this template match to this template onward.
So, there may be a home, there may be a train, there may be a likes school all kinds of
what is are there. And once I get this template try to match with template of all words
and the best matching template will declared you have say this one. That is called pattern
reorganization matching between the 2 things is pattern reorganization. So, how the
match there is a different kinds of algorithms are there they are easily initially use vector
quantization, dynamic time working hidden mark of model and then is today neural
network, then art artificial neural network, then there is a today deep learning will be
week topic to let me using deep learning for template matching, or any kind of
specification suppose vector machine.
So, any kind of specification technique I can find out whether the 2 templates are match
or not. There quantization initially this was very much use in speaker identification.
584
And last week I will take one class or single identification because I will just describe the
problem, and then I will say the how the vector quantization is used in there. So,
basically vector quantization is nothing but a creating a you can say that training vector
someone if I say speech, speech, speech, speech 10 time. Every time there is a different
speech I produce.
Then should I store all trends speech in machine, or I can say from the all trained speech
manage a representative vector that is speech. So, vector quantization is quantized what
the issue is same as quantization, issue the quantization issue is there if you remember in
my the Digital quantization class.
What I said I have said I divided the let us first 5 volt in let us 8 bit then I divided this +5
volt to -5 volt in 266 level and each level changing by single bit if it is sign then 127 plus
and 127 minus. And I said that all this voltage level this delta we may be presented by
only single representation that line. So, what about the voltage within this limit may be I
585
say it is fall in here one it is cross here, then I say little represented by here this line. So, I
quantized that continuous 5 volts you some sort of a level. So, same things is happen. So,
suppose I have a feature vector x1 x2 up to xn it is a feature vector I have collected all
kinds of feature vector, then I said x1 has a variation from let’s it is a x1 varied from 0 to
1.
I said generate five vector within the variation of 0 to 10. So, I can say 0 to 10 is divided
in 5 level. Or I can say 0 to 10 let us variation of x 1 is from - 10 to + 10. So, 20 is the
variation so I said there is a 20 variation if there, let’s equally divided that variation with
a 5 then 5 division will be state. So, that is vector quantization if you read this slide you
can understand that codebook and vector quantization details I will discuss later on in a
speaker identification problem.
586
Important issue where the speak recognition is start. Dynamic time warping, which is
very important. It is leads to that you can say that dynamic programming from here
which is started. So, what is the problem?
Problem if you see there is a 2 utterance A and B, let us first utterance is let’s it is a
speech of something is utterance of one unvoice then voice then unvoice then voice like
this. Same word it utters 2 time the length is differ there is 2 different length. Now if this
one is storing my machine, and this one is I want to match with this distinct there cannot
be match. Because I cannot match whole words with the whole words what I say and
comparing let us 0 to 10 millisecond of here with 0to10 millisecond, then if this portion
I compare with this portion ,this portion is high, so if say no this cannot be the word. So,
this is not correct matching have you understand or not.
Suppose I have a look forget about this example, suppose I have a speech.
This length of this signal is let us 10 millisecond, let’s one second and next time I
pronounce speech I pronounce the length is 1.5 millisecond. So, if I frame it what about
the framing I done parameter expression I done, let LPC parameter for 100 frame per
second. So, this give me the 100 frame this give me the 150 frame. Now I try to match
100 frame with 150 frame. So, there will be time alignment, if there is a mix match in
time alignment. So, I made compare this signal this portion of signal with this portion of
signal, but see this signal can compare only this signal. So, what I want the 2 sequence, 2
number although there is a number of framing difference, but I want they the frame
should be compare with similar type.
So, I want either this 150 frame should be walk to 100 frame, or I can say this 100 frame
representation can be walk to 150 frame. Any one I have to do so many frame will be
587
same I can say let’s frame 1 is compare with frame 1 frame 2 compare with frame 2 of
this one and then frame 3 of this one is compare with here frame 3. Next time also frame
3 is compare with frame 4 is compare with frame 3,then frame 4 is compare with frame 5
is compare with frame 3.Until unless the distance between the 2 frame is very high. Then
I can say this is the 100 frame 1 to frame 100, but once I compare this is frame one to
frame 150 within that time all because I have not jump.
So that is his dynamic time warping. So, time frame of this signal is wrapped at for the
whole data. So, from they are the best I want from all if you see the here time sense a and
here time sense b I want the best possible match path and maximum, I final the distortion
between the 2 things. And if this is within the limit frame jump will be not happened in
the outside the limit I said the frame jumping happen here within the limit. So, frame
jump is not happen. So, that way I can time work detail you can go through. So, this is
dynamic time warping. So,template match with time alloying with testing. So, it is can
cater to the radiation of speech rate, but it cannot it I do not know whether it is cater to
speaker variation or not.
So, all kinds of other kind of technology I have to use. So, there is a lot of techniques of
ASR are developed lot of sophisticated pattern matching techniques are developed is VM
support vector machine there is a or neural network, today in all research are going to the
deep neural network. DVN deep neural network these neural network,DNN. So, all kinds
of things are happening. I have not reading out the slide. Now next class I will discuss
this one. That state of our speech recognition system what is they are problem? What are
the research issue of they are in state of our speech recognition problem? Issues and there
is a less resource what is the requirement of this kind of statistical model. And what is
the problem in rest language what are the alternative solution what are the research going
on this lesson I will discuss in the next lecture .
Thank you.
588
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture – 40
Statistical Modeling of Automatic Speech Recognition
There is the state of what statistical model if you see. This is the block diagram of a state
of art statistical model. So, in here what you want in speech recognize the problem in
ASR is that.
Whatever the spoken I can say the set of sentences that can be generated that or I can say
that let us I have spoken a sentence, this is the acoustic signal which is measure is A. So,
what I think there is a acoustic signal which has frame and parameter are expected and
that measurement is called Acoustical Measurement. So, I say I have recorded the
acoustical signal extracted the parameter as I discussed frame wise, either 100 frame per
589
second or 50 frame per second as you wish. Now that is called acoustical measurement,
then what is ASR that from the acoustical measurement, I have to predict which word I
have spoken understand or not. So, there is a given acoustical model I have to find out
which word I have spoken. So, I can say the problem is probability of W or a given
acoustical measurement. So, I want maximize the probability of recognize the word
correct word for a given acoustical measurement.
So, if I say that this acoustical say this ASR is designed for this set of word w1 w2 w3 w
4 w 5 like that, then I can say for a given acoustical word or acoustical measurement
what is the probability of any word of this given list. Now this theory I can apply these
can be equal to probability of A given a W into probability of W divided by probability
of A joint probability.
So, if I want to maximize this probability this can be treated same problem I can state
that I have to maximize this one. Now probability of A probability of acoustical
measurement all the time I have taken the acoustical measurement. So, I can ignore this
is not requate, because I have all equal that I have acoustical measurement is there. I am
not saying I have not measure the acoustical acoustical measurement will be not there,
but I have do given that I would recognize the word.
So, probability of A I can omit this all I let us say 1. Now the maximizen problem word
down to the maximize the probability of A for a given w and probability of w. So, what
are the meaning? I have said that for a given acoustical measurement find out the
maximum probability which is the A is belong to which word or which word, it which
word give me the maximize this problem it a max of this one. So, this acoustical
measurement are maximum mapped to which word. That converted to for a given set of
word now the word is given for a given set of word I have to maximize this acoustical
observation. So, probability of this acoustical measurement is maximum for which word
and multiplied by probability of W.
So, this is acoustical measurement. So, this is call Acoustical Model. And this is called
Language Model. So, this is the statistical model in ASR. So, developed an acoustical
model. So, what I have to do? That let us I have a given that w1 w2 is there. So, I said
whatever I have measured in acoustical measurement A, what about the parameter? I
have extracted find out the maximum probability of A for a given w1 w2 w3 or wn
590
which is for a probability of A for a given w. So, org max of probability of A for a given
w and probability of w. So, I have to maximize this one and maximize this one, or I can
say the product has to be maximized. This is called language model this is called
acoustical model.
So, acoustical model is done based on the HMM and language model done based on the
trigram bigram trigram or unigram model. So, this is language model bigram trigram and
this is HMM model for acoustical say. I have not details discussed how this HMM model
is done. So, this is nothing but a maximizen of the probability. Now I come to the issues
what are the issue is there.
So, what I said that acoustic signal is related to set of w. So, what I said if I to do that
way that one source w is there speaker has spoken that word production of that word
from the production signal, I measure the acoustic parameter from the acoustic parameter
I want to detect this w.
591
So, I am estimating this w. So, this is ASR system now I have to say I am assuming the
acoustical signal is related to written text w; that means, I am assuming w cap is
estimated output of the word written language. The input word must also be a messaged
in written language. So, speak suppose I am speaking some sentence; that means, if I
speak spontaneously. So, how the when speaker said some sentence how they generate
that message. There is a message planning idea, message planning, production. Now my
ASR model told me, that whatever the production signal is there that is nothing but a w1
w2 w3 w4 or whatever the some set of words. Which can be explicitly expressed in
written form.
But there is a catch spoken word and written word are not identical. When we will listen
the words we can identify spoken.
So, I am saying that speaker when it produce the speech I have prompted the speaker that
you have to set this w1 w2 w3 w4 wn. So that means, something which I am reading text
is given speaker is read out the text, for that signal this ASR model is valid. Now if I say
it is a spontaneous speech I am speaking, when I speak I do not care about what is
written in there. Some words omitted many thing you record it. record any just
continuously read even it continuously read one paragraph, not that first time reading. If
you the practice 10 time then read it then you find many cases many syllabol many
words are dropped. Or I can say when human being communicate we never say complete
message word by word, or even if in continuous speech all words are not explicitly
mentioned. If you find in continuous speech always end of the sentence the amplitude is
going down even, I have seen many cases in end syllabol is not there acoustic signal is
592
not there. So, this ASR model does not model this speaker mine words or you can say
what speaker want to say that is prompted by the input text only.
But if it is not prompted all kind of variation of spoken language will be there, and my
model is not true for that. So, this is the challenge, I will details. I will discuss next
challenge is the understanding the input message because what speaker does he
understand the input message and then spoken. So, what about the speech production
output is there it is not w, it is a spoken or it is a spoken. And whatever I estimated I
want to estimate these w only, but acoustic signal is only content is spoken information
not written information.
How they are different I will come. So, similarly there is a problem in language model
also.
What I said that language model I have PW is nothing but a language model which is
which can be developed by bigram or trigram probability. What is bigram probability?
Bigram model that for a Pw P given w2 s o if I have a sentence w1 w2 w3 w4 I can say
the bigram model is pw 1 pw 2 for a given w 1.
593
Probability of w3 for a given w2, then probability of w4 for a given w3 like that way,
there is bigram model. Similarly trigram model, probability of w1 w2 then w3 for a
given w1 and w2, but up to trigram model I can go if I go beyond above trigram model
the complexity will be arised.
The required training data size will be very huge. So, even if develop the trigram model
also the equated data size required very huge. So, for a less resource language whose
computerized resource are very less, those languages development of trigram model and
acoustic models is very difficult. So, this statistical problem it is true the statistical means
if I have a enough data then my statistical inference may be very good. If I have a limited
data then I have a statistical inference is very limited.
So, statistical model the limitations are it does not made distinction between the spoken
language and written language, what I have described. That we have defined the ASR for
a given w, but it not distinction between the written language and spoken language.
If the noisy communication general model does not approximate the situation involved in
ambiguity, that is I will discuss written ambiguity means. There is a some ambiguity in
written language by ice cream I do not know whether it is ice cream or ice cream. Then it
is a impossible to use higher order statistical model or language model, beyond trigram I
594
cannot use, but that trigram model not give me the enough language model information.
So, what is I do? So, they leads till there is a open challenge, which calls spoken
language recognition.
So, I have to develop some system, which is spoken language recognition. Spoken
language is deform from written language. Speech and text are physical signal. So,
fujisaki, if you see the slide is taken from the fujisaki (Refer Time: 14:33) fujisaki.
He said that spoken language versus written language spoken new concept in spoken
language, speech and text that physical signal, but also saw that the different code
system. The information they carry or not identical. Spoken language refer to both signal
and both system same applies to written language.
Now, if you discuss difference in ambiguity. So, there may be something is difference
spoken language in ambiguous, but written language is not, but something in written
language is ambiguous, but spoken language it is not. That is homographism and
homonymity. Ambiguous in spoken language, but not in written language flower, but
pronunciation are same.
595
But there spelling are different written language they are not same, similarly pronounce
written language they are same, but pronunciation, but written ambiguous in written
language, but not in spoken language,
Now, if I come the prosody, this is the important issue spoken language. Not only
contain the words linguistic information. So, as said linguistic information or not words
is not only the linguistic information there is a linguistic information called prosodying.
So, prosodying is message how the message is plan rhythm is important. So, I said the
spoken written language what boundary are not exist in spoken form, but there is a
boundary in spoken language which is spoken language boundary, not written language
boundary. If I say in English yes there is a two example.
Old man and woman if I say the old man and woman. Then man and women both are old
if I said the old man and woman the man is old, but women is not.
Similarly, in Bengoli I will give you interesting example [FL], if I say [FL], I can say
like that way or I can say [FL], if I said that the meaning is different and if you see the
596
what boundary positions. The position are boundary at different [FL] is same [FL]. So,
prosody is ambiguity is there now, but the prosody information is very important.
Now for spoken words identification is very important.
What boundary marking? What languages spoken word written language or ice cream
ice cream I say ice cream ice cream. Spoken language ambiguity. Then I am not going
details of the written language and spoken language.
So, then we have in our team are doing some research on spoken word or prosody word
based continuous speech recognition system. We said the written language words are
different from spoken language word.So, instead of trading the same system with written
language can I trend the system with prosodic word or prosody that spoken word
language spoken word.
597
So, you developed a system which trend on spoken word for bengolis now. Identification
of the spoken word is important. So, spoken word is identifies for a spoken language
database. So, I can say I record this I collect this continuous speech. And identify the
prosodic information based spoken word position, and I said those words I can create a
dictionary and I told the recognizer those of the spoken word, not w. So, I can say
dictionary. So, I can say that my a statistical model only estimate s cap from s, not w to
w cap because w is converted to s.
So, I have saying my thing is that I have a s I estimate s cap and then try to convert to w
using language model.
So, if you see that there is a lot of research in this area, this is the proposed model. That
prosodic word are boundary are identified there is a lot of this one Thesis is there my
own research is also in this direction then manner based leveling pseudo word formation
and then if you want the details. Then you can contact me I can share you the all details
information of this I am not detail discussed this kind of things this is available. So, this
is complete of speech technology speech recognition things. So, next class let us discuss
about something on speech base learning or my dream research action conversion. So,
next class may be 10 to 20 minutes we try to discuss in the next class.
Thank you.
598
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture – 41
Speech Based Technology Development For E-Learning
So, there is other kind of application. Let us we discuss, we have discussed about that
ASR, we have discussed about those synthesis. So, I am not going details of ASR, I have
not detail discussing that HMM model bitter be algorithm because I think that required
another courses so, but I give you the glance of that ASR and the research issue in ASR.
So, there is lot of research is going on, non convention the beyond statistical model.
today there are lot of researching find in deep Dural neural network base ASR
development, then you find that manner base that at speech attribute base ASR
development lot of research is going on because the ASR problem on statistical mode
night not yet solved, it can save sub the purpose for certain extent, it is not solved.
The human speech recognition problem or you can say you cannot solve that extend
where we can use it in many cases or you limited application it is good. But when the
application is large scale up, then there is some problem even the resource constant
problems are there if the ASR; the language which is less resource we cannot develop the
statistical base ASR system because there will were error rate will be very high and ASR
performance is measured on word error rate.
599
(Refer Slide Time: 02:05)
So, I have not going details on that thing. So, another area of research may be speech is
may be spoken language acquisition, suppose I want to teach or I once you that I want to
teach the English or I can say second language acquisition or spoken language
acquisition I want to teach English. English like that problem, I can describing that the
government of West Bengal decided that English will be taught from primary level
earlier it was started from class 5 or 6. Now it destroyed that English can we taught from
class 2 or 3 level.
Now, if we visited many school in village area, you see the pronunciation different the
teacher’s pronunciation is also not acceptable. So, if I want that we want to spread the
spoken English or I want to spread English literacy in village area. So, I want to use the
computer because English is a second language that English is not their mother tongue.
So, if I want to taught the second language or English spoken language in every school;
can we developed E-learning tools or E-learning platform by which we can taught all the
school children English pronunciation because if the pronunciation is not correct then
you see the spelling mistakes will be happened. So, spelling and pronunciation orality
and because the orality is first if you read the history that before written script was came
the orality is exist.
600
So, orality was the oldest things for communication. So, first I have do oral then we will
go for the written. So, orality; so, if I want to thought the spoken language that spoken
English second language acquisition in English.
So, can we developed a systems where I can say that there is a lets reference speech
reference or expert speech; expert speech is according in computer means in front of the
computer the child speech child spoke there is a microphone as pronounced some
English word when it compared with the expert page. So, 2 parameters are there
segmental and supra segmental is compare with the expert speech and told me which
area I have to stress which area I have to improved you he may say ok.
This word lets the pronunciation is zoo then say you are not pronouncing zoo either
pronouncing joo. So, zoo make it joo. So, I can many times, I can listen and try to
pronounce once its match congress to a yes your achieving it. So, that way I can trend the
students that match their pronunciation with experts speech and computer will told me
where it is different; no, no, no, no, it will be look like this. So, like that music listening a
or an you can say the acquiring of music from a guru that you use you are saying a song
then guru said no, no, no, no, no, you upper no it is not touching that is gurus perceptual
judgment.
601
Now, I replacing that guru by a computer judging, I have pronounced zoo if I initially
pronounced zoo computer is no, no, no, no, it is a zoo, then I say joo, no, no, no, it still it
is zoo then I can say zoo is achieve.
So, that is one of that finest application you can developed you can take a project that let
us start with spoken language I want to teach spoken language through computer. There
may be you can take the help of the language pedagogy which kind of words has to be
taught in this level. So, language pedagogy is there pedagogy will be there in the
language background language pedagogy. The second problem is that I am saying accent
conversion.
If you see this slides, suppose a Japanese speaker; Japanese professor taking a lecture lets
Japanese professor taking a lecture; I am not saying it that I am the Japanese professor
taking lectures.
Many of us; since we are not a custom with the Japanese accented English, if I sit in that
class we may not understand that accented English or clarity of that accented English to
me will be recognize that clarity will be very difficult. So, I am not clear what professor
want to communicate with me. So, what will happen I will not understands his lectures
completely means I will understand, but not completely I want to customs with that
lectures. So, what will happen after few times I will switched off my; so, I do not want
602
try to understand that things, but same lecture if is listen by a Japanese students he a
custom to the Japanese accent he enjoy the lecture.
Similarly, if I taking a lectures and if it is listen by a Japanese students. So, what will
happen he is not a custom with this my English accent. So, he will not enjoy my teaching
because he is not 100 percent understand he is not 100 percent you can say attentive to
my lectures. Now similarly, suppose I put a device in middle of here whatever I say it is
my accented English I am not converting the language I am saying my accented English
and when the listen by a Japanese student; he listen Japanese accented English
understand. So, I am speaking in Bengali because my fist language is Bengali. So, I have
Bengali accented English I was speaking.
Everybody spectacle is personalized; that means, as per my spec power or spec condition
of the eyes spec is described. So, I want to see what I want; that means, as per my either
spectacle is designed. Now think about speech spectacles you are speaking to me, but my
accent is different. So, I cannot understand lets I put my device. So, is convert that your
accented English to my accent oh I understand fully. So, ultimate aim is to develop a
speech spectacle and if I; it is awesome that suppose somebody is giving an American
professor giving a lecture and I am second language is in English and I cannot
understand what is saying then I where my speech spectacle I put my device there and
convert in my accent oh I understand.
So, this is ultimate dream then can I develop this speech spectacle. So, this is a open
ended research problem people lot of peoples are working in this see the a area you can
take up any students can take up in teachers can take up this project and try to develop
because it is possible speech has 2 parameter segmental and supra segmental. That
means, array and speaking that you can say that prosodic modification accent and
603
contextual the segmental modification if I am. So, the there is a lot of study is required
from segmental study and supra segmental study. So, there is a thesis this area that the
difference between American; native American English means whose L 1 is American
English and a spoken a Bengalis people whose L 1 is Bengali.
So, there is a thesis on that and what are the segmental and supra segmental difference is
exist is it acquirable or it is not how it can be convert all things are discussing those
thesis. So, if you are interested you can search the thesis on this. So, here there some
thesis is done by my guidance one students as done that. So, he is compare or you can
find out the segmental and supra segmental difference between L 1 American English
and L 2 Bengali English. So, whose L 1 is Bengali, but L 2 is English. So, there is a there
is Ishka project if you see there is a lot of study is going on non English is a world
language you can say that. So, there is lot of study on English of different countries
English.
Malayalam speaking English; so, there is a single you can say the Thailand people are
speaking in English. So, all kind of things are which are going on, but this is my ultimate
dream in one day I want to develop the systems which will convert that lets English it
can be used for any pair of language suppose I have suppose somebody is speaking in
Bengali which is a second language suppose Hindi people are speaking Bengali, oh, it is
strange. So, what I will do I put a converter there I want, but way I wanted to listen. So,
that can be that can be apply for any pair of language that is open ended challenge. So, I
can say I can develop a speech spectacle personalize my speaking what I want to listen.
604
(Refer Slide Time: 13:03)
So, that is there. So, I am not going details orality, there is a History. So, now, the how is
it possible yes it is possible if you see the when you produce the speech there is a
segmental and supra segmental speech also this speech. So, there is a linguistic,
paralinguistic and non linguistic driven on this information segmental and supra
segmental is developed if you say processing of information extraction.
So, here all speech which is segmental and supra segmental acoustic signal is provided
here. So, physical constant speech sound analysis then physiological constant motor
605
common excitation physiological prosody analysis utterance analysis rule of grammar
message analysis then output is linguistic paralinguistic and nonlinguistic for.
So, all I can do from the speech parameter also. So, that is open ended research we are
still doing on that things. So, there is a some study who is taken from the net there is a
some accent conversion works are there research trend there that there is a accent
conversion through dia-concatenative synthesis technique then accent conversion
through articulatory synthesis technique, then there is a accent conversion through cross
speaker articulatory synthesis technique and here we have we are proposing some
technique which still in research end that we are proposing that since which is a
combination of segmental and supra segmental features; we try to change the segmental
supra segmental features using deep lib network or deep neural network kind of
application can we do that still it is in research stage we are doing it.
So, this is end of that speech application what I have discussed; I can discuss more
application in next week like this way prosody modeling and speaker identification and
speaker verification I can discuss the preliminary. So, you have a rough idea what is
speaker identification as speaker verification. So, that that idea is there and I will give
you some idea about the prosody modeling which we are doing in our lab that speech
prosody modeling for Bengali language long time who are working in this area. So, there
is a HMM base things, I will discuss there will a lot of papers also there. So, I will
606
discuss the some research issue on those things. So, I am not details discuss about that
that side of that sub computing side. So, you can read the sub compute sizing and try to
co relate with this speech application.
Thank you.
607
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture – 42
Prosody Modeling
So, let us start that speech prosody. We have said in the 8th week, I will discuss about
the speech prosody. So, this more or less; all discussion will be related to the Fujisaki
model and also some work has been done for Bangla in my lab that is I also discuss. So,
think about that; what is speech prosody; first people will ask that what is prosody ; what
do you mean by prosody. Now if you remember in speech synthesis, what I said that
what about the speech synthesis technique, I have used whether it is a I-Phone base or
unit selection base the prosody is an important parameters because prosody information
is only if you want that speech should be natural communication then without prosody
the synthesis becomes unnatural.
So, I want to implement speech prosody means in synthesized speech, I can use speech
prosody on other hand prosody can be used in speech recognition also because speech
signal not only carry the segmental information, but also there is a prosodic information
which is necessary for which may be the meaning full for improve the speech
recognition output and prosody may be used in speaker and voice identification all the
cases. So, we have discuss about the speech prosodic first then you can think what kind
of information processing you should be use for what kind of application.
Now, what is speech prosody? The prosody is defined as per the Fujisaki is the
systematic organization of individual linguistic units into an utterance or a coherent
608
group of utterance in the process of speech production. So, the meaning is that if I want
to say read a paragraph or if I want to say a sentence it is not arbitrary rhythm I have
followed, I am followed some rhythm when I pronounce the sentence which is dictate by
the synthetic or structures of that sentence. You may say that then people who do not
know the takes processing or who do not know the words alphabet, then also they speak
speech prosody yes because this is an acquire phenomena by a human being, but if you
want to analyze it you find there is a unique relationship between the structure of the
sentence and the prosody.
So, there is a relationship between the speech prosody. So, prosody if I say it is a
conveying some information which is may not be linguistic it also conveying the non
linguistic and paralinguistic information also. So, suppose if I say the emotional
recognition lot of people working today in emotion recognition speech emotion. So,
emotion information is mainly on this speech prosody because prosodic parameter is
change that is why different times; I produce different kind of emotional speech because
if you remember the content what phonemes; I am pronounced is the segmental features
now all the phonemes come together and with the melody then we can complete speech,
but how this melody, how this speech will varies across the sentence, it depends on the
speech prosody. So, prosody is defined a systematic organization of individual linguistic
unit into an utterance or coherent group of utterance in the process of speech production.
So, if I say the prosody then it is across the segmental boundary that is why all the
prosodic parameters are called supra segmental parameters. So, what are the parameter
which control the prosody mainly these four parameters, I am not going to the speech
quality sometime also referred to a speech prosodic parameter, but I am not going that if
I say the speech prosody depends on the pause, pause means sentence where I have
stopped.
609
If you see in between a word in continuous speech there, is a no boundary because there
is no silence, we never said one word then we pause then say the next word we never say
that way, but pause in appropriate position has an important role to convey the
information you know that if I put the pause in arbitrary position then speech become
unnatural and also the meaning of the speech may also change. So, pause as important
parameter in speech communication pause means where I have break the utterance.
So, definition it is very important is to know that sentence and utterance in written
language we say this is a sentence in spoken language we say it is utterance; that means,
it is not true that always one sentence is belongs to one utterance it may not be all true
that utterance means within a single without resetting this whole system. Whatever I
produce is called single utterance at after each and every utterance the system is reset
means all prosodic parameters start from new things. So, that is prosodic break I can say
prosodic break is the utterance.
So, if there is a long gap generally if it is gap silence is more than 300 millisecond then
you find that prosodic parameter is reset. So, I can say this is an utterance. So, suppose I
record a long sentence it may contain 3 or 4 utterance because there may be an utterance
and after the utterance there is a more than 400 milliseconds pause then we can say this
is single utterance although this is the part of same sentence. So, in spoken language the
utterance is the chunk in written language we take a sentence as a chunk. So, this is the
sentence level and utterance level you can say. So, within the utterance there inside there
is a pause. So, those pause has carry an important information and the placement of the
pause in spoken language is not arbitrary there is a depending on the structures of the
utterance language structures of the utterance pause is determined by the human being
itself. So, why do we say we don’t put the pause in arbitrary position if we put it
arbitrary position then the speech is unnatural and meaning may be change then there is a
intonation or F0 modeling if I say an utterance throughout the utterance F0 is not constant
fundamental frequency of the utterance is not constant.
610
So, if I say suppose I have a speech signal let us I have a speech signal of one second, the
if I extracted the F0 for every 10 milliseconds; if I have a 1 second utterance then I found
there will be a movement of the fundamental frequency and the fundamental frequency is
not continuous because in utterance there may be a non voice region. So, if the speech
signal is non voice there will be no F0 because F0 is the part of the vocal cord vibration if
the vocal cord vibration does not exist. That means, there is a no F0, but if you see that F0
control throughout the sentence you find there is continuous movement of the F0 in a
particular pattern. So, that pattern is called intonation of the utterance or F0 control of
that utterance.
So, F0 is not flat for all whole the signal; it is moving along the signal and if I continuous
if I model it continuous modally using a sine card if I make it continuous then I found the
F0 is moving in a particular rhythm or particular pattern that is called intonation then if
you see here if I show you in the speech, let us suppose this is a sentence.
611
If I found the duration of each phoneme, each syllable instead of phoneme; I can define
duration of the syllable all duration of the syllables are not equal throughout the whole
sentence.
So; that means, duration of the segmental property is changing along the utterance I can
say that utterance instead of sentence. So, along the utterance duration profile is
changing that is called duration of syllable or phoneme whatever you can say phoneme
duration in terms of you can morally written syllable also. So, syllable duration a
changing across the time. So, duration is important parameters which convey the speech
prosody if I demonstrated you if you repeat this if you copy this duration and paste it,
again if you check the duration you find the prosody is changing.
So, duration modeling duration is also an important parameter which control the speech
prosody. loudness amplitude if you see the whole utterance amplitude is not fixed
amplitude if you take the average amplitude of every 10 millisecond if I plot it instead of
unvoiced region, all voice region; you find that the amplitude is moving sometime it will
fall or it will fall and again rise if there is a emphasis in the last 2 words. So, there is a
amplitude movement also, but people say that amplitude is not that much of important
for prosody because even if amplitude is equal if you able to model that duration F0 and
pause, then I can convey the prosodic information of the speech. So,that prosodic
parameter for the speech are pause intonation or F0 duration and loudness amplitude.
Now, we discuss about one after another in modeling. So, here in we have done some
work in pause modeling, if you see that think about a paragraph, readout this course that
I have reading a story written I am reading in a page; something is written in one
language, it may be English it may be Bengali, it may be Hindi, I am reading that text, if
612
I read that text, you find this kind of textures whole text has an larger discourse every
paragraph has an discourse segment. So, whole text has a discourse depending on that
discourse, my voice quality will be changing. Suppose in some paragraph; somebody is
died in a story; next paragraph one, I am describing this that intonation will be changed
or by prosody will be changed on the next paragraph because of that pragmatic
information.
So, discourse is an upper level spoken discourse that depends on the whole story that
depends on the whole sentence whole paragraph the whole text kind of things then there
is a discourse in the segment each paragraph has its own discourse and after every
paragraph pause is mandatory every paragraph, we have a pause.
Now within a paragraph; a paragraph consists of certain sentences; few sentences, let us
say, there is a 10 sentences; after every sentence, you know, there is a pause in speech
and each sentence is consist of certain clause after every clause pause is mandatory if this
pause is more than 300 milliseconds, then I treated that clause is an utterance or if it is
less than 300 milliseconds, then I can say those are not single utterance, but there is a
pause in between the clause. So, after every clause pause is mandatory a clause may
consist of several phases. So, I am working down in the written language processing and
I try to correlate with the spoken language. So, every clause has a phase. Now every
phase can be model I will show you in the one prosody model.
So, suppose there is a phase; phase 1, phase 2, phase 3; it might be often noticed that
even though in written language there is a phase one and phase two, but in spoken
language it might be only consist of 2 phase 1 and 2 merged together consist make a 1
phase and phase 3 another one. So, we can say those are called prosodic phase that I will
come later on and those is called prosodic clause and those is called prosodic sentence.
So, prosodic clause may have about utterance. So, utterance I can say this is a utterance
if the this is see more than 300 milliseconds if it is not then clause one and clause 2 make
a contender prosodic utterance then phase suppose prosodic phase every phase may
consist of several words.
So, there will be written word W1 W2 W3, but you may found that suppose there is a
prosodic phase that this is the linguistic phrase PP1, PP2, PP3 that is phase based on the
linguistic analysis of the written text, but if you analyze in spoken text you will find there
613
may be a PP1 and PP2 merged together to form a single phase spoken phrase which is
called prosodic phase then this is PP1 and this is PP2 .
Now, every phase may consist of a several word; W1, W2, W3 , W4; now both at the
written word in spoken form you may found that W1 and W2 pronounced together I will
come later on what do you mean by pronounce together and form a prosodic word. So, a
prosodic utterance we say utterance I start from utterance in text it is sentence in spoken
it is utterance. So, utterance consists of a prosodic clause, prosodic clause consists of a
prosodic phase, prosodic phase consists of a prosodic word similarly in text clause
sentence clause phase word in text processing. Later on I will explain in one diagram.
Now the problem is that the pause the occurrence of pause after every clause is
mandatory by if you found after every phrase pause may not be there or pause may be
there.
So, I have to make a model which can detect after which phrase pause is necessary and
how much duration. So, I have to detect the occurrence of the pause in after phrase and
the duration of that pause if it is there is a pause then what will be the duration what do
you mean by pause you can see what do you mean by pause let us see the sentence So,
this is a sentence if you analyze a sentence you find after some utterance there is a pause
there is a silence region of silence. So, that can define a pause there may not be a region
of silence, but what happens if you see there is a large movement of all prosodic
parameters.
So, may be there is a deep F0 counter is going down and then again going up. So, there
will one kind of resetting of prosodic parameter duration F0 . So, that can indicate that
there is a break in speech chunk. So, that is called prosodic phrase. So, the occurrence
probability of the physical pause there may be a physical pause there may not be a
physical pause, but there may be a break of co-articulation or there may be break of
prosodic parameters. So, if there is a prosodic parameter or break of co-articulation may
also be there that can also define as a pause with zero duration, but there is a co-
articulation break that identify when play the speech that phenomena identifying me that
this is the chunk of the speech.
So, after we have studied that kind of things we found that factor effecting the
occurrence and duration of a sentence medial pause we call sentence medial pause
614
because at the end of the sentence pause is mandatory at the end of the clause pause is
mandatory, but within a clause after which phase I should put the pause with how much
duration is called sentence medial pause modeling. So, take this parameter type of phase,
length and distance between the current phrase and dependant counterpart. So, what we
have done for Bengali, it may not be true for other language also I do not know we have
not tried it.
So, suppose I have a sentence; so, I have a phrase P1, P2, P3, P4, I have a 4 phase then it
may be M P it may be V P; N P. So, I do not know. So, I have detected that there we find
out that type of phase an important parameter then length of the P1 in term of number of
syllable we measure in term of number of syllable and also whether this P1 dependant on
P2 or not means if you see I will show you we have draw a what kind of binary curve
where we can say whether this phase is dependent on this or not if it is far away Suppose
this phrase is depend on this phase and this phase is depends on this phase then the
distance in case of this the distance will be one and this case of this distance will be 2.
So, if the distance increases there is a probability of pause increases.
So, you have take 3 kinds of parameters then we analyze I can give example like this. So,
this example is a Bengali example you can see.
615
[FL]. So, it is a Bengali example. Now if you see this is W 1, W 2, W 3, W 4, W 5, W 6,
W 7, W 8, W 9, W 10, now if you see we have said this words directly related to [FL].
So, this word related to [FL] and this was related to [FL] this was related to [FL]. So,
now, if you see this word is related to this word. So, distance is high. So, there is a high
chance of probability of pause in here this boundary this word related to this word is very
closely relation. So, I cannot say that this probability of pause in here will be very less
then this word is related to this word. So, probability pause here is high then probability
of pause here is less like that way we have define we have done that things you can read
this paper the paper is available Bangla duration modeling you can read this paper.
So, what I will say that depending on those 3 parameters you have done you have
analyzed the what is the occurrence to W t of pause in a read out text of Bangla. So,
Bangla you have collect read out sentences because we want the textual information
must correlate with the spoken information. So, what we said we take that written text
we can say that we have taken some sentences those sentences are read by some speakers
and then we analyze the pause.
616
And then we try to correlate; what is the relations between those pause with this 3
parameters then we realize this 3 parameters then after analyzing we try to develop a
linear model using these 3 factors so that you can predict the pause.
So, one is called occurrence probability prediction whether for a after this word or after
this phrase there will be a pause or not whether there will be a pause or not whether there
will be a pause or not. So, that we probability we have calculated then we try to linear
regression analysis we have done linear modeling we have done may be it is; it can be
improved by non-linear modeling also we have not done it, but you can do it non-linear
modeling also then duration probability, we have done that. Find out the pause duration
again linear modeling we have done and we have calculated something and we have
calculated the cumulative percentage error for prediction error in the for prediction pause
prediction pause for duration more prediction error we have calculated and we are
reported in a papers.
Now, if you see, there is a example in here also if you listen this example that this thing
then you find that there is a 2 kinds of sentence I have synthesized one is depending on
the pause modeling one is just simple concatenation synthesis then you find that if I use
the pause model it improve the clarity. So, I can say my pause model is now will
somewhat improved the synthesized page, but if you if you think in today in all kinds of
TTS; TTS which is used in HTS or HMMs; TTS system this kind of modeling is not
617
required because we have same the system lot of original data so; that means, within the
corpus there is a occurrence of pause which is already taken care by the HMM model.
I will show you HMM base synthesis, this pause and duration are within the and it
change the system that is model and also F0 somewhat it is model, but F0 also not model
correctly I will show you that what is the problem they are having and how can it be
overcome.
So, this is pause now come to the F0 which is very important parameter because if you
see the F0 even if pause and pause and duration if you model F0 itself, then it can
increase their clarity of the speech mode. So, F0 modeling is an important part of this
prosody modeling.
So, I am discussing about the available this F0 model and then details; I will discuss
about the Fujisaki F0 modeling. So, next lecture I start that F0 modeling.
Thank you.
618
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture – 43
Fundamental Frequency Contour Modeling
So let us start with the F0 modeling. So, if you see what kind of information; if F0 is
carrying in many languages, the pattern temporal change in F0 is phonemic which is
called tonal language if you see the tone variation of tone can change the meaning. So,
tonal languages; so, temporal variation of F0 can change the meaning of the words.
Similarly if I say the trace main parameter for defining trace is F0 , if I say emphasize
word; that means, if you found I have increased the F0 there. So, change of F0 define the
tone define whether that words is emphasis or not and also in case of English if you see
there is a contrasting trace language. So, it trace depends on the context. So, if I put
arbitrary trace in the words, then I can say that my pronunciation of English is not native
like.
The problem in here also suppose I am in Bengali on Indian language; most of the Indian
languages are bound trace language; that means, the trace is defined at the beginnings
level of every prosodic word not linguistic words not in the written word for every
prosodic word beginning trace is defined that’s why it is called bound trace language, but
English is contrastive trace language. So, you can say the F0 is convey the linguistic
information also like the tone it can be phonemic also so; that means, variation of the
619
tone change the word meaning then F0 is convey the prosodic meaning of the sentence if
I put the trace in different location the meaning may be change the old man and women
if I say the old man and women the old man and women then I say the man is old, but
women is not.
But if I said the old man and women then man and women both are old. So, F0 linguistic
information is carrying not only linguistic information F0 also carry the para-linguistic
information and non-linguistic information, para-linguistic information and non
linguistic information also convey by F0. If it is male speaker you know whether I am
male of female speaker that depends on the variation of the F0, if it is male speaker you
know the average F0 cannot goes after180 hertz, but if it is female speaker it is started
from 200 hertz; it can goes up to 300 hertz if it is child speech you can say that it start
can started as 250 hertz and goes to 350 hertz.
So, age all kinds of non linguistic information and para-linguistic information all are
carried by F0 attitude emotion and then speaking style all can be found in F0 modeling.
So, if you see; there is a lot of F0 modeling and you see that this slides are taken from the
Fujisaki F0 modeling. So, that is why copyright Fujisaki is given. So, 3 approaches
ofdescription and representation of F0 contour. So, I can what I said in beginning that for
an utterance has an F0 movement.
620
So, there is a F0 movement. So, this movement of the F0 can be modeled by 3 things one
is called labeling which is called Toby stylization and modeling.
So, I can model generate some model we can generate this contour or I can say I can
assign some level. So, it is a rising then I can say flat then I can say falling F. So, rising,
flat, falling, soft rising, slow rising, slow falling, sharp falling all kinds of level I can
assign and I can model that F0 things that is called Toby model.
Similarly, I can do some stylization means I can linearly approximate this F0 contour. So,
I can say whole F0 contour is nothing, but a sum of some linear line within this line F0 is
fixed. So, this is called piecewise linear approximation or I can generate a mathematical
model by which changing the parameter I can generate this contour which is called
Fujisaki command response model which will discuss details. So, whole sentence F0 can
be model using 3 model one is called Toby stylization and modeling
So, Toby is nothing, but a assigning some label on F0 , it is may be in syllable it may be
let syllable wise. So, I can say this is rising this is flat this is falling then I can define the
pattern of rising it is sharp rising it is flat rising. So, those kind of things is used in Toby
model then stylization I can approximate whole F0 contour is some set of linear line or
linear segment, then I can model that things that is called stylization or I can generate
some mathematical model and that model has a some parameter and I can vanish those
parameter to generate this F0 contour that is called generation process Fujisaki generation
process modeling. So, we will discuss details on generation process modeling.
621
So, this is called Fujisaki generation process F0 contour modeling that if you remember
we have said F0 has 2 kinds of variation one is called local variation another one is call
global variation if you see here the green line is the global variation and red line are local
variation. So, the local variation are called accent and global variation are called phrase.
So, that is why is called it is command response model which is control by the phrase
control and accent control mechanism. So, one is control the phrase command phrase
command control the phrase part accent command control the accent part and there is a
base line and Fujisaki model mathematical model is this.
So, let us try to derive this mathematical model or I can describe this model as per the
Fujisaki papers at this I will come later on
622
So, if you see that; how this vocal cords are there we have the section of vocal cords. So,
vocal cords are here; here there is a vocal cords and if you remember that vocal cords are
closed in one end and open in one end this is closing and opening closing and opening
and this is housed on a some muscle that muscle control the closing and opening. So,
those are the muscle structures.
Now, if you see this muscle this of the vocal cord housed in some muscle structure of
some bone and the muscle control bone structures where it can move. So, that muscles
can move that bone and that can change the tension of the vocal cords and that create
different F0. So, details I will describe that create different F0 .
623
So, let us first see stress strain relationship of skeletal muscles that is already studied and
the relationship is there. So, this is called physical property of skeletal muscle the
muscles which is the bone which is house the vocal cords are connected by a muscles or
you can house by a muscles and those muscles has a property that property is skeletal
muscles properties that is already studied. Then physical properties of skeletal muscle
then how it is varies if you see this first graph x axis is tension y axis is incremental
tension dT/dx.
So, I can say dT/dx=b( T + a ) incremental tension dT/dx that is relationship a and b are
constant, then muscle elongation and tension T. So, I can say how much muscle has to be
elongated to produce certain tension that is required. So, that is the relationship.
So, the elongation is x and tension is T relationship is T=a (𝑒 𝑏𝑥 -1). So, elongation if I
elongate the muscles how much tension will be created that depends on this equation if I
624
increase incremental tension dT/dx is equal to b ( T + a ), now from there stress strain
skeletal muscle. So, dT/dx = b (T + a) now if I integrate both side.
So, I can say dT/dx = b (T + a) where T is the tension or I can say x instead of x I can put
l; l is the length of the vocalic T is the tension and a is the stiffness at T = 0 then
relationship is dT/dl = b (T + a). So, b is a constant a is a stiffness at T = 0 and T is the
tension; l is the length of the vocalic. Now if I integrate both side what I will get I get T=
(T0 + a/b )𝑒 (𝑙−𝑙𝑜) – a/b if I integrate both side now T = I can say T0 what is T0 static
tension what is l0 static length and so T = T0. So, at T = T0 the l = l0 . So, I can say T 0 or
at static tension is T0.
So, static tension is T0 then I can say l – l0 l,l0 is nothing, but a change of length which I
can defined as x. So, I can say (T0 + a/b) 𝑒 𝑏𝑥 -a/b .
Now if a/b is much much less than T 0 then I can say T = T 0 𝑒 𝑏𝑥 which is there in here
which is equation 3.
625
1
𝑏𝑥 𝑇 2
Now if it is T 0 𝑒 , then frequency of vibration of elastic membrane F0 = C0 ( ) where
𝜎
sigma is the density per unit area this is relationship between the elastic membrane
vibration.
1
𝑇 2
Now, if I take loge F0 = logeC0 ( ) now I put the value of T this is T = T0 𝑒 𝑏𝑥 ; I can say
𝜎
1 1
𝑇 𝑏𝑥
loge C0 ( ) + loge; 𝑒
2 2 .
𝜎
𝑇 1
1𝑏𝑥
loge C0 ( 0)2 + .
𝜎 2
1
𝑇0 2
Ok. So, now, if you see I can say C0 ( ) is nothing, but a constant. So, I can say lets it
𝜎
is defined as Fb . So, if it is Fb then; I can say loge F0 is nothing, but a log of Fb + ½ bx.
626
So, x is nothing, but a l – l0 now if it is ½ bx. So, if x is time dependent I can write it is
xT. So, I can say the F0 contour or log F0 contour one plotted in log arithmic scale as a
function of time can be expressed it is the sum of constant baseline this is the constant
and term and time varying term a time varying term which will be changed and a
constant baseline.
So, if you see here this is the baseline constant baseline Fb and then the time varying
term which is ½ bxT. So, how this time varying term is generated this time varying term
is generated due to the muscle movement or due to the movement while the vocal cord
is housed in a muscle stage or you can say bones cage which is connected by a muscle
and that movement is defined it xT what kind of movement it is?
It is a 2 kind of movement one is called rotation around the cricothyroid joint due to; if I
see this.
627
This is kind of housing is there. So, there is a 2 kind of movement one is rotation another
is translation rotation and translation. So, if you see in here vocal cords I get 2 kind of
movement one is translation and that can change the x l – l0 and one is rotation. So,
suppose I have; now rope in here if I this end if I rotate the tension on the rope will be
increases or if I translate it then also tension will be increases. So, these 2 kind of
movement increase the tension and once it is increase the tension that change the F0 that
is change the F0 contour that is xT part time dependent part.
So, this kind of rotation and translator movement it change the fundamental frequency if
you see somebody; say a man is mimicking the female voice how he does it because
there is vocal cords mass and tension and there. So, once he talk normally his F 0 may be
in around 100 hertz or 120 hertz, but when he mimicking he can change it to somewhere
else female voice.
So, how he does it by moving this by practicing this type of movement rotation and
translation movement he practice it and he can change the F0 to that scale any how we
can change F0 movement is there is 1 octa movement is not natural if by base F0 is 80
hertz, I can change up to 160 hertz because that is practice when you are singing when I
singing what you are doing we are practicing the movement of the F0 . So, you are
actually rotation and translations are practiced.
How much movement and this rotation and translation can change the F0 then how this
will be used in prosody or how this will be using continuous movement of the F0 is an
important one, so how it is used?
628
If you see that rotation is providing a component which x2t and translation is provide a
component x1t. So, total change is x1t + x 2t which is in log F0 contour. So, this will be
additive in log F0 contour. So, x1t and x2t is added and that change the log F0 it is
additive.
Now if I say this relation with that language. So, if I phrasing and accentuation. So, there
is a local variation and there is a global variation. So, if I say the global variation is
fascination and local variation is accentuation accent local accent. So, then I can say
accent is control by rotation of thyroid muscle and phrasing is controlled by translation.
So, translation mechanism controls the global variation. So, translation effect is time if it
is xt.
So, translation will change the F0 in roughly change the F0 and rotation will change the
local variation.
629
imposed with normal F0 with the accent component and phrase component. So, that way
we are changing the F0 . So, translation and rotation details are is there.
So, you can see this kind of system. So, there is a membrane. So, it can rotate if CT is
this is constricted if this portion is constricted CT is constricted and then I can say it is
rotator movement and if this portion is strength, then it is a translatory movement.
So, if I say this is my membrane I am not drawing it correctly. So, this is the lets this is
my things. So, if I move it like this way then it is rotation. So, if I move it lets this
position then I can say this angle is nothing, but a 𝜃. So, this is rotation and if I translate
it then it is l + x if I move this direction then it is l + x. So, l is the length of the vocalic x
is the extension of elongation of vocalic 𝜃 is a angular displacement of the thyroid.
So, if I do the angular displacement there also be a length change if I do the translation
there also be a change of length and that is l - l0 which is produced by which is
represented by xT and this xT is summation of this 2 change in the logarithmic domain
one is rotation another is translation and this rotation and translations are nothing. But
the spring mass movement say if it is a spring mass movement there is a second order
differential equation.
630
If you remember that spring mass movement of a body mechanical vibration spring mass
movement if you remember that m S and R. So, you know that
So, I can say I 𝑑2 𝜃/𝑑𝑡 2 + r d𝜃/dt+ K𝜃 k is the stiffness r is the mechanical resistance
and I is the mass; mass of the muscle. So, mechanical resistance mass and stiffness then I
can find out the relationship movement variation of theta. So, I if I solve this second
order equation. So, it is nothing, but a exponential solution. So, it is nothing, but a
exponential solution. So, I can say 𝜃T which can be expressed as a constant multiplied
by a exponential curve which is represented by minima of 1 – (1 + 𝛽T) 𝑒 −𝛽𝑡,𝛾
So, this is experimentally verified and find out by professor Fujisaki and he can find out
the value for 𝛽 and 𝛾 for Japanese language and we have verified it for Bangla language
this 𝛽 and 𝛾 is working fine. So, if rotation give me the accent component then I can say
gat represent the accent component with a multiply by a constant which actually content
the amplitude of the; that component.
So, if I see that the rotation angle 𝜃t is varying like this elongation will be delayed then
logarithmic F0 also will be varying like this. So, if you see in accent component. So,
suppose this is the global component and the accent component it will vary like. So, even
if I apply the command in here it will be take time to reach the half and again it will be
going down because of elasticity.
Then again if I apply I command in here it will be again up somewhere else because it
require some rising time.
631
So, this is Fujisaki’s slide I have taken, then you can say the translation also same
equation, differential equation and translation is defined by Gpt.
So, you can say the total equation is like this dell you can say i equal to 1 to capital I
what is this; this capital I is nothing, but a number of phrase command.
So, suppose I have a sentence I can say this is here this is the Bengali sentence if you see
this dotted line are original F0 extracted from the speech. So, this is align actually this F0
contour is aligned with the original speech or Bangla sentence one bangla sentence is
there if you see the Bengali sentence are written.
Now, if you see the black dots are original red one if you see the red one are generated F0
contour and blue ones are the phrase command blue one is the phrase contour and this
accent command is created after the phrase command is placed then the accent command
is put this red line is generated if you see the red lien almost follow the black dot s. So, it
632
is almost possible to model the F0 contour using this phrase command and accent. So, if
you see there is a number of phrase.
So, here I represent the number of phrase. So, I equal to 1 to capital I. So, if there is a 3
phrase in here if you see how many phrase are there 1, 2, 3. So, 3 phrases are there 1,2,3
and 4 last one is negative amplitude . So, it will be negative direction. So, 3 phrases are
there 1,sorry,4; 1 to 4 Api are the amplitude of the phrase command and this is the
variation then accent command Aaj j equal to 1 to capital J number of accent command
how many number of accents command are there;1, 2, 3,4,5,6,7. So, there may be a 7
accent command and A this is nothing, but a accent command amplitude. Now, if you
see this one the red line is completely follow the green line dotted line. So, I can say it is
possible to completely generate the original F0 contour using this phrase command and
accent command.
Let us today I stop here to this lecture I stop here and tomorrow. So, I will play this voice
sound. So, let I arrange this playing of the voice sound I show you how close this
original speech and synthesized speech.
Thank you.
633
Digital Speech Processing
Prof. S. K. Das Mandal
Centre for Educational Technology
Indian Institute of Technology, Kharagpur
Lecture - 44
Fundamental Frequency Contour Modeling (Contd.)
So, as we discuss about the prosodic modeling, the Fujisaki modeling. We have tried it
for Bangla language long back 2010, we have published the paper on Bengali f0 contour
modeling based on Fujisaki model in 2010 and based on that I just present to a one slides
where I show you. This is the original speech if you listen that if you listening Bengali if
you do not know Bengali then just try to listen the naturalness. [FL].
So, there is a Bengali words,Bengali sentence is which contain that [FL]. So, that
sentence the original sentence. So, I playing the original sentence. Then what do we have
done we have try to find out whether that our generation contour that analysis by
synthesis model is correct or not. What do we have done we have extracted the f0 contour
which is represented by the black spot here there is a black mark is there.
So, those represented the original f0 contour. And if you see there is a blue line once I put
the phrase command then the blue line is come, which is the phrase command. And if
you see the black line straight line is the base line, and after putting the accent command,
you see the red line in generated. So, this red line is actually synthesis f0 contour. If I
synthesis f0 contour is embedded on the original speech, if you see how the synthesis
speech is like this [FL].
634
So, there are no difference between the synthesis contour and the original contour. So, I
can say that quality wise that it is possible to analyze and then synthesis using the f0
modeling. So, this we have tried for Bengali in many reason and then what we have done
that we have developed Bengali TTS system based on that STS model and then try to do
the f0 modification using prosodic model. So, I will describe that things in this lectures
because that will open up some research initiative on your own language.
Now, if we see depending on that prosodic model, Fujisaki has the define certain
prosodic unit definition.
This is basically taken from the Fujisaki definition. He said the written words and the
spoken words are different. Because as I said already that in spontaneous speech or in a
spoken language there is no specific word boundary, means if there is a sentence let us
there sentence I am or I will go to Calcutta tomorrow. So, you say that every I then there
is a gap then there is a am then there is a gap. So, in written language this gap identified
that what boundary.
Ok, but in case of spoken language if I say I am then there is a no gap between I and am.
There is co-articulation effect of I and am. So, I cannot say this boundary is exist in
635
spoken language, but yes, that is a certain word boundary exist in the written spoken
language which is identifiable by perception. If you see if I perceive the origin of speech
of Bengali sound this one [FL] if you see that boundary are clearly indicated, but it is
not as per the lexical word boundary. So, then Fujisaki defined that prosodic unit, first
one is called prosodic word. So, instead of written words we say prosodic word, is define
as a part of whole of an utterance that form an accent type.
If you see here, one accent type if you see [FL] is from accent type. So, the up to this
second end of this [FL] is one word which is call p w one prosodic word one. Second
accent type third accent type [FL] is p w 2 [FL] one accent type then [FL] one accent
type [FL] is another accent type. So, he define the prosodic word is defined as a part of
whole utterance that form an accent type is call prosodic word.
In Bengali we found that the every accent or every prosodic word begin with a negative
accent. If you say we said that the Bengali is a bound stress language; that means, at the
beginning of every prosodic word f0 contour is rising if you see the f0 contour in black
line beginning of every prosodic word f0 contour is rising. So, it is not that prosodic word
is consist of a single written word it may be several written word may be form a prosodic
word. So, prosodic word is defined in spoken language not as per the written language.
So, 2x one accent type is defined as a prosodic word. Then he define the prosodic phrase
is defined as an interval between the 2 successive prosodic phrase command.
If you see this one this blue line up to here this line is a single phrase one. So, this is the
phrase one. So, this is call phrase prosodic phrase, then this is the second part blue line
phrase 2 prosodic phrase 2 and last one is prosodic phrase 3. So, every phrase command
it is the beginning of the phrase. So, at this phrase command is end of the first phrase and
beginning of the second phrase. So, if you see actual boundary may be leading; that
means, see that here is an muscle control things.
So, the command is executed then effect will come after certain time interval. That is
why if the actual f0 rising is here command may be executed in here it takes some time to
reach this effect. So, that is why if you see it may be some leading time. So, this is call
prosodic phrase then he define prosodic clause and prosodic utterances. Now if I that for
this definition, if I just draw the prosodic utterance structures.
636
Suppose this is in prosodic utterance [FL] these an prosodic utterance. Then I can say if I
draw the syntactic graph syntactic tree of this sentence which is nothing but a depending
on the word the how far this is a distance. Then you see the prosodic word p w is defined
like this [FL] is a prosodic word. So, there is a some mistake. So, p w [FL] is a prosodic
word p w will be connected.
This will be connected. So, this is single prosodic word [FL]. So, there is a mistake this
gap will be not there then [FL] 2 nd prosodic word [FL] 3 rd prosodic word [FL] 4th
prosodic word [FL] 5th one if you see. So, both are the prosodic words. So, there is a
mistake in here this will be one prosodic word then prosodic phrase. It may coincide with
prosodic word, but there is a prosodic phrase. So, this prosodic phrase is defined like
again by that phrase command and then prosodic clause. So, this is call prosodic
structured.
Which may differ it may not coincide with one to one corresponding to the syntactic
structures. So, based on this idea when you develop the HTS engine, I am not discuss in
this slide or already we said again.
637
So, I know we developing our Bengali HTS treatise what we have done, first we find out
that develop the Bengali HTS. Then if you see this blue line is indicate that original f0 .
Red line is indicate the HMM synthesized means without doing any f0 modification. I
have trend the HTS system and I synthesized the sentence, then red line is that
synthesized sentencef0 contour. And if you see that solid line this is the original
boundary dotted line is the HMM pit, what boundary?
Those are the what boundary. So, there is a I am not showing that phoneme alignment
every phoneme alignment I can show. So, that phoneme alignment, I am not showing
here if you listen the original speech [FL] if you see the just HTS after training I generate
the sentence [FL]. Then we thought why not we train this HTS same system instead of
leveling at syntactic origin over boundary I should train it with the prosodic structures.
So, spoken corpora is level instead of syntactic structure.
So, what boundary are all boundary are marked this is the word is based on the prosodic
word based on the prosodic phrase and based on the prosodic clause, then we see that is
word error rate is reduce.
So, result is there in the paper, if you see the there is a journal paper on HMM base
Bengali speech synthesis system based on the paper is there you can search that paper.
Or you can say the Bengali HTS based TTS that our paper will come you can read that
paper.
638
But if you see our proposal is like that way that we take the input text which is level
input text and that is the speech which is level based on the prosodic structure instead of
syntactic structure. Then we have done this thing that HTS part. Then we said since f0 if
you see the HTS engine is nothing but a equator one is call segmental information
another is call supra segmental information which is nothing but a f0 . So, what we said
we extract that f0 contour form HMM TTS and modify as for the language requirement.
So, f0 contour which is generated by train HTS system is nothing but a f0 contour which
is best match based on the HMM modeling.
But then that may not be coincide with the required f0 contour targeted f0 contour which
is detect which is defined by the structured of the sentence, or syntactic structures of the
sentence. So, what we done we developed a f0 contour modeling based on Fujisaki
model, and another one we have try with a DBN network DBN DNN methods. So, that
method also we try, so based on that we modify the f0 . So, this f0 is come and from that
input text we analyze the prosodic word and prosodic phrase, and then we modify which
generate the f0 contour and supply to the decoder.
So, f0 is modified. So, duration information and that segmental information is come from
the syntactic you can say HTS engine. So, we are not doing anything on the duration and
syntactic. So, duration is based on the prosodic structure training whatever we get using
HMM modeling, but f0 is modified based on the 2 model we have done this work and I
can show you the this is the prosodic structure I am not detailed discussion.
639
So, I will show you that for there is a 2 kind of modeling we have done one is called
Fujisaki f0 generation process model another is called deep belief network base, because
where the problem is Fujisaki is that Fujisaki model is successful model and it is a
generation process model. So, it is very good.
But only problem in this model this requires a lot of level data to find out the rule for that
a generation process model. If this kind of sentence come this is a structure of the
sentence then based on that structure or from the rule what should be the height of the
action command phrase command all kind of things I have to define. So, that require
some level data for training, but DNN require an supposes DBN is learn in unsupervised
640
fashion it not that much of level data is required. So, it tried with that also, I will show
you the result in the end.
Then f0 model Fujisaki generation process model we have done. So, what I have say I
have already explain that spectrum information and duration information with training
from the HTS engine. And with stuck that f0 contour and modify the f0 contour based on
the accent command or phrase command by the Fujisaki model and then we synthesize
the speech.
641
So, in that case Fujisaki generation process model, so accent component and phrase
component rule has to be generated. So, what is there you know that if this is the Fujisaki
equation, you see the phrase component this is the accent component. So, phrase
component magnitude and phrase lead time. If the original phrase boundary is here while
it is started. So, that lead time is required.
So, that lead time and then phrase accent command magnitude, and accent command
lead and lag time what is the accent command. So, if you see in here you see there is a
phrase command there is a timing occurrence of the phrase command timing, and if you
see the accent command there is a beginning and end command beginning and command
end. So, we generate some rule for that on some paper is published based on that and
then we generate those rule using that Fujisaki model and using those rule We synthesize
that generate the f0 contour.
642
So, then we synthesize the speech I will show you the result in then later on.
And DBN instead of Fujisaki generation process model. We try to find out that f0
contour modeling using deep neural network training. So, I am not going details on
DBN, because DBN is the topic deep neural network is a topic written this. So, I am not
going details of the deep neural network training and all kinds of things you can go
through the paper because that is not also that much require for this course. So, I am not
going that. So, if you see that we have done and if you see the result this is the result.
643
This is the original sentence [FL] this is the best line means where only HTS is train
based on the prosodic structures not syntactic structure prosodic structure [FL] .
Now, I modify the f0 based on the sentence structure using Fujisaki model. [FL] now
based on the DBN. [FL] both of the cases it is shown that Fujisaki model and DBN is
almost same result is found, but Fujisaki model is generalized model. So, even if the
there is a success rate is high for even if the sentence type is completely unknown for this
training DBN. So, in that case also because the rule is exist. So, it is better, but both
models are comparable and this is already published paper. So, you can try on your own
language using Fujisaki model.
So, what I said that 2 years mean here we have done. One is called train the HTS engine
with prosodic structures, and modify the f0 contour based on the Fujisaki model or DNN
model. So, if you remember I have explain one another application that accent
conversion. So, my proposal is that can I not modify the f0 contour and segmental
information based on that target language accent, using some DNN or deep neural
network or some rule base method. So, we have not yet done that we have started that
work. So, it is possible that accent conversion can be done using because it is ultra after
all HTS synthesis nothing but equator. So, using the same principle we can do the accent
conversion, so doing that research on this area.
So, this is the end of that prosodic modeling using Fujisaki model. If you have a specific
query or if you interested to do pursue the research working your own language then you
contact me I will give you whatever help is possible for me I can give you the help. And
I think most of the Indian language that there is a lot of TTS is available in Indian
language, but yes there is a problem in prosody modeling. Because prosody modeling not
only require that generation of the f0 model, but also language analysis is very important
parameter. Because if you see in this our this research also include that parts of speech
tagging and all kinds of things are there. Because of you until unless you find out
because the in TTS input is only the text.
So, I do not know the f0 contour. So, from the text I have to generate the f0 contour. Yes,
I can text some trading from some sentences, but if you see that syntactic structures of
the depending on the sentence specific prosodic can be given. So, that kind of things is
required because I can parameterize that synthesize or you can say target f0 contour best.
644
On some correlation between the linguistic parameters and acoustics parameter has to be
trained. So, that requires that extraction of that linguistic parameter is very important. So,
I have to extract those linguistic parameter, which parameter has an correlation with the
f0 contour I have to find out that is oil established you can see that this syntactic
structures of the sentence is important. And if you see that pause modeling although in
HTS engine cases pause modeling is not separate it is within the training there, but yes
we can definitely train the system better way if we know when the pause will occur how
the pause is related with syntactic structures.
So, that information can be in cooperate in HTS training and that can generate better
result. So, this is the last lecture in prosody modeling. So, this will be the my last lecture
in speech this course, but after reviving all the lectures whatever I have given in this as
things up to 7th week you can say those are related to or up to 6 th week is related to signal
processing. So, only one things I have missed which is called GMM Gaussian mixture
modeling I have not covered, but yes I will cover one lecture I may uploaded on GMM
the basic GMM Gaussian mixture modeling vector quantization I have trust upon, but I
am not detail. As I beginning I said these course is not contain that AI part or sub
computing part. I am purely concentrated on signal processing part and see only the not
signal processing last 2 week I said something about that modern application in speech
research, which can you taken care and you can pursue that speech research or so that
technology is developed. So, that is why I touches this prosody modeling as important
issue today.
So, you know that there is a international conference on only prosody speech prosody.
So, speech prosody there is prosody modeling action conversion and emotion is nothing
but the part of a prosody. So, emotion recognition emotion synthesis all are nothing but
the prosody modeling. So, if you know the prosody modeling then work on emotion area
will be very easy. So, that is why I touch the prosody modeling part. Yes, I can include 1
or 2 lectures on GMM. So, that it can give you some over view what is Gaussian mixture
model why we have use that and that HMM, I have said that HMM is nothing but AI
algorithm you can find lot of lectures on HMM modeling. If you want then if you have a
particular query you can directly contact me so that I can explain that things whatever I
know.
645
THIS BOOK IS NOT FOR SALE
NOR COMMERCIAL USE