NLP Unit Iv
NLP Unit Iv
1. PHONETICS
Phonetics is a branch of linguistics that studies how humans produce and perceive sounds.
Phoneticians—linguists who specialize in phonetics—study the physical properties of speech.
The field of phonetics is traditionally divided into three sub-disciplines based on the research questions involved such
as
    1. how humans plan and execute movements to produce speech (articulatory phonetics)
    2. how various movements affect the properties of the resulting sound (acoustic phonetics)
    3. how humans convert sound waves to linguistic information (auditory phonetics).
Traditionally, the minimal linguistic unit of phonetics is the phone—a speech sound in a language.
A phoneme is the smallest unit of sound in speech. For example – the word ‘hat’ has 3 phonemes – ‘h’ ‘a’ and ‘t’.
English is formed by approximately 44 phonemes. Phonemes are classified into vowels, glides, semivowels and
consonants.
A grapheme is a letter or a number of letters that represent the sounds in our speech. So, a grapheme will be the letter/
letters that represent a phoneme. English has a complex written code and in our code a grapheme can be 1, 2,3 or 4
letters. For example:
    2. ARTICULATORY PHONETICS
Articulatory phonetics is the study of how these phones are produced as the various articulatory phonetics organs in the
mouth, throat, and nose modify the airflow from the lungs.
Articulation is a process resulting in the production of speech sounds. It consists of a series of movements by a set of
organs of speech called the articulators. The articulators that move during the process of articulation are called active
articulators. Organs of speech which remain relatively motionless are called passive articulators. The points at which
the articulator are moving towards or coming in to contact with certain other organ are the place of articulation. The
type or the nature of movement made by the articulator is called the manner of articulation.
                                                                                                                       1
2.1 Vocal Organs – Sound is produced by the rapid movement of air. Humans produce most sounds in spoken languages
by expelling air from the lungs through the windpipe (technically, the trachea) and then out the mouth or nose. As it
passes through the trachea, the air passes through the larynx, commonly known as the Adam’s apple or voice box.
The larynx contains two small folds of muscle, the vocal folds or vocal cords, which can be moved together or apart.
The space between these two folds is called the glottis. If the folds are close together (but not tightly closed), they will
vibrate as air passes through them; if they are far apart, they won’t vibrate.
Sounds made with the vocal folds together and vibrating are called voiced; sounds made without this cord vibration are
called unvoiced or voiceless. Voiced sounds include [b], [d], [g], [v], [z], and all the English vowels, among others.
Unvoiced sounds include [p], [t], [k], [f], [s], and others.
The area above the trachea is called the vocal tract; it consists of the oral tract and the nasal tract. After the air leaves
the trachea, it can exit the body through the mouth or the nose. Most sounds are made by air passing through the mouth.
Sounds made by air passing through the nose are called nasal sounds; nasal sounds (like English [m], [n], and [ng]) use
both the oral and nasal tracts as resonating cavities.
Phones are divided into two main classes: consonants and vowels. Both kinds of sounds are formed by the motion of
air through the mouth, throat or nose.
Consonants are made by restriction or blocking of the airflow in some way, and can be voiced or unvoiced.
Vowels have less obstruction, are usually voiced, and are generally louder and longer-lasting than consonants. The
technical use of these terms is much like the common usage; [p], [b], [t], [d], [k], [g], [f], [v], [s], [z], [r], [l], etc., are
consonants; [aa], [ae], [ao], [ih], [aw], [ow], [uw], etc., are vowels.
Semivowels (such as [y] and [w]) have some of the properties of both; they are voiced like vowels, but they are short
and less syllabic like consonants
Glides can be defined as vowel –like sounds, differing from the vowels in a lack of the continuant characteristic: glides
are produced rather during a fast dynamic change of the articulators; the first sounds of the word “you a” and the word
“what” provide examples of glides
Consonants can be classified on the basis of the manner of articulation as stop, nasals, fricatives, trills, flaps, laterals,
affricates, continuants, etc
Labial: Consonants whose main restriction is formed by the two lips coming together have a bilabial place of
articulation. In English these include [p] as in possum, [b] as in bear, and [m] as in marmot. The English labiodental
consonants [v] and [f] are made by pressing the bottom lip against the upper row of teeth and letting the air flow through
the space in the upper teeth
Dental: Sounds that are made by placing the tongue against the teeth are dentals. The main dentals in English are the
[th] of thing and the [dh] of though, which are made by placing the tongue behind the teeth with the tip slightly between
the teeth.
Alveolar: The alveolar ridge is the portion of the roof of the mouth just behind the upper teeth. Most speakers of
American English make the phones [s], [z], [t], and [d] by placing the tip of the tongue against the alveolar ridge. The
word coronal is often used to refer to both dental and alveolar.
Palatal: The roof of the mouth (the palate) rises sharply from the back of the palate alveolar ridge. The palato-alveolar
sounds [sh] (shrimp), [ch] (china), [zh] (Asian), and [jh] (jar) are made with the blade of the tongue against the rising
back of the alveolar ridge. The palatal sound [y] of yak is made by placing the front of the tongue up close to the palate.
Velar: The velum, or soft palate, is a movable muscular flap at the very back of the roof of the mouth. The sounds [k]
(cuckoo), [g] (goose), and [N] (kingfisher) are made by pressing the back of the tongue up against the velum.
                                                                                                                               2
Glottal: The glottal stop [q] is made by closing the glottis (by bringing the vocal folds together).
Consonants are also distinguished by how the restriction in airflow is made, for example, by a complete stoppage of air
or by a partial blockage. This feature is called the manner of articulation of a consonant.
The combination of place and manner of articulation is usually sufficient to uniquely identify a consonant. Following
are the major manners of articulation for English consonants:
Stop: A stop is a consonant in which airflow is completely blocked for a short time. This blockage is followed by an
explosive sound as the air is released. The period of blockage is called the closure, and the explosion is called the release.
English has voiced stops like [b], [d], and [g] as well as unvoiced stops like [p], [t], and [k]. Stops are also called plosives.
Nasal: The nasal sounds [n], [m], and [ng] are made by lowering the velum and allowing air to pass into the nasal cavity.
Fricatives : In fricatives, airflow is constricted but not cut off completely. The turbulent airflow that results from the
constriction produces a characteristic “hissing” sound. The English labiodental fricatives [f] and [v] are produced by
pressing the lower lip against the upper teeth, allowing a restricted airflow between the upper teeth. The dental fricatives
[th] and [dh] allow air to flow around the tongue between the teeth. The alveolar fricatives [s] and [z] are produced with
the tongue against the alveolar ridge, forcing air over the edge of the teeth.
Sibilants: The higher-pitched fricatives (in English [s], [z], [sh] and [zh]) are called sibilants.
Affricates: Stops that are followed immediately by fricatives are called affricates; these include English [ch] (chicken)
and [jh] (giraffe).
Approximants: In approximants, the two articulators are close together but not close enough to cause turbulent airflow.
In English [y] (yellow), the tongue moves close to the roof of the mouth but not close enough to cause the turbulence
that would characterize a fricative. In English [w] (wood), the back of the tongue comes close to the velum.
Tap: A tap or flap [dx] is a quick motion of the tongue against the alveolar ridge. The consonant in the middle of the
word lotus ([l ow dx ax s]) is a tap in most dialects of American English.
    2.4 Vowels
Like consonants, vowels can be characterized by the position of the articulators as they are made. The three most relevant
parameters for vowels are
        a. vowel height, which correlates roughly with the height of the highest part of the tongue
        b. vowel frontness or backness, indicating whether this high point is toward the front or back of the oral tract
            and whether the shape of the lips is rounded or not.
Front Vowels: Vowels in which the tongue is raised toward the front are called front vowels. Both [ih] and [eh] are
front vowels, the tongue is higher for [ih] than for [eh].
Back Vowels: Vowels in which the tongue is raised toward the back are called back vowels. Eg. [ow], [aw].
High vowels: Vowels in which the highest point of the high vowel tongue is comparatively high are called high vowels.
Eg. [uw], [iy]
Low vowels: Vowels with mid or low values of maximum tongue height are called mid vowels or low vowels. Eg. [aa],
[ae].
Diphthong: A vowel in which the tongue positiosn changes markedly during the production of the vowel is a diphthong.
A diphthong is a sound made by combining two vowels (di means double, Greek word diphthongos which means
"having two sounds”), specifically when it starts as one vowel sound and goes to another, like the oy sound in oil, chair,
fear, pout, etc.
                                                                                                                               3
    2.5 Syllables
Consonants and vowels combine to make a syllable. A syllable is a vowel-like sound together with some of the
surrounding consonants that are most closely associated with it. The word dog has one syllable, [d aa g]; the word
catnip has two syllables, [k ae t] and [n ih p].
The vowel at the core of a syllable is called the nucleus. Initial consonants, if any, are called the onset. Onsets with
more than one consonant (as in strike [s t r ay k]), are called complex onsets. The coda is the optional consonant or
sequence of consonants following the nucleus. rime Thus [d] is the onset of dog, and [g] is the coda. The rime, or
rhyme, is the nucleus plus coda.
Fig. show some sample syllable structures.
σ σ σ
                                     a          m                      ee          n     e        g g
                Fig: Syllable structure of ham, green, eggs. σ =syllable.
Syllabification: The task of automatically breaking up a word into syllables is called syllabification. Syllable structure
is also closely related to the phonotactics of a language.
Phonotactics: The term phonotactics means the constraints on which phones can follow each other in a language. For
example, English has strong constraints on what kinds of consonants can appear together in an onset; the sequence
[zdr], for example, cannot be a legal English syllable onset. Phonotactics can be represented by a language model or
finite-state model of phone sequences.
    3. ACOUSTICS PHONETICS
    3.1 Waves - Acoustic analysis is based on the sine and cosine functions. The sine wave can be
        described using the function y=a*sin(bx), where:
The amplitude dictates the magnitude of the swings, whereas the periodicity dictates how often the swings
happen.
The frequency is the number of times a second a amplitude wave repeats itself, that is, the number of cycles.
Frequency is usually measured in cycles per second called as hertz. The amplitude A of a period sine wave
is the maximum value on the Y axis. The period T of the wave is the time it takes for one cycle to complete,
defined as
                                                  T=1/f
                                                                                                                        4
    3.2 Speech Sound Waves
The input to a speech recognizer, like the input to the human ear, is a complex series of changes in air
pressure. These changes in air pressure obviously originate with the speaker. Sound waves are represented
by plotting the change in air pressure over time. The graph measures the amount of compression or
rarefaction (uncompression) of the air molecules at this plate.
The first step in digitizing a sound wave like above Fig. is to convert the analog representations sampling into
a digital signal. This analog-to-digital conversion has two steps: sampling and quantization.
To sample a signal, we measure its amplitude at a particular time; the sampling rate is the number of samples
taken per second. To accurately measure a wave, we must have at least two samples in each cycle: one
measuring the positive part of the wave and one measuring the negative part. More than two samples per cycle
increases the amplitude accuracy, but fewer than two samples causes the frequency of the wave to be
completely missed. Thus, the maximum frequency wave that can be measured is one whose frequency is half
the sample rate (since every cycle needs two samples). This maximum frequency for a given sampling rate is
called the Nyquist frequency. Most information in human speech is in frequencies Nyquist frequency below
10,000 Hz; thus, a 20,000 Hz sampling rate would be necessary for complete accuracy.
But telephone speech is filtered by the switching network, and only frequencies less than 4,000 Hz are
transmitted by telephones. Thus, an 8,000 Hz sampling rate is sufficient for telephone-bandwidth speech like
the Switchboard corpus, while 16,000 Hz sampling is often used for microphone speech.
Even an 8,000 Hz sampling rate requires 8000 amplitude measurements for each second of speech, so it is
important to store amplitude measurements efficiently. They are usually stored as integers, either 8 bit (values
from -128–127) or 16 bit (values from -32768–32767). This process of representing real-valued numbers as
integers is called quantization because the difference between two integers acts as a minimum granularity (a
quantum size) and all values that are closer together than this quantum size are represented identically.
Once data is quantized, it is stored in various formats. One parameter of these formats is the sample rate and
sample size as discussed earlier. Another parameter is the number of channels. For stereo data or for two-
party conversations, we can store both channels in the same file or we can store them in separate files. A final
parameter is individual sample storage—linearly or compressed. One common compression format used for
telephone speech is µ-law. the equation for compressing a linear PCM (Pulse Code Modulation) sample value
x to 8-bit µ-law, (where µ=255 for 8 bits):
There are a number of standard file formats for storing the resulting digitized wavefile, such as Microsoft’s
.wav and Apple’s AIFF all of which have special headers; simple headerless “raw” files are also used. For
example, the .wav format is a subset of Microsoft’s RIFF format for multimedia files; RIFF is a general
format that can represent a series of nested chunks of data and control information. Below Figure shows
Microsoft wavefile header format, assuming simple file with one chunk. Following this 44-byte header would
be the data chunk.
                                                                                                              5
    3.3 Pitch and Loudness
Voices are caused by regular openings and closing of the vocal folds. When the vocal folds are open, air is
pushing up through the lungs, creating a region of high pressure. When the folds are closed, there is no pressure
from the lungs. Thus, when the vocal folds are vibrating, there are regular peaks in amplitude of the kind each
major peak corresponding to an opening of the vocal folds. The frequency of the vocal fold vibration, or the
frequency of the complex wave, is called the fundamental frequency of the waveform, often abbreviated
F0. We can plot F0 over time in a pitch track. Below Figure shows the pitch track of a short question, “Three
o’clock?” represented below the waveform. Note the rise in F0 at the end of the question.
500 Hz
0 Hz
Three o’clock
                                 0                                                                           0.544375
                                                                    Time (s)
                    Figure: Pitch track of the question “Three o’clock?”, shown below the wavefile. Note the rise
                   in F0 at the end of the question. Note the lack of pitch trace during the very quiet part (the “o’”
                   of “o’clock”; automatic pitch tracking is based on counting the pulses in the voiced regions,
                   and doesn’t work if there is no voicing (or insufficient sound).
At some point of time, we also often need to know the average amplitude over some time range. We generally
use the RMS (root-mean-square) amplitude, which squares each number before averaging (making it
positive), and then takes the square root at the end.
The power of the signal is related to the square of the amplitude. If the number of samples of a sound is N, the
power is
                                                              N
                                           Power = (1 /N) ∑ xi2
                                                             i=1
The intensity of sound normalizes the power to the human auditory threshold and is measured in dB. If P0 is
the auditory threshold pressure = 2×10−5 Pa, then intensity is defined as follows:
                                                                     N
                                                                                                                         6
                                     Intensity = 10log 10(1 /NPo) ∑ xi2
                                                                    i=1
Two important perceptual properties, pitch and loudness, are related to frequency and intensity. The pitch of
a sound is the mental sensation, or perceptual correlate, of fundamental frequency; in general, if a sound has
a higher fundamental frequency we perceive it as having a higher pitch.
Roughly speaking, human pitch perception is most accurate between 100 Hz and 1000 Hz and in this range
pitch correlates linearly with frequency. Human hearing represents frequencies above 1000 Hz less accurately,
and above this range, pitch correlates logarithmically with frequency. Logarithmic representation means
that the differences between high frequencies are compressed and hence not as accurately perceived. There
are various psychoacoustic models of pitch perception scales. One common model is the mel scale (Stevens
et al. 1937, Stevens and Volkmann 1940). A mel is a unit of pitch defined such that pairs of sounds which are
perceptually equidistant in pitch are separated by an equal number of mels. The mel frequency m can be
computed from the raw acoustic frequency as follows:
                                           m = 1127 In (1 + (f / 700))
The loudness of a sound is the perceptual correlate of the power. So sounds with higher amplitudes are
perceived as louder, but again the relationship is not linear.
First of all, humans have greater resolution in the low-power range; the ear is more sensitive to small power
differences. Second, it turns out that there is a complex relationship between power, frequency, and perceived
loudness; sounds in certain frequency ranges are perceived as being louder than those in other frequency
ranges.
Various algorithms exist for automatically extracting F0. There are many pitch extraction algorithms and
toolkits available.
                        sh     iy     j    ax           s    h    ae   d ax   b   ey          b       iy
                                                                       x
          0                                                                                   1.059
                                                  Time (s)
For a stop consonant, which consists of a closure followed by a release, we can often see a period of silence
or near silence followed by a slight burst of amplitude. We can see this for both of the [b]’s in baby in Fig
Another phone that is often quite recognizable in a waveform is a fricative. Recall that fricatives, especially
very strident fricatives like [sh], are made when a narrow channel for airflow causes noisy, turbulent air. The
resulting hissy sounds have a noisy, irregular waveform. This can be seen somewhat in above Fig., it’s even
clearer in below where we’ve magnified just the first word she.
                                                                                                               7
        0                                                                                               0.257
                                                       Time (s)
 Fig A more detailed view of the first word “she” extracted from the wavefile in Fig. Noticethe difference
 between the random noise of the fricative [sh] and the regular voicing of the vowel [iy].
We can represent these two component frequencies with a spectrum. The spectrum of a signal is a
representation of each of its frequency components and their amplitudes. Figure below shows the spectrum of
above Fig. . Frequency in Hz is on the x-axis and amplitude on the y-axis. Note the two spikes in the figure,
one at 10 Hz and one at 100 Hz. Thus, the spectrum is an alternative representation of the original waveform,
and we use the spectrum as a tool to study the component frequencies of a sound wave at a particular time
point.
80
60
                                                       40
                                                            1     2   5    10 20      50 100 200
                                                                          Frequency (Hz)
0.04968
                         –0.05554
                                    0                                                                        0.04275
                                                                            Time (s)
Fig The waveform of part of the vowel [ae] from the word had cut out
Note that there is a complex wave that repeats about ten times in the figure; but there is also a smaller repeated
wave that repeats four times for every larger pattern (notice the four small peaks inside each repeated wave).
                                                                                                                       8
The complex wave has a frequency of about 234 Hz (we can figure this out since it repeats roughly 10 times
in .0427 seconds, and 10 cycles/.0427 seconds = 234 Hz).
The smaller wave then should have a frequency of roughly four times the frequency of the larger wave, or
roughly 936 Hz. Then, if you look carefully, you can see two little waves on the peak of many of the 936 Hz
waves. The frequency of this tiniest wave must be roughly twice that of the 936 Hz wave, hence 1872 Hz.
While a spectrum shows the frequency components of a wave at one point in time, a spectrogram is a way of
envisioning how the different frequencies that make up a waveform change over time. The x-axis shows time,
as it did for the waveform, but the y-axis now shows frequencies in hertz. The darkness of a point on a
spectrogram corresponds to the amplitude of the frequency component. Very dark points have high amplitude,
light points have low amplitude. Thus, the spectrogram is a useful way of visualizing the three dimensions
(time x frequency x amplitude).
Below Figure shows spectrograms of three English vowels, [ih], [ae], and [ah]. Note that each vowel has a set
of dark bars at various frequency bands, slightly different bands for each vowel. Each of these represents the
same kind of spectral peak that we see in figure.
                         Frequency (Hz)
Each dark bar (or spectral peak) is called a formant. As we discuss below, a formant is a frequency band that
is particularly amplified by the vocal tract. Since different vowels are produced with the vocal tract in different
positions, they will produce different kinds of amplifications or resonances
It turns out, however, that the vocal tract acts as a kind of filter or amplifier; indeed any cavity, such as a tube,
causes waves of certain frequencies to be amplified and others to be damped. This amplification process is
caused by the shape of the cavity; a given shape will cause sounds of a certain frequency to resonate and hence
be amplified. Thus, by changing the shape of the cavity, we can cause different frequencies to be amplified.
When we produce particular vowels, we are essentially changing the shape of the vocal tract cavity by placing
the tongue and the other articulators in particular positions. The result is that different vowels cause different
harmonics to be amplified. So a wave of the same fundamental frequency passed through different vocal tract
positions will result in different harmonics being amplified.
We can see the result of this amplification by looking at the relationship between the shape of the vocal tract
and the corresponding spectrum. Below Figure shows the vocal tract position for three vowels and a typical
resulting spectrum. The formants are places in the spectrum where the vocal tract happens to amplify particular
harmonic frequencies.
                                                                                                                   9
                         [iy] (tea)
                   F1
                                                                        [ae] (cat)
                                      F2
                                                             F1    F2
 Figure Visualizing the vocal tract position as a filter: the tongue positions for three English vowels andthe
 resulting smoothed spectra showing F1 and F2.
Speech is not a stationary signal, i.e., it has properties that change with time
Thus a single representation based on all the samples of a speech utterance, for the most part, has no
meaning
Instead, we define a time-dependent Fourier transform (TDFT or STFT) of speech that changes
periodically as the speech properties change over time
The Short-time Fourier transform (STFT), is a Fourier-related transform used to determine the sinusoidal
frequency and phase content of local sections of a signal as it changes over time. Generally, the procedure for
computing STFTs is to divide a longer time signal into shorter segments of equal length and then compute the
Fourier transform separately on each shorter segment. This reveals the Fourier spectrum on each shorter
segment.
It is a powerful general-purpose tool for audio signal processing. It defines a particularly useful class of time-
frequency distributions which specify complex amplitude versus time and frequency for any signal. STFT
parameters are tuned for the following applications:
    1. Approximating the time-frequency analysis performed by the ear for purposes of spectral display.
    2. Measuring model parameters in a short-time spectrum.
In the first case, applications of audio spectral display go beyond merely looking at the spectrum. They also
provide a basis for audio signal processing tasks intended to imitate human perception, such as auditory scene
recognition or automatic transcription of music.
Examples of the second case include estimating the decay-time-versus-frequency for vibrating strings and
body resonances, or measuring as precisely as possible the fundamental frequency of a periodic signal based
on tracking its many harmonics in the STFT.
STFT is a function of two variables, the time index, nˆ, which is discrete, and the frequency variable,
wˆ, which is continuous
                                                                                                                 10
   5. FILTER BANKS
In signal processing, a filter bank (or filterbank) is an array of bandpass filters that separates the input
signal into multiple components, each one carrying a single frequency sub-band of the original signal. One
application of a filter bank is a graphic equalizer, which can attenuate the components differently and
recombine them into a modified version of the original signal.
In other words, a class of systems that generates scaling and wavelet function are known as filter banks. The
below figure shows the simple structure of a filter bank. It is a set of bandpass filters with each filter centered
at a different frequency.
A digital filter bank is a collection of filters having a common input or output. In digital signal processing, the
term filter bank is also commonly applied to a bank of receivers. The difference is that receivers also down-
convert the subbands to a low center frequency that can be re-sampled at a reduced rate. The same result can
sometimes be achieved by undersampling the bandpass subbands.
Filter banks play an important role in modern signal and image processing applications such as audio and
image coding.
                                                                                                                11
How does a Filter Bank work?
A filter bank separates or splits the input signal into multiple components. The process of separating the
input signals into multiple components is known as analysis. The output of the analysis is referred to as a
sub-band of the original signal.
Now, the Filter bank attenuates the components of the signal differently and reconstructed them into an
improved version of the original signal. This reconstruction process is known as synthesis.
For example, if we have an input audio signal x(n), then filter banks separate this input audio signal into a
set of analysis signals i.e., x1(n), x2(n), x3(n), etc…, each of these set of analysis signals corresponds to a
different region in the spectrum of the input signal x(n).
This set of analysis signals x1(n), x2(n), x3(n)… can be obtained by filter banks with bandwidths BW1,
BW2, BW3… and centre frequencies fc1, fc2, fc3… respectively.
The below figure shows the frequency response of a Filter Bank in which bands of a three-band filter bank
do not overlap, but are lined up one after the other with adjacent band edges touching each other. These
three bands span the frequency range from fcl1 = 0 Hz to fch3 = fmax.
The analysis filter bank decomposes the input signal to a different sub-band with a different frequency
spectrum and the synthesis filter bank reconstructs the different sub-band signal and generates a modified
version of the original signal.
                                                                                                                  12
Types of Filter Banks
   • DCT Filter Banks
   • Polyphase Filter Banks
   • Gabor Filter Banks
   • Mel Filter Banks
   • Filter Bank Multicarrier (FBMC)
   • DFT Filter Banks
   • Uniform DFT Filter Bank
                                                                                                             13
       resonances give rise to formants, or enhanced frequency bands in the sound produced. Hisses and
       pops are generated by the action of the tongue, lips and throat during sibilants and plosives.
   •   LPC analyzes the speech signal by estimating the formants, removing their effects from the speech
       signal, and estimating the intensity and frequency of the remaining buzz. The process of removing
       the formants is called inverse filtering, and the remaining signal after the subtraction of the filtered
       modeled signal is called the residue.
   •   The numbers which describe the intensity and frequency of the buzz, the formants, and the residue
       signal, can be stored or transmitted somewhere else. LPC synthesizes the speech signal by reversing
       the process: use the buzz parameters and the residue to create a source signal, use the formants to
       create a filter (which represents the tube), and run the source through the filter, resulting in speech.
   •   Because speech signals vary with time, this process is done on short chunks of the speech signal,
       which are called frames; generally, 30 to 50 frames per second give an intelligible speech with good
       compression.
    • LPC methods provide extremely accurate estimates of speech parameters, and does it extremely
        efficiently
    • Basic idea of Linear Prediction: current speech sample can be closely approximated as a linear
        combination of past samples
        p
s(n) = ∑ αks(n-k) for some value of p, αk
        k=1
   •   or LP, the predictor coefficients (the αk 's) are determined (computed) by minimizing the sum of
       squared differences (over a finite interval) between the actual speech samples and the linearly
       predicted ones
   • LP is based on speech production and synthesis models
           o speech can be modeled as the output of a linear, time-varying system, excited
              by either quasi-periodic pulses or noise;
          o assume that the model parameters remain constant over speech analysis
              interval
   • LP provides a robust, reliable and accurate method for
       o estimating the parameters of the linear system (the combined
         vocal tract, glottal pulse, and radiation characteristic for voiced speech)
                                                                                                                  14
• LP methods have been used in control and information theory—called methods of system
   estimation and system identification
 – Used extensively in speech under group of names including
1. covariance method
2. autocorrelation method
3. lattice method
15