0% found this document useful (0 votes)
19 views15 pages

NLP Unit Iv

Phonetics is a branch of linguistics focused on the production and perception of speech sounds, divided into articulatory, acoustic, and auditory phonetics. It examines the physical properties of speech, including the roles of phonemes, graphemes, and the processes involved in speech production and perception. The document also details the classification of consonants and vowels based on articulation and manner, as well as the acoustic properties of speech sound waves.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views15 pages

NLP Unit Iv

Phonetics is a branch of linguistics focused on the production and perception of speech sounds, divided into articulatory, acoustic, and auditory phonetics. It examines the physical properties of speech, including the roles of phonemes, graphemes, and the processes involved in speech production and perception. The document also details the classification of consonants and vowels based on articulation and manner, as well as the acoustic properties of speech sound waves.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

UNIT 4

1. PHONETICS
Phonetics is a branch of linguistics that studies how humans produce and perceive sounds.
Phoneticians—linguists who specialize in phonetics—study the physical properties of speech.
The field of phonetics is traditionally divided into three sub-disciplines based on the research questions involved such
as
1. how humans plan and execute movements to produce speech (articulatory phonetics)
2. how various movements affect the properties of the resulting sound (acoustic phonetics)
3. how humans convert sound waves to linguistic information (auditory phonetics).
Traditionally, the minimal linguistic unit of phonetics is the phone—a speech sound in a language.
A phoneme is the smallest unit of sound in speech. For example – the word ‘hat’ has 3 phonemes – ‘h’ ‘a’ and ‘t’.
English is formed by approximately 44 phonemes. Phonemes are classified into vowels, glides, semivowels and
consonants.

A grapheme is a letter or a number of letters that represent the sounds in our speech. So, a grapheme will be the letter/
letters that represent a phoneme. English has a complex written code and in our code a grapheme can be 1, 2,3 or 4
letters. For example:

1 letter grapheme – m a t (m)

2 letter grapheme – sh i p (sh) DIGRAPH

3 letter grapheme – n igh t (igh) TRIGRAPH

4 letter grapheme – eigh t (eigh) QUADGRAPH

Phonetics broadly deals with two aspects of human speech:


1. production—the ways humans make sounds
2. perception—the way speech is understood.
Language production consists of several interdependent processes which transform a non-linguistic message into a
spoken or signed linguistic signal.
1. After identifying a message to be linguistically encoded, a speaker must select the individual words—known
as lexical items—to represent that message in a process called lexical selection.
2. During phonological encoding, the mental representation of the words is assigned their phonological content as
a sequence of phonemes to be produced. The phonemes are specified for articulatory features which denote
particular goals such as closed lips or the tongue in a particular location.
3. These phonemes are then coordinated into a sequence of muscle commands that can be sent to the muscles, and
when these commands are executed properly the intended sounds are produced.

2. ARTICULATORY PHONETICS
Articulatory phonetics is the study of how these phones are produced as the various articulatory phonetics organs in the
mouth, throat, and nose modify the airflow from the lungs.

Articulation is a process resulting in the production of speech sounds. It consists of a series of movements by a set of
organs of speech called the articulators. The articulators that move during the process of articulation are called active
articulators. Organs of speech which remain relatively motionless are called passive articulators. The points at which
the articulator are moving towards or coming in to contact with certain other organ are the place of articulation. The
type or the nature of movement made by the articulator is called the manner of articulation.

1
2.1 Vocal Organs – Sound is produced by the rapid movement of air. Humans produce most sounds in spoken languages
by expelling air from the lungs through the windpipe (technically, the trachea) and then out the mouth or nose. As it
passes through the trachea, the air passes through the larynx, commonly known as the Adam’s apple or voice box.
The larynx contains two small folds of muscle, the vocal folds or vocal cords, which can be moved together or apart.
The space between these two folds is called the glottis. If the folds are close together (but not tightly closed), they will
vibrate as air passes through them; if they are far apart, they won’t vibrate.

Sounds made with the vocal folds together and vibrating are called voiced; sounds made without this cord vibration are
called unvoiced or voiceless. Voiced sounds include [b], [d], [g], [v], [z], and all the English vowels, among others.
Unvoiced sounds include [p], [t], [k], [f], [s], and others.

The area above the trachea is called the vocal tract; it consists of the oral tract and the nasal tract. After the air leaves
the trachea, it can exit the body through the mouth or the nose. Most sounds are made by air passing through the mouth.
Sounds made by air passing through the nose are called nasal sounds; nasal sounds (like English [m], [n], and [ng]) use
both the oral and nasal tracts as resonating cavities.

Phones are divided into two main classes: consonants and vowels. Both kinds of sounds are formed by the motion of
air through the mouth, throat or nose.

Consonants are made by restriction or blocking of the airflow in some way, and can be voiced or unvoiced.

Vowels have less obstruction, are usually voiced, and are generally louder and longer-lasting than consonants. The
technical use of these terms is much like the common usage; [p], [b], [t], [d], [k], [g], [f], [v], [s], [z], [r], [l], etc., are
consonants; [aa], [ae], [ao], [ih], [aw], [ow], [uw], etc., are vowels.

Semivowels (such as [y] and [w]) have some of the properties of both; they are voiced like vowels, but they are short
and less syllabic like consonants

Glides can be defined as vowel –like sounds, differing from the vowels in a lack of the continuant characteristic: glides
are produced rather during a fast dynamic change of the articulators; the first sounds of the word “you a” and the word
“what” provide examples of glides

Consonants can be classified on the basis of the manner of articulation as stop, nasals, fricatives, trills, flaps, laterals,
affricates, continuants, etc

2.2 Consonants: Place of Articulation


Because consonants are made by restricting airflow, we can group them into classes by their point of maximum
restriction, their place of articulation.

Labial: Consonants whose main restriction is formed by the two lips coming together have a bilabial place of
articulation. In English these include [p] as in possum, [b] as in bear, and [m] as in marmot. The English labiodental
consonants [v] and [f] are made by pressing the bottom lip against the upper row of teeth and letting the air flow through
the space in the upper teeth

Dental: Sounds that are made by placing the tongue against the teeth are dentals. The main dentals in English are the
[th] of thing and the [dh] of though, which are made by placing the tongue behind the teeth with the tip slightly between
the teeth.

Alveolar: The alveolar ridge is the portion of the roof of the mouth just behind the upper teeth. Most speakers of
American English make the phones [s], [z], [t], and [d] by placing the tip of the tongue against the alveolar ridge. The
word coronal is often used to refer to both dental and alveolar.
Palatal: The roof of the mouth (the palate) rises sharply from the back of the palate alveolar ridge. The palato-alveolar
sounds [sh] (shrimp), [ch] (china), [zh] (Asian), and [jh] (jar) are made with the blade of the tongue against the rising
back of the alveolar ridge. The palatal sound [y] of yak is made by placing the front of the tongue up close to the palate.

Velar: The velum, or soft palate, is a movable muscular flap at the very back of the roof of the mouth. The sounds [k]
(cuckoo), [g] (goose), and [N] (kingfisher) are made by pressing the back of the tongue up against the velum.

2
Glottal: The glottal stop [q] is made by closing the glottis (by bringing the vocal folds together).

2.3 Consonants: Manner of Articulation

Consonants are also distinguished by how the restriction in airflow is made, for example, by a complete stoppage of air
or by a partial blockage. This feature is called the manner of articulation of a consonant.

The combination of place and manner of articulation is usually sufficient to uniquely identify a consonant. Following
are the major manners of articulation for English consonants:

Stop: A stop is a consonant in which airflow is completely blocked for a short time. This blockage is followed by an
explosive sound as the air is released. The period of blockage is called the closure, and the explosion is called the release.
English has voiced stops like [b], [d], and [g] as well as unvoiced stops like [p], [t], and [k]. Stops are also called plosives.

Nasal: The nasal sounds [n], [m], and [ng] are made by lowering the velum and allowing air to pass into the nasal cavity.

Fricatives : In fricatives, airflow is constricted but not cut off completely. The turbulent airflow that results from the
constriction produces a characteristic “hissing” sound. The English labiodental fricatives [f] and [v] are produced by
pressing the lower lip against the upper teeth, allowing a restricted airflow between the upper teeth. The dental fricatives
[th] and [dh] allow air to flow around the tongue between the teeth. The alveolar fricatives [s] and [z] are produced with
the tongue against the alveolar ridge, forcing air over the edge of the teeth.

Sibilants: The higher-pitched fricatives (in English [s], [z], [sh] and [zh]) are called sibilants.

Affricates: Stops that are followed immediately by fricatives are called affricates; these include English [ch] (chicken)
and [jh] (giraffe).

Approximants: In approximants, the two articulators are close together but not close enough to cause turbulent airflow.
In English [y] (yellow), the tongue moves close to the roof of the mouth but not close enough to cause the turbulence
that would characterize a fricative. In English [w] (wood), the back of the tongue comes close to the velum.

Tap: A tap or flap [dx] is a quick motion of the tongue against the alveolar ridge. The consonant in the middle of the
word lotus ([l ow dx ax s]) is a tap in most dialects of American English.

2.4 Vowels
Like consonants, vowels can be characterized by the position of the articulators as they are made. The three most relevant
parameters for vowels are
a. vowel height, which correlates roughly with the height of the highest part of the tongue
b. vowel frontness or backness, indicating whether this high point is toward the front or back of the oral tract
and whether the shape of the lips is rounded or not.

Front Vowels: Vowels in which the tongue is raised toward the front are called front vowels. Both [ih] and [eh] are
front vowels, the tongue is higher for [ih] than for [eh].

Back Vowels: Vowels in which the tongue is raised toward the back are called back vowels. Eg. [ow], [aw].

High vowels: Vowels in which the highest point of the high vowel tongue is comparatively high are called high vowels.
Eg. [uw], [iy]

Low vowels: Vowels with mid or low values of maximum tongue height are called mid vowels or low vowels. Eg. [aa],
[ae].

Diphthong: A vowel in which the tongue positiosn changes markedly during the production of the vowel is a diphthong.
A diphthong is a sound made by combining two vowels (di means double, Greek word diphthongos which means
"having two sounds”), specifically when it starts as one vowel sound and goes to another, like the oy sound in oil, chair,
fear, pout, etc.

3
2.5 Syllables
Consonants and vowels combine to make a syllable. A syllable is a vowel-like sound together with some of the
surrounding consonants that are most closely associated with it. The word dog has one syllable, [d aa g]; the word
catnip has two syllables, [k ae t] and [n ih p].

The vowel at the core of a syllable is called the nucleus. Initial consonants, if any, are called the onset. Onsets with
more than one consonant (as in strike [s t r ay k]), are called complex onsets. The coda is the optional consonant or
sequence of consonants following the nucleus. rime Thus [d] is the onset of dog, and [g] is the coda. The rime, or
rhyme, is the nucleus plus coda.
Fig. show some sample syllable structures.

σ σ σ

Onset Rime Onset Rime Rime

h Nucleus Coda g r Nucleus Coda Nucleus Coda

a m ee n e g g
Fig: Syllable structure of ham, green, eggs. σ =syllable.

Syllabification: The task of automatically breaking up a word into syllables is called syllabification. Syllable structure
is also closely related to the phonotactics of a language.
Phonotactics: The term phonotactics means the constraints on which phones can follow each other in a language. For
example, English has strong constraints on what kinds of consonants can appear together in an onset; the sequence
[zdr], for example, cannot be a legal English syllable onset. Phonotactics can be represented by a language model or
finite-state model of phone sequences.
3. ACOUSTICS PHONETICS

3.1 Waves - Acoustic analysis is based on the sine and cosine functions. The sine wave can be
described using the function y=a*sin(bx), where:

• a is known as the amplitude of the sine wave


• b is known as the periodicity

The amplitude dictates the magnitude of the swings, whereas the periodicity dictates how often the swings
happen.

The frequency is the number of times a second a amplitude wave repeats itself, that is, the number of cycles.
Frequency is usually measured in cycles per second called as hertz. The amplitude A of a period sine wave
is the maximum value on the Y axis. The period T of the wave is the time it takes for one cycle to complete,
defined as
T=1/f

4
3.2 Speech Sound Waves
The input to a speech recognizer, like the input to the human ear, is a complex series of changes in air
pressure. These changes in air pressure obviously originate with the speaker. Sound waves are represented
by plotting the change in air pressure over time. The graph measures the amount of compression or
rarefaction (uncompression) of the air molecules at this plate.

The first step in digitizing a sound wave like above Fig. is to convert the analog representations sampling into
a digital signal. This analog-to-digital conversion has two steps: sampling and quantization.

To sample a signal, we measure its amplitude at a particular time; the sampling rate is the number of samples
taken per second. To accurately measure a wave, we must have at least two samples in each cycle: one
measuring the positive part of the wave and one measuring the negative part. More than two samples per cycle
increases the amplitude accuracy, but fewer than two samples causes the frequency of the wave to be
completely missed. Thus, the maximum frequency wave that can be measured is one whose frequency is half
the sample rate (since every cycle needs two samples). This maximum frequency for a given sampling rate is
called the Nyquist frequency. Most information in human speech is in frequencies Nyquist frequency below
10,000 Hz; thus, a 20,000 Hz sampling rate would be necessary for complete accuracy.

But telephone speech is filtered by the switching network, and only frequencies less than 4,000 Hz are
transmitted by telephones. Thus, an 8,000 Hz sampling rate is sufficient for telephone-bandwidth speech like
the Switchboard corpus, while 16,000 Hz sampling is often used for microphone speech.

Even an 8,000 Hz sampling rate requires 8000 amplitude measurements for each second of speech, so it is
important to store amplitude measurements efficiently. They are usually stored as integers, either 8 bit (values
from -128–127) or 16 bit (values from -32768–32767). This process of representing real-valued numbers as
integers is called quantization because the difference between two integers acts as a minimum granularity (a
quantum size) and all values that are closer together than this quantum size are represented identically.

Once data is quantized, it is stored in various formats. One parameter of these formats is the sample rate and
sample size as discussed earlier. Another parameter is the number of channels. For stereo data or for two-
party conversations, we can store both channels in the same file or we can store them in separate files. A final
parameter is individual sample storage—linearly or compressed. One common compression format used for
telephone speech is µ-law. the equation for compressing a linear PCM (Pulse Code Modulation) sample value
x to 8-bit µ-law, (where µ=255 for 8 bits):

F(x) = sgn(x) log(1 + µ|x|)


log(1 + µ) —1 ≤ x ≤ 1

There are a number of standard file formats for storing the resulting digitized wavefile, such as Microsoft’s
.wav and Apple’s AIFF all of which have special headers; simple headerless “raw” files are also used. For
example, the .wav format is a subset of Microsoft’s RIFF format for multimedia files; RIFF is a general
format that can represent a series of nested chunks of data and control information. Below Figure shows
Microsoft wavefile header format, assuming simple file with one chunk. Following this 44-byte header would
be the data chunk.

5
3.3 Pitch and Loudness
Voices are caused by regular openings and closing of the vocal folds. When the vocal folds are open, air is
pushing up through the lungs, creating a region of high pressure. When the folds are closed, there is no pressure
from the lungs. Thus, when the vocal folds are vibrating, there are regular peaks in amplitude of the kind each
major peak corresponding to an opening of the vocal folds. The frequency of the vocal fold vibration, or the
frequency of the complex wave, is called the fundamental frequency of the waveform, often abbreviated
F0. We can plot F0 over time in a pitch track. Below Figure shows the pitch track of a short question, “Three
o’clock?” represented below the waveform. Note the rise in F0 at the end of the question.

500 Hz

0 Hz

Three o’clock

0 0.544375
Time (s)

Figure: Pitch track of the question “Three o’clock?”, shown below the wavefile. Note the rise
in F0 at the end of the question. Note the lack of pitch trace during the very quiet part (the “o’”
of “o’clock”; automatic pitch tracking is based on counting the pulses in the voiced regions,
and doesn’t work if there is no voicing (or insufficient sound).

At some point of time, we also often need to know the average amplitude over some time range. We generally
use the RMS (root-mean-square) amplitude, which squares each number before averaging (making it
positive), and then takes the square root at the end.

The power of the signal is related to the square of the amplitude. If the number of samples of a sound is N, the
power is
N
Power = (1 /N) ∑ xi2
i=1
The intensity of sound normalizes the power to the human auditory threshold and is measured in dB. If P0 is
the auditory threshold pressure = 2×10−5 Pa, then intensity is defined as follows:
N

6
Intensity = 10log 10(1 /NPo) ∑ xi2
i=1
Two important perceptual properties, pitch and loudness, are related to frequency and intensity. The pitch of
a sound is the mental sensation, or perceptual correlate, of fundamental frequency; in general, if a sound has
a higher fundamental frequency we perceive it as having a higher pitch.

Roughly speaking, human pitch perception is most accurate between 100 Hz and 1000 Hz and in this range
pitch correlates linearly with frequency. Human hearing represents frequencies above 1000 Hz less accurately,
and above this range, pitch correlates logarithmically with frequency. Logarithmic representation means
that the differences between high frequencies are compressed and hence not as accurately perceived. There
are various psychoacoustic models of pitch perception scales. One common model is the mel scale (Stevens
et al. 1937, Stevens and Volkmann 1940). A mel is a unit of pitch defined such that pairs of sounds which are
perceptually equidistant in pitch are separated by an equal number of mels. The mel frequency m can be
computed from the raw acoustic frequency as follows:
m = 1127 In (1 + (f / 700))

The loudness of a sound is the perceptual correlate of the power. So sounds with higher amplitudes are
perceived as louder, but again the relationship is not linear.

First of all, humans have greater resolution in the low-power range; the ear is more sensitive to small power
differences. Second, it turns out that there is a complex relationship between power, frequency, and perceived
loudness; sounds in certain frequency ranges are perceived as being louder than those in other frequency
ranges.
Various algorithms exist for automatically extracting F0. There are many pitch extraction algorithms and
toolkits available.

3.4 Interpretation of Phones from a Waveform


Vowels are pretty easy to spot. Recall that vowels are voiced; another property of vowels is that they tend to
be long and are relatively loud. Length in time manifests itself directly on the x-axis, and loudness is related
to (the square of) amplitude on the y-axis. Each major peak in graph corresponding to an opening of the
vocal folds. Below Figure shows the waveform of the short sentence “she just had a baby”. We have labeled
this wave- form with word and phone labels.

she just had a baby

sh iy j ax s h ae d ax b ey b iy
x

0 1.059
Time (s)

For a stop consonant, which consists of a closure followed by a release, we can often see a period of silence
or near silence followed by a slight burst of amplitude. We can see this for both of the [b]’s in baby in Fig

Another phone that is often quite recognizable in a waveform is a fricative. Recall that fricatives, especially
very strident fricatives like [sh], are made when a narrow channel for airflow causes noisy, turbulent air. The
resulting hissy sounds have a noisy, irregular waveform. This can be seen somewhat in above Fig., it’s even
clearer in below where we’ve magnified just the first word she.

7
0 0.257
Time (s)

Fig A more detailed view of the first word “she” extracted from the wavefile in Fig. Noticethe difference
between the random noise of the fricative [sh] and the regular voicing of the vowel [iy].

3.5 Spectra the Frequency Domain


While some broad phonetic features (such as energy, pitch, and the presence of voicing, stop closures, or
fricatives) can be interpreted directly from the waveform, most computational applications such as speech
recognition (as well as human auditory processing) are based on a different representation of the sound in
terms of its component frequencies. The insight of Fourier analysis is that every complex wave can be
represented as a sum of many sine waves of different frequencies. Consider the waveform in below fig. This
waveform was created by summing two sine waveforms, one of frequency 10 Hz and one of frequency 100
Hz.

We can represent these two component frequencies with a spectrum. The spectrum of a signal is a
representation of each of its frequency components and their amplitudes. Figure below shows the spectrum of
above Fig. . Frequency in Hz is on the x-axis and amplitude on the y-axis. Note the two spikes in the figure,
one at 10 Hz and one at 100 Hz. Thus, the spectrum is an alternative representation of the original waveform,
and we use the spectrum as a tool to study the component frequencies of a sound wave at a particular time
point.

80

60

40
1 2 5 10 20 50 100 200
Frequency (Hz)

0.04968

–0.05554
0 0.04275
Time (s)

Fig The waveform of part of the vowel [ae] from the word had cut out

Note that there is a complex wave that repeats about ten times in the figure; but there is also a smaller repeated
wave that repeats four times for every larger pattern (notice the four small peaks inside each repeated wave).

8
The complex wave has a frequency of about 234 Hz (we can figure this out since it repeats roughly 10 times
in .0427 seconds, and 10 cycles/.0427 seconds = 234 Hz).

The smaller wave then should have a frequency of roughly four times the frequency of the larger wave, or
roughly 936 Hz. Then, if you look carefully, you can see two little waves on the peak of many of the 936 Hz
waves. The frequency of this tiniest wave must be roughly twice that of the 936 Hz wave, hence 1872 Hz.

While a spectrum shows the frequency components of a wave at one point in time, a spectrogram is a way of
envisioning how the different frequencies that make up a waveform change over time. The x-axis shows time,
as it did for the waveform, but the y-axis now shows frequencies in hertz. The darkness of a point on a
spectrogram corresponds to the amplitude of the frequency component. Very dark points have high amplitude,
light points have low amplitude. Thus, the spectrogram is a useful way of visualizing the three dimensions
(time x frequency x amplitude).

Below Figure shows spectrograms of three English vowels, [ih], [ae], and [ah]. Note that each vowel has a set
of dark bars at various frequency bands, slightly different bands for each vowel. Each of these represents the
same kind of spectral peak that we see in figure.
Frequency (Hz)

Each dark bar (or spectral peak) is called a formant. As we discuss below, a formant is a frequency band that
is particularly amplified by the vocal tract. Since different vowels are produced with the vocal tract in different
positions, they will produce different kinds of amplifications or resonances

3.6 The Source Filter Model


The source filter model is a way of explaining the acoustics of a sound by modeling how the pulses produced
by the glottis (the source) are shaped by the vocal tract (the filter). Whenever we have a wave such as the
vibration in air harmonic caused by the glottal pulse, the wave also has harmonics. A harmonic is another
wave whose frequency is a multiple of the fundamental wave. Thus, for example, a 115 Hz glottal fold
vibration leads to harmonics (other waves) of 230 Hz, 345 Hz, 460 Hz, and so on on. In general, each of these
waves will be weaker, that is, will have much less amplitude than the wave at the fundamental frequency.

It turns out, however, that the vocal tract acts as a kind of filter or amplifier; indeed any cavity, such as a tube,
causes waves of certain frequencies to be amplified and others to be damped. This amplification process is
caused by the shape of the cavity; a given shape will cause sounds of a certain frequency to resonate and hence
be amplified. Thus, by changing the shape of the cavity, we can cause different frequencies to be amplified.
When we produce particular vowels, we are essentially changing the shape of the vocal tract cavity by placing
the tongue and the other articulators in particular positions. The result is that different vowels cause different
harmonics to be amplified. So a wave of the same fundamental frequency passed through different vocal tract
positions will result in different harmonics being amplified.

We can see the result of this amplification by looking at the relationship between the shape of the vocal tract
and the corresponding spectrum. Below Figure shows the vocal tract position for three vowels and a typical
resulting spectrum. The formants are places in the spectrum where the vocal tract happens to amplify particular
harmonic frequencies.

9
[iy] (tea)
F1
[ae] (cat)
F2
F1 F2

Figure Visualizing the vocal tract position as a filter: the tongue positions for three English vowels andthe
resulting smoothed spectra showing F1 and F2.

4. SHORT TIME( or TERM) FOURIER TRANSFORM

Speech is not a stationary signal, i.e., it has properties that change with time
Thus a single representation based on all the samples of a speech utterance, for the most part, has no
meaning
Instead, we define a time-dependent Fourier transform (TDFT or STFT) of speech that changes
periodically as the speech properties change over time

The Short-time Fourier transform (STFT), is a Fourier-related transform used to determine the sinusoidal
frequency and phase content of local sections of a signal as it changes over time. Generally, the procedure for
computing STFTs is to divide a longer time signal into shorter segments of equal length and then compute the
Fourier transform separately on each shorter segment. This reveals the Fourier spectrum on each shorter
segment.

It is a powerful general-purpose tool for audio signal processing. It defines a particularly useful class of time-
frequency distributions which specify complex amplitude versus time and frequency for any signal. STFT
parameters are tuned for the following applications:

1. Approximating the time-frequency analysis performed by the ear for purposes of spectral display.
2. Measuring model parameters in a short-time spectrum.

In the first case, applications of audio spectral display go beyond merely looking at the spectrum. They also
provide a basis for audio signal processing tasks intended to imitate human perception, such as auditory scene
recognition or automatic transcription of music.

Examples of the second case include estimating the decay-time-versus-frequency for vibrating strings and
body resonances, or measuring as precisely as possible the fundamental frequency of a periodic signal based
on tracking its many harmonics in the STFT.

STFT is a function of two variables, the time index, nˆ, which is discrete, and the frequency variable,
wˆ, which is continuous

10
5. FILTER BANKS
In signal processing, a filter bank (or filterbank) is an array of bandpass filters that separates the input
signal into multiple components, each one carrying a single frequency sub-band of the original signal. One
application of a filter bank is a graphic equalizer, which can attenuate the components differently and
recombine them into a modified version of the original signal.

In other words, a class of systems that generates scaling and wavelet function are known as filter banks. The
below figure shows the simple structure of a filter bank. It is a set of bandpass filters with each filter centered
at a different frequency.
A digital filter bank is a collection of filters having a common input or output. In digital signal processing, the
term filter bank is also commonly applied to a bank of receivers. The difference is that receivers also down-
convert the subbands to a low center frequency that can be re-sampled at a reduced rate. The same result can
sometimes be achieved by undersampling the bandpass subbands.
Filter banks play an important role in modern signal and image processing applications such as audio and
image coding.

11
How does a Filter Bank work?
A filter bank separates or splits the input signal into multiple components. The process of separating the
input signals into multiple components is known as analysis. The output of the analysis is referred to as a
sub-band of the original signal.

Now, the Filter bank attenuates the components of the signal differently and reconstructed them into an
improved version of the original signal. This reconstruction process is known as synthesis.

For example, if we have an input audio signal x(n), then filter banks separate this input audio signal into a
set of analysis signals i.e., x1(n), x2(n), x3(n), etc…, each of these set of analysis signals corresponds to a
different region in the spectrum of the input signal x(n).

This set of analysis signals x1(n), x2(n), x3(n)… can be obtained by filter banks with bandwidths BW1,
BW2, BW3… and centre frequencies fc1, fc2, fc3… respectively.

The below figure shows the frequency response of a Filter Bank in which bands of a three-band filter bank
do not overlap, but are lined up one after the other with adjacent band edges touching each other. These
three bands span the frequency range from fcl1 = 0 Hz to fch3 = fmax.

Analysis and Synthesis Filter Bank


There are two main types of filter banks. An analysis filter bank and a synthesis filter bank. An analysis
filter bank is a set of analysis filters Hk(n) which splits an input signal into M sub-band signals Xk(n) and a
synthesis filter bank is a set of M synthesis filters Fk(z) which combine M signal Yk(n) into a reconstructed
signal x^(n) as shown in the below figure.

The analysis filter bank decomposes the input signal to a different sub-band with a different frequency
spectrum and the synthesis filter bank reconstructs the different sub-band signal and generates a modified
version of the original signal.

12
Types of Filter Banks
• DCT Filter Banks
• Polyphase Filter Banks
• Gabor Filter Banks
• Mel Filter Banks
• Filter Bank Multicarrier (FBMC)
• DFT Filter Banks
• Uniform DFT Filter Bank

Advantages of Filter Bank


• High resolution with low spectral leakage
• More stable peaks
• The more accurate and consistent noise floor
• By using filter banks, we can set channelizers to get different resolutions at different
frequency bands. This is used in audio spectral analysis.
Applications of Filter Banks
• The filter bank is used to compress the signal when some particular frequencies are more
important than other frequencies. They are used in speech processing, image compression,
communications systems, antenna systems, analog voice privacy systems, and in the digital
audio industry.
• The filter bank can be used as a graphic equalizer, which can attenuate the components of the
signal differently and reconstructed them into an improved version of the original signal.
• Filter banks are used for trans multiplexers, sub-band adaptive filtering, vocoder, etc. Here,
the vocoder uses a filter bank to determine and control the amplitude of the sub-bands of the
carrier signal.
• DCT filter Banks are used in JPEG image compression and modifications of sub-bands for
image modifications.
• DCT Filter banks are used in audio signal processing, digital audio, digital radio, speech
processing, etc.
• FBMC (Filter Bank Multicarrier) can be used as a tool for spectrum sensing.
• In document image processing, Gabor filter banks are used for identifying the script of a
word in a multilingual document.
• Gabor filter banks are widely used for decomposing an image into oriented sub bands such as
texture analysis, edge and texture detection, feature extraction, object and shape recognition,
scene segmentation, etc.

6. LPC (Linear Predictive Coding) METHODS


• LPC methods are the most widely used in speech coding, speech synthesis, speech recognition,
speaker recognition and verification and for speech storage
• Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech
processing for representing the spectral envelope of a digital signal of speech in compressed form,
using the information of a linear predictive model.
• It is a powerful speech analysis technique, and a useful method for encoding good quality speech at a
low bit rate.
• LPC starts with the assumption that a speech signal is produced by a buzzer at the end of a tube
(for voiced sounds), with occasional added hissing and popping sounds (for voiceless sounds such
as sibilants and plosives). Although apparently crude, this Source–filter model is actually a close
approximation of the reality of speech production. The glottis (the space between the vocal folds)
produces the buzz, which is characterized by its intensity (loudness) and frequency (pitch). The vocal
tract (the throat and mouth) forms the tube, which is characterized by its resonances; these

13
resonances give rise to formants, or enhanced frequency bands in the sound produced. Hisses and
pops are generated by the action of the tongue, lips and throat during sibilants and plosives.
• LPC analyzes the speech signal by estimating the formants, removing their effects from the speech
signal, and estimating the intensity and frequency of the remaining buzz. The process of removing
the formants is called inverse filtering, and the remaining signal after the subtraction of the filtered
modeled signal is called the residue.
• The numbers which describe the intensity and frequency of the buzz, the formants, and the residue
signal, can be stored or transmitted somewhere else. LPC synthesizes the speech signal by reversing
the process: use the buzz parameters and the residue to create a source signal, use the formants to
create a filter (which represents the tube), and run the source through the filter, resulting in speech.
• Because speech signals vary with time, this process is done on short chunks of the speech signal,
which are called frames; generally, 30 to 50 frames per second give an intelligible speech with good
compression.
• LPC methods provide extremely accurate estimates of speech parameters, and does it extremely
efficiently
• Basic idea of Linear Prediction: current speech sample can be closely approximated as a linear
combination of past samples
p
s(n) = ∑ αks(n-k) for some value of p, αk
k=1

• For periodic signals with period Np , it is obvious that


s(n) ≈ s(n - Np )
• But that is not what LP is doing; it is estimating s(n) from the p ( p << Np ) most recent values of
s(n) by linearly predicting its value

• or LP, the predictor coefficients (the αk 's) are determined (computed) by minimizing the sum of
squared differences (over a finite interval) between the actual speech samples and the linearly
predicted ones
• LP is based on speech production and synthesis models
o speech can be modeled as the output of a linear, time-varying system, excited
by either quasi-periodic pulses or noise;
o assume that the model parameters remain constant over speech analysis
interval
• LP provides a robust, reliable and accurate method for
o estimating the parameters of the linear system (the combined
vocal tract, glottal pulse, and radiation characteristic for voiced speech)

14
• LP methods have been used in control and information theory—called methods of system
estimation and system identification
– Used extensively in speech under group of names including

1. covariance method

2. autocorrelation method

3. lattice method

4. inverse filter formulation

5. spectral estimation formulation

6. maximum likelihood method

7. inner product method

15

You might also like