Speech sounds and phonetic
transcription
Phonetics
The study of speech sounds used in the languages of the world
Phone
A speech sound
Represented with phonetic symbols
Two types:
Consonants
Vowels
1
Phonology
Phonology is the area of linguistics that describes the systematic
way that sounds are differently realized in different
environments, and how this system of sounds is related to
rest of the grammar.
2
Speech sounds and phonetic
transcription
Phonetic alphabets
IPA
Standard developed by International Phonetic Association
Alphabet + principles of transcription
ARPAbet
Designed for American English in ASCII symbols
Computer-friendly
3
Speech sounds and phonetic
transcription
[ɹ]
4
Articulatory phonetics
Definition:
The study of how phones are produced
The vocal organ
5
Articulatory phonetics
Consonants are defined by
Place of articulation
The point of maximum constriction
Manner of articulation
How the restriction of airflow is made
Voicing
State of the glottis
6
Articulatory phonetics
Sounds are formed by the motion of air through the
mouth
Consonants:
Made by restricting or blocking the airflow in some way
May be voiced(sounds made with vocal folds together and
vibrating) or voiceless
Vowels:
Made with less obstruction
Usually voiced
Generally louder and longer than consonants
7
Articulatory phonetics
Place of articulation
coronal
8
SPEECH PROCESSING
Why Speech?
No visual contact required
No special equipment required
Can be done while doing other things
Speech Processing
Speech Coding
Speech Synthesis
Speech Recognition
Speaker Recognition/Verification
Speech Synthesis
Construct Speech waveform from words
Speaker Quality and Accent
Speech Recognition
Convert a sound waveform to words
The most relevant and important task in the industry
90% in lab conditions, much lower in factory conditions
Speaker Recognition
Concerned with Biometrics
Acceptable as a verification technique
How would this be different from Speech recognition?
Speaker Quality
Pitch, Accent etc.
Automatic Speech Recognition
Most Important Task
Hardest Task
Co-articulation: Two speakers speaking at the same time
Speaker Variation
Spontaneity
Language Modeling
Noise Robustness
ASR: Problems
ASR: Method
ASR: Application
WHAT IS PHONOLOGY
Despite the infinite variations which occur when we speak, all speakers of
language agree that certain utterances are the «same» and others are
«different». Phonology tells us why this is the case.
Note that, everyone who knows a language knows its phonology
unconsciously.
PHONETICS PHONOLOGY
How are How are
speech sounds speech sounds
made? classified?
PHONOLOGY is concerned with the abstract,
PHONETICS is concerned with the physical properties of sounds.
PHONOLOGY is study of sounds in context, PHONETICS is study of
sounds in isolation.
Phonology deals with the
following questions:
1. Why do related forms differ?
2. What is stored in the mind?
3. What sounds go together?
4. How are sounds organized into syllables?
5. What are the differences between languages?
1. Why do related forms differ?
Sane—Sanity. Electric—Electricity/ Atom—Atomic
Phonology finds the systematic ways in which the forms differ and
explains them
2. What is stored in the mind?
Phonology studies abstract mental entities, such as structures and
processes.
This contrasts with phonetics, which deals with the actual production
and acoustics of the sounds of language.
3. What sounds go together?
Looks at what sounds/sound combinations are accepted and why.
4. How are sounds organized
into syllables?
With the use of phonological trees syllables are broken up more easily.
Syllables are made up of a rhyme and an onset (any consonants before the
rhyme).
The rhyme made up of a nucleus (the vowel sound(s) in the syllable, the key
component of all syllables) and a coda (any consonants following the
nucleus).
5. What are the differences
between languages?
For example, different languages can use different phonemes, or
different syllable structures (what sounds can go together to make
sequences or words) and phonology identifies these differences.
2. BASIC UNITS OF PHONOLOGY
1. PHONE
2. PHONEME
3. ALLOPHONE
Phone is a sound.
It is the smallest unit of sound.
The speech sounds we hear and produce are all
phones. It is represented with phonetic symbols.
i.e. [t]
PHONEME is a speech sound that signals a difference in
meaning
i.e. dime vs dine
They sound exactly alike but their meanings are different. It
is the /m/ and /n/ that made the difference in meaning.
Allophones are variations of the samephoneme.
phoneme
/p/
allophones
[p] [ph]
Spit pit
Phonemes: The Phonological
Units of Language
• Phonemes are the basic unit of sound
and are sensed in your mind rather than
spoken or heard
• Each phoneme has one or more sounds
called allophones associated with it,
which represent the actual sound being
produced in various environments
Vowel phonemes in English
• When you do these substitutions you are creating minimal
pairs, such as in this list:
• This list demonstrates that this dialect of English has fourteen
different vowel phonemes: /i ɪ e ɛ æ u ʊ o ɔ a ʌ/ and /aɪ/, /aʊ/
and /ɔɪ .And all of these phonemes has at least two allophones;
• Consonants also have allophones:
tick [thɪk] stick [stɪk] hits [hɪts] bi7er [bɪɾər]
• /t/ is pronounced [th] before a stressed vowel
• /t/ is pronounced [t] directly before or aMer [s]
• /t/ is pronounced [ɾ] between a stressed and unstressed vowel
• If we pronounce tick as [tɪk] or [ɾɪk] instead of [thɪk], we are still speaking the
same word, even if it sounds strange because these allophones of /t/ do not
contrast
• However, if we tried to pronounce tick as [sɪk], we would be saying sick,
which has a different meaning
• The meaning changes because /t/ and /s/ are separate phonemes and do
contrast
Distinctive Features of Phonemes
• For two phones, or sounds, to contrast meaning
there must be some difference between them
– For example, the phonetic feature of voicing distinguishes
[s] from [z]
• When a feature distinguishes one phoneme from
another, it is a distinctive feature or a
phonemic feature
Feature Values
• Features have two values: [+ feature] and
[-feature] to indicate the presence or
absence of that particular feature
– For example, [b] is [+voiced] and [p] is [-
voiced]
• At least one feature difference must distinguish each phoneme of a
language
Phonemic Patterns May Vary
Across Languages
• The same phones may occur in two languages
but pattern differently because the phonologies
of the languages are different
• While aspiration is not distinctive in English, it is
distinctive in Thai:
Phonological rules
• Phonological rules describe how phonemes
are realized as their allophones in the given
environment
• Environment in phonology typically refers to
the neighboring phonemes
• Example: In English, the phoneme /t/ is
realized as its allophone [ɾ] between a
stressed vowel and an unstressed vowel
Formal notation
• Formally a phonological rule is written as
X -> Y / W _ Z
• The variables (X, Y, etc.) can be phonemes,
allophones, or features
e.g. /t/ -> [ɾ] / V́ _ V
e.g. [-voice] -> [+voice] / V _ V
• Some symbols have special meanings
– # means word-boundary
– $ means syllable-boundary
– + means morpheme boundary
Syllable Structure
• Words are composed of one or more syllables, which
are phonological units composed of one or more
phonemes
– Every syllable has a nucleus, and the nucleus may be
preceded and/or followed by one or more phonemes called the
onset and the coda
– The rime is the nucleus + the coda
Assimilation Rules
• An assimilation rule is a rule that makes
neighboring segments more similar by
duplicating a phonetic property
– For example, the English vowel nasalization
rule states that vowels become nasalized
before a nasal consonant within the same
syllable
Assimilation Rules
• Assimilation rules reflect coarticulation
– Coarticulation is the spreading of phonetic features either in
anticipation or in the preservation of articulatory processes
• For example, it is easier to lower the velum while a vowel is being
produced before a nasal stop than to wait for the completion of the
vowel to then lower the velum even more quickly
• There are many assimilation rules in English and other
languages
– English plural and past tense morphemes
– Akan negative morphemes
Dissimilation Rules
• Languages also have dissimilation rules, in
which a segment becomes less like another
segment
– It is sometimes easier to articulate dissimilar sounds
• Latin suffix –alis to form adjectives dissimilates
to
–aris when an l is in the noun and the
dissimilation can be seen in the words
borrowed into English
Segment Insertion and Deletion
Rules
• Phonological rules may also add or delete
entire segments
– Adding a segment is known as epenthesis
• The rules for forming plurals, possessives, and third
person singular verb agreement in English all involve an
epenthesis rule:
Insert a [ə] before the plural morpheme /z/ when a regular noun
ends in a sibilant, giving [əz]
Segment Insertion and
Deletion Rules
• Segment deletion is more common than
insertion
– The word memory is often pronounced as if it
were spelled memry
– The deletion of [g]:
The Function of Phonological Rules
• Phonological rules provide the phonetic
information necessary for the pronunciation of
utterances
– Derivation: the way the phonological rules apply to the
underlying phonemic representation to create the
phonetic representation:
Probabilistic Approaches to
Pronunciation and Spelling
Spoken and Written Word (Lexical)
Errors
Variation vs. error
Word formation errors:
I go to Columbia Universary.
Lexical access problems:
Turn to the right (left)
“I called my mother on the television and did not understand
the door. It was too breakfast, but they came from far to near.
My mother is not too old for me to be young."
Can humans understand ‘what is meant’ as opposed to ‘what
is said/written’?
Aoccdrnig to a rscheearch at an Elingsh uinervtisy, it deosn't
mttaer in waht oredr the ltteers in a wrod are, the olny
iprmoetnt tihng is taht the frist and lsat ltteers are at the rghit
pclae. The rset can be a toatl mses and you can sitll raed it
wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter
by itslef but the wrod as a wlohe.
How do we do this?
Detecting and Correcting Spelling
Errors
Applications:
Spell checking in MS word
OCR scanning errors
Hand-writing recognition of zip codes, signatures, Graffiti
Issues:
Correct non-words (dg for dog, but frplc?)
Correct “wrong” words in context (their for there, words of
rule formation)
Patterns of Error
Human typists make different types of errors from OCR
systems -- why?
Error classification I: performance-based:
Insertion: catt
Deletion: ct
Substitution: car
Transposition: cta
Error classification II: cognitive
People don’t know how to spell (nucular/nuclear)
Homonymous errors (their/there)
How do we decide if a (legal) word
is an error?
How likely is a word to occur?
They met there friends in Mumbai.
The Noisy Channel Model
Source Noisy
Input to channel: Channel
true (typed or spoken) word Decoder
w
Output from channel: an observation O
Decoding task: find w = P(w|O)
arg max
wV
Probability
Experiment: a procedure involving chance that leads to
different results
Outcome: the result of a single trial of an experiment;
Event: one or more outcomes of an experiment;
Probability: the measure of how likely an event is;
Basic Approach
Bayes Rule: P ( D | h) P ( h)
P(h | D) =
P( D)
P(h) = prior probability of hypothesis h
P(D) = prior probability of training data D
P(h|D) = probability of h given D (posterior density )
P(D|h) = probability of D given h (likelihood of D given h)
The Goal of Bayesian Learning: the most probable hypothesis given the
training data (Maximum A Posteriori hypothesis hmap)
hmap
= max P(h | D)
hH
P ( D | h) P ( h)
= max
hH P( D)
= max P( D | h) P(h)
hH
An Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of
the cases in which the disease is not present. Furthermore, .008 of the
entire population have this cancer.
P (cancer ) = .008, P (cancer ) = .992
P (+ | cancer ) = .98, P (− | cancer ) = .02
P (+ | cancer ) = .03, P (− | cancer ) = .97
P (+ | cancer ) P (cancer )
P (cancer | + ) = = 0.0078/(0.0078 + 0.0298)
P(+)
= 0.20745
P (+ | cancer ) P (cancer )
P (cancer | + ) = = 0.0298/(0.0078 + 0.0298)
P(+)
= 0.79255
Bayesian Inference
Population: 10 Columbia students
4 vegetarians, 3 CS major
Probability that a rcs is a vegetarian? p(v) = .4
That a rcs who is a vegetarian is also a CS major? p(c|v) = .5
That a rcs is a vegetarian (and) CS major? p(c,v) = .2
Returning to Spelling...
Source Noisy Channel Decoder
Channel Input: w; Output: O
Decoding: hypothesis w = P(w|O)
arg max
or, by Bayes Rule... wV
w=
P(O | w) P(w)
and, since
argP(O)
maxdoesn’t change for any entries in our lexicon
wV P(O )
we are going to consider, we can ignore it as constant, so…
w= P(O|w) P(w) (Given that w was intended, how
likely are we to see O)
arg max
wV
Applications for spelling
correction
Word processing Phones
Web search
62
Spelling Tasks
Spelling Error Detection
Spelling Error Correction:
Autocorrect
hte→the
Suggest a correction
Suggestion lists
63
Types of spelling errors
Non-word Errors
graffe →giraffe
Real-word Errors
Typographical errors
three →there
Cognitive Errors (homophones)
piece→peace,
too → two
your →you’re
Non-word correction was historically mainly context insensitive
Real-word correction almost needs to be context sensitive
64
Rates of spelling errors
Depending on the application, ~1–20%
error rates
26%: Web queries Wang et al. 2003
13%: Retyping, no backspace: Whitelaw et al. English&German
7%: Words corrected retyping on phone-sized organizer
2%: Words uncorrected on organizer Soukoreff &MacKenzie 2003
1-2%: Retyping: Kane and Wobbrock 2007, Gruden et al. 1983
65
Non-word spelling errors
Non-word spelling error detection:
Any word not in a dictionary is an error
The larger the dictionary the better … up to a point
(The Web is full of mis-spellings, so the Web isn’t necessarily a
great dictionary …)
Non-word spelling error correction:
Generate candidates: real words that are similar to error
Choose the one which is best:
Shortest weighted edit distance
Highest noisy channel probability
66
How could we use bayes model to
correct spelling errors?
Simplifying assumptions
We only have to correct non-word errors
Each non-word (O) differs from its correct word (w) by one
step (insertion, deletion, substitution, transposition)
From O, generate a list of candidates differing by one step
and appearing in the lexicon, e.g.
Error Corr Corr letter Error letter Pos Type
caat cat - a 2 ins
caat carat r - 3 del
How do we decide which correction
is most likely?
We want to find the lexicon entry w that maximizes
P(typo|w) P(w)
How do we estimate the likelihood P(typo|w) and the prior
P(w)?
First, find some corpora
Different corpora needed for different purposes
Some need to be labeled -- others do not
For spelling correction, what do we need?
Word occurrence information (unlabeled)
A corpus of labeled spelling errors
Cat vs Carat
Suppose we look at the occurrence of cat and carat in a large
(50M word) AP news corpus
cat occurs 6500 times, so p(cat) = .00013
carat occurs 3000 times, so p(carat) = .00006
Now we need to find out if inserting an ‘a’ after an ‘a’ is
more likely than deleting an ‘r’ after an ‘a’ in a corrections
corpus of 50K corrections ( p(typo|word))
suppose ‘a’ insertion after ‘a’ occurs 5000 times (p(+a)=.1)
and ‘r’ deletion occurs 7500 times (p(-r)=.15)
Then p(word|typo) = p(typo|word) * p(word)
p(cat|caat) = p(+a) * p(cat) = .1 * .00013 = .000013
p(carat|caat) = p(-r) * p(carat) = .15 * .000006 = .000009
Issues:
What if there are no instances of carat in corpus?
Smoothing algorithms
Estimate of P(typo|word) may not be accurate
Training probabilities on typo/word pairs
What if there is more than one error per word?
A More General Approach: Minimum Edit Distance
How can we measure how different one word is from
another word?
How many operations will it take to transform one word into
another?
caat --> cat, fplc --> fireplace (*treat abbreviations as typos??)
Levenshtein distance: smallest number of insertion, deletion, or
substitution operations that transform one string into another
(ins=del=subst=1)
Alternative: weight each operation by training on a corpus of
spelling errors to see which most frequent
Dynamic Programming
Dynamic Programming is an algorithm design technique for
optimization problems: often minimizing or maximizing.
Like divide and conquer, DP solves problems by combining
solutions to subproblems.
Unlike divide and conquer, subproblems are not independent.
Subproblems may share subsubproblems,
However, solution to one subproblem may not affect the solutions to other
subproblems of the same problem. (More on this later.)
DP reduces computation by
Solving subproblems in a bottom-up fashion.
Storing solution to a subproblem the first time it is solved.
Looking up the solution when subproblem is encountered again.
Key: determine structure of optimal solutions
Comp 1
Edit Distance
One measure of similarity between two strings is their edit
distance.
This is a measure of the number of operations required to
transform the first string into the other.
Single character operations:
Deletion of a character in the first string
Insertion of a character in the first string
Substitution of a character from the second character into the second string
Match a character in the first string with a character of the second .
Edit Distance
Example from textbook: transform vintner to writers
vintner replace v with w → wintner
wintner insert r after w → wrintner
wrintner match i → wrintner
wrintner delete n → writner
writner match t → writner
writner delete n → writer
writer match e → writer
writer match r → writer
writer insert s → writers
Edit Distance
Let = {I, D, R, M} be the edit alphabet
Defn. An edit transcript of two strings is a string over
describing a transformation of one string into another.
Defn. The edit distance between two strings is defined as the
minimum number of edit operations needed to transform
the first into the second. Matches are not included in the count.
Edit distance is also called Levenshtein distance.
Edit Distance
Defn. An optimal transcript is an edit transcript with the
minimal number of edit operations for transforming one
string into another.
Note: optimal transcripts may not be unique.
Defn. The edit distance problem entails computing the edit
distance between two strings along with an optimal
transcript.
Dynamic Programming
Observation 1: There are many possible ways to transform one
string into another.
Observation 2: This is like the knapsack problem
Recall: dynamic programming is used to solve knapsack-like
problems.
Defn. Let D(i,j) denote the edit distance of S1[1..i] and S2[1..j].
That is, D(i,j) is the minimum number of edit ops needed to transform
the first i characters of S1 into the first j characters of S2.
Dynamic Programming
Notice that we can solve D(i,j) for all combination of lengths of
prefixes of S1 and S2.
Examples: D(0,0),.., D(0,j), D(1,0),..,D(1,j), … D(i,j)
Dynamic programming is a divide and conquer method.
The three parts to dynamic programming are:
The recurrence relation
Tabular computation
Traceback
Dynamic Programming
The recurrence relation expresses the recursive relation
between a problem and smaller instances of the problem.
For any recursive relation, the base condition(s) must be
specified.
Base conditions for D(i,j) are:
D(i,0) = i
Q: Why is this true? What does it mean in terms of edit ops?
D(0,j) = j
Q: Why is this true? What does it mean in terms of edit ops?
Dynamic Programming
The general recurrence is given by:
D(i,j) = min[D(i - 1, j) + 1, D(i, j - 1) + 1, D(i - 1, j - 1) + t (i,j) ]
Here t (i,j) = 1 if S1(i) S2(j), o/w t (i,j) = 0.
Basic argument: D(i,j) must be one of :
1. D(i - 1, j) + 1
2. D(i, j - 1) + 1
3. D(i - 1, j - 1) + t (i,j)
There are NO other ways of creating S2[1..j] from S1[1..i].
Dynamic Programming
Note: In calculating D(n,m), there are only (n + 1) (m + 1)
unique combinations of i and j.
Clearly an exponential number of computations is NOT
required.
Soln: instead of going top-down with recursion, go bottom-
up. Compute each combination only once.
Decide on a data structure to hold intermediate results.
Start from base conditions. These are the smallest D(i,j) values and are
already defined.
Compute D(i,j) for larger values of i and j.
Dynamic Programming
Q: What kind of data structure should we use for edit
distance?
1. Has to be a random access data structure.
2. Has to support the dimensionality of the problem.
D(i,j) is two-dimensional: S1 and S2.
We will use a two-dimensional array, i.e., a table.
Dynamic Programming
Example: edit distance from vintner to writer
Fill in the base condition values.
D(i,j) w r i t e r s
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
v 1 1
i 2 2
n 3 3
t 4 4
n 5 5
e 6 6
r 7 7
Dynamic Programming
Q: How do we fill in the other values?
A: use the recurrence:
D(i,j) = min[D(i - 1, j) + 1, D(i, j - 1) + 1, D(i - 1, j - 1) + t (i,j) ]
where t (i,j) = 1 if S1(i) S2(j), o/w t (i,j) = 0.
We can first compute D(1,1) because we have D(0,0), D(0,1),
and D(1,0)
D(1,1) = min[ 1+1, 1+1, 0+1] = 1
Then we have all the values needed to compute in turn D(1,2),
D(1,3),..,D(1,m)
Dynamic Programming
First compute D(1,1) because we have D(0,0), D(0,1), and D(1,0)
Then compute in turn D(1,2), D(1,3),..,D(1,m)
D(i,j) w r i t e r s
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
v 1 1 1 2 3 4 5 6 7
i 2 2
n 3 3
t 4 4
n 5 5
e 6 6
r 7 7
Dynamic Programming
Fill in subsequent values, row by row, from left to
D(i,j) w r i t e r s
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
v 1 1 1 2 3 4 5 6 7
i 2 2 2 2 2 3 4 5 6
n 3 3 3 3 3 3 4 5 6
t 4 4 4 4 4 3 4 5 6
n 5 5 5 5 5 4 4 5 6
e 6 6 6 6 6 5 4 5 6
r 7 7 7 6 7 6 5 4 5
Dynamic Programming
Alternatively, first compute D(1,1) from D(0,0), D(0,1), and D(1,0)
Then compute in turn D(2,1), D(3,1),..,D(n,1)
D(i,j) w r i t e r s
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
v 1 1 1 2 3 4 5 6 7
i 2 2 2
n 3 3 3
t 4 4 4
n 5 5 5
e 6 6 6
r 7 7 7
Dynamic Programming
Fill in subsequent values, column by column, from t
D(i,j) w r i t e r s
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
v 1 1 1 2 3 4 5 6 7
i 2 2 2 2 2 3 4 5 6
n 3 3 3 3 3 3 4 5 6
t 4 4 4 4 4 3 4 5 6
n 5 5 5 5 5 4 4 5 6
e 6 6 6 6 6 5 4 5 6
r 7 7 7 6 7 6 5 4 5
Dynamic Programming
Filling each cell entails a constant number of operations.
Cell (i,j) depends only on characters S1(i) and S2(j) and cells (i -
1, j - 1), (i, j - 1), and (i - 1, j).
There are O(nm) cells in the table
Consequently, we can compute the edit distance D(n, m) in
O(nm) time by computing the table in O(nm).
Summary
We can apply probabilistic modeling to NL problems like
spell-checking
Noisy channel model, Bayesian method
Training priors and likelihoods on a corpus
Dynamic programming approaches allow us to solve large
problems that can be decomposed into subproblems
e.g. MED algorithm
Apply similar methods to modeling pronunciation variation