FIT3036 Final Report
Kevin Yue & Da Wei Wang
Introduction
The aim of the project is to implement an Error Corrector to accompany the Error Detector, coded
by ++++. This Error Corrector is to use the errors found by the Error Detector and provide multiple
possibilities for the spoken sentence.
Background
Before attempting the project, we must learn of topics related to the subject matter (speech).**
J.A.R.V.I.S. Speech API & the Java FlacEncoder
The Java Speech API: Just A Reliable Vocal Interpreter & Synthesizer (J.A.R.V.I.S.) Speech API is
used to record and communicate with Google's servers. The API is dependent on the Java Flac
Encoder The classes used in the project include the Microphone class that is used to record wave
files. These wave files are converted to flac files by the Flac Encoder. The purpose of this
conversion is so that the files can be sent to Google by a Recognizer Chunked object to identify the
words in the sound file. The output from Google is fed into the Error Detector as input.
Error Detector
The Error Detector is application used to detect errors in vocally identified words. It is programmed
by Fashid The program takes an XML file as input that contains the possible sentences form the
ASR output. The detector analyses each word in the sentence and checks if there is a possibility that
it is wrong. It returns the sentences as text file, whose words are annotated with the its tag (from the
Stanford PoS tagger),
Stanford Part-of-Speech (PoS) Tagger
The tagger is used to identify what role a word has in a sentence. The tag returned for a word
represents the different roles the words play in a sentence. There are 36 tags in total.
CC
Coordinating conjunction
CD
Cardinal number
DT
Determiner
EX
Existential there
FW
Foreign word
IN
Preposition or subordinating
conjunction
JJ
Adjective
JJR
Adjective, comparative
JJS
Adjective, superlative
LS
List item marker
MD
Modal
NN
Noun, singular or mass
NNS
Noun, plural
NNP
Proper noun, singular
NNPS
Proper noun, plural
PDT
Predeterminer
POS
Possessive ending
PRP
Personal pronoun
PRP$
Possessive pronoun
RB
Adverb
RBR
Adverb, comparative
RBS
Adverb, superlative
RP
Particle
SYM
Symbol
TO
to
UH
Interjection
VB
Verb, base form
VBD
Verb, past tense
VBG
Verb, gerund or present participle
VBN
Verb, past participle
VBP
Verb, non-3rd person singular present
VBZ
Verb, 3rd person singular present
WDT
Wh-determiner
WP
Wh-pronoun
WP$
Possessive wh-pronoun
WRB
Wh-adverb
Table 1: List of tags from the Penn Treebank Project [].
CMU Pronunciation Dictionary
The Dictionary is originally distributed by the Carnegie Mellon University. The project uses three
files, cmudict.0.7a, cmudict.0.7a.phones and cmudict.0.7a.symbols. The file cmudict.0.7a
was used as a reference, describing how a word is pronounced.
The pronunciation for a word is represented as a sequence of phonemes, an atomic unit of speech.
That are listed in cmudict.0.7a.symbols. These phonemes are categorized into groups, known as
Broad Sound Groups (BSG), which are described in cmudict.0.7a.phones file. There are 39
phonemes that are separated into 8 groups. The groups are vowels, stops, affricates, fricatives,
aspirate, liquid, nasal and semivowels. *
CMU Dictionary: http://www.speech.cs.cmu.edu/cgi-bin/cmudict#phones
Phonemes and Broad sound groups
Phonemes are used in order to visually represent the pronunciation of English words. The phonemes
transcription used by the CMU dictionary is based on Arpabet transcription code developed in the
1970. In Arpabet two letters are used to represent each phoneme and the phonemes are divided into
8 groups which known as board sound groups. Vowel which are pronounced with an open vocal
tract, Stop where all airflow ceases, affricate which starts with stop and ends as fricative, fricative
where air follow is forced through a narrow channel, nasal where air is allowed to escape freely
through the nose, semivowel which sound similar to vowel but used differently and liquid where
airstream travels alone the side of the tongue but blocked from going into the mouth centre.
Edit Distance
Edit distance are used to compare the difference between two strings to see how similar they are.
The algorithm takes two string and transforms the first string to the second string by inserting,
deleting and replacing elements, while doing so it computes the shortest path to get from string 1 to
string 2. Each action of insertion deletion and replacement have an associated cost with the action.
After each action the cost is added to total cost, the lower the total cost the similar the strings. The
minimum edit distance cost is zero where two string are the same and the max is the length of the
longer string, where all element in the string has to be inserted.
WEKA's Simple K Means Clustering
The WEKA Java API was used to perform K-Means clustering to organise the words. K Means
Clustering is a form of unsupervised learning and is used to group features into clusters. Each
element consists of the item (in our case, the word) and the vector of attributes (the number of BSG
and length of phonemes).
The User Interface
Fig. 1: The GUI
The final user interface resembles the general layout described in the Project Description. The text
fields for the speaker name, speaker ID and Object ID, spinner for the number of alternatives and
the Record button are arranged on the top left side. The program will use the speaker name as the
folder name, while the speaker ID and the object ID is used to name all files related to this
particular recording. The alternatives spinner sets the number of sentences for the ASR output and
restricts it to a non-negative amount.
Fig 2: The top left portion of the GUI
Fig. 3: Record window brought up by the Record wave button
Record wave button brings up the Record window. This window has a text box that display
whether it is currently recording and three buttons, Record, Cancel and OK button. The text
box shows whether or not the program is currently recording the user. Clicking OK will create a
wave file and will be sent to the ASR (Automatic Speech Recogniser), while Cancel will exit the
window without a file being created.
Fig. 4: The top right part of the GUI
On the top right side, the text field for the file input, text input for the actual sentence, Do Not
Save checkbox, replay audio button and Detect & Correct Error button. The Open file button
brings up the file chooser, allowing the user to easily navigate to the desired text or wave file. The
checkbox ensures that files are temporary and will be deleted after the program ends. The Replay
button makes it easy to listen to the wave file, if it is available.
Fig. 5: ASR, Error Detection and Error Detection Output displaying text.
Finally, on the bottom of the window, the ASR output, Error Detector output and Error Correction
output is printed in the three next boxes. The output from the Error Corrector highlights the
suggested words in red and the probabilities in green and surrounded by brackets.
Error Correction Approach
An Example
The user enters the Speaker name, Speaker and object ID fields and brings up the record window.
The moment the user clicks on Record, the Microphone Analyser object starts recording and stops
when Stop is pressed. When OK is pressed, a wave file is created, converted into a flac file
before sending the file to the ASR. The program waits for a response before it prints the output on
screen.
The user proceeds to the Detect & Correct Errors button. The output from the ASR is converted
into an XML file and is given to the Error Detector as input. The Error Detector itself is executed as
a Process object. The program will wait until the Error Detector has terminated. The newly created
ED text file will be printed to the text pane.
Finally the Error Detector output will be put through the Error Corrector. Each word marked no
will
Clustering
The clusters are created programmatically using WEKA's API. The kitchen corpus file is loaded
Finding Clusters
Edit Distance
In alternative sentence returned from the google ASR, each word marked wrong by
the error detector, a number of replacement option is selected from the words in the
closest cluster. To compute the probability of replacement for each word, first we get
the Phonemes for each replacement word from the Pronunciation Dictionary. Since visually
represent how the word is pronounced, by computing the edit distance between the phonemes of the
wrong word to each of the possible replacement we can get an idea of the similar are the two words
in pronunciation.
When implementing the edit distance, the pseudo code of the algorithm using matrix which is found
on Wikipedia was followed. Once the edit distance is known then we calculate the probability of
replacement, since we know that the max number of edit distance is the number of phonemes in the
longer word, we divide the distance by the number phonemes in the longest word and that will give
as the percentage of the replacement required. Since that the lower the requirement the higher
probability of the replace word been correct, we can calculate the probability replacement using 1
the percentage of replacement required.
When computing edit distance we have taken into account of BSG, in other words replacement with
in the same group have a cheaper cost since that they are pounced in a similar way, replacement
between vowel and semivowel have a reduced cost for the same reason. Also insert and delete of
vowel have a slight higher cost due to the fact that it is harder to miss, for example the word marc
and the word marco has a insertion cost of 1, how ever since a vowel is inserted the two words
sounds completely different.
Since we only replace words when we are certain that the original word is wrong. Next the
probability of replacement given that the original word is wrong. The information provided to us
suggested that the error detector have an accuracy of 0.92 so we multiply the probability given by
the error detector by 0.92 and it gives us the probability the original word is wrong given that the
error detector was right. Last we multiply the probability of replacement by the probability of
original word is wrong given that the error detector was right.
The final result is inserted into a sorted list based on the probability of replacement given that the
original word is wrong. If the probability of replacement is less than the probability of keeping the
wrong word which is given by 1-(probability given by error detector * 0.92), then the wrong word
is kept.
Output Options
For each alternative given by the google ASR we take the 3 replacement word with the highest
probability of replacement and swap them in to the sentence. Hence the result given by the error
corrector is 3 * number of alternatives. Then the probability of the sentence is calculated by
averaging the probability of replacement of all wrong words in the alternative.
Merging
If multi sentence have the same probability then the average of the sentence probability is computed
and the sentences and their average probability are kept. For example if two same sentences have
the probability of 0.90 and 0.95, then the two sentence is merged and the probability of 0.925 is
used. In the testing example this method seems to work well since same sentence generally have the
very simular probability, however a more accurate way of merging need consider the different
weight of each word.
Ranking
After all output are merged the result is then inserted in to a sorted list based on the
probability of each sentence. To measure effectiveness of the program each result is
run through the error detector and the output is checked with the output of error
detector from the ASR to see if there is any improvement. Also the all the output is
compared with the original text input using edit distance based on the words and we
check if the top output result matches the original text input. During the edit distance
compare the sentences with no word marked wrong is excluded.
References & Links
University of Pennsylvania. (2003). Alphabetical list of part-of-speech tags used in the Penn
Treebank Project. Accessed: 23/10/2014. Available:
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Levenshtein distance: http://en.wikipedia.org/wiki/Levenshtein_distance
J.A.R.V.I.S. API: https://github.com/The-Shadow/java-speech-api
JavaFlacEncoder: http://sourceforge.net/projects/javaflacencoder/files/