0% found this document useful (0 votes)
76 views7 pages

Final Report: Kevin Yue & Da Wei Wang

The document summarizes an error correction project that uses speech recognition and natural language processing techniques. It describes the key components used: 1) A speech API and encoder to record audio and send it to Google for speech recognition. Errors are detected and sentences annotated with part-of-speech tags. 2) A pronunciation dictionary and k-means clustering to group words. Edit distance is used to calculate replacement probabilities based on phoneme similarity. 3) An error corrector that suggests replacements from the closest cluster for words marked as errors, displaying the probabilities. The user interface displays the recognition and correction process.

Uploaded by

DaWeiWang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views7 pages

Final Report: Kevin Yue & Da Wei Wang

The document summarizes an error correction project that uses speech recognition and natural language processing techniques. It describes the key components used: 1) A speech API and encoder to record audio and send it to Google for speech recognition. Errors are detected and sentences annotated with part-of-speech tags. 2) A pronunciation dictionary and k-means clustering to group words. Edit distance is used to calculate replacement probabilities based on phoneme similarity. 3) An error corrector that suggests replacements from the closest cluster for words marked as errors, displaying the probabilities. The user interface displays the recognition and correction process.

Uploaded by

DaWeiWang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 7

FIT3036 Final Report

Kevin Yue & Da Wei Wang

Introduction
The aim of the project is to implement an Error Corrector to accompany the Error Detector, coded
by ++++. This Error Corrector is to use the errors found by the Error Detector and provide multiple
possibilities for the spoken sentence.

Background
Before attempting the project, we must learn of topics related to the subject matter (speech).**

J.A.R.V.I.S. Speech API & the Java FlacEncoder


The Java Speech API: Just A Reliable Vocal Interpreter & Synthesizer (J.A.R.V.I.S.) Speech API is
used to record and communicate with Google's servers. The API is dependent on the Java Flac
Encoder The classes used in the project include the Microphone class that is used to record wave
files. These wave files are converted to flac files by the Flac Encoder. The purpose of this
conversion is so that the files can be sent to Google by a Recognizer Chunked object to identify the
words in the sound file. The output from Google is fed into the Error Detector as input.

Error Detector
The Error Detector is application used to detect errors in vocally identified words. It is programmed
by Fashid The program takes an XML file as input that contains the possible sentences form the
ASR output. The detector analyses each word in the sentence and checks if there is a possibility that
it is wrong. It returns the sentences as text file, whose words are annotated with the its tag (from the
Stanford PoS tagger),

Stanford Part-of-Speech (PoS) Tagger


The tagger is used to identify what role a word has in a sentence. The tag returned for a word
represents the different roles the words play in a sentence. There are 36 tags in total.
CC

Coordinating conjunction

CD

Cardinal number

DT

Determiner

EX

Existential there

FW

Foreign word

IN

Preposition or subordinating
conjunction

JJ

Adjective

JJR

Adjective, comparative

JJS

Adjective, superlative

LS

List item marker

MD

Modal

NN

Noun, singular or mass

NNS

Noun, plural

NNP

Proper noun, singular

NNPS

Proper noun, plural

PDT

Predeterminer

POS

Possessive ending

PRP

Personal pronoun

PRP$

Possessive pronoun

RB

Adverb

RBR

Adverb, comparative

RBS

Adverb, superlative

RP

Particle

SYM

Symbol

TO

to

UH

Interjection

VB

Verb, base form

VBD

Verb, past tense

VBG

Verb, gerund or present participle

VBN

Verb, past participle

VBP

Verb, non-3rd person singular present

VBZ

Verb, 3rd person singular present

WDT

Wh-determiner

WP

Wh-pronoun

WP$

Possessive wh-pronoun

WRB
Wh-adverb
Table 1: List of tags from the Penn Treebank Project [].

CMU Pronunciation Dictionary


The Dictionary is originally distributed by the Carnegie Mellon University. The project uses three
files, cmudict.0.7a, cmudict.0.7a.phones and cmudict.0.7a.symbols. The file cmudict.0.7a
was used as a reference, describing how a word is pronounced.
The pronunciation for a word is represented as a sequence of phonemes, an atomic unit of speech.
That are listed in cmudict.0.7a.symbols. These phonemes are categorized into groups, known as
Broad Sound Groups (BSG), which are described in cmudict.0.7a.phones file. There are 39
phonemes that are separated into 8 groups. The groups are vowels, stops, affricates, fricatives,
aspirate, liquid, nasal and semivowels. *
CMU Dictionary: http://www.speech.cs.cmu.edu/cgi-bin/cmudict#phones

Phonemes and Broad sound groups


Phonemes are used in order to visually represent the pronunciation of English words. The phonemes
transcription used by the CMU dictionary is based on Arpabet transcription code developed in the
1970. In Arpabet two letters are used to represent each phoneme and the phonemes are divided into
8 groups which known as board sound groups. Vowel which are pronounced with an open vocal
tract, Stop where all airflow ceases, affricate which starts with stop and ends as fricative, fricative
where air follow is forced through a narrow channel, nasal where air is allowed to escape freely
through the nose, semivowel which sound similar to vowel but used differently and liquid where
airstream travels alone the side of the tongue but blocked from going into the mouth centre.

Edit Distance
Edit distance are used to compare the difference between two strings to see how similar they are.
The algorithm takes two string and transforms the first string to the second string by inserting,
deleting and replacing elements, while doing so it computes the shortest path to get from string 1 to
string 2. Each action of insertion deletion and replacement have an associated cost with the action.
After each action the cost is added to total cost, the lower the total cost the similar the strings. The
minimum edit distance cost is zero where two string are the same and the max is the length of the
longer string, where all element in the string has to be inserted.

WEKA's Simple K Means Clustering


The WEKA Java API was used to perform K-Means clustering to organise the words. K Means
Clustering is a form of unsupervised learning and is used to group features into clusters. Each
element consists of the item (in our case, the word) and the vector of attributes (the number of BSG
and length of phonemes).

The User Interface

Fig. 1: The GUI


The final user interface resembles the general layout described in the Project Description. The text

fields for the speaker name, speaker ID and Object ID, spinner for the number of alternatives and
the Record button are arranged on the top left side. The program will use the speaker name as the
folder name, while the speaker ID and the object ID is used to name all files related to this
particular recording. The alternatives spinner sets the number of sentences for the ASR output and
restricts it to a non-negative amount.

Fig 2: The top left portion of the GUI


Fig. 3: Record window brought up by the Record wave button
Record wave button brings up the Record window. This window has a text box that display
whether it is currently recording and three buttons, Record, Cancel and OK button. The text
box shows whether or not the program is currently recording the user. Clicking OK will create a
wave file and will be sent to the ASR (Automatic Speech Recogniser), while Cancel will exit the
window without a file being created.

Fig. 4: The top right part of the GUI

On the top right side, the text field for the file input, text input for the actual sentence, Do Not
Save checkbox, replay audio button and Detect & Correct Error button. The Open file button
brings up the file chooser, allowing the user to easily navigate to the desired text or wave file. The
checkbox ensures that files are temporary and will be deleted after the program ends. The Replay
button makes it easy to listen to the wave file, if it is available.

Fig. 5: ASR, Error Detection and Error Detection Output displaying text.
Finally, on the bottom of the window, the ASR output, Error Detector output and Error Correction
output is printed in the three next boxes. The output from the Error Corrector highlights the
suggested words in red and the probabilities in green and surrounded by brackets.

Error Correction Approach


An Example
The user enters the Speaker name, Speaker and object ID fields and brings up the record window.
The moment the user clicks on Record, the Microphone Analyser object starts recording and stops
when Stop is pressed. When OK is pressed, a wave file is created, converted into a flac file
before sending the file to the ASR. The program waits for a response before it prints the output on
screen.
The user proceeds to the Detect & Correct Errors button. The output from the ASR is converted
into an XML file and is given to the Error Detector as input. The Error Detector itself is executed as
a Process object. The program will wait until the Error Detector has terminated. The newly created
ED text file will be printed to the text pane.

Finally the Error Detector output will be put through the Error Corrector. Each word marked no
will

Clustering
The clusters are created programmatically using WEKA's API. The kitchen corpus file is loaded

Finding Clusters
Edit Distance
In alternative sentence returned from the google ASR, each word marked wrong by
the error detector, a number of replacement option is selected from the words in the
closest cluster. To compute the probability of replacement for each word, first we get
the Phonemes for each replacement word from the Pronunciation Dictionary. Since visually
represent how the word is pronounced, by computing the edit distance between the phonemes of the
wrong word to each of the possible replacement we can get an idea of the similar are the two words
in pronunciation.
When implementing the edit distance, the pseudo code of the algorithm using matrix which is found
on Wikipedia was followed. Once the edit distance is known then we calculate the probability of
replacement, since we know that the max number of edit distance is the number of phonemes in the
longer word, we divide the distance by the number phonemes in the longest word and that will give
as the percentage of the replacement required. Since that the lower the requirement the higher
probability of the replace word been correct, we can calculate the probability replacement using 1
the percentage of replacement required.
When computing edit distance we have taken into account of BSG, in other words replacement with
in the same group have a cheaper cost since that they are pounced in a similar way, replacement
between vowel and semivowel have a reduced cost for the same reason. Also insert and delete of
vowel have a slight higher cost due to the fact that it is harder to miss, for example the word marc
and the word marco has a insertion cost of 1, how ever since a vowel is inserted the two words
sounds completely different.
Since we only replace words when we are certain that the original word is wrong. Next the
probability of replacement given that the original word is wrong. The information provided to us
suggested that the error detector have an accuracy of 0.92 so we multiply the probability given by
the error detector by 0.92 and it gives us the probability the original word is wrong given that the
error detector was right. Last we multiply the probability of replacement by the probability of
original word is wrong given that the error detector was right.
The final result is inserted into a sorted list based on the probability of replacement given that the
original word is wrong. If the probability of replacement is less than the probability of keeping the
wrong word which is given by 1-(probability given by error detector * 0.92), then the wrong word
is kept.

Output Options
For each alternative given by the google ASR we take the 3 replacement word with the highest

probability of replacement and swap them in to the sentence. Hence the result given by the error
corrector is 3 * number of alternatives. Then the probability of the sentence is calculated by
averaging the probability of replacement of all wrong words in the alternative.

Merging
If multi sentence have the same probability then the average of the sentence probability is computed
and the sentences and their average probability are kept. For example if two same sentences have
the probability of 0.90 and 0.95, then the two sentence is merged and the probability of 0.925 is
used. In the testing example this method seems to work well since same sentence generally have the
very simular probability, however a more accurate way of merging need consider the different
weight of each word.

Ranking
After all output are merged the result is then inserted in to a sorted list based on the
probability of each sentence. To measure effectiveness of the program each result is
run through the error detector and the output is checked with the output of error
detector from the ASR to see if there is any improvement. Also the all the output is
compared with the original text input using edit distance based on the words and we
check if the top output result matches the original text input. During the edit distance
compare the sentences with no word marked wrong is excluded.

References & Links


University of Pennsylvania. (2003). Alphabetical list of part-of-speech tags used in the Penn
Treebank Project. Accessed: 23/10/2014. Available:
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Levenshtein distance: http://en.wikipedia.org/wiki/Levenshtein_distance
J.A.R.V.I.S. API: https://github.com/The-Shadow/java-speech-api
JavaFlacEncoder: http://sourceforge.net/projects/javaflacencoder/files/

You might also like