0% found this document useful (0 votes)

76 views7 pages

Final Report: Kevin Yue & Da Wei Wang

The document summarizes an error correction project that uses speech recognition and natural language processing techniques. It describes the key components used: 1) A speech API and encoder to record audio and send it to Google for speech recognition. Errors are detected and sentences annotated with part-of-speech tags. 2) A pronunciation dictionary and k-means clustering to group words. Edit distance is used to calculate replacement probabilities based on phoneme similarity. 3) An error corrector that suggests replacements from the closest cluster for words marked as errors, displaying the probabilities. The user interface displays the recognition and correction process.

Uploaded by

DaWeiWang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views7 pages

Final Report: Kevin Yue & Da Wei Wang

Uploaded by

DaWeiWang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 7

FIT3036 Final Report

Kevin Yue & Da Wei Wang

Introduction
The aim of the project is to implement an Error Corrector to accompany the Error Detector, coded
by ++++. This Error Corrector is to use the errors found by the Error Detector and provide multiple
possibilities for the spoken sentence.

Background
Before attempting the project, we must learn of topics related to the subject matter (speech).**

J.A.R.V.I.S. Speech API & the Java FlacEncoder

The Java Speech API: Just A Reliable Vocal Interpreter & Synthesizer (J.A.R.V.I.S.) Speech API is
used to record and communicate with Google's servers. The API is dependent on the Java Flac
Encoder The classes used in the project include the Microphone class that is used to record wave
files. These wave files are converted to flac files by the Flac Encoder. The purpose of this
conversion is so that the files can be sent to Google by a Recognizer Chunked object to identify the
words in the sound file. The output from Google is fed into the Error Detector as input.

Error Detector
The Error Detector is application used to detect errors in vocally identified words. It is programmed
by Fashid The program takes an XML file as input that contains the possible sentences form the
ASR output. The detector analyses each word in the sentence and checks if there is a possibility that
it is wrong. It returns the sentences as text file, whose words are annotated with the its tag (from the
Stanford PoS tagger),

Stanford Part-of-Speech (PoS) Tagger

The tagger is used to identify what role a word has in a sentence. The tag returned for a word
represents the different roles the words play in a sentence. There are 36 tags in total.
CC

Coordinating conjunction

Cardinal number

Determiner

Existential there

Foreign word

Preposition or subordinating
conjunction

Adjective

JJR

Adjective, comparative

JJS

Adjective, superlative

List item marker

Modal

Noun, singular or mass

NNS

Noun, plural

NNP

Proper noun, singular

NNPS

Proper noun, plural

PDT

Predeterminer

POS

Possessive ending

PRP

Personal pronoun

PRP$

Possessive pronoun

Adverb

RBR

Adverb, comparative

RBS

Adverb, superlative

Particle

SYM

Symbol

Interjection

Verb, base form

VBD

Verb, past tense

VBG

Verb, gerund or present participle

VBN

Verb, past participle

VBP

Verb, non-3rd person singular present

VBZ

Verb, 3rd person singular present

WDT

Wh-determiner

Wh-pronoun

WP$

Possessive wh-pronoun

WRB
Wh-adverb
Table 1: List of tags from the Penn Treebank Project [].

CMU Pronunciation Dictionary

The Dictionary is originally distributed by the Carnegie Mellon University. The project uses three
files, cmudict.0.7a, cmudict.0.7a.phones and cmudict.0.7a.symbols. The file cmudict.0.7a
was used as a reference, describing how a word is pronounced.
The pronunciation for a word is represented as a sequence of phonemes, an atomic unit of speech.
That are listed in cmudict.0.7a.symbols. These phonemes are categorized into groups, known as
Broad Sound Groups (BSG), which are described in cmudict.0.7a.phones file. There are 39
phonemes that are separated into 8 groups. The groups are vowels, stops, affricates, fricatives,
aspirate, liquid, nasal and semivowels. *
CMU Dictionary: http://www.speech.cs.cmu.edu/cgi-bin/cmudict#phones

Phonemes and Broad sound groups

Phonemes are used in order to visually represent the pronunciation of English words. The phonemes
transcription used by the CMU dictionary is based on Arpabet transcription code developed in the
1970. In Arpabet two letters are used to represent each phoneme and the phonemes are divided into
8 groups which known as board sound groups. Vowel which are pronounced with an open vocal
tract, Stop where all airflow ceases, affricate which starts with stop and ends as fricative, fricative
where air follow is forced through a narrow channel, nasal where air is allowed to escape freely
through the nose, semivowel which sound similar to vowel but used differently and liquid where
airstream travels alone the side of the tongue but blocked from going into the mouth centre.

Edit Distance
Edit distance are used to compare the difference between two strings to see how similar they are.
The algorithm takes two string and transforms the first string to the second string by inserting,
deleting and replacing elements, while doing so it computes the shortest path to get from string 1 to
string 2. Each action of insertion deletion and replacement have an associated cost with the action.
After each action the cost is added to total cost, the lower the total cost the similar the strings. The
minimum edit distance cost is zero where two string are the same and the max is the length of the
longer string, where all element in the string has to be inserted.

WEKA's Simple K Means Clustering

The WEKA Java API was used to perform K-Means clustering to organise the words. K Means
Clustering is a form of unsupervised learning and is used to group features into clusters. Each
element consists of the item (in our case, the word) and the vector of attributes (the number of BSG
and length of phonemes).

The User Interface

Fig. 1: The GUI

The final user interface resembles the general layout described in the Project Description. The text

fields for the speaker name, speaker ID and Object ID, spinner for the number of alternatives and
the Record button are arranged on the top left side. The program will use the speaker name as the
folder name, while the speaker ID and the object ID is used to name all files related to this
particular recording. The alternatives spinner sets the number of sentences for the ASR output and
restricts it to a non-negative amount.

Fig 2: The top left portion of the GUI

Fig. 3: Record window brought up by the Record wave button
Record wave button brings up the Record window. This window has a text box that display
whether it is currently recording and three buttons, Record, Cancel and OK button. The text
box shows whether or not the program is currently recording the user. Clicking OK will create a
wave file and will be sent to the ASR (Automatic Speech Recogniser), while Cancel will exit the
window without a file being created.

Fig. 4: The top right part of the GUI

On the top right side, the text field for the file input, text input for the actual sentence, Do Not
Save checkbox, replay audio button and Detect & Correct Error button. The Open file button
brings up the file chooser, allowing the user to easily navigate to the desired text or wave file. The
checkbox ensures that files are temporary and will be deleted after the program ends. The Replay
button makes it easy to listen to the wave file, if it is available.

Fig. 5: ASR, Error Detection and Error Detection Output displaying text.
Finally, on the bottom of the window, the ASR output, Error Detector output and Error Correction
output is printed in the three next boxes. The output from the Error Corrector highlights the
suggested words in red and the probabilities in green and surrounded by brackets.

Error Correction Approach

An Example
The user enters the Speaker name, Speaker and object ID fields and brings up the record window.
The moment the user clicks on Record, the Microphone Analyser object starts recording and stops
when Stop is pressed. When OK is pressed, a wave file is created, converted into a flac file
before sending the file to the ASR. The program waits for a response before it prints the output on
screen.
The user proceeds to the Detect & Correct Errors button. The output from the ASR is converted
into an XML file and is given to the Error Detector as input. The Error Detector itself is executed as
a Process object. The program will wait until the Error Detector has terminated. The newly created
ED text file will be printed to the text pane.

Finally the Error Detector output will be put through the Error Corrector. Each word marked no
will

Clustering
The clusters are created programmatically using WEKA's API. The kitchen corpus file is loaded

Finding Clusters
Edit Distance
In alternative sentence returned from the google ASR, each word marked wrong by
the error detector, a number of replacement option is selected from the words in the
closest cluster. To compute the probability of replacement for each word, first we get
the Phonemes for each replacement word from the Pronunciation Dictionary. Since visually
represent how the word is pronounced, by computing the edit distance between the phonemes of the
wrong word to each of the possible replacement we can get an idea of the similar are the two words
in pronunciation.
When implementing the edit distance, the pseudo code of the algorithm using matrix which is found
on Wikipedia was followed. Once the edit distance is known then we calculate the probability of
replacement, since we know that the max number of edit distance is the number of phonemes in the
longer word, we divide the distance by the number phonemes in the longest word and that will give
as the percentage of the replacement required. Since that the lower the requirement the higher
probability of the replace word been correct, we can calculate the probability replacement using 1
the percentage of replacement required.
When computing edit distance we have taken into account of BSG, in other words replacement with
in the same group have a cheaper cost since that they are pounced in a similar way, replacement
between vowel and semivowel have a reduced cost for the same reason. Also insert and delete of
vowel have a slight higher cost due to the fact that it is harder to miss, for example the word marc
and the word marco has a insertion cost of 1, how ever since a vowel is inserted the two words
sounds completely different.
Since we only replace words when we are certain that the original word is wrong. Next the
probability of replacement given that the original word is wrong. The information provided to us
suggested that the error detector have an accuracy of 0.92 so we multiply the probability given by
the error detector by 0.92 and it gives us the probability the original word is wrong given that the
error detector was right. Last we multiply the probability of replacement by the probability of
original word is wrong given that the error detector was right.
The final result is inserted into a sorted list based on the probability of replacement given that the
original word is wrong. If the probability of replacement is less than the probability of keeping the
wrong word which is given by 1-(probability given by error detector * 0.92), then the wrong word
is kept.

Output Options
For each alternative given by the google ASR we take the 3 replacement word with the highest

probability of replacement and swap them in to the sentence. Hence the result given by the error
corrector is 3 * number of alternatives. Then the probability of the sentence is calculated by
averaging the probability of replacement of all wrong words in the alternative.

Merging
If multi sentence have the same probability then the average of the sentence probability is computed
and the sentences and their average probability are kept. For example if two same sentences have
the probability of 0.90 and 0.95, then the two sentence is merged and the probability of 0.925 is
used. In the testing example this method seems to work well since same sentence generally have the
very simular probability, however a more accurate way of merging need consider the different
weight of each word.

Ranking
After all output are merged the result is then inserted in to a sorted list based on the
probability of each sentence. To measure effectiveness of the program each result is
run through the error detector and the output is checked with the output of error
detector from the ASR to see if there is any improvement. Also the all the output is
compared with the original text input using edit distance based on the words and we
check if the top output result matches the original text input. During the edit distance
compare the sentences with no word marked wrong is excluded.

References & Links

University of Pennsylvania. (2003). Alphabetical list of part-of-speech tags used in the Penn
Treebank Project. Accessed: 23/10/2014. Available:
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Levenshtein distance: http://en.wikipedia.org/wiki/Levenshtein_distance
J.A.R.V.I.S. API: https://github.com/The-Shadow/java-speech-api
JavaFlacEncoder: http://sourceforge.net/projects/javaflacencoder/files/

Vocalizer Dictionary and Rules
No ratings yet
Vocalizer Dictionary and Rules
12 pages
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
No ratings yet
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
19 pages
Speech Recognition: College Name: Guru Nanak Engineering College Authors: Shruthi Tapse
No ratings yet
Speech Recognition: College Name: Guru Nanak Engineering College Authors: Shruthi Tapse
13 pages
Speech Recognition Application
No ratings yet
Speech Recognition Application
13 pages
Review of Feature Extraction Techniques in Automatic Speech Recognition
100% (1)
Review of Feature Extraction Techniques in Automatic Speech Recognition
6 pages
Designing Effective Speech Interfaces
No ratings yet
Designing Effective Speech Interfaces
13 pages
Text To Speech Synthesis TTS
No ratings yet
Text To Speech Synthesis TTS
7 pages
CCS369 - TSS-Unit 4
No ratings yet
CCS369 - TSS-Unit 4
30 pages
Daniel Hirst Speech Prosody From Acoustics To Interpretation PR
No ratings yet
Daniel Hirst Speech Prosody From Acoustics To Interpretation PR
313 pages
Human-Robot Communication: Supervisor: Prof. Nejat Biomechantronics Lab Progress Report
No ratings yet
Human-Robot Communication: Supervisor: Prof. Nejat Biomechantronics Lab Progress Report
23 pages
Praat Prosodic Feature Extraction Tool
No ratings yet
Praat Prosodic Feature Extraction Tool
13 pages
ASR For Embedded Real Time Applications: K.Kartheek, D.V.Srihari Babu
No ratings yet
ASR For Embedded Real Time Applications: K.Kartheek, D.V.Srihari Babu
9 pages
Voice Integrated Systems
No ratings yet
Voice Integrated Systems
18 pages
Unit 4 Ppttsa
No ratings yet
Unit 4 Ppttsa
19 pages
ASR Proof
No ratings yet
ASR Proof
19 pages
Praat Manual
100% (2)
Praat Manual
1,270 pages
Text-to-Speech (TTS) System
No ratings yet
Text-to-Speech (TTS) System
11 pages
DC Motor Control
No ratings yet
DC Motor Control
2 pages
Voice Recognition in Virtual Reality
100% (1)
Voice Recognition in Virtual Reality
3 pages
Unit 5 Speech Processing
No ratings yet
Unit 5 Speech Processing
12 pages
Favsi m3 (Phonetics)
No ratings yet
Favsi m3 (Phonetics)
21 pages
Unit 2 Sound or Audio System
No ratings yet
Unit 2 Sound or Audio System
29 pages
White Paper - Demystifying Speech Recognition by Charles Corfield - July2012
No ratings yet
White Paper - Demystifying Speech Recognition by Charles Corfield - July2012
5 pages
Lab9: Speech Synthesis
No ratings yet
Lab9: Speech Synthesis
13 pages
Irjet V7i6965
No ratings yet
Irjet V7i6965
5 pages
Easy Align
No ratings yet
Easy Align
5 pages
Speech Audiometry Guide
No ratings yet
Speech Audiometry Guide
5 pages
Praat Guide for Linguists
No ratings yet
Praat Guide for Linguists
10 pages
IRJET Speech Scribd
No ratings yet
IRJET Speech Scribd
3 pages
Speech Recognition Seminar
No ratings yet
Speech Recognition Seminar
19 pages
Voice Recognition System Speech To Text
No ratings yet
Voice Recognition System Speech To Text
5 pages
Praat Tutorial
No ratings yet
Praat Tutorial
9 pages
Lab 9 A
No ratings yet
Lab 9 A
12 pages
EEL6586 Final Project:: A Speaker Identification and Verification System
No ratings yet
EEL6586 Final Project:: A Speaker Identification and Verification System
16 pages
Pratt Tutorial
No ratings yet
Pratt Tutorial
27 pages
Automatic Recognition of Correctly Pronounced English Words Using Machine Learning
No ratings yet
Automatic Recognition of Correctly Pronounced English Words Using Machine Learning
12 pages
Assignment On Speech
No ratings yet
Assignment On Speech
9 pages
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
No ratings yet
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
8 pages
Artificial Intelligence-An Introduction: Department of Computer Science & Engineering
No ratings yet
Artificial Intelligence-An Introduction: Department of Computer Science & Engineering
17 pages
Wavesurfer: Uses and Applications Maimoona Abdulaziz
No ratings yet
Wavesurfer: Uses and Applications Maimoona Abdulaziz
26 pages
Speech Processing: Review # (Or) Seminar #
No ratings yet
Speech Processing: Review # (Or) Seminar #
49 pages
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
No ratings yet
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
3 pages
Punjabi Speech Recognition: A Survey: by Muskan and Dr. Naveen Aggarwal
No ratings yet
Punjabi Speech Recognition: A Survey: by Muskan and Dr. Naveen Aggarwal
7 pages
Jarvis Digital Life Assistant IJERTV2IS1237 PDF
No ratings yet
Jarvis Digital Life Assistant IJERTV2IS1237 PDF
6 pages
GH 166
No ratings yet
GH 166
4 pages
Solomon Teferra Abate, Martha Yifiru Tachbelie, Wolfgang Menzel - Amharic
No ratings yet
Solomon Teferra Abate, Martha Yifiru Tachbelie, Wolfgang Menzel - Amharic
12 pages
Final PPT On Speech Processing
50% (2)
Final PPT On Speech Processing
20 pages
Eura English Transcription Guidelines 2024 - ADAP QF
No ratings yet
Eura English Transcription Guidelines 2024 - ADAP QF
25 pages
Automatic Speech Recognition: 2.1 Relevant Keywords From Probability Theory and Statistics
No ratings yet
Automatic Speech Recognition: 2.1 Relevant Keywords From Probability Theory and Statistics
14 pages
A Review On Speech Recognition Challenge
No ratings yet
A Review On Speech Recognition Challenge
7 pages
Introduction to IPA Handbook
No ratings yet
Introduction to IPA Handbook
32 pages
Shareef Seminar Docs
No ratings yet
Shareef Seminar Docs
24 pages
Repeat After Me User's Guide
No ratings yet
Repeat After Me User's Guide
5 pages
NLP Project Reportttt
No ratings yet
NLP Project Reportttt
9 pages
Speaker Recognition
No ratings yet
Speaker Recognition
11 pages
Synopsis
No ratings yet
Synopsis
11 pages
Voice-Controlled Robotics Guide
100% (1)
Voice-Controlled Robotics Guide
3 pages
Presentation of Parts of Speech
No ratings yet
Presentation of Parts of Speech
23 pages
Semi Oxalary Varb
No ratings yet
Semi Oxalary Varb
7 pages
Past Simple Speaking Activity
No ratings yet
Past Simple Speaking Activity
2 pages
CELTA Pre-Course Task Insights
100% (9)
CELTA Pre-Course Task Insights
11 pages
Talmon Wintner Tagging The Quran
No ratings yet
Talmon Wintner Tagging The Quran
8 pages
Essays On Reference, Language, and Mind PDF
100% (1)
Essays On Reference, Language, and Mind PDF
231 pages
10 Common Challenges Spanish English
No ratings yet
10 Common Challenges Spanish English
2 pages
Teach Portuguese
50% (2)
Teach Portuguese
237 pages
Sentence Building With Adjective Clauses - Sentence Combining Exercises - Building Sentences With Adjective Clauses - Handout 3
100% (1)
Sentence Building With Adjective Clauses - Sentence Combining Exercises - Building Sentences With Adjective Clauses - Handout 3
2 pages
Esperanto Grammar
No ratings yet
Esperanto Grammar
395 pages
Zero Articles
No ratings yet
Zero Articles
5 pages
Text Preprocessing: Information Retrieval
100% (2)
Text Preprocessing: Information Retrieval
16 pages
Ni Afrihili Oluga
No ratings yet
Ni Afrihili Oluga
34 pages
Testing Grammar
No ratings yet
Testing Grammar
4 pages
Yellow Submarine by The Beatles: Notes For The Teacher
No ratings yet
Yellow Submarine by The Beatles: Notes For The Teacher
4 pages
Sentences PDF
0% (1)
Sentences PDF
16 pages
Comparative and Superlative Practice
No ratings yet
Comparative and Superlative Practice
5 pages
Final Exam English 2º Eso
No ratings yet
Final Exam English 2º Eso
4 pages
Korean Language Guide - PDF Learn Korean: LP's Korean Language Learning by Luke Park 2013
No ratings yet
Korean Language Guide - PDF Learn Korean: LP's Korean Language Learning by Luke Park 2013
7 pages
English Grammar Tenses Guide
No ratings yet
English Grammar Tenses Guide
20 pages
Yearly Lesson Plan English Language (Form 4)
No ratings yet
Yearly Lesson Plan English Language (Form 4)
17 pages
Letter Writing Rubric
100% (1)
Letter Writing Rubric
1 page
Compound Type Identification in Sanskrit: What Roles Do The Corpus and Grammar Play?
No ratings yet
Compound Type Identification in Sanskrit: What Roles Do The Corpus and Grammar Play?
10 pages
Past Continuous
No ratings yet
Past Continuous
5 pages
Quiz On Sentence Types
No ratings yet
Quiz On Sentence Types
7 pages
Korean and Japanese Morphology From A Lexical Perspective
No ratings yet
Korean and Japanese Morphology From A Lexical Perspective
50 pages
Neoslavonic Zonal Constructed Language
100% (1)
Neoslavonic Zonal Constructed Language
165 pages
English Grammar & Vocabulary PLC Plan
No ratings yet
English Grammar & Vocabulary PLC Plan
2 pages
Reported Speech
No ratings yet
Reported Speech
6 pages
Present Perfect Extra Practice 3 Ano - Answer Key PDF
No ratings yet
Present Perfect Extra Practice 3 Ano - Answer Key PDF
7 pages