ASR System Design for Students

1. The document describes a student project to implement a basic automatic speech recognition system using techniques like feature extraction, Hidden Markov Models, and N-gram language models. 2. The students separated the project into preprocessing, decoding, and postprocessing stages. Preprocessing included segmenting audio, extracting features, and separating speech from silence. Decoding used HMMs and Viterbi algorithm. Postprocessing applied N-gram rescoring to the N-best list from decoding. 3. The goal was to extend a previous vowel recognition project to recognize words and phrases with the overall aim of learning about speech recognition, though the accuracy of their system was still not satisfactory for applications. Future work to increase accuracy was needed.

Uploaded by

Ken K

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views4 pages

ASR System Design for Students

Uploaded by

Ken K

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

1

arithmetic commands. The system, essentially a more capable

Automatic Speech version of ‘Audrey’, would take the spoken commands and
numbers as input and produce the calculated results as
output.[3]
Recognition(Attempt) As Cold War tensions heightened in the 1970s, the U.S.
ECE 113DB Final Project, Winter 2019
Department of Defense’s Advanced Research Projects Agency
(DARPA) funded a 5-year ‘Speech Understanding Research’
Fong Chi Ho, Zijun Sun, Shao Xiong Lee program. The lead product of the program was created by
Carnegie Mellon University, a device called ‘Harpy’ which
Abstract—
Our goal is to design and implement a simple can recognize over 1000 words and entire sentences
automatic speech recognition system. With the increasing constructed by those words. [3] ‘Harpy’ shaped the model of
ubiquity of mobile devices with more powerful processing
capabilities, the applications of speech recognition are the modern automatic speech recognition system.
increasingly relevant to everyday life, and our aim was to better After the Hidden Markov Model was introduced in 1980s,
understand the concepts of automated speech recognition by the accuracy of automated speech recognition increased
implementing one ourselves. Our project was based on the vowel greatly. With the increase of computational power, this
recognition mini-project attempted last quarter, extending the statistical approach proved ideal for solving this problem.
technique used to recognize vowels to recognize words and short
phrases. Instead of using the machine learning tool in MATLAB Hidden Markov Models remain the predominant algorithm for
for training, we use the actual language and acoustic model speech recognition, but the potential of machine learning and
associated with statistical tools to get our results. At the result, AI promise even greater accuracy.
our project successfully recognized sentences and phrases, but
the accuracy of the program is still far from satisfactory. Future B. Global Constraints
improvement to increase the accuracy is needed which could Speech recognition algorithms are sensitive to all sounds in
make system useful for some applications.
the input, so the signal-to-noise ratio must be maximized. For
the best performance, the input must be recorded in a quiet

Index Terms— Automated Speech Recognition, Hidden environment.
Markov Model, Signal Processing, Viterbi Algorithm The publicly available data set used is American English
spoken by a male speaker, and accuracy will drop if inputs
spoken in other accents of English or other speakers.
I. INTRODUCTION
Even after half a century of work, automated speech II. MOTIVATION
recognition is considered a difficult problem due to the
multidisciplinary nature of the work and its complexity. In Our motivation is rooted in the previous mini project on
order to make a functional ASR system, one must be vowel recognition. We saw the potential of signal processing
familiarized with a number of disciplines such as signal and virtual voice assistants like Google Assistant, Siri and
processing, pattern recognition, linguistics and computer Alexa are becoming ubiquitous, and automated speech
science. With modern technology and computational power, recognition of complete words and sentences is a direct
modern ASR is more powerful than ever, but ‘natural extension of the mini-project’s goals. Since this is our first
language processing’, the capacity to interpret speech in a time working on the real applications of this multidisciplinary
conversation while taking into account nuances like context subject, the progress is more important than result. We see the
and intonation, remains an elusive prospect. project more of a learning tool than a real application.
A. History
III. APPROACH
The first attempts at automated speech recognition date back
Our approach was based on the mini project of vowel
more than half a century. In the 1950s, the project “digit
recognition. All theories were taken from the book “Spoken
recognizer - Audrey” was developed by Bell Laboratories to
Language Processing”, authored by Xuedong Huang, Alex
recognize spoken numerical digits.[3] Audrey was designed
Acero and Hsiao-Wuen Hon, former professors from CMU
to look for ‘formants’, a type of audio fingerprint.
who transferred to Microsoft to work on ASR development.
In 1960s, IBM built the ‘Shoebox’, a system could not only
We separated our project into three stages: pre-processing,
recognize spoken numerical digits, but also recognize spoken
decoding and post-processing. The whole project is written in
the C programming language.
1
2019 March 23nd. This project is undergraduate signal processing
capstone project under supervision of professor Mike Briggs, Professor of
Electrical Computer Engineering Department, UCLA.
range, so the sampling frequency is more than double of the
A. Team Organization
maximum frequency required and is sufficient for speech
● Fong Chi Ho: pre-processing/feature extraction, recognition applications. By separating the recording into each
detecting the signal energy to segment the phases and sampling array of 50 ms to 100 ms, a feature extraction can be
determine the silence. Using Gaussian Mixture processed and passed on the decoding session. In the
Model to separate silence and the speech, as well as speech/non-speech segmentation, the system will detect the
separating the syllables. sound of any phoneme with probability calculated using
Hidden Markov Model, in which background noise can result
● Zijun Sun: decoding, use HMM as acoustic model for in insertions and substitutions of phonemes or words of the
each unit (phoneme or word), use ARPAbet as output if the noise resembles the parameters of a phoneme
phoneme dictionary, use Viterbi algorithm to get model better than those of a silence model [1]. The error in
n-best list. output caused by insertion and substitution can be reduced by
removing the sampling arrays from the speech signal between
● Shao Xiong Lee: apply N-gram model to n-best list the start of the recording and the point of time when the user
for rescoring (grammar), N-gram as language model starts to speak, and after the end of the utterance. This
with word or word sequence likelihoods. Also, make segmentation process is called end-point detection [1].
the measurement and check the accuracy and
correctness. A threshold value is obtained by taking the average of the
sampling arrays in the end-point detection, which indicates the
B. Plan possible noise from the background. By using a high-pass
filter, arrays with frequencies below the threshold value can be
Get familiar with following subjects.
removed as they have little relevance with the human speech.
Signal processing: Fourier Transforms, DFT, FFT.
While there are phonemes with low signal energy and short
pauses between words, Gaussian mixture models is used [2].
Acoustics: Physics of sounds and speech, models of vocal
tract.
Decoding
Pattern recognition: clustering and pattern matching
Hidden Markov Model: A hidden Markov model is
techniques. basically a Markov chain where the output observation is a
random variable X generated according to a output
Artificial intelligence: knowledge representation and search, probabilistic function associated with each state. Can be noted
natural language processing. as follow:
Φ = (A, B , π)
Computer science: hardware, parallel systems, algorithm Where A is transit probability matrix, B is an output
optimization. probability matrix, and π is the initial state matrix. All above
variable can be acquired during the pre-processing.
Statistics: probability theory, Hidden Markov Models, For our problem, given model Φ and a sequence of
Dynamic programming. observations X = (X 1 , X 2 , …, X T ) , to find the most likely state
sequence S = (S 0 , S 1 , …, S T ) , which is the original sentence
Linguistics: acoustic phonetics, lexical representation, spoke.
syntax, semantics.
Viterbi Algorithm: the algorithm can be regarded as the
C. Standard dynamic programming algorithm applied to the HMM. Instead
De jure: IEEE 1666 C language reference manual. of summing up probabilities from different paths coming to
the same destination state, the Viterbi algorithm picks and
De facto: FFT, Mel Frequency, Gaussian mixture model, remembers the best path. To define the best-path probability
Hidden Markov Model, Viterbi algorithm, N-gram model. as follow:
V t (i) = P (X t1 , S 1t−1 , st = i|Φ)
D. Theory Where V t (i) is the probability of the most likely state
sequence at time t, which has generated the observation X t1
Pre-processing
and ends in state i. implementation as follow:
The input speech is recorded with a mono-channel
Step1: Initialization
microphone discretized with a sampling frequency of 16 kHz.
V t (i) = π i bi (X i ) 1i N
The Shannon sampling theorem states that a bandwidth
limited signal can be reconstructed perfectly if the sampling B t (i) = 0
frequency is more than double of the maximum frequency. A Step2: Induction
typical telephone network is limited to the 5 Hz–3.7 kHz V i (j) = [V t−1 (i)aij ]bi (X i ) 2 t T ; 1 i N
V i (j) = [V t−1 (i)aij ] 2 t T ; 1 i N The CMU Trigram language model is utilized for this
Step3: Termination project, in two parts. FIrst, the trigram model is applied at
T he best score = [V t (i)] the phoneme level to choose the most likely set of words,
s*T = [B T (i)] and then at the sentence level to determine the most
probable words in the sentence.
Step4: Backtracking
( )
s*T = B t+1 s*t+1 t = T − 1, T − 2, …, 1
S* = (
s*1 , s*2 , …, s*T ) yield the best sequence
E. Software / Hardware
Post-processing Software IDE:
Microsoft Visual Studio 2017 w/ C development package
The result of the Viterbi algorithm is a list of possible
candidates for the final sentence. Post-processing techniques Software dependencies:
are used to ensure that the selected result is most likely to be a Gcc (C compiler), autoconf, libtool, bison, swig, Python
coherent sentence. To select the candidate for the most likely development package, pulseaudio development package.
sentence, a statistical n-gram model is used. The n-gram
model consists of a large database of text of the type the
speech recognition algorithm is trained to work on. The F. System Build / Operation
conditional probability of the next word or phoneme in a After successful installing the software, run the program in
sentence or a multisyllabic word for n words preceding it can the command prompt. Go to the program root folder and type
be determined from the database by counting the number of the following command to load the acoustic and language
incidences of the word following the given n words and model thus run the program:
dividing by the total number of incidences of the word.
bin\Release\Win32\finalproject.exe -inmic yes -hmm
The probability of a word A given that the preceding word is model\en-us\en-us -lm model\en-us\en-us.lm.bin -dict
B in a bigram model, where N is the number of incidences of model\en-us\cmudict-en-us.dict
the word/phrase in the entirety of the training data:
System will prompt user to input audio and output
P ("A"|"B") = N ("A"|"B")/N ("A") corresponding text output after running the program.

Using these probabilities, the probability of a given phrase

can be calculated using the chain rule. IV. RESULTS
Test the result is part of our project. We category the test
For the sentence “I want to eat an apple”, the probability of into three types.
it appearing would be given as follows: Correctness = (N – D – S)/N
- N = # of words in the correct transcription
P(“I want to eat an apple”)= P(“I”|<s>*)P(“want”| “I”)
- D = # of deletions (words not identified)
P(“to”| “want”)P(“eat”| “to”)P(“an”| “eat”)P(“apple”| “an”)
- S = # of substitutions (misidentified words)
*<s> is a placeholder indicating the lack of a preceding
word. Accuracy = (N – D – S – I)/N
- I = # of insertions (inserted words that don’t exist)
Each probability can be retrieved from the corpus of
sentences within the database. The n-best list of sentence Error rate = (D + S + I)/N
candidates generated by the previous decoding step is
analyzed, and the sentence or phrase with the highest
A. Description of Results
conditional probability is selected.
After the 50 tests of 50 samples, we have the result as
For speech recognition purposes, a ‘trigram’ model is follow:
frequently used, where the probability takes into account
the preceding two words instead of one. e.g. Correctness: average 18 wrong of 50 samples.
Accuracy: average 16 wrong of 50 samples.
P ("A"|"B"∩"C") = N ("A"|"B"∩"C")/N ("A") Error rate: 23 wrong of 50 samples.

4-gram or 5- gram models are possible if a large quantity All tests were performed in a relatively quiet environment.
of data is available, but are not commonly used. Error rate increases drastically if any noise is present. Our
program has poor noise tolerance.
Also, all sample tests were conducted by male non-native
speakers with slight accents. Test results were better than
expected since the data available only contains speech from
male American English speakers.

B. Discussion of results
The Word Error Rate is the standard to evaluate the ASR’s
performance. The WER is almost 50% which is a poor
performance compared to the Google assistant which achieved
a 4.7% WER in 2017. The high WER is due to limited
computational power and statistical data available to construct
the language model. With a larger available data set, or if the
list of possible sentences is constrained to a subset of all
possible sentences, the algorithm’s performance could perform
significantly better.

REFERENCES
[1] R. Gruhn, W. Minker, S. Nakamura, “Statistical
Pronunciation Modeling for Non-Native Speech
Proccessing.” Ch2, 2011
[2] M. Stuttle, “A Gaussian Mixture Model Spectral
Representation for Speech Recognition” Cambridge,
University of Cambridge 2003
[3] X. Huang, A. Acero, H, Hon, “Spoken Language
Processing”, Upper Saddle River, NJ, USA: Prentice Hall
PTR, 2001

Speech Recognition Course Guide
No ratings yet
Speech Recognition Course Guide
74 pages
Tejaswini Group Report
No ratings yet
Tejaswini Group Report
18 pages
Jarvis Digital Life Assistant IJERTV2IS1237 PDF
No ratings yet
Jarvis Digital Life Assistant IJERTV2IS1237 PDF
6 pages
A Report On
No ratings yet
A Report On
35 pages
Speech Recognition System
No ratings yet
Speech Recognition System
5 pages
Automatic Speech Recognition History
No ratings yet
Automatic Speech Recognition History
9 pages
ASR Proof
No ratings yet
ASR Proof
19 pages
Research Paper
No ratings yet
Research Paper
9 pages
Final Report
No ratings yet
Final Report
35 pages
Speech Recognition Full Report
No ratings yet
Speech Recognition Full Report
11 pages
Synopsis
No ratings yet
Synopsis
5 pages
Speech Recognition Evolution
No ratings yet
Speech Recognition Evolution
27 pages
Speech Recognition Final Report (1) - Removed - Removed
No ratings yet
Speech Recognition Final Report (1) - Removed - Removed
62 pages
Speech Recognition Seminar
No ratings yet
Speech Recognition Seminar
19 pages
Body and Conclu
No ratings yet
Body and Conclu
3 pages
Iot Based Personal Voice Assistant: Research Paper
No ratings yet
Iot Based Personal Voice Assistant: Research Paper
7 pages
Pankaj Singh Synopsis (Recovoicegnition)
No ratings yet
Pankaj Singh Synopsis (Recovoicegnition)
11 pages
IRJET Speech Scribd
No ratings yet
IRJET Speech Scribd
3 pages
Fyp Proposal
No ratings yet
Fyp Proposal
4 pages
Speech Recognition in AI (COMP 334)
No ratings yet
Speech Recognition in AI (COMP 334)
26 pages
(IJCST-V4I2P62) :Dr.V.Ajantha Devi, Ms.V.Suganya
No ratings yet
(IJCST-V4I2P62) :Dr.V.Ajantha Devi, Ms.V.Suganya
6 pages
Minor Project123
No ratings yet
Minor Project123
40 pages
Speech Recognition Using Python
100% (2)
Speech Recognition Using Python
6 pages
Voice Recognition System
No ratings yet
Voice Recognition System
4 pages
Survey on Speech Recognition Systems
No ratings yet
Survey on Speech Recognition Systems
2 pages
Speech Recognition Seminar Report
No ratings yet
Speech Recognition Seminar Report
32 pages
A Project Report On Speech Recognition For Gender Discrimination
No ratings yet
A Project Report On Speech Recognition For Gender Discrimination
33 pages
Jasmeet Seminar Report
No ratings yet
Jasmeet Seminar Report
24 pages
Virtual Personal Assistant
No ratings yet
Virtual Personal Assistant
4 pages
Speech Recognition Seminar Report
87% (97)
Speech Recognition Seminar Report
32 pages
NLP Project Reportttt
No ratings yet
NLP Project Reportttt
9 pages
DesktopAssistant Reoprt
No ratings yet
DesktopAssistant Reoprt
42 pages
ASR Thesis Help for Students
100% (3)
ASR Thesis Help for Students
7 pages
Speech Recognition Report
No ratings yet
Speech Recognition Report
46 pages
Speech Recognition Using Ic HM2007
100% (4)
Speech Recognition Using Ic HM2007
31 pages
Speech Recognition: College Name: Guru Nanak Engineering College Authors: Shruthi Tapse
No ratings yet
Speech Recognition: College Name: Guru Nanak Engineering College Authors: Shruthi Tapse
13 pages
Speech Recognition Project
No ratings yet
Speech Recognition Project
33 pages
Chapter 1. INTRODUCTION
No ratings yet
Chapter 1. INTRODUCTION
2 pages
Human-Computer Interaction Based On Speech Recogni
No ratings yet
Human-Computer Interaction Based On Speech Recogni
9 pages
Speech Recognition System - A Review: April 2016
No ratings yet
Speech Recognition System - A Review: April 2016
10 pages
9 Speech Recognition
No ratings yet
9 Speech Recognition
26 pages
Vivek Kumar - 1613112052
No ratings yet
Vivek Kumar - 1613112052
7 pages
DL Proj Rep
No ratings yet
DL Proj Rep
11 pages
Artificial Intelligence For Speech Recog
No ratings yet
Artificial Intelligence For Speech Recog
5 pages
Automatic Speech Recognition Documentation
No ratings yet
Automatic Speech Recognition Documentation
24 pages
Speech Recognition
100% (3)
Speech Recognition
66 pages
VLSI
No ratings yet
VLSI
14 pages
Ijreas Volume 3, Issue 3 (March 2013) ISSN: 2249-3905 Efficient Speech Recognition Using Correlation Method
No ratings yet
Ijreas Volume 3, Issue 3 (March 2013) ISSN: 2249-3905 Efficient Speech Recognition Using Correlation Method
9 pages
Artificial Intelligence in Voice Recognition
No ratings yet
Artificial Intelligence in Voice Recognition
14 pages
Artificial Intelligence For Speech Recognition
No ratings yet
Artificial Intelligence For Speech Recognition
5 pages
An In-Depth Analysis of Automatic Speech Recognition System
No ratings yet
An In-Depth Analysis of Automatic Speech Recognition System
5 pages
Speech Processing
No ratings yet
Speech Processing
4 pages
Computer Based Automatic Speech Processing: Pham Van Tuan
No ratings yet
Computer Based Automatic Speech Processing: Pham Van Tuan
70 pages
25 The Comprehensive Analysis Speech Recognition System
No ratings yet
25 The Comprehensive Analysis Speech Recognition System
5 pages
Project New
No ratings yet
Project New
2 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
30 pages
CASE STUDY - Speech Recognition
No ratings yet
CASE STUDY - Speech Recognition
25 pages
VI - Essentials of Psychological Testing
100% (1)
VI - Essentials of Psychological Testing
26 pages
The IMRaD Format
100% (2)
The IMRaD Format
3 pages
How To Write A News Article
100% (2)
How To Write A News Article
18 pages
The Purpose of Industrial Training: Is To Expose Students To Real Work of Environment
No ratings yet
The Purpose of Industrial Training: Is To Expose Students To Real Work of Environment
2 pages
Teaching Art To Young Children
No ratings yet
Teaching Art To Young Children
22 pages
An Introduction To Reading Comprehension
0% (1)
An Introduction To Reading Comprehension
4 pages
Chapter09 Motivation
No ratings yet
Chapter09 Motivation
12 pages
80 Skills Kids Should Have Before They Start Kindergarten
No ratings yet
80 Skills Kids Should Have Before They Start Kindergarten
3 pages
TFL Test Developing IELTS Final
No ratings yet
TFL Test Developing IELTS Final
2 pages
Remembering and Forgetting in The Age of Technology - Teaching, Learning, and The Science of Memory in A Wired World
No ratings yet
Remembering and Forgetting in The Age of Technology - Teaching, Learning, and The Science of Memory in A Wired World
281 pages
Physical Education
No ratings yet
Physical Education
2 pages
Translation Strategies in Poetry
No ratings yet
Translation Strategies in Poetry
15 pages
FINAL EXAM Exam in Oral Communication
No ratings yet
FINAL EXAM Exam in Oral Communication
2 pages
HRM - Performance & Potential Appraisal
67% (3)
HRM - Performance & Potential Appraisal
34 pages
Playbook Digital Transformation of L and D
No ratings yet
Playbook Digital Transformation of L and D
15 pages
Model Paper Ece Eee
No ratings yet
Model Paper Ece Eee
12 pages
Cooperative Learning in Elementary Science
No ratings yet
Cooperative Learning in Elementary Science
5 pages
Personal Best British Edition A2 Students Book SS 1
No ratings yet
Personal Best British Edition A2 Students Book SS 1
2 pages
NeurIPS 2020 Reinforcement Learning With Augmented Data Paper
No ratings yet
NeurIPS 2020 Reinforcement Learning With Augmented Data Paper
12 pages
Loom Presentation Project
No ratings yet
Loom Presentation Project
23 pages
Hbsc2203 - v2 - Tools of Learning of Sciences
No ratings yet
Hbsc2203 - v2 - Tools of Learning of Sciences
10 pages
Ai ML Engineer Job Description
No ratings yet
Ai ML Engineer Job Description
1 page
G A L N E E T M I: Daily Lesson Plan
No ratings yet
G A L N E E T M I: Daily Lesson Plan
4 pages
Q1 W4 Module 4
No ratings yet
Q1 W4 Module 4
22 pages
Detailed Lesson Plan (DLP) Format: Instructional Planning
No ratings yet
Detailed Lesson Plan (DLP) Format: Instructional Planning
5 pages
IB Visual Arts Workbook Tips
100% (2)
IB Visual Arts Workbook Tips
16 pages
Program Book For Community Service Project
No ratings yet
Program Book For Community Service Project
42 pages
Trauma-Informed Service Delivery: Blue Knot Helpline 1300 657 380
No ratings yet
Trauma-Informed Service Delivery: Blue Knot Helpline 1300 657 380
2 pages
Detecting Lies: Techniques & Tips
No ratings yet
Detecting Lies: Techniques & Tips
5 pages
Oral Presentation: Practical Research 1: Qualitative Research
No ratings yet
Oral Presentation: Practical Research 1: Qualitative Research
11 pages

ASR System Design for Students

Uploaded by

ASR System Design for Students

Uploaded by

1

arithmetic commands. The system, essentially a more capable

Using these probabilities, the probability of a given phrase

You might also like