1
arithmetic commands. The system, essentially a more capable
Automatic Speech version of ‘Audrey’, would take the spoken commands and
numbers as input and produce the calculated results as
output.[3]
Recognition(Attempt) As Cold War tensions heightened in the 1970s, the U.S.
ECE 113DB Final Project, Winter 2019
Department of Defense’s Advanced Research Projects Agency
(DARPA) funded a 5-year ‘Speech Understanding Research’
Fong Chi Ho, Zijun Sun, Shao Xiong Lee program. The lead product of the program was created by
Carnegie Mellon University, a device called ‘Harpy’ which
Abstract—
Our goal is to design and implement a simple can recognize over 1000 words and entire sentences
automatic speech recognition system. With the increasing constructed by those words. [3] ‘Harpy’ shaped the model of
ubiquity of mobile devices with more powerful processing
capabilities, the applications of speech recognition are the modern automatic speech recognition system.
increasingly relevant to everyday life, and our aim was to better After the Hidden Markov Model was introduced in 1980s,
understand the concepts of automated speech recognition by the accuracy of automated speech recognition increased
implementing one ourselves. Our project was based on the vowel greatly. With the increase of computational power, this
recognition mini-project attempted last quarter, extending the statistical approach proved ideal for solving this problem.
technique used to recognize vowels to recognize words and short
phrases. Instead of using the machine learning tool in MATLAB Hidden Markov Models remain the predominant algorithm for
for training, we use the actual language and acoustic model speech recognition, but the potential of machine learning and
associated with statistical tools to get our results. At the result, AI promise even greater accuracy.
our project successfully recognized sentences and phrases, but
the accuracy of the program is still far from satisfactory. Future B. Global Constraints
improvement to increase the accuracy is needed which could Speech recognition algorithms are sensitive to all sounds in
make system useful for some applications.
the input, so the signal-to-noise ratio must be maximized. For
the best performance, the input must be recorded in a quiet
Index Terms— Automated Speech Recognition, Hidden environment.
Markov Model, Signal Processing, Viterbi Algorithm The publicly available data set used is American English
spoken by a male speaker, and accuracy will drop if inputs
spoken in other accents of English or other speakers.
I. INTRODUCTION
Even after half a century of work, automated speech II. MOTIVATION
recognition is considered a difficult problem due to the
multidisciplinary nature of the work and its complexity. In Our motivation is rooted in the previous mini project on
order to make a functional ASR system, one must be vowel recognition. We saw the potential of signal processing
familiarized with a number of disciplines such as signal and virtual voice assistants like Google Assistant, Siri and
processing, pattern recognition, linguistics and computer Alexa are becoming ubiquitous, and automated speech
science. With modern technology and computational power, recognition of complete words and sentences is a direct
modern ASR is more powerful than ever, but ‘natural extension of the mini-project’s goals. Since this is our first
language processing’, the capacity to interpret speech in a time working on the real applications of this multidisciplinary
conversation while taking into account nuances like context subject, the progress is more important than result. We see the
and intonation, remains an elusive prospect. project more of a learning tool than a real application.
A. History
III. APPROACH
The first attempts at automated speech recognition date back
Our approach was based on the mini project of vowel
more than half a century. In the 1950s, the project “digit
recognition. All theories were taken from the book “Spoken
recognizer - Audrey” was developed by Bell Laboratories to
Language Processing”, authored by Xuedong Huang, Alex
recognize spoken numerical digits.[3] Audrey was designed
Acero and Hsiao-Wuen Hon, former professors from CMU
to look for ‘formants’, a type of audio fingerprint.
who transferred to Microsoft to work on ASR development.
In 1960s, IBM built the ‘Shoebox’, a system could not only
We separated our project into three stages: pre-processing,
recognize spoken numerical digits, but also recognize spoken
decoding and post-processing. The whole project is written in
the C programming language.
1
2019 March 23nd. This project is undergraduate signal processing
capstone project under supervision of professor Mike Briggs, Professor of
Electrical Computer Engineering Department, UCLA.
range, so the sampling frequency is more than double of the
A. Team Organization
maximum frequency required and is sufficient for speech
● Fong Chi Ho: pre-processing/feature extraction, recognition applications. By separating the recording into each
detecting the signal energy to segment the phases and sampling array of 50 ms to 100 ms, a feature extraction can be
determine the silence. Using Gaussian Mixture processed and passed on the decoding session. In the
Model to separate silence and the speech, as well as speech/non-speech segmentation, the system will detect the
separating the syllables. sound of any phoneme with probability calculated using
Hidden Markov Model, in which background noise can result
● Zijun Sun: decoding, use HMM as acoustic model for in insertions and substitutions of phonemes or words of the
each unit (phoneme or word), use ARPAbet as output if the noise resembles the parameters of a phoneme
phoneme dictionary, use Viterbi algorithm to get model better than those of a silence model [1]. The error in
n-best list. output caused by insertion and substitution can be reduced by
removing the sampling arrays from the speech signal between
● Shao Xiong Lee: apply N-gram model to n-best list the start of the recording and the point of time when the user
for rescoring (grammar), N-gram as language model starts to speak, and after the end of the utterance. This
with word or word sequence likelihoods. Also, make segmentation process is called end-point detection [1].
the measurement and check the accuracy and
correctness. A threshold value is obtained by taking the average of the
sampling arrays in the end-point detection, which indicates the
B. Plan possible noise from the background. By using a high-pass
filter, arrays with frequencies below the threshold value can be
Get familiar with following subjects.
removed as they have little relevance with the human speech.
Signal processing: Fourier Transforms, DFT, FFT.
While there are phonemes with low signal energy and short
pauses between words, Gaussian mixture models is used [2].
Acoustics: Physics of sounds and speech, models of vocal
tract.
Decoding
Pattern recognition: clustering and pattern matching
Hidden Markov Model: A hidden Markov model is
techniques. basically a Markov chain where the output observation is a
random variable X generated according to a output
Artificial intelligence: knowledge representation and search, probabilistic function associated with each state. Can be noted
natural language processing. as follow:
Φ = (A, B , π)
Computer science: hardware, parallel systems, algorithm Where A is transit probability matrix, B is an output
optimization. probability matrix, and π is the initial state matrix. All above
variable can be acquired during the pre-processing.
Statistics: probability theory, Hidden Markov Models, For our problem, given model Φ and a sequence of
Dynamic programming. observations X = (X 1 , X 2 , …, X T ) , to find the most likely state
sequence S = (S 0 , S 1 , …, S T ) , which is the original sentence
Linguistics: acoustic phonetics, lexical representation, spoke.
syntax, semantics.
Viterbi Algorithm: the algorithm can be regarded as the
C. Standard dynamic programming algorithm applied to the HMM. Instead
De jure: IEEE 1666 C language reference manual. of summing up probabilities from different paths coming to
the same destination state, the Viterbi algorithm picks and
De facto: FFT, Mel Frequency, Gaussian mixture model, remembers the best path. To define the best-path probability
Hidden Markov Model, Viterbi algorithm, N-gram model. as follow:
V t (i) = P (X t1 , S 1t−1 , st = i|Φ)
D. Theory Where V t (i) is the probability of the most likely state
sequence at time t, which has generated the observation X t1
Pre-processing
and ends in state i. implementation as follow:
The input speech is recorded with a mono-channel
Step1: Initialization
microphone discretized with a sampling frequency of 16 kHz.
V t (i) = π i bi (X i ) 1i N
The Shannon sampling theorem states that a bandwidth
limited signal can be reconstructed perfectly if the sampling B t (i) = 0
frequency is more than double of the maximum frequency. A Step2: Induction
typical telephone network is limited to the 5 Hz–3.7 kHz V i (j) = [V t−1 (i)aij ]bi (X i ) 2 t T ; 1 i N
V i (j) = [V t−1 (i)aij ] 2 t T ; 1 i N The CMU Trigram language model is utilized for this
Step3: Termination project, in two parts. FIrst, the trigram model is applied at
T he best score = [V t (i)] the phoneme level to choose the most likely set of words,
s*T = [B T (i)] and then at the sentence level to determine the most
probable words in the sentence.
Step4: Backtracking
( )
s*T = B t+1 s*t+1 t = T − 1, T − 2, …, 1
S* = (
s*1 , s*2 , …, s*T ) yield the best sequence
E. Software / Hardware
Post-processing Software IDE:
Microsoft Visual Studio 2017 w/ C development package
The result of the Viterbi algorithm is a list of possible
candidates for the final sentence. Post-processing techniques Software dependencies:
are used to ensure that the selected result is most likely to be a Gcc (C compiler), autoconf, libtool, bison, swig, Python
coherent sentence. To select the candidate for the most likely development package, pulseaudio development package.
sentence, a statistical n-gram model is used. The n-gram
model consists of a large database of text of the type the
speech recognition algorithm is trained to work on. The F. System Build / Operation
conditional probability of the next word or phoneme in a After successful installing the software, run the program in
sentence or a multisyllabic word for n words preceding it can the command prompt. Go to the program root folder and type
be determined from the database by counting the number of the following command to load the acoustic and language
incidences of the word following the given n words and model thus run the program:
dividing by the total number of incidences of the word.
bin\Release\Win32\finalproject.exe -inmic yes -hmm
The probability of a word A given that the preceding word is model\en-us\en-us -lm model\en-us\en-us.lm.bin -dict
B in a bigram model, where N is the number of incidences of model\en-us\cmudict-en-us.dict
the word/phrase in the entirety of the training data:
System will prompt user to input audio and output
P ("A"|"B") = N ("A"|"B")/N ("A") corresponding text output after running the program.
Using these probabilities, the probability of a given phrase
can be calculated using the chain rule. IV. RESULTS
Test the result is part of our project. We category the test
For the sentence “I want to eat an apple”, the probability of into three types.
it appearing would be given as follows: Correctness = (N – D – S)/N
- N = # of words in the correct transcription
P(“I want to eat an apple”)= P(“I”|<s>*)P(“want”| “I”)
- D = # of deletions (words not identified)
P(“to”| “want”)P(“eat”| “to”)P(“an”| “eat”)P(“apple”| “an”)
- S = # of substitutions (misidentified words)
*<s> is a placeholder indicating the lack of a preceding
word. Accuracy = (N – D – S – I)/N
- I = # of insertions (inserted words that don’t exist)
Each probability can be retrieved from the corpus of
sentences within the database. The n-best list of sentence Error rate = (D + S + I)/N
candidates generated by the previous decoding step is
analyzed, and the sentence or phrase with the highest
A. Description of Results
conditional probability is selected.
After the 50 tests of 50 samples, we have the result as
For speech recognition purposes, a ‘trigram’ model is follow:
frequently used, where the probability takes into account
the preceding two words instead of one. e.g. Correctness: average 18 wrong of 50 samples.
Accuracy: average 16 wrong of 50 samples.
P ("A"|"B"∩"C") = N ("A"|"B"∩"C")/N ("A") Error rate: 23 wrong of 50 samples.
4-gram or 5- gram models are possible if a large quantity All tests were performed in a relatively quiet environment.
of data is available, but are not commonly used. Error rate increases drastically if any noise is present. Our
program has poor noise tolerance.
Also, all sample tests were conducted by male non-native
speakers with slight accents. Test results were better than
expected since the data available only contains speech from
male American English speakers.
B. Discussion of results
The Word Error Rate is the standard to evaluate the ASR’s
performance. The WER is almost 50% which is a poor
performance compared to the Google assistant which achieved
a 4.7% WER in 2017. The high WER is due to limited
computational power and statistical data available to construct
the language model. With a larger available data set, or if the
list of possible sentences is constrained to a subset of all
possible sentences, the algorithm’s performance could perform
significantly better.
REFERENCES
[1] R. Gruhn, W. Minker, S. Nakamura, “Statistical
Pronunciation Modeling for Non-Native Speech
Proccessing.” Ch2, 2011
[2] M. Stuttle, “A Gaussian Mixture Model Spectral
Representation for Speech Recognition” Cambridge,
University of Cambridge 2003
[3] X. Huang, A. Acero, H, Hon, “Spoken Language
Processing”, Upper Saddle River, NJ, USA: Prentice Hall
PTR, 2001