0% found this document useful (0 votes)
35 views5 pages

5-Speech Recognition

The document discusses methods to enhance Google Speech Recognition by incorporating contextual information to improve accuracy, especially on mobile devices. It introduces an on-the-fly rescoring mechanism that adjusts language model weights based on relevant n-grams from the user's context, addressing challenges like out-of-vocabulary words. The authors present various approaches for creating contextual models and demonstrate significant improvements in speech recognition performance through their experiments.

Uploaded by

Mubashir Ehsan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views5 pages

5-Speech Recognition

The document discusses methods to enhance Google Speech Recognition by incorporating contextual information to improve accuracy, especially on mobile devices. It introduces an on-the-fly rescoring mechanism that adjusts language model weights based on relevant n-grams from the user's context, addressing challenges like out-of-vocabulary words. The authors present various approaches for creating contextual models and demonstrate significant improvements in speech recognition performance through their experiments.

Uploaded by

Mubashir Ehsan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Bringing Contextual Information to Google Speech Recognition

Petar Aleksic, Mohammadreza Ghodsi, Assaf Michaely, Cyril Allauzen,


Keith Hall, Brian Roark, David Rybach, Pedro Moreno

Google Inc.
{apetar,ghodsi,amichaely,allauzen,kbhall,roark,rybach,pedro}@google.com

Abstract weighted finite-state transducer [2, 3], contextual model. We in-


troduce several approaches for creating contextual models from
In automatic speech recognition on mobile devices, very of- the context, as well as methods for combining the score from
ten what a user says strongly depends on the particular context the main language model and the contextual model. In addi-
he or she is in. The n-grams relevant to the context are often tion, we address the issue of handling out-of-vocabulary (OOV)
not known in advance. The context can depend on, for exam- words present in the provided context, by using a class specific
ple, particular dialog state, options presented to the user, con- language model, as described in section 2.
versation topic, location, etc. Speech recognition of sentences One can view this approach as a generalization of cache
that include these n-grams can be challenging, as they are often models [4, 5, 6], which have been used to personalize language
not well represented in a language model (LM) or even include models based on recent language produced by the individual
out-of-vocabulary (OOV) words. whose utterance is being recognized. Our approach derives
In this paper, we propose a solution for using contextual the biasing n-grams from varied sources beyond an individ-
information to improve speech recognition accuracy. We utilize ual’s prior utterances and makes use of more complex methods
an on-the-fly rescoring mechanism to adjust the LM weights of for mixing with the baseline model than the fixed interpolation
a small set of n-grams relevant to the particular context during or decay functions typically used with recency cache models
speech decoding. [4, 5]. See also the discussion of related work in [1].
Our solution handles out of vocabulary words. It also ad- We organize the paper as follows. In section 2 we present
dresses efficient combination of multiple sources of context and the approach we used to perform on-the-fly n-gram biasing of
it even allows biasing class based language models. We show the language model towards context present in the contextual
significant speech recognition accuracy improvements on sev- model. In section 3 we present various approaches for creat-
eral datasets, using various types of contexts, without negatively ing a contextual model from the provided context. Finally, in
impacting the overall system. The improvements are obtained section 4, we describe the test sets used in our experiments and
in both offline and live experiments. present all of our experimental results.

1. Introduction 2. Contextual language model biasing


The impact of quality of speech recognition on user experi- In this section we describe the language model biasing frame-
ence on mobile devices has been significantly increasing with work we use, and how it handles class-based language models
increase in voice input usage. Voice input is used to perform and OOVs.
search by voice, give specific voice commands, or ask general
questions. The users expect their phones to keep on getting 2.1. General approach
smarter and to take into account various signals that would im- We used the framework for biasing language models using n-
prove the quality of communication with the device and overall grams, introduced in [1]. In this framework, a small set of n-
user experience. grams is compactly represented as a weighted finite-state trans-
In this effort, utilizing contextual information plays a great ducer [7]. An on-the-fly rescoring algorithm allows biasing the
role. The context can be defined in a number of ways. It can recognition towards these n-grams. The cost from the main lan-
depend on the location that the user is in, on the time of the day, guage model G is combined with the cost from the contextual
the user’s search history, the particular dialog state that the user model B as follows:
is in, the conversation topic, the content on the screen that the
user is looking at, etc. Very often the amount of information if (w|H) ∈

sG (w|H) /B
s(w|H) = ,
about the context can be very small, consisting of only a few C(sG (w|H), sB (w|H)) if (w|H) ∈ B
words or sentences. However, if the context is relevant it can (1)
significantly improve the speech recognition accuracy, if con- where sG (w|H) is the raw score from the main model G for
sumed appropriately by the speech recogntion system. the word w leaving history state H and sB (w|H) is the raw
In this paper we present a system that uses contextual in- score for the biasing model. Observe that this approach only
formation to improve speech recognition accuracy. Our solu- modifies the LM scores of n-grams, Hw, for which the biasing
tion works well for both large contexts and contexts consist- model provides an explicit score. This differs from regular lan-
ing of only several words or phrases. We use a framework for guage model interpolation and is motivated by the fact that the
biasing language models (LM) using n-grams as the biasing support of the biasing model is much sparser than that of the
context [1]. The n-grams and corresponding weights, calcu- main language model.
lated based on the reliability of the context, are represented as a [1] offers the following alternatives for the operation C
</contact> w:w
Michael 2 Riley w:ε
<contact> </contact>
0 1 Pedro 4 5
Moreno <classname>:$CLASSNAME
0 1
3 </classname>:ε
</contact>

Figure 1: Example of a class grammar with decorators for the


“$CONTACTS” class. Figure 2: Transducer T maps decorator-delimited phrases back
to the corresponding class label.
used to combine the scores. The first approach corresponds to
using log-linear interpolation: 2.3. Handling out-of-vocabulary words
C 0 (sG (w|H), sB (w|H)) = α∗sG (w|H)+β∗sB (w|H). (2) The contextual model might contain words that do not appear
Since our costs are negative-log conditional probabilities, this in the base vocabulary of our base language model. We want to
simply corresponds to linear interpolation in the log-domain. be able to add these OOV words to our base LM at the unigram
Finally, [1] also provides a mechanism that restricts the bi- state so that they can be hypothesized and rescored accordingly
asing to be applied only if it reduces the cost. In equation 3 we by the contextual model.
define the positive biasing function which applies this restric- We achieve this by leveraging class-based language model-
tion: ing approach. We introduce a “$UNKNOWN” class that only
appears at the unigram state of the LM. At recognition time,
C(sG (w|H), sB (w|H)) = we extract the set of OOV words from the contextual model
for the considered utterance. We then create a “$UNKNOWN”
min(sG (w|H), C 0 (sG (w|H), sB (w|H))). (3) class grammar representing these words, as a monophone-to-
Dynamic decoding of input speech is performed similarly to word transducer as described in [10]. In this instance, we do not
what is described in [8]. Specifically, given a vocabulary V we add decorators to the class grammars, since we want to rescore
generate a lattice from the alphabet Σ = V ∪ {}. Given a CLG the individual OOV words in the biased contexts and not the
(a composition of the context-dependent phone model, lexicon, “$UNKNOWN” class.
and main language model), we perform time-synchronous de-
coding via beam search. As in [8], a pseudo-deterministic word- 3. Constructing the contextual model
lattice is built during decoding. It is at this point where we apply
the on-the-fly rescoring [9] with the contextual biasing model as The context we use for biasing can consist of hundreds of
described in [1]. phrases or only a handful of phrases. Each phrase is a sequence
of one or more words. For example, the following phrases may
2.2. Biasing class-based language models be used as the context for an utterance: “Hotels in Manhattan”,
“Holiday Inn”, “Cheap flights to New York City”.
Our main language model is class-based [10, 11, 12]. Examples When biasing, we want to allow partial matching to the con-
of classes are address numbers, street names, dates, and contact text. For example, given the context above, we might also want
names. The last being an example of an utterance-dependent to bias towards “Cheap hotels in New York”.
user-specific class. In general, if the size of the context is large enough that a
We might want to bias towards the whole class in some regular language model can be constructed from it, then one can
context. For instance, we might want to bias towards “call use the LM costs as biasing scores. (In that case, the interpola-
$CONTACTS” or “directions to $ADDRESSNUM $STREET- tion would be a standard interpolation between two LM costs.)
NAME” instead of being limited to simply biasing towards However, often the context available is too small for using this
some instantiations of the classes (e.g. “call Michael” or “di- approach. We developed methods that address this case.
rections to 111 Eight Avenue”). In this section, we discuss how we select biasing n-grams
Our language model consists of: (a) a top-level n-gram and their scores, given a set of context phrases such as above.
language model over regular words and class labels and (b)
for each class c, a class grammar Gc over regular words that 3.1. Extracting and scoring n-grams
might be utterance-dependent. All components are represented
as weighted automata. At run-time, this model is expanded on- We want to bias more heavily towards higher order (longer) n-
demand into a weighted automata G using the replacement op- grams. This is because of two related reasons: The first is that
eration as described in [10]. In this approach, class-based bi- we want to reward longer exact matches between the context
asing is achieved by: (1) Modifying each class grammar to in- and the recognition result. The second is that biasing towards
sert decorators allowing us to keep track of whether words in shorter n-grams has a larger negative effect on the recognition
the hypothesis word lattice were generated by the top level LM of general (out of context) queries.
or by one of the class grammars. This corresponds to using One simple scoring function that satisfies the above require-
G0c =< c > Gc < /c > as class grammar for class c where ments is the length-linear function, where n is the length of
(<c>, </c>) is the decorator pair for c (see Figure 1). (2) Al- Hw. That is:
lowing n-grams containing class labels in the contextual biasing sB (w|H) = f1 (length(Hw)) = (n − 1)p2 + p1 (4)
model. (3) Treating decorator-delimited phrases as their cor-
responding class-label during rescoring. This is achieved by where p1 and p2 are parameters that control the strength of bias-
composition on-the-fly the word lattice with the transducer T ing, their values depending on the quality of the context. These
described Figure 2 and then applying the biasing model as de- parameters can be learned on a transcribed development data set
scribed in the previous section. with context.
The length-linear function would assign the same score to text, we attach to each utterance 100 irrelevant contexts ran-
all n-grams of the same order. However, because the final cost domly selected from other utterances. We call this a negative
used by the recognizer is an interpolation of the biasing score set. (2) In sets with fixed context, we attach the fixed context to
and the original LM cost, the effect of the biasing score depends a set of utterances for which the context is irrelevant. We call
on the interpolation function. this an anti-set.
Since we want to bias more heavily towards longer n-grams,
we would want sB (n) to be a decreasing function of n, i.e. 4.1.1. Entities and location
p2 < 0.
This test set contains 876 utterances. Each utterance con-
The main limitation of the length-linear function is that the
tains the name of an entity and/or the name of a location
cost of various n-gram orders are interdependent. A slightly
e.g. “Directions to Sky Song in Phoenix, Arizona”. The
more general function would assign independent scores to each
context is defined per utterance, and is a list of locations
of the n-gram orders. In our system, we observed diminishing
and entities, e.g. {“Sky Song”, “Phoenix, Arizona”}.
gains beyond specifying scores for unigrams and bigrams only.
Test set variants: entities pos, entities neg, entities baseline
(Note that, similar to back-offs mechanism in LMs, the biasing
model will use the score of the lower order n-gram if the longer
4.1.2. Confirmation
one is absent.)
We define the unigram-and-bigram function as: This test set contains 1000 utterances. All queries correspond
to a state where the user is provided with the choice to con-
firm or cancel some action. The context is the same for all ut-

p1 : n = 1
sB (w|H) = f2 (length(Hw)) = (5)
p2 : n ≥ 2 terance, and it consists of the words {“yes”, “no”, “cancel”}.
Test set variants: ync pos, ync baseline and anti ync. anti ync
The unigram-and-bigram function is more robust and easier is an anti-set consisting of 22k utterances not related to confir-
to interpret, compared to the length-linear function. We there- mation/cancellation states.
fore used the unigram-and-bigram function in most of the ex-
periments presented in this paper. 4.1.3. Hard n-grams
This testset consists of 2,704 utterances. All utterances in this
3.2. Sentence boundaries
testset contain n-grams with high LM costs, for n ∈ [2, 7]. The
As mentioned, biasing towards unigrams can be detrimental to context, defined per utterance, is a list of high-cost n-grams.
the general query recognition performance. But what if some Test set variants: costly pos, costly neg, costly baseline.
or all of the context phrases contain only one word? For exam-
ple, in one of our test sets (confirmation) the context consists 4.1.4. Class based (numeric)
of the phrases “yes”, “no”, and “cancel”. If we were to bias
This testset contains 816 utterances, each containing some type
towards these unigrams heavily, we may get recognition results
of number in the transcript, e.g. “Set alarm for 5:30 p.m. to-
that contain repetitions of these words, such as “no no no . . . ”.
day”. The context is defined per utterance and consists of a
We can avoid this outcome by appending sentence bound- list of the transcripts with class members replaced by their class
ary tokens (“<S>” and “</S>” in our case) to each phrase in symbol (e.g. “Set alarm for $TIME p.m. today”). The context
the context, before extracting the biasing n-grams. Then, in the for each utterance contains the utterance’s modified transcript.
above example, we would bias towards bigrams such as “<S> Test set variants: numeric, numeric baseline
no” and “no </S>” much more than we bias towards the uni-
grams. 4.1.5. Class based (contacts)

4. Experimental results This testset contains 10670 utterances. All utterances corre-
spond to contact calling voice commands, e.g. “Call James
In this section we describe our test sets, experimental setup, and Brown”. Similar to numeric testset the context is created by
analyze the experimental results. All test sets used have been name class members being replaces by “$CONTACTS” in tran-
anonymized. scripts.

4.1. Corpora 4.2. Recognition accuracy with biasing


The experiments described below use various test sets in Amer- We measured the effect of biasing on our test sets using both of
ican English. All of the test sets were manually transcribed. the functions introduced in section 3.1. We then measured the
The context for some test sets is defined per utterance (e.g. test effects of each of the features that our biasing implementation
set “Entities and location”), whereas for others the context is supports. Finally, we show how we can control the strength
constant for the whole test set (e.g. test set “Confirmation”). of the biasing by varying a range of parameters of our biasing
Several experimenal setups were used to evaluate the positive score function.
effect of relevant context and the negative effect (overtrigger- Table 1 shows the effect of biasing versus the baseline (i.e.
ing) of irrelevant context. In the baseline setup, experiments are no biasing). “bias 1” uses the length-linear scoring function,
run with no context provided. In order to evaluate the positive and “bias 2” uses the unigram-and-bigram function. Both bias-
effect of relevant context we use the following setups: (1) In ing tests use the positive biasing interpolation function in equa-
sets with per-utterance context, we attach to each utterance its tion (3), however “bias 1” uses (α, β) = (0.25, 1) whereas
relevant context. (2) In sets with fixed context, the same context “bias 2” uses (α, β) = (0, 1), which is effectively the same
is attached to every utterance as using min(sG (w|H), sB (w|H)) for interpolation. The val-
In order to evaluate the negative effect of relevant context ues of α and β control the interpolation of main LM costs and
we use the following setups: (1) In sets with per-utterance con- biasing scores based on equation (3).
Test set Baseline bias 1 bias 2 p1 p2
entities pos 8.9 7.2 7.2 -1 0 1 5
entities neg 8.9 9.0 9.0 -2 30.8 43.3 30.5 42.8 32.7 42.0 35.9 42.1
ync pos 18.8 10.4 11.0 0 7.2 9.3 8.9 8.9 7.3 9.0 7.7 8.9
anti ync 10.9 10.9 10.9 6 6.6 9.6 7.3 9.1 6.6 9.2 7.3 9.0
costly pos 12.9 4.2 6.1 10 6.7 9.4 8.2 8.9 7.0 9.0 7.5 8.9
costly neg 12.9 13.8 13.8
numeric 11.0 4.7 5.7 Table 3: WER(%) for entities pos (using regular font) and en-
contacts 15.0 2.8 3.2 tities neg (using italics) over a range of (p1 , p2 ) values for the
unigram-and-bigram scoring function (equation (5)).
Table 1: WER(%) for baseline vs two biasing methods. bias
1: length-linear, (α, β) = (0.25, 1) and (p1 , p2 ) = (0, −0.4).
bias 2: unigram-and-bigram, (α, β) = (0, 1) and (p1 , p2 ) = the strength of bias is increased, the WER for the negative test
(7, 3). increases monotonically, but the WER for the positive test set
decreases, up to a certain minimum, after which it also starts
Test set bias 2 bias 2.a bias 2.b bias 2.c increasing. This is because the context starts to cause errors in
entities pos 7.2 7.2 7.3 7.4 the parts of the utterance that are not supposed to be biased. At
entities neg 9.0 8.9 9.0 9.0 the extremely high biasing level of (-2, -1), both the positive and
ync pos 11.0 15.0 11.6 11.0 negative tests are significantly worse than baseline.
anti ync 10.9 10.9 10.9 10.9 The operating point of (7, 3) used in Table 1 and Table 2
costly pos 6.1 6.5 6.7 6.1 is a relatively conservative operating point, which has minimal
costly neg 13.8 13.6 13.8 13.8 effect on the negative test. This operating point was chosen to
numeric 5.7 6.1 6.0 5.9 balance positive and negative performance on several different
contacts 3.2 5.1 3.2 3.2 test sets. For the test set used in Table 3, a more aggressive
operating point of (6, 1) results in WERs 6.6% and 9.2%, re-
Table 2: The effect of biasing features on WER(%): bias 2: spectively, on the positive and negative tests (baseline is 8.9%).
With all features, same as in Table 1. bias 2.a: Without sentence
boundaries. bias 2.b: Without case variants. bias 2.c: Without 4.3. Live biasing experiments
OOV support.
In order to further validate that our system improvements are
beneficial we ran a live experiment. In our experiment, a per-
centage of the production traffic is cloned and sent to two speech
Table 2 compares the effect of having each of the following
recognition systems. We focused only on the traffic correspond-
features disabled:
ing to the confirmation dialog state, that is, the state in which
bias 2.a. Add sentence boundaries to the context. a user is asked to respond with one of the words “yes”, “no”,
bias 2.b. Include upper/lower case variants of the context. “cancel”. The first system was used as the baseline while the
bias 2.c. Support OOV words in the context. second used the biasing methodology described in this paper. In
the biasing system, for each of the utterances we used the fixed
Disabling sentence boundaries has a particularly detrimen- biasing context consisting of three words described above.
tal effect on our ync pos test set. It also negatively affects con- During our experiment approximately 30,000 utterances
tacts, numeric and costly pos test sets. As mentioned in sec- were processed by each system. This was done anonymously
tion 3.2, the reason is that in the ync pos test set, the context and and on-the-fly. We compared the performance of the two sys-
most of the expected transcripts are unigrams. Adding sentence tems by using sentence accuracy as metric. Using the bias-
boundaries allows us to bias these contexts at the bigram level, ing methodology and optimal operating point described in sec-
which can be safely biased more heavily. In contrast, trying tion 4.2 resulted in a sentence accuracy increase of 8% relative.
to achieve the same result by increasing unigrams bias would This was significant with p < 0.1.
result in a sharp increase in insertion errors.
Disabling case variants has a negative effect on the recog-
nition resuls for the test sets for which the context includes the
5. Conclusion
case variant less probable in the LM. The worst effect is on the In this paper, we describe an approach for biasing speech recog-
costly pos and numeric test sets. This feature does not have a nition towards provided contextual information. We analyze
noticeable effect on the negative test sets, so it can be safely various types of context, describe context preprocessing tech-
turned on by default. The third feature, support for OOV words niques, and provide a solution to OOVs present in the context.
in the context, reduces WER on test sets that include OOVs and We also present biasing functions used to adjust LM scores
has no effect otherwise. based on provided context. We conducted experiements using
Finally, we present WER for our entities and location test several datasets with various types of contextual information
sets at a range of operating points (set of values for parameters provided. The results obtained show that the proposed method-
in the scoring function). This test set contains context phrases of ology can significantly improve speech recognition accuracy
various lengths, case variants, and OOVs. In Table 3 the WER when reliable contextual information is available. For exam-
for the positive and negative tests are shown side by side, for ple, on our confirmation (ync) testset, WER relative reduction
various pairs of values of (p1 , p2 ) in equation (5). The point of 44% is achieved on positive (ync pos) testset without any
(0, 0) corresponds to baseline WERs as the context is ignored. WER changes on the negative test set (anti nyc). Furthermore,
At the lowest level of biasing, (p1 , p2 ) = (10, 5), the neg- we show that speech recognition gains are achieved withouth
ative test is not affected (the WER is equal to baseline). How- causing overtriggering on queries not related to the context.
ever the positive WER is already better than the baseline. As
6. References
[1] K. B. Hall, E. Cho, C. Allauzen, F. Beaufays, N. Coc-
caro, K. Nakajima, M. Riley, B. Roark, D. Rybach, and
L. Zhang, “Composition-based on-the-fly rescoring for
salient n-gram biasing,” in Interspeech 2015, 2015.
[2] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and
M. Mohri, “OpenFst: A general and efficient weighted
finite-state transducer library,” in CIAA 2007, ser. LNCS,
vol. 4783, 2007, pp. 11–23, http://www.openfst.org.
[3] M. Mohri, F. Pereira, and M. Riley, “Speech recognition
with weighted finite-state transducers,” in Handbook of
Speech Processing, Y. H. Jacob Benesty, Mohan Sondhi,
Ed. Springer, 2008, pp. 559–582.
[4] R. Kuhn and R. De Mori, “A cache-based natural language
model for speech recognition,” Pattern Analysis and Ma-
chine Intelligence, IEEE Transactions on, vol. 12, no. 6,
pp. 570–583, 1990.
[5] P. R. Clarkson and A. J. Robinson, “Language model
adaptation using mixtures and an exponentially decay-
ing cache,” in Acoustics, Speech, and Signal Processing,
1997. ICASSP-97., 1997 IEEE Internation al Conference
on, vol. 2. IEEE, 1997, pp. 799–802.
[6] S. Besling and H.-G. Meier, “Language model speaker
adaptation,” in Fourth European Conference on Speech
Communication and Technology, 1995.
[7] M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state
transducers in speech recognition,” Computer Speech and
Language, vol. 16, pp. 69–88, 2002.
[8] G. Saon, D. Povey, and G. Zweig, “Anatomy of an ex-
tremely fast LVCSR decoder,” in in Proc. Interspeech,
2005, pp. 549–552.
[9] T. Hori, C. Hori, Y. Minami, and A. Nakamura, “Efficient
WFST-based one-pass decoding with on-the-fly hypoth-
esis rescoring in extremely large vocabulary continuous
speech recognition,” Audio, Speech, and Language Pro-
cessing, IEEE Transactions on, vol. 15, no. 4, pp. 1352–
1365, 2007.
[10] P. Aleksic, C. Allauzen, D. Elson, A. K. D. M. Casado,
and P. J. Moreno, “Improved recognition of contact names
in voice commands,” in ICASSP 2015, 2015.
[11] L. Vasserman, V. Schogol, and K. Hall, “Sequence-based
class tagging for robust transcription in ASR,” in Submit-
ted to Interspeech, 2015.
[12] P. F. Brown, V. J. D. Pietra, P. V. deSouza, J. C. Lai,
and R. L. Mercer, “Class-based n-gram models of natu-
ral language,” Computational Linguistics, vol. 18, no. 4,
pp. 467–479, 1992.

You might also like