End-to-end systems 1: CTC
(Connectionist Temporal Classification)
Steve Renals
Automatic Speech Recognition – ASR Lecture 15
11 March 2019
ASR Lecture 15 End-to-end systems 1: CTC 1
End-to-end systems
End-to-end systems are systems which learn to directly map from an input
sequence X to an output sequence Y , estimating P(Y |X )
Y can be a sequence of words or subwords
ML trained HMMs are kind of end-to-end system – the HMM estimates P(X |Y ),
and when combined with a language model gives an estimate of P(Y |X )
Sequence discriminative training of HMMs (using GMMs or DNNs) can be
regarded as end-to-end
But training is quite complicated – need to estimate the denominator (total
likelihood) using lattices, first train conventionally (ML for GMMs, CE for NNs) then
finetune using sequence discriminative training
Lattice-free MMI is one way to address these issues
Other approaches based on recurrent networks which directly map input to output
sequences
CTC – Connectionist Temporal Classification
Encoder-decoder approaches
ASR Lecture 15 End-to-end systems 1: CTC 2
End-to-end systems
End-to-end systems are systems which learn to directly map from an input
sequence X to an output sequence Y , estimating P(Y |X )
Y can be a sequence of words or subwords
ML trained HMMs are kind of end-to-end system – the HMM estimates P(X |Y ),
and when combined with a language model gives an estimate of P(Y |X )
Sequence discriminative training of HMMs (using GMMs or DNNs) can be
regarded as end-to-end
But training is quite complicated – need to estimate the denominator (total
likelihood) using lattices, first train conventionally (ML for GMMs, CE for NNs) then
finetune using sequence discriminative training
Lattice-free MMI is one way to address these issues
Other approaches based on recurrent networks which directly map input to output
sequences
CTC – Connectionist Temporal Classification
Encoder-decoder approaches (next lecture)
ASR Lecture 15 End-to-end systems 1: CTC 2
Here Wk and bk denote the k’th column of the weight matrix and k’th bias, respectively.
Once we have computed a prediction for P(ct |x), we compute the CTC loss [13] L(ŷ, y) to measure
Deep Speech
the error in prediction. During training, we can evaluate the gradient rŷ L(ŷ, y) with respect to
the network outputs given the ground-truth character sequence y. From this point, computing the
gradient with respect to all of the model parameters may be done via back-propagation through the
rest of the network. We use Nesterov’s Accelerated gradient method for training [41].3
Output: character probabilities (a-z, <apostrophe>, <space>, <blank>)
Trained using CTC
Softmax output layer
Bidirectional recurrent
hidden layer
3 feed-forward
hidden layers
Input: Filter bank features (spectrogram)
Figure 1: Structure of our RNN model and notation.
Hannun et al (2014), “Deep Speech: Scaling up end-to-end speech recognition”,
The complete RNN model is illustrated in Figure 1. Note that its structure is considerably simpler
https://arxiv.org/abs/1412.5567.
than related models from the literature [14]—we have limited ourselves to a single recurrent layer
(which is the hardest to parallelize) andASR
we do not use
Lecture 15 Long-Short-Term-Memory
End-to-end systems 1: (LSTM)
CTC circuits. 3
Deep Speech: Results
Model SWB CH Full
Vesely et al. (GMM-HMM BMMI) [44] 18.6 33.0 25.8
Vesely et al. (DNN-HMM sMBR) [44] 12.6 24.1 18.4
Maas et al. (DNN-HMM SWB) [28] 14.6 26.3 20.5
Maas et al. (DNN-HMM FSH) [28] 16.0 23.7 19.9
Seide et al. (CD-DNN) [39] 16.1 n/a n/a
Kingsbury et al. (DNN-HMM sMBR HF) [22] 13.3 n/a n/a
Sainath et al. (CNN-HMM) [36] 11.5 n/a n/a
Soltau et al. (MLP/CNN+I-Vector) [40] 10.4 n/a n/a
Deep Speech SWB 20.0 31.8 25.9
Deep Speech SWB + FSH 12.6 19.3 16.0
Table 3: Published error rates (%WER) on Switchboard dataset splits. The columns labeled “SWB”
and “CH” are respectively the easy and hard subsets of Hub5’00.
5.2 Noisy speech
ASR Lecture 15 End-to-end systems 1: CTC 4
Deep Speech Training
Maps from acoustic frames X to subword sequences S, where S is a sequence of
characters (in some other CTC approaches, S can be a sequence of phones)
CTC loss function
Makes good use of large training data
Synthetic additional training data by jittering the signal and adding noise
Many computational optimisations
n-gram language model to impose word-level constraints
Competitive results on standard tasks
ASR Lecture 15 End-to-end systems 1: CTC 5
Deep Speech Training
Maps from acoustic frames X to subword sequences S, where S is a sequence of
characters (in some other CTC approaches, S can be a sequence of phones)
CTC loss function
Makes good use of large training data
Synthetic additional training data by jittering the signal and adding noise
Many computational optimisations
n-gram language model to impose word-level constraints
Competitive results on standard tasks
ASR Lecture 15 End-to-end systems 1: CTC 5
Connectionist Temporal Classification (CTC)
Train a recurrent network to map from input sequence X to output sequence S
sequences can be different lengths – for speech, input sequence X (acoustic frames)
is much longer than output sequence S (characters or phonemes)
CTC does not require frame-level alignment (matching each input frame to an
output token)
CTC sums over all possible alignments (similar to forward-backward algorithm) –
“alignment free”
Possible to back-propagate gradients through CTC
Gopod overview of CTC: Awni Hannun, “Sequence Modeling with CTC”, Distill.
https://distill.pub/2017/ctc
ASR Lecture 15 End-to-end systems 1: CTC 6
CTC: Alignment
Imagine mapping (x1 , x2 , x3 , x4 , x5 , x6 ) to [a, b, c]
Possible alignments: aaabbc, aabbcc, abbbbc,. . .
However
Don’t always want to map every input frame to an output symbol (e.g. if there is
“inter-symbol silence”)
Want to be able to have two identical symbols adjacent to each other – keep the
difference between
Solve this using an additional blank symbol ()
CTC output compression
1 Merge repeating characters
2 Remove blanks
Thus to model the same character successively, separate with a blank
Some possible alignments for [h, e, l, l, o] and [h, e, l, o] given a 10-element input
sequence
[h, e, l, l, o]: helllo; hellloo
[h, e, l, o]: hellllo; hhelo
ASR Lecture 15 End-to-end systems 1: CTC 7
CTC: Alignment example
h h e ϵ ϵ l l l ϵ l l o
First, merge repeat
characters.
h e ϵ l ϵ l o
Then, remove any ϵ
tokens.
h e l l o
The remaining characters
are the output.
h e l l o
ASR Lecture 15 End-to-end systems 1: CTC 8
CTC: Valid and invalid alignments
Consider an output [c, a, t] with an input of length six
Valid Alignments Invalid Alignments
corresponds to
ϵ c c ϵ a t c ϵ c ϵ a t Y = [c, c, a, t]
c c a a t t c c a a t has length 5
c a ϵ ϵ ϵ t c ϵ ϵ ϵ t t missing the 'a'
ASR Lecture 15 End-to-end systems 1: CTC 9
CTC: Alignment properties
Monotonic – Alignments are monotonic (left-to-right model); no re-ordering
(unlike neural machine translation)
Many-to-one – Alignments are many-to-one; many inputs can map to the same
output (however a single input cannot map to many outputs)
CTC doesn’t find a single alignment: it sums over all possible alignments
ASR Lecture 15 End-to-end systems 1: CTC 10
CTC: Loss function (1)
Let C be an output label sequence, including blanks and repetitions – same length
as input sequence X
Posterior probability of output labels C = (c1 , . . . ct , . . . cT ) given the input
sequence X = (x1 , . . . xt , . . . xT ):
T
Y
P(C |X ) = y (ct , t)
t=1
where y (ct , t) is the output for label ct at time t
This is the probability of a single alignment
ASR Lecture 15 End-to-end systems 1: CTC 11
CTC: Loss function (2)
Let S be the target output sequence after compression
Compute the posterior probability of the target sequence S = (s1 , . . . sm , . . . sM )
(M ≤ T ) given X by summing over the possible CTC alignments:
X
P(S|X ) = P(C |X )
c∈A(S)
where A is the set of possible output label sequences c that can be mapped to S
using the CTC compression rules (merge repeated labels, then remove blanks)
The CTC loss function LCTC is given by the negative log likelihood of the sum of
CTC alignments:
LCTC = − log P(S|X )
Perform the sum over alignments using dynamic programming – similar structure
as used in forward-backward algorithm and Viterbi (see Hannun for details)
Various NN architectures can be used for CTC – usually use a deep bidirectional
LSTM RNN
ASR Lecture 15 End-to-end systems 1: CTC 12
CTC: Distribution over alignments
We start with an input sequence,
like a spectrogram of audio.
The input is fed into an RNN,
for example.
h h h h h h h h h h
e e e e e e e e e e The network gives pt (a | X ),
a distribution over the outputs
l l l l l l l l l l {h, e, l, o, ϵ} for each input step.
o o o o o o o o o o
ϵ ϵ ϵ ϵ ϵ ϵ ϵ ϵ ϵ ϵ
h e ϵ l l ϵ l l o o With the per time-step output
distribution, we compute the
h h e l l ϵ ϵ l ϵ o probability of different sequences
ϵ e ϵ l l ϵ ϵ l o o
h e l l o By marginalizing over alignments,
we get a distribution over outputs.
e l l o
h e l o
ASR Lecture 15 End-to-end systems 1: CTC 13
Understanding CTC: Conditional independence assumption
Each output is dependent on the entire input sequence (in Deep Speech this is
achieved using a bidirectional recurrent layer)
Given the inputs, each output is independent of the other outputs (conditional
independence)
CTC does not learn a language model over the outputs, although a language
model can be applied later
Graphical model showing dependences in CTC:
a1 a2 aT
ASR Lecture 15 End-to-end systems 1: CTC 14
Understanding CTC: CTC and HMM
a b ϵ a ϵ b ϵ
Left-to-right HMM CTC HMM
CTC can be interpreted as an HMM with additional (skippable) blank states,
trained discriminatively
ASR Lecture 15 End-to-end systems 1: CTC 15
Applying language models to CTC
Direct interpolation of a language model with the CTC acoustic model:
Ŵ = arg max(α log P(S|X ) + log P(W ))
W
Only consider word sequences W which correspond to the subword sequence S
(using a lexicon)
α is an empirically determined scale factor to match the acoustic model to the
language model
Lexicon-free CTC: use a “subword language model” P(S) (Maas et al, 2015)
WFST implementation: create an FST T which transforms a framewise label
sequence c into the subword sequence S, then compose with L and G :
T ◦ min(det(L ◦ G )) (Miao et al, 2015)
ASR Lecture 15 End-to-end systems 1: CTC 16
Mozilla Deep Speech
Mozilla have released an Open Source TensorFlow implementation of the Deep
Speech architecture:
https://hacks.mozilla.org/2017/11/a-journey-to-10-word-error-rate/
https://github.com/mozilla/DeepSpeech
Close to state-of-the-art results on librispeech
Mozilla Common Voice project: https://voice.mozilla.org/en
ASR Lecture 15 End-to-end systems 1: CTC 17
Summary and reading
CTC is an alternative approach to sequence discriminative training, typically
applied to RNN systems
Used in “Deep Speech” architecture for end-to-end speech recognition
Reading
A Hannun et al (2014), “Deep Speech: Scaling up end-to-end speech recognition”,
ArXiV:1412.5567. https://arxiv.org/abs/1412.5567
A Hannun (2017), “Sequence Modeling with CTC”, Distill.
https://distill.pub/2017/ctc
Background reading
Y Miao et al (2015), “EESEN: End-to-end speech recognition using deep RNN
models and WFST-based decoding”, ASRU-2105.
https://ieeexplore.ieee.org/abstract/document/7404790
A Maas et al (2015). “Lexicon-free conversational speech recognition with neural
networks”, NAACL HLT 2015, http://www.aclweb.org/anthology/N15-1038
ASR Lecture 15 End-to-end systems 1: CTC 18