0% found this document useful (0 votes)

192 views20 pages

CTC Loss Function

The document discusses Connectionist Temporal Classification (CTC) as an end-to-end approach for automatic speech recognition. CTC trains a recurrent neural network to map acoustic input sequences directly to character output sequences without requiring frame-level alignment. It handles variable input/output lengths by introducing a "blank" symbol and summing over all possible alignments. CTC has been shown to produce competitive results on standard speech recognition benchmarks.

Uploaded by

Ivan Fadillah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

192 views20 pages

CTC Loss Function

Uploaded by

Ivan Fadillah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

End-to-end systems 1: CTC

(Connectionist Temporal Classification)

Steve Renals

Automatic Speech Recognition – ASR Lecture 15

11 March 2019

ASR Lecture 15 End-to-end systems 1: CTC 1

End-to-end systems

End-to-end systems are systems which learn to directly map from an input
sequence X to an output sequence Y , estimating P(Y |X )
Y can be a sequence of words or subwords
ML trained HMMs are kind of end-to-end system – the HMM estimates P(X |Y ),
and when combined with a language model gives an estimate of P(Y |X )
Sequence discriminative training of HMMs (using GMMs or DNNs) can be
regarded as end-to-end
But training is quite complicated – need to estimate the denominator (total
likelihood) using lattices, first train conventionally (ML for GMMs, CE for NNs) then
finetune using sequence discriminative training
Lattice-free MMI is one way to address these issues
Other approaches based on recurrent networks which directly map input to output
sequences
CTC – Connectionist Temporal Classification
Encoder-decoder approaches (next lecture)
ASR Lecture 15 End-to-end systems 1: CTC 2
Here Wk and bk denote the k’th column of the weight matrix and k’th bias, respectively.
Once we have computed a prediction for P(ct |x), we compute the CTC loss [13] L(ŷ, y) to measure
Deep Speech
the error in prediction. During training, we can evaluate the gradient rŷ L(ŷ, y) with respect to
the network outputs given the ground-truth character sequence y. From this point, computing the
gradient with respect to all of the model parameters may be done via back-propagation through the
rest of the network. We use Nesterov’s Accelerated gradient method for training [41].3
Output: character probabilities (a-z, <apostrophe>, <space>, <blank>)
Trained using CTC

Softmax output layer

Bidirectional recurrent
hidden layer

3 feed-forward
hidden layers

Input: Filter bank features (spectrogram)

Figure 1: Structure of our RNN model and notation.
Hannun et al (2014), “Deep Speech: Scaling up end-to-end speech recognition”,
The complete RNN model is illustrated in Figure 1. Note that its structure is considerably simpler
https://arxiv.org/abs/1412.5567.
than related models from the literature [14]—we have limited ourselves to a single recurrent layer
(which is the hardest to parallelize) andASR
we do not use
Lecture 15 Long-Short-Term-Memory
End-to-end systems 1: (LSTM)
CTC circuits. 3
Deep Speech: Results

Model SWB CH Full

Vesely et al. (GMM-HMM BMMI) [44] 18.6 33.0 25.8
Vesely et al. (DNN-HMM sMBR) [44] 12.6 24.1 18.4
Maas et al. (DNN-HMM SWB) [28] 14.6 26.3 20.5
Maas et al. (DNN-HMM FSH) [28] 16.0 23.7 19.9
Seide et al. (CD-DNN) [39] 16.1 n/a n/a
Kingsbury et al. (DNN-HMM sMBR HF) [22] 13.3 n/a n/a
Sainath et al. (CNN-HMM) [36] 11.5 n/a n/a
Soltau et al. (MLP/CNN+I-Vector) [40] 10.4 n/a n/a
Deep Speech SWB 20.0 31.8 25.9
Deep Speech SWB + FSH 12.6 19.3 16.0

Table 3: Published error rates (%WER) on Switchboard dataset splits. The columns labeled “SWB”
and “CH” are respectively the easy and hard subsets of Hub5’00.

5.2 Noisy speech

ASR Lecture 15 End-to-end systems 1: CTC 4
Deep Speech Training

Maps from acoustic frames X to subword sequences S, where S is a sequence of

characters (in some other CTC approaches, S can be a sequence of phones)
CTC loss function
Makes good use of large training data
Synthetic additional training data by jittering the signal and adding noise
Many computational optimisations
n-gram language model to impose word-level constraints
Competitive results on standard tasks

ASR Lecture 15 End-to-end systems 1: CTC 5

Deep Speech Training

Maps from acoustic frames X to subword sequences S, where S is a sequence of

ASR Lecture 15 End-to-end systems 1: CTC 5

Connectionist Temporal Classification (CTC)

Train a recurrent network to map from input sequence X to output sequence S

sequences can be different lengths – for speech, input sequence X (acoustic frames)
is much longer than output sequence S (characters or phonemes)
CTC does not require frame-level alignment (matching each input frame to an
output token)
CTC sums over all possible alignments (similar to forward-backward algorithm) –
“alignment free”
Possible to back-propagate gradients through CTC

Gopod overview of CTC: Awni Hannun, “Sequence Modeling with CTC”, Distill.
https://distill.pub/2017/ctc

ASR Lecture 15 End-to-end systems 1: CTC 6

CTC: Alignment
Imagine mapping (x1 , x2 , x3 , x4 , x5 , x6 ) to [a, b, c]
Possible alignments: aaabbc, aabbcc, abbbbc,. . .
However
Don’t always want to map every input frame to an output symbol (e.g. if there is
“inter-symbol silence”)
Want to be able to have two identical symbols adjacent to each other – keep the
difference between
Solve this using an additional blank symbol ()
CTC output compression
1 Merge repeating characters
2 Remove blanks
Thus to model the same character successively, separate with a blank
Some possible alignments for [h, e, l, l, o] and [h, e, l, o] given a 10-element input
sequence
[h, e, l, l, o]: helllo; hellloo
[h, e, l, o]: hellllo; hhelo
ASR Lecture 15 End-to-end systems 1: CTC 7
CTC: Alignment example

h h e ϵ ϵ l l l ϵ l l o
First, merge repeat
characters.
h e ϵ l ϵ l o
Then, remove any ϵ
tokens.
h e l l o
The remaining characters
are the output.
h e l l o

ASR Lecture 15 End-to-end systems 1: CTC 8

CTC: Valid and invalid alignments

Consider an output [c, a, t] with an input of length six

Valid Alignments Invalid Alignments

corresponds to
ϵ c c ϵ a t c ϵ c ϵ a t Y = [c, c, a, t]

c c a a t t c c a a t has length 5

c a ϵ ϵ ϵ t c ϵ ϵ ϵ t t missing the 'a'

ASR Lecture 15 End-to-end systems 1: CTC 9

CTC: Alignment properties

Monotonic – Alignments are monotonic (left-to-right model); no re-ordering

(unlike neural machine translation)
Many-to-one – Alignments are many-to-one; many inputs can map to the same
output (however a single input cannot map to many outputs)
CTC doesn’t find a single alignment: it sums over all possible alignments

ASR Lecture 15 End-to-end systems 1: CTC 10

CTC: Loss function (1)

Let C be an output label sequence, including blanks and repetitions – same length
as input sequence X
Posterior probability of output labels C = (c1 , . . . ct , . . . cT ) given the input
sequence X = (x1 , . . . xt , . . . xT ):
T
Y
P(C |X ) = y (ct , t)
t=1

where y (ct , t) is the output for label ct at time t

This is the probability of a single alignment

ASR Lecture 15 End-to-end systems 1: CTC 11

CTC: Loss function (2)

Let S be the target output sequence after compression

Compute the posterior probability of the target sequence S = (s1 , . . . sm , . . . sM )
(M ≤ T ) given X by summing over the possible CTC alignments:
X
P(S|X ) = P(C |X )
c∈A(S)

where A is the set of possible output label sequences c that can be mapped to S
using the CTC compression rules (merge repeated labels, then remove blanks)
The CTC loss function LCTC is given by the negative log likelihood of the sum of
CTC alignments:
LCTC = − log P(S|X )
Perform the sum over alignments using dynamic programming – similar structure
as used in forward-backward algorithm and Viterbi (see Hannun for details)
Various NN architectures can be used for CTC – usually use a deep bidirectional
LSTM RNN
ASR Lecture 15 End-to-end systems 1: CTC 12
CTC: Distribution over alignments
We start with an input sequence,
like a spectrogram of audio.

The input is fed into an RNN,

for example.

h h h h h h h h h h
e e e e e e e e e e The network gives pt (a | X ),
a distribution over the outputs
l l l l l l l l l l {h, e, l, o, ϵ} for each input step.

o o o o o o o o o o
ϵ ϵ ϵ ϵ ϵ ϵ ϵ ϵ ϵ ϵ

h e ϵ l l ϵ l l o o With the per time-step output

distribution, we compute the
h h e l l ϵ ϵ l ϵ o probability of different sequences

ϵ e ϵ l l ϵ ϵ l o o

h e l l o By marginalizing over alignments,

we get a distribution over outputs.
e l l o
h e l o
ASR Lecture 15 End-to-end systems 1: CTC 13
Understanding CTC: Conditional independence assumption

Each output is dependent on the entire input sequence (in Deep Speech this is
achieved using a bidirectional recurrent layer)
Given the inputs, each output is independent of the other outputs (conditional
independence)
CTC does not learn a language model over the outputs, although a language
model can be applied later
Graphical model showing dependences in CTC:

a1 a2 aT

ASR Lecture 15 End-to-end systems 1: CTC 14

Understanding CTC: CTC and HMM

a b ϵ a ϵ b ϵ

Left-to-right HMM CTC HMM

CTC can be interpreted as an HMM with additional (skippable) blank states,

trained discriminatively

ASR Lecture 15 End-to-end systems 1: CTC 15

Applying language models to CTC

Direct interpolation of a language model with the CTC acoustic model:

Ŵ = arg max(α log P(S|X ) + log P(W ))

Only consider word sequences W which correspond to the subword sequence S

(using a lexicon)
α is an empirically determined scale factor to match the acoustic model to the
language model
Lexicon-free CTC: use a “subword language model” P(S) (Maas et al, 2015)
WFST implementation: create an FST T which transforms a framewise label
sequence c into the subword sequence S, then compose with L and G :
T ◦ min(det(L ◦ G )) (Miao et al, 2015)

ASR Lecture 15 End-to-end systems 1: CTC 16

Mozilla Deep Speech

Mozilla have released an Open Source TensorFlow implementation of the Deep

Speech architecture:
https://hacks.mozilla.org/2017/11/a-journey-to-10-word-error-rate/
https://github.com/mozilla/DeepSpeech
Close to state-of-the-art results on librispeech
Mozilla Common Voice project: https://voice.mozilla.org/en

ASR Lecture 15 End-to-end systems 1: CTC 17

Summary and reading

CTC is an alternative approach to sequence discriminative training, typically

applied to RNN systems
Used in “Deep Speech” architecture for end-to-end speech recognition
Reading
A Hannun et al (2014), “Deep Speech: Scaling up end-to-end speech recognition”,
ArXiV:1412.5567. https://arxiv.org/abs/1412.5567
A Hannun (2017), “Sequence Modeling with CTC”, Distill.
https://distill.pub/2017/ctc
Background reading
Y Miao et al (2015), “EESEN: End-to-end speech recognition using deep RNN
models and WFST-based decoding”, ASRU-2105.
https://ieeexplore.ieee.org/abstract/document/7404790
A Maas et al (2015). “Lexicon-free conversational speech recognition with neural
networks”, NAACL HLT 2015, http://www.aclweb.org/anthology/N15-1038

ASR Lecture 15 End-to-end systems 1: CTC 18

Ed3book (347 520)
No ratings yet
Ed3book (347 520)
174 pages
End-to-End ASR Models Tutorial
No ratings yet
End-to-End ASR Models Tutorial
177 pages
Deep Speech 3 1707.07413
No ratings yet
Deep Speech 3 1707.07413
8 pages
2021 DLA-09 ZalkowMueller CTC
No ratings yet
2021 DLA-09 ZalkowMueller CTC
46 pages
Lexicon-Free Conversational Speech Recognition With Neural Networks
No ratings yet
Lexicon-Free Conversational Speech Recognition With Neural Networks
10 pages
Comparison of Decoding Strategies For CTC Acoustic Models
No ratings yet
Comparison of Decoding Strategies For CTC Acoustic Models
5 pages
Understanding CTC for Text Recognition
No ratings yet
Understanding CTC for Text Recognition
7 pages
Sequence Transduction With Recurrent Neural Networks: Alex Graves
No ratings yet
Sequence Transduction With Recurrent Neural Networks: Alex Graves
9 pages
TensorFlow Handwritten Text Recognition
No ratings yet
TensorFlow Handwritten Text Recognition
11 pages
Connectionist Temporal Classification: Labelling Unsegmented Sequence Data With Recurrent Neural Networks
No ratings yet
Connectionist Temporal Classification: Labelling Unsegmented Sequence Data With Recurrent Neural Networks
8 pages
Deep Speech - Scaling Up End-To-End Speech Recognition
No ratings yet
Deep Speech - Scaling Up End-To-End Speech Recognition
12 pages
Word Beam Search A Connectionist Temporal Classification Decoding Algorithm
No ratings yet
Word Beam Search A Connectionist Temporal Classification Decoding Algorithm
6 pages
Hybrid CTC/Attention Architecture For End-to-End Speech Recognition
No ratings yet
Hybrid CTC/Attention Architecture For End-to-End Speech Recognition
16 pages
Connectionist Temporal Classification
No ratings yet
Connectionist Temporal Classification
6 pages
Speech Recognition With Deep Recurrent Neural Networks
No ratings yet
Speech Recognition With Deep Recurrent Neural Networks
2 pages
Advancing RNN Transducer Technology For Speech Recognition
No ratings yet
Advancing RNN Transducer Technology For Speech Recognition
5 pages
Joint CTC-attention Decoding For End-To-End Speech Recognitionhori Et Al - 2017
No ratings yet
Joint CTC-attention Decoding For End-To-End Speech Recognitionhori Et Al - 2017
12 pages
End-to-End Automatic Speech Recognition
No ratings yet
End-to-End Automatic Speech Recognition
19 pages
Recent Advances in End-to-End Automatic Speech Recognition
No ratings yet
Recent Advances in End-to-End Automatic Speech Recognition
64 pages
RNNs for Unsegmented Data Labeling
No ratings yet
RNNs for Unsegmented Data Labeling
8 pages
Voice Assistant
No ratings yet
Voice Assistant
30 pages
Presentation 2
No ratings yet
Presentation 2
12 pages
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
End-To-End Neural Architectures For Asr: Instructor: Preethi Jyothi
No ratings yet
End-To-End Neural Architectures For Asr: Instructor: Preethi Jyothi
16 pages
Voice Assistant
No ratings yet
Voice Assistant
34 pages
End To End CSR
No ratings yet
End To End CSR
5 pages
28575-Article Text-32629-1-2-20240324
No ratings yet
28575-Article Text-32629-1-2-20240324
9 pages
SVTRV 2
No ratings yet
SVTRV 2
17 pages
Channel Model For End-To-End Learning of Communications Systems - Survey
No ratings yet
Channel Model For End-To-End Learning of Communications Systems - Survey
3 pages
GTC: Guided Training of CTC Towards Efficient and Accurate Scene Text Recognition
No ratings yet
GTC: Guided Training of CTC Towards Efficient and Accurate Scene Text Recognition
9 pages
Consecutive Decoding For Speech-To-Text Translation: Qianqian Dong, Mingxuan Wang, Hao Zhou, Shuang Xu, Bo Xu, Lei Li
No ratings yet
Consecutive Decoding For Speech-To-Text Translation: Qianqian Dong, Mingxuan Wang, Hao Zhou, Shuang Xu, Bo Xu, Lei Li
11 pages
Deep LSTM for Chinese Handwriting
No ratings yet
Deep LSTM for Chinese Handwriting
6 pages
Baskar Is2019 193167
No ratings yet
Baskar Is2019 193167
5 pages
Multi-View Self-Supervised Learning and Multi-Scale Feature Fusion For Automatic Speech Recognition
No ratings yet
Multi-View Self-Supervised Learning and Multi-Scale Feature Fusion For Automatic Speech Recognition
20 pages
Continual Training of Language Models For Few-Shot Learning
No ratings yet
Continual Training of Language Models For Few-Shot Learning
13 pages
Mod 4-RNN Deep Learning
No ratings yet
Mod 4-RNN Deep Learning
63 pages
Listen, Attend and Spell
No ratings yet
Listen, Attend and Spell
16 pages
Sequence Modeling for IT Students
No ratings yet
Sequence Modeling for IT Students
71 pages
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
No ratings yet
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
5 pages
DRNN Am
No ratings yet
DRNN Am
5 pages
Lec02 RNN
No ratings yet
Lec02 RNN
52 pages
Survey on Speech Imitation Using ML
No ratings yet
Survey on Speech Imitation Using ML
5 pages
RNN (v2)
No ratings yet
RNN (v2)
89 pages
Digital Coding Lecture Slide 4
No ratings yet
Digital Coding Lecture Slide 4
25 pages
CHAPTER THREE Prental Fix-1
No ratings yet
CHAPTER THREE Prental Fix-1
8 pages
L - B S R G C N: Etter Ased Peech Ecognition With Ated ONV ETS
No ratings yet
L - B S R G C N: Etter Ased Peech Ecognition With Ated ONV ETS
10 pages
Temporal Convolutional Networks Explained
100% (1)
Temporal Convolutional Networks Explained
21 pages
Scaling Up Online Speech Recognition Using ConvNets
No ratings yet
Scaling Up Online Speech Recognition Using ConvNets
8 pages
Yang 2017
No ratings yet
Yang 2017
4 pages
Camgoz SubUNets End-To-End Hand ICCV 2017 Paper
No ratings yet
Camgoz SubUNets End-To-End Hand ICCV 2017 Paper
10 pages
A Comparison Between Recurrent Neural Network
No ratings yet
A Comparison Between Recurrent Neural Network
5 pages
CNN Final
No ratings yet
CNN Final
6 pages
ASSAR, Richard - SampleRNN - Article
No ratings yet
ASSAR, Richard - SampleRNN - Article
11 pages
Signlanguage Detection 2
No ratings yet
Signlanguage Detection 2
30 pages
11 RNN
No ratings yet
11 RNN
32 pages
(2020, Mathieu Chagnon) - Principles - and - Techniques - To - Optimize - Digital - Communication - Systems - From - End - To - End
No ratings yet
(2020, Mathieu Chagnon) - Principles - and - Techniques - To - Optimize - Digital - Communication - Systems - From - End - To - End
6 pages
Object Tracking Methods-A Review
No ratings yet
Object Tracking Methods-A Review
7 pages
Liveness Detection in Face Recognition Using Deep Learning
No ratings yet
Liveness Detection in Face Recognition Using Deep Learning
4 pages
Face Mask Detection in The Era of The COVID-19: How Can Machine Learning Help The Authorities?
No ratings yet
Face Mask Detection in The Era of The COVID-19: How Can Machine Learning Help The Authorities?
7 pages
Discriminative Pattern Classification
No ratings yet
Discriminative Pattern Classification
10 pages
Tkinter Tuitorial
No ratings yet
Tkinter Tuitorial
24 pages
Tingkat Signifikansi Untuk Uji 1 Arah
No ratings yet
Tingkat Signifikansi Untuk Uji 1 Arah
5 pages
Exploring Engineering
No ratings yet
Exploring Engineering
4 pages
im (x −2) δ ϵ - x −9 - ϵ ⟺ - x−3 - δ
No ratings yet
im (x −2) δ ϵ - x −9 - ϵ ⟺ - x−3 - δ
2 pages
Exam 3
No ratings yet
Exam 3
3 pages
English ID Student S Book 1 - 015
No ratings yet
English ID Student S Book 1 - 015
1 page
UNIT 10. Text Organization 3 (Definition & Exemplication) PDF
100% (1)
UNIT 10. Text Organization 3 (Definition & Exemplication) PDF
7 pages
BET 2101 - Lesson 3-Peer Teaching and Media Practicals
No ratings yet
BET 2101 - Lesson 3-Peer Teaching and Media Practicals
26 pages
Introduction Esl
No ratings yet
Introduction Esl
7 pages
Notes LLB Part 2
No ratings yet
Notes LLB Part 2
36 pages
English Test: Sentence Combination & Reading Comprehension
No ratings yet
English Test: Sentence Combination & Reading Comprehension
7 pages
A Brief History of Chinese Characters
100% (4)
A Brief History of Chinese Characters
18 pages
Rendering The Regional Local Language in Contemporary Chinese Media Edward M Gunn Download
No ratings yet
Rendering The Regional Local Language in Contemporary Chinese Media Edward M Gunn Download
79 pages
Indian Sign Language Character Recognition: Shravani K, Sree Lakshmi A, Sri Geethikam, DR - Sapna B Kulkarni
No ratings yet
Indian Sign Language Character Recognition: Shravani K, Sree Lakshmi A, Sri Geethikam, DR - Sapna B Kulkarni
6 pages
Passive and Active Listening
No ratings yet
Passive and Active Listening
4 pages
Curriculum Vitae
100% (1)
Curriculum Vitae
6 pages
Advt 01R2025
No ratings yet
Advt 01R2025
6 pages
Temporal Lobe Structure & Function
No ratings yet
Temporal Lobe Structure & Function
55 pages
QD Module I
No ratings yet
QD Module I
54 pages
Pallabi Chatterjee: Objective
No ratings yet
Pallabi Chatterjee: Objective
2 pages
Mid Term HK1 Test 2
No ratings yet
Mid Term HK1 Test 2
9 pages
Ndebele Noun Classification
100% (2)
Ndebele Noun Classification
5 pages
AcadienceReading AssessmentManual
No ratings yet
AcadienceReading AssessmentManual
156 pages
Lesson Plan Mau
No ratings yet
Lesson Plan Mau
4 pages
Java Tutorials
100% (1)
Java Tutorials
42 pages
Final Test for Basic Students
No ratings yet
Final Test for Basic Students
5 pages
SO2ndEDINTUnittest1
100% (1)
SO2ndEDINTUnittest1
3 pages
Form 2 English: Recycling Lesson
100% (3)
Form 2 English: Recycling Lesson
5 pages
التعبيرات الاصطلاحية
0% (1)
التعبيرات الاصطلاحية
178 pages
Simple Past Tense
No ratings yet
Simple Past Tense
5 pages
Assessments of Preschooler
No ratings yet
Assessments of Preschooler
8 pages
Chapter 9 Word List
No ratings yet
Chapter 9 Word List
1 page
2020 OL English Language Marking Scheme
No ratings yet
2020 OL English Language Marking Scheme
22 pages
English Lesson Plan: Adjectives & Toys
100% (3)
English Lesson Plan: Adjectives & Toys
1 page
mV20 EN G5 Literacy Stations Web
No ratings yet
mV20 EN G5 Literacy Stations Web
127 pages

CTC Loss Function

Uploaded by

CTC Loss Function

Uploaded by

End-to-end systems 1: CTC

(Connectionist Temporal Classification)

Automatic Speech Recognition – ASR Lecture 15

ASR Lecture 15 End-to-end systems 1: CTC 1

Softmax output layer

Input: Filter bank features (spectrogram)

Model SWB CH Full

5.2 Noisy speech

Maps from acoustic frames X to subword sequences S, where S is a sequence of

ASR Lecture 15 End-to-end systems 1: CTC 5

Maps from acoustic frames X to subword sequences S, where S is a sequence of

ASR Lecture 15 End-to-end systems 1: CTC 5

Train a recurrent network to map from input sequence X to output sequence S

ASR Lecture 15 End-to-end systems 1: CTC 6

ASR Lecture 15 End-to-end systems 1: CTC 8

Consider an output [c, a, t] with an input of length six

Valid Alignments Invalid Alignments

c a ϵ ϵ ϵ t c ϵ ϵ ϵ t t missing the 'a'

ASR Lecture 15 End-to-end systems 1: CTC 9

Monotonic – Alignments are monotonic (left-to-right model); no re-ordering

ASR Lecture 15 End-to-end systems 1: CTC 10

where y (ct , t) is the output for label ct at time t

ASR Lecture 15 End-to-end systems 1: CTC 11

Let S be the target output sequence after compression

The input is fed into an RNN,

h e ϵ l l ϵ l l o o With the per time-step output

h e l l o By marginalizing over alignments,

ASR Lecture 15 End-to-end systems 1: CTC 14

Left-to-right HMM CTC HMM

CTC can be interpreted as an HMM with additional (skippable) blank states,

ASR Lecture 15 End-to-end systems 1: CTC 15

Direct interpolation of a language model with the CTC acoustic model:

Ŵ = arg max(α log P(S|X ) + log P(W ))

Only consider word sequences W which correspond to the subword sequence S

ASR Lecture 15 End-to-end systems 1: CTC 16

Mozilla have released an Open Source TensorFlow implementation of the Deep

ASR Lecture 15 End-to-end systems 1: CTC 17

CTC is an alternative approach to sequence discriminative training, typically

ASR Lecture 15 End-to-end systems 1: CTC 18

You might also like