0% found this document useful (0 votes)
8 views24 pages

Lecture01 Introduction

The document outlines a course on Speech Recognition at Shanghai University of Engineering Science, detailing the schedule, assessment criteria, and course content. It includes a list of students, reference materials, and a framework for Automatic Speech Recognition (ASR). The course spans 12 weeks with lectures and experiments focusing on various aspects of speech processing and recognition.

Uploaded by

Shubrata Barua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views24 pages

Lecture01 Introduction

The document outlines a course on Speech Recognition at Shanghai University of Engineering Science, detailing the schedule, assessment criteria, and course content. It includes a list of students, reference materials, and a framework for Automatic Speech Recognition (ASR). The course spans 12 weeks with lectures and experiments focusing on various aspects of speech processing and recognition.

Uploaded by

Shubrata Barua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

To Get Started

before the Course

Steven (吴 中)

Mobile: 13641865488
E-mail: stevenwuzhong@sues.edu.cn
#01 To Know You and Me
You Me
SN ID Name Chinese Name Nationality
1 027121101 AKTERN. Steven (吴 中)
2 027121102 AL-ADEMIB.
3 027121103 AURNABM.
4 027121104 BARUAS. Mobile: 13641865488
5 027121105 CHIRWAK. E-mail: stevenwuzhong@sues.edu.cn
6 027121106 DELANIEG.
7 027121107 ELM.
8 027121108 GHAFRIF.
9 027121109 HAQUEI.
10 027121110 HAQUEM.
11 027121111 HOSSAINM.A.
12 027121112 HOSSAINM.S.
13 027121113 HUBBLEA.
14 027121114 JAHIDS.
15 027121115 JALLAHW.
16 027121116 KAMALI.
17 027121117 KAULAC.
18 027121118 KOLISONC.
19 027121119 LAMTOUEHR.
20 027121120 M’KADDAMY.
21 027121121 MAHMUDM.
22 027121122 MOSHAROFM.
23 027121123 PAVELM.
24 027121124 PONEMASHO.
25 027121125 SABBIRM.
26 027121126 SHAIKATM.
27 027121127 SHAKERINM.
2024/3/1 28 027121128 TOKPAHE. Shanghai University of Engineering Science 2
29 037121119 ISKAKOVK.
#02 Reference Materials
Spoken Language Processing Speech and Language Processing Automatic Speech Recognition

Spoken Language Processing:


A Guide to Theory, Algorithm, and System Development

2024/3/1 Shanghai University of Engineering Science 3


#03 Contents for Theory and Experiments

Theory 16 Lectures + Experiments 8 Lectures

l Introduction 1 l Spectrum Analysis 1


l Fundamental Theory 2 l Spectrogram and MFCC 1

l Speech Features 2 l Observation Probability 1

l Hidden Markov Model 4 l Optimal State Path 1

l Language Model 2 l N-Gram 1

l DNN for SR 4 l DNN for SR 3


l Course Review and Q&A 1

2024/3/1 Shanghai University of Engineering Science 4


#04 Timeline in the Semester
l 12 weeks, delivered live on from now (1st week).
l 2 lectures per week: Tuesday, and Friday
The 13th week: final exam

February March April May

5/17: the last Lecture


2/27: the 1st Lecture

2024/3/1 Shanghai University of Engineering Science 5


#05 Assessment

l Attendance worth 10%

l Homework worth 20%

l Experiment worth 20%

l Final Exam worth 50%

2024/3/1 Shanghai University of Engineering Science 6


Chapter 1

Introduction

Steven (吴 中)

Mobile: 13641865488
E-mail: stevenwuzhong@sues.edu.cn
Lecture 01: Objectives

⭐ What is Speech Recognition?

⭐ A Typical Speech Recognition System

⭐ Speech Recognition Components

⭐ History of Speech Recognition

2024/3/1 Shanghai University of Engineering Science 8


#01 What is Speech Recognition?

Speech

Hi, Siri!

l Speech-to-Text transcription (STT)


l Transform recorded audio into w sequence of words.
l Just the words, no meaning… but do need to deal with acoustic
ambiguity: “Recognize speech?” or “Wreck a nice beach?”

2024/3/1 Shanghai University of Engineering Science 9


#01 What is Speech Recognition?

2024/3/1 Shanghai University of Engineering Science 10


#02 An Example of Speech Recognition
今天天气很好

Word seq. 今天 天气 很 好 Language


Model
P(W)
Phoneme j in1 t ian1 t ian1 q i1 h en2 h ao3

Acoustic
States s" s# s$ s% ⋯⋯⋯ Model
P(O|W)

Features

2024/3/1 Shanghai University of Engineering Science 11


#02 An Example of Speech Recognition
今天天气很好

今天 天气 很 好 Lexicon:
今天 j in1 t ian1
天气 t ian1 q i1
很 h en2
j in1 t ian1 t ian1 q i1 h en2 h ao3 好 h ao3

HMM s" s$ s% s& s'


⋯⋯⋯ s" s$ s% s& s'

Stochastic

• Speak what? o" o$ o% o& o' o* o+ o, o(


⋯⋯⋯⋯⋯

• How to speak?

2024/3/1 Shanghai University of Engineering Science 12


#03 What is “Automatic” SR?

l Computer recognition of speech

l Enabling a computer to “recognize” what was spoken

l Usually understood as the ability to faithfully transcribe what was spoken

l Something even humans cannot do often

l More completely, the ability to understand what was spoken

l Which humans do extremely well

2024/3/1 Shanghai University of Engineering Science 13


#04 Why Speech?
l Most natural form of human communication

l With modern telephones, people can communicate over long distances

l For natural-machine interaction, like voice search

l For spoken document processing: like speech mining and retrieval

l For fun: artificial and intelligent robot that talks like humans

l Voice command can free hands and eyes for other tasks

l Especially in cars, where hands and eyes are busy

2024/3/1 Shanghai University of Engineering Science 14


#05 A Framework of ASR

Lexicon

Feature
Decoder Text
Extraction

Speech Acoustic Acoustic Language Language Text


Corpora Modeling Model Model Modeling Corpora

2024/3/1 Shanghai University of Engineering Science 15


#05 A Framework of ASR

2024/3/1 Shanghai University of Engineering Science 16


#06 Hierarchical Modelling of Speech

l We generally represent recorded speech as a sequence of acoustic feature vectors


(observations) X, and the output word sequence as W
l At recognition time, our aim is to find the most likely W, given X
l To achieve this, statistical models are trained using a corpus (Xn, Wn)

Use an acoustic model, language model, and lexicon


to obtain the most probable word sequence W∗ given
the observed acoustics X

W ∗ = arg max P(W|X)


W

2024/3/1 Shanghai University of Engineering Science 17


#06 Hierarchical Modelling of Speech

2024/3/1 Shanghai University of Engineering Science 18


#07 Fundamental Equation of Statistical SR
If X is the sequence of acoustic feature vectors (observations) and W denotes a word
sequence, the most likely word sequence W∗ is given by

W ∗ = arg max P(W|X)


W
Applying Bayes’ Theorem:

. /0 1(2)
P WX = ∝ 5 / 0 6(0)
.(3)

W ∗ = arg max P X W P(W)


W
Acoustic Model Language Model

2024/3/1 Shanghai University of Engineering Science 19


#08 ASR History

Template DTW DNN End-to-End:


GMM-HMM DNN-HMM
Models Matching- VQ CTC
10 words RNN-T
Attention
Transformer

Time 1950s 1970s 1980~1990s 2006 2012 2015~

Isolated word Continuous Speech Complex Scenarios

Phase I: Phase II: Phase III:


Template Matching Statistical Models Deep Learning

2024/3/1 Shanghai University of Engineering Science 20


#09 Further Reading…

Chapter 1 in Spoken Language Processing

2024/3/1 Shanghai University of Engineering Science 21


#09 Further Reading…
Chapter 1 in Automatic Speech Recognition A Historical Perspective of Speech Recognition
A Deep Learning Approach

2024/3/1 Shanghai University of Engineering Science 22


#10 Coursework

l What is the input and the output in a typical ASR system?

l What are the main parts in ASR? Please draw the Framework of ASR.

l What are the three phases in ASR History?

2024/3/1 Shanghai University of Engineering Science 23


THANK YOU

You might also like