ELEN 6820
Speech and audio signal processing
Instructor: Nima Mesgarani (nm2764)
3 credits
TA: Yi Luo (yl3364)
O ce hours: TBD
ffi
Course overview
• Brief history of speech recogni on
• Discrete Signal Processing (DSP) overview
• Pa ern recogni on and deep learning overview
• Speech signal produc on
• Speech signal representa on
• Auditory Scene Analysis, speech enhancement and separa on
• Speech processing in the auditory system
• Acous c modeling
• Sequence recogni on and Hidden Markov Models
• Language models
• Music signal processing
tt
ti
ti
ti
ti
ti
ti
ti
Homeworks
• HW1: Discrete signal processing (wri en) (W2)
• HW2: Neural networks and voice ac vity detec on (programming) (W3&4)
• HW3: Speech signal produc on and representa on (wri en) (W5)
• HW3: Speech enhancement and separa on (programming) (W6)
• HW4: Acous c event detec on and Speaker iden ca on (programming)
(W7&8)
• HW5: Phoneme recogni on and automa c speech recogni on
(programming) (W9&10)
• Final project (programming) (W11-13)
ti
ti
ti
ti
ti
tt
ti
ti
ti
ti
ti
fi
ti
tt
ti
Week topic HW
1 Introduc on and history -
2 Discrete signal processing DSP (W)
3 Machine learning 1 Neural network and VAD (P)
4 Machine learning 2 -
5 Speech signal produc on Speech produc on (W)
6 Speech signal representra on Speech enhancement (P)
Speech enhancement and
7 -
separa on
8 Human speech percep on Acous c event detec on (P)
9 Acous c modeling -
10 Sequence modeling and HMMs Phoneme recogni on and ASR (P)
11 Language modeling -
12 Automa c speech recogn on Projcet
13 Music signal processing -
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
Course evalua on
• Two wri en homework (20%)
• Four programming homework (60%)
• Final project (20%)
• Late submission: 10% penalty per day
tt
ti
Final project
• Preferably choose the prede ned course project
• Alterna vely, de ne a project that is similar in scope and
workload, in discussion with me and Yi
ti
fi
fi
How to install Python with graphical interface on Mac/
Windows/Linux
Install Jupyter Notebook using Anaconda and cond
Download it from the following address and follow the instruction
www.anaconda.com/products/individual
• Anaconda will simultaneously install Python and Jupyter Notebook as well as
some necessary packages (e. g. numpy, scipy, etc.)
• You can either use graphical installer or use command line installer in Mac OS
• If you have Windows 10 and want to use bash commands, it is highly
recommended that you enable Linux subsystem bash environment and
install a Linux version of Anaconda on it (using command line installer)
• A er installing Anaconda, you can either run Jupyter Notebook from
Anaconda app or run the command, jupyter notebook, in terminal
• You can also use other environments for interac ng with Python, but the one
recommended for this course is Jupyter Notebook, specially if you want to
run your codes on a server (e.g. for Tensor ow)
• More informa on available on: h p://jupyter.readthedocs.io/en/latest/
index.html
• More instruc ons and tutorials for star ng Python will be taught next session
• The assignments will be checked in Jupyter Notebook using Python 3
ft
ti
ti
tt
ti
fl
ti
Signal processing
background (chapter 2)
Speech communica on
Produc on Percep on
Ear drum
Cocktail party problem, Cherry, (1953)
ti
ti
ti
Discrete Signal Processing
• Discrete me signals and systems
• Discrete me Fourier transform, z-transform
• Digital lters, IIR and FIR
• Sampling theorem, changing the sampling rate
• Emphasis on intui on
fi
ti
ti
ti
Discrete me Signals and
Systems
• Speech signal: represen ng con nuously varying pa ern
as func ons of a con nous variable t, which represents
me.
• Discrete signal: x[n] = xa(nT), where T = 1/Fs
• Telephone bandwidth speech: Fs = 6.4KHz
• Wide-band speech: Fs = 16KHz
ti
ti
ti
ti
ti
ti
tt
Few basics
• Unit impulse func on, unit step func on, exponen al
sequence
• Convolu on
ti
ti
ti
ti
Transforma ons of Signals and
Systems
• Fourier Transform
• z-Transform
ti
The Con nous-Time Fourier
Transform
• What did Fourier show?
• Whats the big deal? 1822
• Decomposing signals into fast and slow components
• Importance of sine func on for linear systems
ti
ti
The z-Transform
• A powerful tool for analyzing linear systems of di eren al
equa ons
• De ni on
• Inverse z-Transform
• Examples: delayed unit response, box pulse, exponen al
• Proper es of z-Transform: linearity, shi , exponen al
weigh ng, Linear weigh ng, convolu on, mul plica on of
sequences
fi
ti
ti
ti
ti
ti
ti
ft
ti
ff
ti
ti
ti
ti
The Discrete-Time Fourier Transform
Discrete-Time Fourier Transform
+∞
!
(e ) = x[n]e−jωn
jω
X
n=−∞
& π
1
x[n] = 2π
X (ejω )ejωn dω
−π
+∞ ""
! "
"
• De condition
• Sufficient ni on, periodic
for convergence: "
" x[n] " < +∞
"
n=−∞
• Although x[n] is discrete, X (ejω ) is continuous and periodic with period 2π.
• Inverse DTFT
• Convolution/multiplication duality:
y[n] = x[n] ∗ h[n]
• DTFT of a Cosine Signal
Y (ejω ) = X (ejω )H(ejω )
y[n] = x[n]w[n]
& π
fi
ti
The Discrete Fourier Transform
• Sampling the DTFT: Discrete Fourier Transform (DFT)
Prac cal implica ons
• Periodic signals, or, nite length sequences
• What frequency each DFT corresponds to?
• Circular shi of x[n]
• Boundary condi ons, importance of windowing
ti
ft
ti
fi
ti
Dependent Fourier
e-Dependent Transform)
Fourier Transform)
Create a nite length sequence:
w [ 50 - m ] w [ 100 - m ] w [ 200 - m ]
w [ 50 - m ] w [ 100 - m ] w [ 200 - m ]
x [ mx] [ m ]
windowing m m
00 nn == 50
50 nn
==100
100 n = 200
n = 200
+∞
!+∞
Xn (ejω
jω
)= ! w[n − m]x[m]e−jωm
−jωm
Xn (e ) = m=−∞ w[n − m]x[m]e
m=−∞
fixed, then it can be shown that:
fixed, then it can be shown that:
" π
1
Xn (ejω ) = 2π
" πW (ejθ )ejθn X (ej(ω+θ) )dθ
1
Xn (ejω ) = 2π
−π W (ejθ )ejθn X (ej(ω+θ) )dθ
−π
bove equation is meaningful only if we assume that X (ejω ) represents
er transform
ove equationofisa meaningful
signal whoseonly
properties continuethat
if we assume X (ejω
outside the) repres
windo
ytransform
that the signal is zero whose
of a signal outside properties
the window.continue outside the wi
that
der forthe signal
Xn (e jω
) to is zero outside
correspond the
to X (e jω window.
), W (ejω ) must resemble an impu
jω
fi
Rectangular window
Rectangular Window
w[n] = 1, 0≤n≤N −1
6.345 Automatic Speech Recognition (2003) Speech Signal Representaion 4
Hamming window
Hamming Window
2πn
" !
w[n] = 0.54 − 0.46cos , 0≤n≤N −1
N −1
6.345 Automatic Speech Recognition (2003) Speech Signal Representaion 5
Comparison of Windows
6.345 Automatic Speech Recognition (2003) Speech Signal Representaion 6
Spectrogram
• Use a sliding window over the signal, and display the
magne te of the DFT for each step.
• Large vs. Small window?
• Overlapping vs. non-overlapping?
ti
A Wideband Spectrogram
Two plus seven is less than ten
6.345 Automatic Speech Recognition (2003) Speech Signal Representaion 8
A Narrowband Spectrogram
Two plus seven is less than ten
Tradeoff between DFT length (temporal resolution)
6.345 Automatic Speech Recognition (2003) Speech Signal Representaion 9
and spectral resolution
Digital lters
• A digital lter is a discrete- me shi -invariant system
• Convolu on equa on: unit response, transfer func on,
system func on
• All useful systems sa sfy the linear di erence equa on
ti
fi
ti
fi
ti
ti
ti
ft
ff
ti
ti
FIR vs. IIR lters
• Linear vs. nonlinear phase
• Large vs. small impulse response dura on
fi
ti
Sampling
• Represent a con nous me signal as a sequence of
numbers
• The Sampling Theorem
ti
ti
Changing the sampling rate of a signal