Easy Does It: Robust Spectro-Temporal ManyStream ASR without Fine Tuning Streams
Ravuri, Morgan, UC Berkeley
Presented by JJ
Motivation
Physiological experiments in different mammal species : a large percentage of neurons in the primary auditory cortex (A1) respond differently to upwardversus downward-moving ripples in the spectrogram of the input (Depireux et al., 2001).
Spectro-temporal receptive fields (STRFs) : individual neurons are sensitive to specific spectrotemporal modulation frequencies in the incoming sound signal
Introduction
Cortically-inspired TF features, which capture spectral and temporal modulations speech recognition and discrimination. Basically, spectro-temporal features are derived from filtering spectrograms with particular filters. In this case, the GABOR filter is applied to the auditory spectrogram.
Example
Example
Gabor Filters
Example
Gaussian envelope
Gabor Filters
complex sinusoid s(n, k)
1D Gabor
Gaussian envelope
complex sinusoid s(n, k)
Gaussian envelope
2D Gabor
complex sinusoid s(n, k)
Example
Gaussian envelope
Gabor Filters
complex sinusoid s(n, k)
Their Gabor Filters
Their Gabor Filters
Dummy
parameters
indices
Tons of Combinations!
System
Stream
Stream
Merge MLP outputs
PCA
MFCC
Output
System
Stream
Stream
Merge MLP outputs
PCA
MFCC
Output
System
Stream
Stream
MLP (Multilayer Perceptron) The structure of the MLP depends on the type of feature and corpus.
Number of input units Spectral 567 9 Cepstral 351 9
56D Merge MLP outputs 56D
frames of context
hidden units
160 for Aurora2 500 for Number95 56
160 for Aurora2 500 for Number95 56
PCA
32D 45D MFCC Output
output units
System
Stream
Stream
56D Merge MLP outputs 56D
The outputs of the MLP stream provide an estimate of the posterior probability distribution for phones. Then, combine each of these phone probability estimates across streams by inverse entropy.
PCA
32D 71D MFCC Output
System
Stream
Stream
then apply the KL Transform to the log probabilities of the merged MLPs
56D Merge MLP outputs 56D
PCA
32D 71D MFCC Output Principal Components Analysis
System
Stream
Stream
56D Merge MLP outputs 56D
PCA
32D 71D MFCC Output
then apply the KL Transform to the log probabilities of the merged MLPs reduced to 32D orthogonalized the features are mean and variance normalized by utterance finally appended to the MFCC feature
System
Features HMM
Stream
Stream
56D Merge MLP outputs 56D
PCA
32D 71D MFCC 39D Output 32D
Experiments
Database Aurora 2 (0 20 dB) Numbers95 consists of various numeric portions extracted from telephone dialogues . vocabulary size of 32 words training set contains 3590 utterances of clean data, totaling roughly 3 hrs 2 test sets contains 1227 utterances. The first contains only clean data The second contains the same utterances with noise added at five SNR (20dB, 15dB, 10dB, 5dB, and 0dB). Additive noise Baseline 39 MFCC 4-stream system 28-stream system
Uni-modulation system 150 stream spectral only and spectral/cepstral
Metric: Word Error Rate (WER)
Results
Aurora 2
Numbers 95
Results
Aurora 2
Numbers 95
Results
Aurora 2
Numbers 95
Results
Aurora 2
Discussion 1
Numbers 95
Results
Aurora 2
Discussion 2
Numbers 95
Results
Aurora 2
Discussion 3
Numbers 95
Results
Aurora 2
Numbers 95
Future Work
Stream
Stream
56D Merge MLP outputs 56D
Not just additive noise Another TF feature might not work Log-mel filterbank? Or power like PNCC? How to combine MLP? Inverse Entropy?
PCA
32D 71D MFCC 39D Output 32D