Effect of Noise Suppression Losses On Speech Distortion and Asr Performance

This document discusses the impact of noise suppression losses on speech distortion and automatic speech recognition (ASR) performance, highlighting the challenges of achieving high speech quality with smaller models. The authors explore various loss functions for real-time deep noise suppression, aiming to improve speech intelligibility while minimizing ASR degradation. Their findings indicate that while certain loss functions and pre-trained networks were evaluated, none significantly enhanced ASR rates or speech quality beyond the strong spectral loss already in use.

Uploaded by

Himajyothi Rajamahendravarapu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views5 pages

Effect of Noise Suppression Losses On Speech Distortion and Asr Performance

Uploaded by

Himajyothi Rajamahendravarapu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

EFFECT OF NOISE SUPPRESSION LOSSES

ON SPEECH DISTORTION AND ASR PERFORMANCE

Sebastian Braun, Hannes Gamper

Microsoft Research, Redmond, WA, USA

{sebastian.braun, hannes.gamper}@microsoft.com
arXiv:2111.11606v1 [eess.AS] 23 Nov 2021

ABSTRACT requires either transcriptions for all training data, or using a tran-
scribed subset for the ASR loss. The authors also proposed to up-
Deep learning based speech enhancement has made rapid develop- date the ASR model during training, which creates an in practice
ment towards improving quality, while models are becoming more often undesired dependency between speech enhancement and the
compact and usable for real-time on-the-edge inference. However, jointly trained ASR engine.
the speech quality scales directly with the model size, and small In this work, we explore several loss functions for a real-time
models are often still unable to achieve sufficient quality. Further- deep noise suppressor with the goal to improve SIG without harm-
more, the introduced speech distortion and artifacts greatly harm ing ASR rates. The contribution of this paper is threefold. First,
speech quality and intelligibility, and often significantly degrade au- we show that by decoupling the loss from the speech enhancement
tomatic speech recognition (ASR) rates. In this work, we shed light inference engine using end-to-end training, choosing a higher reso-
on the success of the spectral complex compressed mean squared er- lution in a spectral signal-based loss can improve SIG. Second, we
ror (MSE) loss, and how its magnitude and phase-aware terms are re- propose ways to integrate MOS and WER estimates from pre-trained
lated to the speech distortion vs. noise reduction trade off. We further networks in the loss as weighting. Third, we evaluate additional su-
investigate integrating pre-trained reference-less predictors for mean pervised loss terms computed using pre-trained networks, similar to
opinion score (MOS) and word error rate (WER), and pre-trained the deep feature loss [4]. In [5], six different networks pre-trained
embeddings on ASR and sound event detection. Our analyses reveal on different tasks have been used to extract embeddings from output
that none of the pre-trained networks added significant performance and target signals, to form an additional loss, where a benefit was
over the strong spectral loss. reported only for the sound event detection model published in [6].
Index Terms— speech enhancement, noise reduction, speech We show different results when trained on a larger dataset and eval-
distortion reduction, speech quality uated on real data using decisive metrics on speech quality and ASR
performance, while we find that none of the pre-trained networks
improved ASR rates or speech quality significantly.
1. INTRODUCTION

Speech enhancement techniques are present in almost any device 2. SYSTEM AND TRAINING OBJECTIVE
with voice communication or voice command capabilities. The
goal is to extract the speaker’s voice only, reducing disturbing A captured microphone signal can generally be described by
background noise to improve listening comfort, and aid intelligibil-
ity for human or machine listeners. In the past few years, neural y(t) = m {s(t) + r(t) + v(t)} , (1)
network-based speech enhancement techniques showed tremendous
improvements in terms of noise reduction capability [1, 2]. Data- where s(t) is the non-reverberant desired speech signal, r(t) unde-
driven methods can learn the tempo-spectral properties of any type sired late reverberation, v(t) additive noise or interfering sounds,
of speech and noise, in contrast to traditional statistical model-based and m{} can model linear and non-linear acoustical, electrical, or
approaches that often mismatch certain types of signals. While one digital effects that the signals encounter.
big current challenge in this field is still to find smaller and more
efficient network architectures that are computationally light for
real-time processing while delivering good results, another major 2.1. End-to-end optimization
challenge addressed here is obtaining good, natural sounding speech
quality without processing artifacts. We use an end-to-end enhancement system using a complex en-
In the third deep noise suppression (DNS) challenge [2], the sep- hancement filter in the short-time Fourier transform (STFT) domain
arate evaluation of speech distortion (SIG), background noise reduc-
tion (BAK), and overall quality (OVL) according to ITU P.835 re- sb(t) = FP {G(k, n) FP {y(t)}}−1 (2)
vealed that while current state-of-the-art methods achieve outstand-
ing noise reduction, only one submission did not degrade SIG on av- where FP {a(t)} = A(k, n) denotes the linear STFT operator yield-
erage while still showing high BAK. Degradations in SIG potentially ing a complex signal representation at frequency k and time index n,
also harm the performance of following automatic speech recogni- and G(k, n) is a complex enhancement filter. The end-to-end opti-
tion (ASR) systems, or human intelligibility as well. mization objective is to train a neural network predicting G(k, n),
In [3] a speech enhancement model was trained on a signal- while optimizing a loss on the time-domain output signal sb(t) as
based loss and a ASR loss with alternating updates. This method shown in Fig. 1.

LibriVox (500h)
Inference 0.4 0.8 AVspeech mini (4580h)
Mandarin SLR (54h)
Noisy enhanced

norm. hist.
0.3 0.6 Spanish SLR (46h)
STFTP Feature DNN iSTFTP Emotion CremaD (2h)
audio audio Emotion Internal (8h)
0.2 0.4 Freesound Laughing (2h)

0.1 0.2
Target Spectral
STFTL STFTL
audio Loss 0 0
-10 0 10 20 30 40 0 10 20 30 40 50
MOS, WER RSNR (dB) SRR (dB)
Loss
Fig. 2. SNR and SRR distributions of speech datasets
Embedding loss
(e.g. sound event
Training only detection model)
signal-to-reverberation ratio (SRR) as predicted by a slightly modi-
fied neural network following [12]. The singing data is not shown in
Fig. 1. End-to-end trained system on various losses. Fig. 2 as it is studio quality and our RSNR estimator is not trained
on singing. While the Librivox data has both high RSNR and SRR,
this is not the case for the other datasets, which have broader distri-
2.2. Features and network architecture butions and lower peaks. Therefore, we select only speech data from
the AVspeech, Spanish, Mandarin, emotion, and laughing subsets
We use complex compressed features by feeding the real and imag- with segSN R > 30 dB and SRR > 35 dB for training.
inary part of the complex FFT spectrum as channels into the first
convolutional layer. Magnitude compression is applied to the com-
plex spectra by
4. LOSS FUNCTIONS
Y (k, n)
Y c (k, n) = |Y (k, n)|c , (3) In this section, we describe training loss functions that are used to
max(|Y (k, n)|, η)
optimize the enhanced signal sb(t). We always use a standard signal-
where the small positive constant η avoids division by zero. based spectral loss described in Sec. 4.1, which is extended or mod-
We use the Convolutional Recurrent U-net for Speech En- ified in several ways as described in the following subsections.
hancement (CRUSE) model proposed in [7] with 4 convolutional
encoder/decoder layers with time-frequency kernels (2,3) and stride
(1,2) with pReLU activations, a group of 4 parallel GRUs in the 4.1. Magnitude-regularized compressed spectral loss
bottleneck and add skip connections with 1 × 1 convolutions. The
network output are two channels for the real and imaginary part of As a standard spectral distance-based loss function LSD , we use the
the complex filter G(k, n). To ensure stability, we use tanh output complex compressed loss [11, 13], which outperformed other spec-
activation functions restraining the filter values to [−1, 1] as in [8]. tral distance-based losses in [14], given by
!
3. DATA GENERATION AND AUGMENTATION 1 X c bc
2 X c c
2
LSD = c λ S −S + (1−λ) |S| −|S|
b , (4)
σs κ,η κ,η
We use an online data generation and augmentation technique, using
the power of randomness to generate virtually infinitely large train-
ing data. Speech and noise portions are randomly selected from raw where here the spectra S(κ, η) = FL {s(t)} and S(κ, b η) =
audio files with random start times to form 10 s clips. If a section FL {b s(t)} are computed with a STFT operation with independent
is too short, one or more files are concatenated to obtain the 10 s settings from FP {·} in (2), Ac = |A|c |A|A
is a magnitude compres-
length. 80% of speech and noise are augmented with random biquad sion operation, and the frequency and time indices κ, η are omitted
filters [9] and 20% are pitch shifted within [−2, 8] semi-tones. If the for brevity. The loss for each sequence is normalized by the active
speech is non-reverberant, a random room impulse response (RIR) speech energy σs [15], which is computed from s(t) using a voice
is applied. The non-reverberant speech training target is obtained by activity detector (VAD). The complex and magnitude loss terms are
windowing the RIR with a cosine decay of length 50 ms, starting linearly weighted by λ, and the compression factor is c = 0.3. The
20 ms (one frame) after the direct path. Speech and noise are mixed processing STFT FP and loss frequency transform FL can differ in
with a signal-to-noise ratio (SNR) drawn from a normal distribution e. g. window and hop-size parameters, or be even different types of
with mean and standard deviation N (5, 10) dB. The signal levels are frequency transforms as shown in Fig. 1. In the following subsec-
varied with a normal distribution N (−26, 10) dB. tions, we propose several extensions to the spectral distance loss
We use 246 h noise data, consisting of the DNS challenge noise LSD to potentially improve quality and generalization.
data (180 h), internal recordings (65 h), and stationary noise (1 h), Optional frequency weighting Frequency weightings for sim-
as well as 115 k RIRs published as part of the DNS challenges [10]. ple spectral distances are often used in perceptually motivated evalu-
Speech data is taken mainly from the 500 h high quality-rated Lib- ation metrics and have been tried to integrate as optimization targets
rivox data from [10], in addition to high SNR data from AVspeech for speech enhancement [16]. While we already showed that the
[11] and the Mandarin and Spanish, singing, and emotional CremaD AMR-wideband based frequency weighting did not yield improve-
corpora published within [10], an internal collection of 8 h emotional ments for the experiments in [14], here we explore another attempt
speech and 2 h laughing sourced from Freesound. Figure 2 shows applying a simple equivalent rectangular bandwidth (ERB) weight-
the distributions of reverberant speech-to-noise ratio (RSNR) and ing [17] to the spectra in (4) using 20 bands.

2
4.2. Additional cepstral distance term SNR 10 dB lowpass 3 kHz
1.00 SNR 25 dB delay 3 ms
The cepstral distance (CD) [18] is one of the few intrusive objec- SNR 50 dB
tive metrics that is also sensitive to speech distortion artifacts caused 0.75

emb. loss
by speech enhancement algorithms, in contrast to most other met-
rics, which are majorly sensitive to noise reduction. This motivated 0.50
adding a CD term LCD to (4) by
0.25
LCD = βLSD (s, sb) + (1 − β)LCD (s, sb) (5)
0.00
2 3 4 5 6 ase v60 ust
where we chose β = 0.001, but did not find a different weight that
NN ANN ANN ANN ANN c-b vec-l c-rob
helped improving the results. PA P P P P v e
v2 wav2 v2ve
wa wa
4.3. Non-intrusive speech quality and ASR weighted loss
Fig. 3. Sensitivity of embedding losses to various degradations.
Secondly, we explore the use of non-intrusive estimators for mean Losses are normalized per embedding type (column).
opinion score (MOS) [19] and word error rate (WER) [20]. The
MOS predictor has been re-trained with subjective ratings of vari-
ous speech enhancement algorithms, including the MOS ratings in-
Understanding the embeddings To provide insights into how
cluded in the three DNS challenges. Note that both are blind (non-
to choose useful embedding extractors and reasons why some em-
intrusive) estimators, meaning they give predictions without requir-
beddings work better than others, we conduct a preliminary experi-
ing any reference, which makes them also interesting for unsuper-
ment. We show the embedding loss terms for a selection of signal
vised training, to be explored in future work. To avoid dependence
degradations, corrupting a speech signal with the same noise with
of a hyper-parameter when extending the loss function by additive
three different SNRs, a 3 kHz lowpass degradation, and the impact
terms (e.g., λ in (4)), we use the predictions to weight the spectral
of a delay, i. e. a linear phase shift. The degradations are applied
distance loss for each sequence b in a training batch by
on 20 speech signals with different noise signals and results are av-
X nW ER(b
sb ) eraged. Fig. 3 shows the PANN embedding loss using a different
LM OS,W ER = LSD (sb , sbb ), (6) number of CNN layers, and the three wav2vec models mentioned
nM OS(b
sb )
b above. The embedding loss is normalized to the maximum per em-
where sb (t) is the b-th sequence, nW ER() and nM OS() are the bedding (i.e., per column) for better visibility, as we are interested in
WER and MOS predictors. We also explore MOS and WER only the differences created by certain degradations.
weighted losses by setting the corresponding other prediction to one. We observe that the PANN models are rather non-sensitive to
lowpass degradations, attributing it a similar penalty as a -50 dB
background noise. The wav2vec embeddings are much more sensi-
4.4. Multi-task embedding losses tive to the lowpass distortion, and rate moderate background noise
As a third extension, similiar to [5], we add a distance loss using pre- with -25 dB comparably less harmful than the PANN embeddings.
trained embeddings on different audio tasks, such as ASR or sound This might be closer to a human importance or intelligibility rat-
event detection (SED), by ing, where moderate background noise might be perceived less
disturbing than speech signal distortions, and seems therefore more
X ku(sb ) − u(b
sb )kp useful in guiding the networks towards preserving more speech
Lemb = LSD (sb , sbb ) + γ (7)
ku(sb )kp components rather than suppressing more (probably hardly audible)
b
background noise. We consequently choose the 4-layer PANN and
where u(sb ) and u(b sb ) are the embedding vectors from the target wav2vec-robust embeddings for our later experiments.
speech and output signals, respectively, and we use the normalized
p-norm as distance metric. This differs from [5], where a L1 spectral
5. EVALUATION
loss was used for LSD . We verified a small benefit from the normal-
ization term and chose p according to the embedding distributions in 5.1. Test sets and metrics
preliminary experiments.
In this work, we use two different embedding extractors u(): a) We show results on the public third DNS challenge test set consist-
the PANN SED model [6] that was the only embedding that showed ing of 600 actual device recordings under noisy conditions [2]. The
a benefit in [5], and b) an ASR embedding using wav2vec 2.0 mod- impact on subsequent ASR systems is measured using three sub-
els [21]. For PANN, we use the pre-trained 14-layer CNN model sets: The transcribed DNS challenge dataset (challenging noisy con-
using the first 2-6 double CNN layers with p = 1 and γ = 0.05. ditions), a collection of internal real meetings (18h, realistic medium
For wav2vec 2.0, we explore three versions of pre-trained mod- noisy condition), and 200 high quality high SNR recordings.
els with p = 2 and γ = 0.1 to extract embeddings, which could Speech quality is measured using a non-intrusive P.835 DNN-
help to improve ASR performance of the speech enhancement net- based estimator similar to DNSMOS [22] trained on the available
works: i) the small wav2vec-base model trained on LibriSpeech, ii) data from the three DNS challenges and internally collected data.
the large wav2vec-lv60 model trained on LibriSpeech, and iii) the We term the non-intrusive predictions for signal distortion, back-
large wav2vec-robust model trained on Libri-Light, CommonVoice, ground noise and overall quality from P.835 DNSMOS nSIG, nBAK,
Switchboard, Fisher, i. e. , more realistic and noisy data. We use the nOVL. The P.835 DNSMOS model predicts SIG with > 0.9 correla-
full wav2vec models taking the logits as output. The pre-trained tion and BAK, OVL with > 0.95 correlation per model. The impact
embeddings extractor networks are frozen while training the speech on production-grade ASR systems is measured using the public Mi-
enhancement network. crosoft Azure Speech SDK service [23] for transcription.

3
-0.05 0.4
0.1 0.2 loss nSIG nBAK nOVL WER (%)
0.0 0.3 0.5
0.7 dataset DNS DNS HQ meet
nSIG

-0.1
0.6 noisy 3.87 3.05 3.11 27.9 5.7 16.5
-0.15 0.8 0.9
L20
SD (20 ms, 50%) 3.77 4.23 3.50 31.0 5.9 18.7
1.0 L32
SD (32 ms, 50%) 3.79 4.26 3.53 30.6 5.9 18.6
-0.2
1.25 1.3 1.35 1.4 L64
SD (64 ms, 75%) 3.79 4.28 3.54 30.1 5.9 18.4
nBAK L64
SD ERB 3.73 4.22 3.46 31.9 6.0 18.6
L64
SD + CD 3.79 4.26 3.53 30.4 5.8 18.1
Fig. 4. Controlling the speech distortion - noise reduction tradeoff L64 4.27 3.53 18.0
SD -MOS 3.78 30.2 6.0
for (4) using the complex loss weight λ, where λ = 1 gives the L64 3.79 4.27 3.53 5.8
SD -WER 30.5 18.2
complex loss term only, and λ = 0 gives the magnitude only loss. L64 3.79 3.53 30.1 5.8
SD -MOS-WER 4.26 18.4
L64
SD + PANN4 3.79 4.27 3.54 30.4 5.8 18.5
L64
SD + wav2vec 3.79 4.26 3.53 30.3 5.9 18.6
5.2. Experimenal settings
The CRUSE processing parameters are implemented using a STFT Table 1. Impact of modifying and extending the spectral loss on
with square-root Hann windows of 20 ms length, 50% overlap, and perceived quality and ASR.
a FFT size of 320. To achieve results in line with prior work, we
use a network size that is on the larger side for most CPU-bound
real-world applications, although it still runs in real-time on standard nificant degradation compared to the linear frequency resolution.
CPUs and is several times less complex than most research speech Contrary to expectations, the additive CD term did not help to
enhancement architectures: The network has 4 encoder conv layers improve SIG further, but slightly reduced BAK. It did however im-
with channels [32, 64, 128, 256], which are mirrored in the decoder, prove the WER for the high quality and meeting test data. Disap-
a GRU bottleneck split in 4 parallel subgroups, and conv skip con- pointingly, the non-intrusive MOS weighting did not improve any
nections [7]. The resulting network has 8.4 M trainable parameters, MOS metrics over the plain best spectral distance loss, and shows
12.8 M MACs/frame and the ONNX runtime has a processing time no clear trend for WER. A reason could be that overall MOS is
of 45 ms per second of audio on a standard laptop CPU. For ref- still emphasizing BAK more, whereas we would need a loss that im-
erence, the first two ranks in the 3rd DNS challenge [2] stated to proves SIG to achieve a better overall result. The non-intrusive WER
consume about 60 M MACs and 93 M FLOPs per frame. The net- weighting shows a minor WER improvement for the high-quality
work is trained using the AdamW optimizer with initial learning rate and meeting data, with small degradation of DNS testset compared
0.001, which is halved after plateauing on the validation metric for to L64SD only. As the ASR models to train the non-intrusive WER
200 epochs. Training is stopped after validation metric plateau of predictor were trained on mostly clean speech, this could be a reason
400 epochs. One epoch is defined as 5000 training sequences of for the WER model not helping the noisy cases. The MOS+WER
10 s. We use the synthetic validation set and heuristic validation weighting ranks in between MOS and WER only weighted losses.
metric proposed in [7], a weighted sum of PESQ, siSDR and CD. Tab. 1 shows results for the PANN-augmented loss using the 4-
layer PANN only, which also does not show an improvement. Using
more PANN layers or a higher PANN weight γ > 0.05 resulted in
5.3. Results
worse performance, and could not exceed the standalone LSD loss.
In the first experiment, we study the not well understood linear Possible reasons have already been indicated in Fig. 3. The wav2vec
weighting between complex compressed and magnitude compressed ASR embedding loss term shows also no significant improvement
loss (4). Fig. 4 shows the nSIG vs. nBAK tradeoff when changing in terms of WER or MOS. Note that the P.835 DNSMOS absolute
the contribution of the complex loss term with the weight λ in (4). values, especially nOV L are somewhat compressed. The CRUSE
The magnitude term acts as a regularizer for distortion, while higher model with L20 SD achieved ∆SIG = −0.17, ∆BAK = 1.98,
weight on the complex term yields stronger noise suppression, at the ∆OV L = 0.85 in the subjective P.835 tests [24].
price of speech distortion. An optimal weighting is therefore found
as a trade off. We use λ = 0.3 in the following experiments, which 6. CONCLUSIONS
we found to be a good balance between SIG and BAK.
Table 1 shows the results in terms of nSIG, nBAK, nOVL and In this work, we provided insight into the advantage of magnitude
WER for all losses under test, where DNS, HQ, and meet refer to the regularization in the complex compressed spectral loss to trade off
three test sets described in Sec. 5.1. The first three rows show the speech distortion and noise reduction. We further showed that an in-
influence of the STFT resolution FL {·} used to compute the end- creased spectral resolution of this loss can lead to significantly better
to-end complex compressed loss (4), where we used Hann windows results. Besides the increased resolution loss, modifications that also
of {20, 32, 64} ms with {50%, 50%, 75%} overlap. The superscript aimed at improving signal quality and distortion, e. g. integrating
of LSD indicates the STFT window length. We can observe that pre-trained networks could not provide a measurable improvement.
the larger windows lead to improvements of all metrics. This is an Also, loss extensions that introduced knowledge from pre-trained
interesting finding, also highlighting the fact that speech enhance- ASR systems showed no improvements in generalization for human
ment approaches implemented on window sizes or look-ahead are or machine listeners. The small improvements in signal distortion
not comparable. With the decoupled end-to-end training, we can im- and WER indicate that there is more research required to improve
prove performance with larger STFT resolutions of the loss, while these metrics significantly.
keeping the smaller STFT resolution for processing, to keep the pro-
cessing delay low. Similar to many other frequency weightings, the
ERB-weighted spectral distance did not work well, showing a sig-

4
7. REFERENCES [13] J. Lee, J. Skoglund, T. Shabestary, and H. Kang, “Phase-
sensitive joint learning algorithms for deep learning-based
[1] D. Wang and J. Chen, “Supervised speech separation based speech enhancement,” IEEE Signal Processing Letters, vol.
on deep learning: An overview,” IEEE/ACM Trans. Audio, 25, no. 8, pp. 1276–1280, 2018.
Speech, Lang. Process., vol. 26, no. 10, pp. 1702–1726, Oct
[14] S. Braun and I. Tashev, “A consolidated view of loss functions
2018.
for supervised deep learning-based speech enhancement,” in
[2] C. K. A. Reddy, H. Dubey, K. Koishida, A. Nair, V. Gopal, Intl. Conf. on Telecomm. and Sig. Proc. (TSP), 2021.
R. Cutler, S. Braun, R. Gamper, H. Aichner, and S. Srinivasan,
[15] S. Braun and I. Tashev, “Data augmentation and loss normal-
“INTERSPEECH 2021 deep noise suppression challenge:,” in
ization for deep noise suppression,” in Proc. Speech and Com-
Proc. Interspeech, 2021.
puters, 2020.
[3] S. E. Eskimez, X. Wang, M. Tang, H. Yang, Z. Zhu, Z. Chen,
[16] Z. Zhao, S. Elshamy, and T. Fingscheidt, “A perceptual weight-
H. Wang, and T. Yoshioka, “Human listening and live cap-
ing filter loss for DNN training in speech enhancement,” in
tioning: Multi-task training for speech enhancement,” in Proc.
Proc. IEEE Workshop on Applications of Signal Processing to
Interspeech Conf., 2021.
Audio and Acoustics (WASPAA), Oct 2019, pp. 229–233.
[4] F. G. Germain, Q. Chen, and V. Koltun, “Speech denoising
[17] B. C. J. Moore and B. R. Glasberg, “Suggested formulae for
with deep feature losses,” online, arXiv:1806.10522, 2018.
calculating auditory-filter bandwidths and excitation patterns,”
[5] S. Kataria, J. Villalba, and N. Dehak, “Perceptual loss based J. Acoust. Soc. Am., vol. 74, pp. 750–753, 1983.
speech denoising with an ensemble of audio pattern recogni-
tion and self-supervised models,” in Proc. IEEE Intl. Conf. on [18] N. Kitawaki, H. Nagabuchi, and K. Itoh, “Objective quality
Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. evaluation for low bit-rate speech coding systems,” IEEE J.
7118–7122. Sel. Areas Commun., vol. 6, no. 2, pp. 262–273, 1988.

[6] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. [19] Hannes Gamper, Chandan K A Reddy, Ross Cutler, Ivan J. Ta-
Plumbley, “PANNs: Large-scale pretrained audio neural net- shev, and Johannes Gehrke, “Intrusive and non-intrusive per-
works for audio pattern recognition,” vol. 28, pp. 2880–2894. ceptual speech quality assessment using a convolutional neural
network,” in 2019 IEEE Workshop on Applications of Signal
[7] S. Braun, H. Gamper, C. K. A. Reddy, and I. Tashev, “Towards Processing to Audio and Acoustics (WASPAA), 2019, pp. 85–
efficient models for real-time deep noise suppression,” in Proc. 89.
IEEE Intl. Conf. on Acoustics, Speech and Signal Processing
(ICASSP), 2021. [20] H. Gamper, D. Emmanilidou, S. Braun, and I. Tashev, “Pre-
dicting word error rate for reverberant speech,” in Proc.
[8] H.-S. Choi, J. Kim, J. Hur, A. Kim, J.-W. Ha, and K. Lee., IEEE Intl. Conf. on Acoustics, Speech and Signal Processing
“Phase-aware speech enhancement with deep complex U-Net,” (ICASSP), 2020.
in Intl. Conf. on Learning Representations (ICLR), 2019.
[21] A. Baevski, H. Zhou, A. Mohammed, and M. Auli, “wav2vec
[9] J. Valin, “A hybrid DSP/deep learning approach to real-time
2.0: a framework for self-supervised learning of speech rep-
full-band speech enhancement,” in 20th Intl. Workshop on Mul-
resentations,” in Proc. Conf. Neural Information Processing
timedia Signal Processing (MMSP), Aug 2018, pp. 1–5.
Systems (NeurIPS), 2020.
[10] C. K. A. Reddy, H. Dubey, V. Gopal, R. Cutler, S. Braun,
[22] C. Reddy, V. Gopal, and R. Cutler, “DNSMOS: A non-
H. Gamper, R. Aichner, and S. Srinivasan, “ICASSP 2021
intrusive perceptual objective speech quality metric to evalu-
deep noise suppression challenge,” in Proc. IEEE Intl. Conf.
ate noise suppressors,” in Proc. IEEE Intl. Conf. on Acoustics,
on Acoustics, Speech and Signal Processing (ICASSP), 2021.
Speech and Signal Processing (ICASSP), October 2021.
[11] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Has-
sidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at [23] Microsoft, “https://github.com/azure-samples/cognitive-
the cocktail party: A speaker-independent audio-visual model services-speech-sdk,” 2021.
for speech separation,” ACM Trans. Graph., vol. 37, no. 4, July [24] 3rd Deep Noise Suppression Challenge Organizers, “Chal-
2018. lenge results,” www.microsoft.com/en-us/research/academic-
[12] S. Braun and I. Tashev, “On training targets for noise-robust program/deep-noise-suppression-challenge-interspeech-
voice activity detection,” in Proc. European Signal Processing 2021/#!results, Aug. 2021.
Conf. (EUSIPCO), 2021.

A Consolidate View of Loss Functions For Supervised Deep Learning-Based Speech Enhancement
No ratings yet
A Consolidate View of Loss Functions For Supervised Deep Learning-Based Speech Enhancement
5 pages
Keynote Slides
No ratings yet
Keynote Slides
33 pages
Petkov
No ratings yet
Petkov
5 pages
An Experimental Study On Speech Enhancement Based On A Combination of Wavelets and Deep Learning
No ratings yet
An Experimental Study On Speech Enhancement Based On A Combination of Wavelets and Deep Learning
17 pages
A Priori SNR Estimation Based On A RNN For Robust Speech Enhancement
No ratings yet
A Priori SNR Estimation Based On A RNN For Robust Speech Enhancement
5 pages
CER Not SE ASR
No ratings yet
CER Not SE ASR
5 pages
Improving Speech Intelligibility in Noise Using Environment-Optimized Algorithms
No ratings yet
Improving Speech Intelligibility in Noise Using Environment-Optimized Algorithms
11 pages
Improving Speech Intelligibility in Noise Using Environment-Optimized Algorithms
No ratings yet
Improving Speech Intelligibility in Noise Using Environment-Optimized Algorithms
11 pages
Background Noise Suppression in Audio File Using LSTM Network
No ratings yet
Background Noise Suppression in Audio File Using LSTM Network
9 pages
Deep Neural Networks For Speech Enhancement
No ratings yet
Deep Neural Networks For Speech Enhancement
7 pages
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
No ratings yet
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
15 pages
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
No ratings yet
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
29 pages
Application of Deep Learning-Based Speech Signal P
No ratings yet
Application of Deep Learning-Based Speech Signal P
6 pages
Enhancing TTS Intelligibility in Noise
No ratings yet
Enhancing TTS Intelligibility in Noise
251 pages
Sigmoid Functions Gain Function in Speech Enhancement
No ratings yet
Sigmoid Functions Gain Function in Speech Enhancement
18 pages
Boll79 SuppressionAcousticNoiseSpectralSubtraction PDF
No ratings yet
Boll79 SuppressionAcousticNoiseSpectralSubtraction PDF
8 pages
A Speech Denoising Demonstration System Using Multi-Model Deep-Learning Neural Networks
No ratings yet
A Speech Denoising Demonstration System Using Multi-Model Deep-Learning Neural Networks
23 pages
On The Compensation Between Magnitude and Phase in Speech Separation
No ratings yet
On The Compensation Between Magnitude and Phase in Speech Separation
5 pages
CNN Basic
No ratings yet
CNN Basic
11 pages
Speech Enhancement Using Signal Subspace Algorithm
No ratings yet
Speech Enhancement Using Signal Subspace Algorithm
4 pages
A Conformer-Based ASR Frontend For Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation
No ratings yet
A Conformer-Based ASR Frontend For Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation
8 pages
Noise To Noise
No ratings yet
Noise To Noise
20 pages
DENOASR
No ratings yet
DENOASR
13 pages
Lightburn2017 (Icaasp)
No ratings yet
Lightburn2017 (Icaasp)
5 pages
Spcom20 Aaron
No ratings yet
Spcom20 Aaron
17 pages
High-Fidelity Noise Reduction With Differentiable Signal
No ratings yet
High-Fidelity Noise Reduction With Differentiable Signal
10 pages
Quantitative Perceptual Separation of Two Kinds of Degradation in Speech Denoising Applications
No ratings yet
Quantitative Perceptual Separation of Two Kinds of Degradation in Speech Denoising Applications
4 pages
Comparison of Noise Removal and Echo Cancellation For Audio Signals
No ratings yet
Comparison of Noise Removal and Echo Cancellation For Audio Signals
3 pages
Seminar Report Parthiv
No ratings yet
Seminar Report Parthiv
58 pages
Fundamental of Speech Enhencements
No ratings yet
Fundamental of Speech Enhencements
112 pages
Effects of Dataset Sampling Rate For Noise Cancellation Through Deep Learning
No ratings yet
Effects of Dataset Sampling Rate For Noise Cancellation Through Deep Learning
16 pages
BTP Group-1 Report
No ratings yet
BTP Group-1 Report
21 pages
Encodec Trans
No ratings yet
Encodec Trans
5 pages
Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, John R. Hershey
No ratings yet
Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, John R. Hershey
5 pages
Efficient Audio Transformer Unveiled
No ratings yet
Efficient Audio Transformer Unveiled
9 pages
Metadata of The Chapter That Will Be Visualized Online: Samui
No ratings yet
Metadata of The Chapter That Will Be Visualized Online: Samui
14 pages
Result Analysis
No ratings yet
Result Analysis
5 pages
Jensen 2016
No ratings yet
Jensen 2016
14 pages
Stutter Tts Controlled Synthesis and Improved Recognition of Stuttered Speech
No ratings yet
Stutter Tts Controlled Synthesis and Improved Recognition of Stuttered Speech
8 pages
Speech Processing Research Paper
No ratings yet
Speech Processing Research Paper
13 pages
BEATS - AudioPre-Training With Acoustic Tokenizers
No ratings yet
BEATS - AudioPre-Training With Acoustic Tokenizers
16 pages
ICASSP 2021 DNS Challenge
No ratings yet
ICASSP 2021 DNS Challenge
5 pages
Speech Enhancement Using Harmonic Regeneration
No ratings yet
Speech Enhancement Using Harmonic Regeneration
5 pages
A Two-Stage Deep Representation Learning-Based Speech Enhancement Method Using Variational Autoencoder and Adversarial Training
No ratings yet
A Two-Stage Deep Representation Learning-Based Speech Enhancement Method Using Variational Autoencoder and Adversarial Training
13 pages
Vlsi Implementation of Temporal and Transform Domain Speech Enhancement Algorithms
No ratings yet
Vlsi Implementation of Temporal and Transform Domain Speech Enhancement Algorithms
21 pages
Taal 2013
No ratings yet
Taal 2013
4 pages
Algorithms 16 00137 v2
No ratings yet
Algorithms 16 00137 v2
14 pages
Seminar Report Final
No ratings yet
Seminar Report Final
37 pages
Towards Neurocomputational Speech and So
No ratings yet
Towards Neurocomputational Speech and So
279 pages
Speech Processing
No ratings yet
Speech Processing
11 pages
Discrete Time Processing of Speech Signa
No ratings yet
Discrete Time Processing of Speech Signa
12 pages
EAT: Self-Supervised Pre-Training With Efficient Audio Transformer
No ratings yet
EAT: Self-Supervised Pre-Training With Efficient Audio Transformer
10 pages
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
No ratings yet
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
3 pages
Computer Speech & Language: Gueorgui Pironkov, Sean UN Wood, ST Ephane Dupont
No ratings yet
Computer Speech & Language: Gueorgui Pironkov, Sean UN Wood, ST Ephane Dupont
13 pages
Nonlinear Speech Synthesis
No ratings yet
Nonlinear Speech Synthesis
8 pages
Speech Enhancement Temporal Convolutional Neural Network
No ratings yet
Speech Enhancement Temporal Convolutional Neural Network
37 pages
Artificial Intelligence and Its Applicat
No ratings yet
Artificial Intelligence and Its Applicat
4 pages
DDSP Differentiable Digital Signal Processing
No ratings yet
DDSP Differentiable Digital Signal Processing
19 pages
SBA Balanced-Scorecard Script
No ratings yet
SBA Balanced-Scorecard Script
5 pages
Peter Kowald
No ratings yet
Peter Kowald
71 pages
Addition of Whole Numbers
No ratings yet
Addition of Whole Numbers
16 pages
06 Elena Nofit
No ratings yet
06 Elena Nofit
3 pages
Application Form: Professional Regulation Commission
No ratings yet
Application Form: Professional Regulation Commission
1 page
Bajaj Electricals Limited
No ratings yet
Bajaj Electricals Limited
71 pages
Project 1ST 4 Pages
No ratings yet
Project 1ST 4 Pages
4 pages
Scala POPS College
No ratings yet
Scala POPS College
3 pages
Counselling Process
100% (2)
Counselling Process
3 pages
Creative, Out of Box Thinking
0% (1)
Creative, Out of Box Thinking
24 pages
2025 NST Grade 5 Test Term 3
No ratings yet
2025 NST Grade 5 Test Term 3
7 pages
Artificial Intelligence Recent Advances Challenges and Future Directions
No ratings yet
Artificial Intelligence Recent Advances Challenges and Future Directions
8 pages
Indian Legal Education: Challenges & Improvements
No ratings yet
Indian Legal Education: Challenges & Improvements
7 pages
Differentiated Lesson Plan
No ratings yet
Differentiated Lesson Plan
5 pages
Notice 1171
No ratings yet
Notice 1171
15 pages
Tvl-Ict-Css: Quarter 3 - Module 7-8: Installing and Configuring Computer System (Iccs)
100% (4)
Tvl-Ict-Css: Quarter 3 - Module 7-8: Installing and Configuring Computer System (Iccs)
26 pages
Impact of Subsidy Removal On Nigerian Educational System
No ratings yet
Impact of Subsidy Removal On Nigerian Educational System
12 pages
Cronbach's Alpha Explained
No ratings yet
Cronbach's Alpha Explained
5 pages
2024 Syllabus
No ratings yet
2024 Syllabus
28 pages
PARC 2023 Book of Abstracts (Final)
No ratings yet
PARC 2023 Book of Abstracts (Final)
188 pages
Sylvan Learning - Summer Smart Reading Math P-K (Etc.) (Z-Library)
100% (1)
Sylvan Learning - Summer Smart Reading Math P-K (Etc.) (Z-Library)
160 pages
Year 2 English: Body Parts Lesson
No ratings yet
Year 2 English: Body Parts Lesson
5 pages
There Is A Very Beautiful Image in Front of Me. I Can Hear The Information About + (Words)
No ratings yet
There Is A Very Beautiful Image in Front of Me. I Can Hear The Information About + (Words)
8 pages
Lecture The Contemporary World 1 ST Sem
No ratings yet
Lecture The Contemporary World 1 ST Sem
3 pages
Plenary Talk Script
No ratings yet
Plenary Talk Script
4 pages
Sarthak Marksheet Class 8th
No ratings yet
Sarthak Marksheet Class 8th
1 page
TM 1 Templates
100% (1)
TM 1 Templates
84 pages
2.5 Catalog of PDSA Examples 1
No ratings yet
2.5 Catalog of PDSA Examples 1
22 pages
GPAT Prep Guide for Pharmacy Students
No ratings yet
GPAT Prep Guide for Pharmacy Students
2 pages
Goal Setting & Job Engagement Guide
No ratings yet
Goal Setting & Job Engagement Guide
17 pages

Effect of Noise Suppression Losses On Speech Distortion and Asr Performance

Uploaded by

Effect of Noise Suppression Losses On Speech Distortion and Asr Performance

Uploaded by

EFFECT OF NOISE SUPPRESSION LOSSES

ON SPEECH DISTORTION AND ASR PERFORMANCE

Sebastian Braun, Hannes Gamper

Microsoft Research, Redmond, WA, USA

Submitted to Proc. ICASSP2022, May 2022, Singapore © IEEE 2021

You might also like