0% found this document useful (0 votes)
18 views5 pages

Effect of Noise Suppression Losses On Speech Distortion and Asr Performance

This document discusses the impact of noise suppression losses on speech distortion and automatic speech recognition (ASR) performance, highlighting the challenges of achieving high speech quality with smaller models. The authors explore various loss functions for real-time deep noise suppression, aiming to improve speech intelligibility while minimizing ASR degradation. Their findings indicate that while certain loss functions and pre-trained networks were evaluated, none significantly enhanced ASR rates or speech quality beyond the strong spectral loss already in use.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views5 pages

Effect of Noise Suppression Losses On Speech Distortion and Asr Performance

This document discusses the impact of noise suppression losses on speech distortion and automatic speech recognition (ASR) performance, highlighting the challenges of achieving high speech quality with smaller models. The authors explore various loss functions for real-time deep noise suppression, aiming to improve speech intelligibility while minimizing ASR degradation. Their findings indicate that while certain loss functions and pre-trained networks were evaluated, none significantly enhanced ASR rates or speech quality beyond the strong spectral loss already in use.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

EFFECT OF NOISE SUPPRESSION LOSSES

ON SPEECH DISTORTION AND ASR PERFORMANCE

Sebastian Braun, Hannes Gamper

Microsoft Research, Redmond, WA, USA


{sebastian.braun, hannes.gamper}@microsoft.com
arXiv:2111.11606v1 [eess.AS] 23 Nov 2021

ABSTRACT requires either transcriptions for all training data, or using a tran-
scribed subset for the ASR loss. The authors also proposed to up-
Deep learning based speech enhancement has made rapid develop- date the ASR model during training, which creates an in practice
ment towards improving quality, while models are becoming more often undesired dependency between speech enhancement and the
compact and usable for real-time on-the-edge inference. However, jointly trained ASR engine.
the speech quality scales directly with the model size, and small In this work, we explore several loss functions for a real-time
models are often still unable to achieve sufficient quality. Further- deep noise suppressor with the goal to improve SIG without harm-
more, the introduced speech distortion and artifacts greatly harm ing ASR rates. The contribution of this paper is threefold. First,
speech quality and intelligibility, and often significantly degrade au- we show that by decoupling the loss from the speech enhancement
tomatic speech recognition (ASR) rates. In this work, we shed light inference engine using end-to-end training, choosing a higher reso-
on the success of the spectral complex compressed mean squared er- lution in a spectral signal-based loss can improve SIG. Second, we
ror (MSE) loss, and how its magnitude and phase-aware terms are re- propose ways to integrate MOS and WER estimates from pre-trained
lated to the speech distortion vs. noise reduction trade off. We further networks in the loss as weighting. Third, we evaluate additional su-
investigate integrating pre-trained reference-less predictors for mean pervised loss terms computed using pre-trained networks, similar to
opinion score (MOS) and word error rate (WER), and pre-trained the deep feature loss [4]. In [5], six different networks pre-trained
embeddings on ASR and sound event detection. Our analyses reveal on different tasks have been used to extract embeddings from output
that none of the pre-trained networks added significant performance and target signals, to form an additional loss, where a benefit was
over the strong spectral loss. reported only for the sound event detection model published in [6].
Index Terms— speech enhancement, noise reduction, speech We show different results when trained on a larger dataset and eval-
distortion reduction, speech quality uated on real data using decisive metrics on speech quality and ASR
performance, while we find that none of the pre-trained networks
improved ASR rates or speech quality significantly.
1. INTRODUCTION

Speech enhancement techniques are present in almost any device 2. SYSTEM AND TRAINING OBJECTIVE
with voice communication or voice command capabilities. The
goal is to extract the speaker’s voice only, reducing disturbing A captured microphone signal can generally be described by
background noise to improve listening comfort, and aid intelligibil-
ity for human or machine listeners. In the past few years, neural y(t) = m {s(t) + r(t) + v(t)} , (1)
network-based speech enhancement techniques showed tremendous
improvements in terms of noise reduction capability [1, 2]. Data- where s(t) is the non-reverberant desired speech signal, r(t) unde-
driven methods can learn the tempo-spectral properties of any type sired late reverberation, v(t) additive noise or interfering sounds,
of speech and noise, in contrast to traditional statistical model-based and m{} can model linear and non-linear acoustical, electrical, or
approaches that often mismatch certain types of signals. While one digital effects that the signals encounter.
big current challenge in this field is still to find smaller and more
efficient network architectures that are computationally light for
real-time processing while delivering good results, another major 2.1. End-to-end optimization
challenge addressed here is obtaining good, natural sounding speech
quality without processing artifacts. We use an end-to-end enhancement system using a complex en-
In the third deep noise suppression (DNS) challenge [2], the sep- hancement filter in the short-time Fourier transform (STFT) domain
arate evaluation of speech distortion (SIG), background noise reduc-
tion (BAK), and overall quality (OVL) according to ITU P.835 re- sb(t) = FP {G(k, n) FP {y(t)}}−1 (2)
vealed that while current state-of-the-art methods achieve outstand-
ing noise reduction, only one submission did not degrade SIG on av- where FP {a(t)} = A(k, n) denotes the linear STFT operator yield-
erage while still showing high BAK. Degradations in SIG potentially ing a complex signal representation at frequency k and time index n,
also harm the performance of following automatic speech recogni- and G(k, n) is a complex enhancement filter. The end-to-end opti-
tion (ASR) systems, or human intelligibility as well. mization objective is to train a neural network predicting G(k, n),
In [3] a speech enhancement model was trained on a signal- while optimizing a loss on the time-domain output signal sb(t) as
based loss and a ASR loss with alternating updates. This method shown in Fig. 1.

Submitted to Proc. ICASSP2022, May 2022, Singapore © IEEE 2021


LibriVox (500h)
Inference 0.4 0.8 AVspeech mini (4580h)
Mandarin SLR (54h)
Noisy enhanced

norm. hist.
0.3 0.6 Spanish SLR (46h)
STFTP Feature DNN iSTFTP Emotion CremaD (2h)
audio audio Emotion Internal (8h)
0.2 0.4 Freesound Laughing (2h)

0.1 0.2
Target Spectral
STFTL STFTL
audio Loss 0 0
-10 0 10 20 30 40 0 10 20 30 40 50
MOS, WER RSNR (dB) SRR (dB)
Loss
Fig. 2. SNR and SRR distributions of speech datasets
Embedding loss
(e.g. sound event
Training only detection model)
signal-to-reverberation ratio (SRR) as predicted by a slightly modi-
fied neural network following [12]. The singing data is not shown in
Fig. 1. End-to-end trained system on various losses. Fig. 2 as it is studio quality and our RSNR estimator is not trained
on singing. While the Librivox data has both high RSNR and SRR,
this is not the case for the other datasets, which have broader distri-
2.2. Features and network architecture butions and lower peaks. Therefore, we select only speech data from
the AVspeech, Spanish, Mandarin, emotion, and laughing subsets
We use complex compressed features by feeding the real and imag- with segSN R > 30 dB and SRR > 35 dB for training.
inary part of the complex FFT spectrum as channels into the first
convolutional layer. Magnitude compression is applied to the com-
plex spectra by
4. LOSS FUNCTIONS
Y (k, n)
Y c (k, n) = |Y (k, n)|c , (3) In this section, we describe training loss functions that are used to
max(|Y (k, n)|, η)
optimize the enhanced signal sb(t). We always use a standard signal-
where the small positive constant η avoids division by zero. based spectral loss described in Sec. 4.1, which is extended or mod-
We use the Convolutional Recurrent U-net for Speech En- ified in several ways as described in the following subsections.
hancement (CRUSE) model proposed in [7] with 4 convolutional
encoder/decoder layers with time-frequency kernels (2,3) and stride
(1,2) with pReLU activations, a group of 4 parallel GRUs in the 4.1. Magnitude-regularized compressed spectral loss
bottleneck and add skip connections with 1 × 1 convolutions. The
network output are two channels for the real and imaginary part of As a standard spectral distance-based loss function LSD , we use the
the complex filter G(k, n). To ensure stability, we use tanh output complex compressed loss [11, 13], which outperformed other spec-
activation functions restraining the filter values to [−1, 1] as in [8]. tral distance-based losses in [14], given by
!
3. DATA GENERATION AND AUGMENTATION 1 X c bc
2 X c c
2
LSD = c λ S −S + (1−λ) |S| −|S|
b , (4)
σs κ,η κ,η
We use an online data generation and augmentation technique, using
the power of randomness to generate virtually infinitely large train-
ing data. Speech and noise portions are randomly selected from raw where here the spectra S(κ, η) = FL {s(t)} and S(κ, b η) =
audio files with random start times to form 10 s clips. If a section FL {b s(t)} are computed with a STFT operation with independent
is too short, one or more files are concatenated to obtain the 10 s settings from FP {·} in (2), Ac = |A|c |A|A
is a magnitude compres-
length. 80% of speech and noise are augmented with random biquad sion operation, and the frequency and time indices κ, η are omitted
filters [9] and 20% are pitch shifted within [−2, 8] semi-tones. If the for brevity. The loss for each sequence is normalized by the active
speech is non-reverberant, a random room impulse response (RIR) speech energy σs [15], which is computed from s(t) using a voice
is applied. The non-reverberant speech training target is obtained by activity detector (VAD). The complex and magnitude loss terms are
windowing the RIR with a cosine decay of length 50 ms, starting linearly weighted by λ, and the compression factor is c = 0.3. The
20 ms (one frame) after the direct path. Speech and noise are mixed processing STFT FP and loss frequency transform FL can differ in
with a signal-to-noise ratio (SNR) drawn from a normal distribution e. g. window and hop-size parameters, or be even different types of
with mean and standard deviation N (5, 10) dB. The signal levels are frequency transforms as shown in Fig. 1. In the following subsec-
varied with a normal distribution N (−26, 10) dB. tions, we propose several extensions to the spectral distance loss
We use 246 h noise data, consisting of the DNS challenge noise LSD to potentially improve quality and generalization.
data (180 h), internal recordings (65 h), and stationary noise (1 h), Optional frequency weighting Frequency weightings for sim-
as well as 115 k RIRs published as part of the DNS challenges [10]. ple spectral distances are often used in perceptually motivated evalu-
Speech data is taken mainly from the 500 h high quality-rated Lib- ation metrics and have been tried to integrate as optimization targets
rivox data from [10], in addition to high SNR data from AVspeech for speech enhancement [16]. While we already showed that the
[11] and the Mandarin and Spanish, singing, and emotional CremaD AMR-wideband based frequency weighting did not yield improve-
corpora published within [10], an internal collection of 8 h emotional ments for the experiments in [14], here we explore another attempt
speech and 2 h laughing sourced from Freesound. Figure 2 shows applying a simple equivalent rectangular bandwidth (ERB) weight-
the distributions of reverberant speech-to-noise ratio (RSNR) and ing [17] to the spectra in (4) using 20 bands.

2
4.2. Additional cepstral distance term SNR 10 dB lowpass 3 kHz
1.00 SNR 25 dB delay 3 ms
The cepstral distance (CD) [18] is one of the few intrusive objec- SNR 50 dB
tive metrics that is also sensitive to speech distortion artifacts caused 0.75

emb. loss
by speech enhancement algorithms, in contrast to most other met-
rics, which are majorly sensitive to noise reduction. This motivated 0.50
adding a CD term LCD to (4) by
0.25
LCD = βLSD (s, sb) + (1 − β)LCD (s, sb) (5)
0.00
2 3 4 5 6 ase v60 ust
where we chose β = 0.001, but did not find a different weight that
NN ANN ANN ANN ANN c-b vec-l c-rob
helped improving the results. PA P P P P v e
v2 wav2 v2ve
wa wa
4.3. Non-intrusive speech quality and ASR weighted loss
Fig. 3. Sensitivity of embedding losses to various degradations.
Secondly, we explore the use of non-intrusive estimators for mean Losses are normalized per embedding type (column).
opinion score (MOS) [19] and word error rate (WER) [20]. The
MOS predictor has been re-trained with subjective ratings of vari-
ous speech enhancement algorithms, including the MOS ratings in-
Understanding the embeddings To provide insights into how
cluded in the three DNS challenges. Note that both are blind (non-
to choose useful embedding extractors and reasons why some em-
intrusive) estimators, meaning they give predictions without requir-
beddings work better than others, we conduct a preliminary experi-
ing any reference, which makes them also interesting for unsuper-
ment. We show the embedding loss terms for a selection of signal
vised training, to be explored in future work. To avoid dependence
degradations, corrupting a speech signal with the same noise with
of a hyper-parameter when extending the loss function by additive
three different SNRs, a 3 kHz lowpass degradation, and the impact
terms (e.g., λ in (4)), we use the predictions to weight the spectral
of a delay, i. e. a linear phase shift. The degradations are applied
distance loss for each sequence b in a training batch by
on 20 speech signals with different noise signals and results are av-
X nW ER(b
sb ) eraged. Fig. 3 shows the PANN embedding loss using a different
LM OS,W ER = LSD (sb , sbb ), (6) number of CNN layers, and the three wav2vec models mentioned
nM OS(b
sb )
b above. The embedding loss is normalized to the maximum per em-
where sb (t) is the b-th sequence, nW ER() and nM OS() are the bedding (i.e., per column) for better visibility, as we are interested in
WER and MOS predictors. We also explore MOS and WER only the differences created by certain degradations.
weighted losses by setting the corresponding other prediction to one. We observe that the PANN models are rather non-sensitive to
lowpass degradations, attributing it a similar penalty as a -50 dB
background noise. The wav2vec embeddings are much more sensi-
4.4. Multi-task embedding losses tive to the lowpass distortion, and rate moderate background noise
As a third extension, similiar to [5], we add a distance loss using pre- with -25 dB comparably less harmful than the PANN embeddings.
trained embeddings on different audio tasks, such as ASR or sound This might be closer to a human importance or intelligibility rat-
event detection (SED), by ing, where moderate background noise might be perceived less
disturbing than speech signal distortions, and seems therefore more
X ku(sb ) − u(b
sb )kp useful in guiding the networks towards preserving more speech
Lemb = LSD (sb , sbb ) + γ (7)
ku(sb )kp components rather than suppressing more (probably hardly audible)
b
background noise. We consequently choose the 4-layer PANN and
where u(sb ) and u(b sb ) are the embedding vectors from the target wav2vec-robust embeddings for our later experiments.
speech and output signals, respectively, and we use the normalized
p-norm as distance metric. This differs from [5], where a L1 spectral
5. EVALUATION
loss was used for LSD . We verified a small benefit from the normal-
ization term and chose p according to the embedding distributions in 5.1. Test sets and metrics
preliminary experiments.
In this work, we use two different embedding extractors u(): a) We show results on the public third DNS challenge test set consist-
the PANN SED model [6] that was the only embedding that showed ing of 600 actual device recordings under noisy conditions [2]. The
a benefit in [5], and b) an ASR embedding using wav2vec 2.0 mod- impact on subsequent ASR systems is measured using three sub-
els [21]. For PANN, we use the pre-trained 14-layer CNN model sets: The transcribed DNS challenge dataset (challenging noisy con-
using the first 2-6 double CNN layers with p = 1 and γ = 0.05. ditions), a collection of internal real meetings (18h, realistic medium
For wav2vec 2.0, we explore three versions of pre-trained mod- noisy condition), and 200 high quality high SNR recordings.
els with p = 2 and γ = 0.1 to extract embeddings, which could Speech quality is measured using a non-intrusive P.835 DNN-
help to improve ASR performance of the speech enhancement net- based estimator similar to DNSMOS [22] trained on the available
works: i) the small wav2vec-base model trained on LibriSpeech, ii) data from the three DNS challenges and internally collected data.
the large wav2vec-lv60 model trained on LibriSpeech, and iii) the We term the non-intrusive predictions for signal distortion, back-
large wav2vec-robust model trained on Libri-Light, CommonVoice, ground noise and overall quality from P.835 DNSMOS nSIG, nBAK,
Switchboard, Fisher, i. e. , more realistic and noisy data. We use the nOVL. The P.835 DNSMOS model predicts SIG with > 0.9 correla-
full wav2vec models taking the logits as output. The pre-trained tion and BAK, OVL with > 0.95 correlation per model. The impact
embeddings extractor networks are frozen while training the speech on production-grade ASR systems is measured using the public Mi-
enhancement network. crosoft Azure Speech SDK service [23] for transcription.

3
-0.05 0.4
0.1 0.2 loss nSIG nBAK nOVL WER (%)
0.0 0.3 0.5
0.7 dataset DNS DNS HQ meet
nSIG

-0.1
0.6 noisy 3.87 3.05 3.11 27.9 5.7 16.5
-0.15 0.8 0.9
L20
SD (20 ms, 50%) 3.77 4.23 3.50 31.0 5.9 18.7
1.0 L32
SD (32 ms, 50%) 3.79 4.26 3.53 30.6 5.9 18.6
-0.2
1.25 1.3 1.35 1.4 L64
SD (64 ms, 75%) 3.79 4.28 3.54 30.1 5.9 18.4
nBAK L64
SD ERB 3.73 4.22 3.46 31.9 6.0 18.6
L64
SD + CD 3.79 4.26 3.53 30.4 5.8 18.1
Fig. 4. Controlling the speech distortion - noise reduction tradeoff L64 4.27 3.53 18.0
SD -MOS 3.78 30.2 6.0
for (4) using the complex loss weight λ, where λ = 1 gives the L64 3.79 4.27 3.53 5.8
SD -WER 30.5 18.2
complex loss term only, and λ = 0 gives the magnitude only loss. L64 3.79 3.53 30.1 5.8
SD -MOS-WER 4.26 18.4
L64
SD + PANN4 3.79 4.27 3.54 30.4 5.8 18.5
L64
SD + wav2vec 3.79 4.26 3.53 30.3 5.9 18.6
5.2. Experimenal settings
The CRUSE processing parameters are implemented using a STFT Table 1. Impact of modifying and extending the spectral loss on
with square-root Hann windows of 20 ms length, 50% overlap, and perceived quality and ASR.
a FFT size of 320. To achieve results in line with prior work, we
use a network size that is on the larger side for most CPU-bound
real-world applications, although it still runs in real-time on standard nificant degradation compared to the linear frequency resolution.
CPUs and is several times less complex than most research speech Contrary to expectations, the additive CD term did not help to
enhancement architectures: The network has 4 encoder conv layers improve SIG further, but slightly reduced BAK. It did however im-
with channels [32, 64, 128, 256], which are mirrored in the decoder, prove the WER for the high quality and meeting test data. Disap-
a GRU bottleneck split in 4 parallel subgroups, and conv skip con- pointingly, the non-intrusive MOS weighting did not improve any
nections [7]. The resulting network has 8.4 M trainable parameters, MOS metrics over the plain best spectral distance loss, and shows
12.8 M MACs/frame and the ONNX runtime has a processing time no clear trend for WER. A reason could be that overall MOS is
of 45 ms per second of audio on a standard laptop CPU. For ref- still emphasizing BAK more, whereas we would need a loss that im-
erence, the first two ranks in the 3rd DNS challenge [2] stated to proves SIG to achieve a better overall result. The non-intrusive WER
consume about 60 M MACs and 93 M FLOPs per frame. The net- weighting shows a minor WER improvement for the high-quality
work is trained using the AdamW optimizer with initial learning rate and meeting data, with small degradation of DNS testset compared
0.001, which is halved after plateauing on the validation metric for to L64SD only. As the ASR models to train the non-intrusive WER
200 epochs. Training is stopped after validation metric plateau of predictor were trained on mostly clean speech, this could be a reason
400 epochs. One epoch is defined as 5000 training sequences of for the WER model not helping the noisy cases. The MOS+WER
10 s. We use the synthetic validation set and heuristic validation weighting ranks in between MOS and WER only weighted losses.
metric proposed in [7], a weighted sum of PESQ, siSDR and CD. Tab. 1 shows results for the PANN-augmented loss using the 4-
layer PANN only, which also does not show an improvement. Using
more PANN layers or a higher PANN weight γ > 0.05 resulted in
5.3. Results
worse performance, and could not exceed the standalone LSD loss.
In the first experiment, we study the not well understood linear Possible reasons have already been indicated in Fig. 3. The wav2vec
weighting between complex compressed and magnitude compressed ASR embedding loss term shows also no significant improvement
loss (4). Fig. 4 shows the nSIG vs. nBAK tradeoff when changing in terms of WER or MOS. Note that the P.835 DNSMOS absolute
the contribution of the complex loss term with the weight λ in (4). values, especially nOV L are somewhat compressed. The CRUSE
The magnitude term acts as a regularizer for distortion, while higher model with L20 SD achieved ∆SIG = −0.17, ∆BAK = 1.98,
weight on the complex term yields stronger noise suppression, at the ∆OV L = 0.85 in the subjective P.835 tests [24].
price of speech distortion. An optimal weighting is therefore found
as a trade off. We use λ = 0.3 in the following experiments, which 6. CONCLUSIONS
we found to be a good balance between SIG and BAK.
Table 1 shows the results in terms of nSIG, nBAK, nOVL and In this work, we provided insight into the advantage of magnitude
WER for all losses under test, where DNS, HQ, and meet refer to the regularization in the complex compressed spectral loss to trade off
three test sets described in Sec. 5.1. The first three rows show the speech distortion and noise reduction. We further showed that an in-
influence of the STFT resolution FL {·} used to compute the end- creased spectral resolution of this loss can lead to significantly better
to-end complex compressed loss (4), where we used Hann windows results. Besides the increased resolution loss, modifications that also
of {20, 32, 64} ms with {50%, 50%, 75%} overlap. The superscript aimed at improving signal quality and distortion, e. g. integrating
of LSD indicates the STFT window length. We can observe that pre-trained networks could not provide a measurable improvement.
the larger windows lead to improvements of all metrics. This is an Also, loss extensions that introduced knowledge from pre-trained
interesting finding, also highlighting the fact that speech enhance- ASR systems showed no improvements in generalization for human
ment approaches implemented on window sizes or look-ahead are or machine listeners. The small improvements in signal distortion
not comparable. With the decoupled end-to-end training, we can im- and WER indicate that there is more research required to improve
prove performance with larger STFT resolutions of the loss, while these metrics significantly.
keeping the smaller STFT resolution for processing, to keep the pro-
cessing delay low. Similar to many other frequency weightings, the
ERB-weighted spectral distance did not work well, showing a sig-

4
7. REFERENCES [13] J. Lee, J. Skoglund, T. Shabestary, and H. Kang, “Phase-
sensitive joint learning algorithms for deep learning-based
[1] D. Wang and J. Chen, “Supervised speech separation based speech enhancement,” IEEE Signal Processing Letters, vol.
on deep learning: An overview,” IEEE/ACM Trans. Audio, 25, no. 8, pp. 1276–1280, 2018.
Speech, Lang. Process., vol. 26, no. 10, pp. 1702–1726, Oct
[14] S. Braun and I. Tashev, “A consolidated view of loss functions
2018.
for supervised deep learning-based speech enhancement,” in
[2] C. K. A. Reddy, H. Dubey, K. Koishida, A. Nair, V. Gopal, Intl. Conf. on Telecomm. and Sig. Proc. (TSP), 2021.
R. Cutler, S. Braun, R. Gamper, H. Aichner, and S. Srinivasan,
[15] S. Braun and I. Tashev, “Data augmentation and loss normal-
“INTERSPEECH 2021 deep noise suppression challenge:,” in
ization for deep noise suppression,” in Proc. Speech and Com-
Proc. Interspeech, 2021.
puters, 2020.
[3] S. E. Eskimez, X. Wang, M. Tang, H. Yang, Z. Zhu, Z. Chen,
[16] Z. Zhao, S. Elshamy, and T. Fingscheidt, “A perceptual weight-
H. Wang, and T. Yoshioka, “Human listening and live cap-
ing filter loss for DNN training in speech enhancement,” in
tioning: Multi-task training for speech enhancement,” in Proc.
Proc. IEEE Workshop on Applications of Signal Processing to
Interspeech Conf., 2021.
Audio and Acoustics (WASPAA), Oct 2019, pp. 229–233.
[4] F. G. Germain, Q. Chen, and V. Koltun, “Speech denoising
[17] B. C. J. Moore and B. R. Glasberg, “Suggested formulae for
with deep feature losses,” online, arXiv:1806.10522, 2018.
calculating auditory-filter bandwidths and excitation patterns,”
[5] S. Kataria, J. Villalba, and N. Dehak, “Perceptual loss based J. Acoust. Soc. Am., vol. 74, pp. 750–753, 1983.
speech denoising with an ensemble of audio pattern recogni-
tion and self-supervised models,” in Proc. IEEE Intl. Conf. on [18] N. Kitawaki, H. Nagabuchi, and K. Itoh, “Objective quality
Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. evaluation for low bit-rate speech coding systems,” IEEE J.
7118–7122. Sel. Areas Commun., vol. 6, no. 2, pp. 262–273, 1988.

[6] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. [19] Hannes Gamper, Chandan K A Reddy, Ross Cutler, Ivan J. Ta-
Plumbley, “PANNs: Large-scale pretrained audio neural net- shev, and Johannes Gehrke, “Intrusive and non-intrusive per-
works for audio pattern recognition,” vol. 28, pp. 2880–2894. ceptual speech quality assessment using a convolutional neural
network,” in 2019 IEEE Workshop on Applications of Signal
[7] S. Braun, H. Gamper, C. K. A. Reddy, and I. Tashev, “Towards Processing to Audio and Acoustics (WASPAA), 2019, pp. 85–
efficient models for real-time deep noise suppression,” in Proc. 89.
IEEE Intl. Conf. on Acoustics, Speech and Signal Processing
(ICASSP), 2021. [20] H. Gamper, D. Emmanilidou, S. Braun, and I. Tashev, “Pre-
dicting word error rate for reverberant speech,” in Proc.
[8] H.-S. Choi, J. Kim, J. Hur, A. Kim, J.-W. Ha, and K. Lee., IEEE Intl. Conf. on Acoustics, Speech and Signal Processing
“Phase-aware speech enhancement with deep complex U-Net,” (ICASSP), 2020.
in Intl. Conf. on Learning Representations (ICLR), 2019.
[21] A. Baevski, H. Zhou, A. Mohammed, and M. Auli, “wav2vec
[9] J. Valin, “A hybrid DSP/deep learning approach to real-time
2.0: a framework for self-supervised learning of speech rep-
full-band speech enhancement,” in 20th Intl. Workshop on Mul-
resentations,” in Proc. Conf. Neural Information Processing
timedia Signal Processing (MMSP), Aug 2018, pp. 1–5.
Systems (NeurIPS), 2020.
[10] C. K. A. Reddy, H. Dubey, V. Gopal, R. Cutler, S. Braun,
[22] C. Reddy, V. Gopal, and R. Cutler, “DNSMOS: A non-
H. Gamper, R. Aichner, and S. Srinivasan, “ICASSP 2021
intrusive perceptual objective speech quality metric to evalu-
deep noise suppression challenge,” in Proc. IEEE Intl. Conf.
ate noise suppressors,” in Proc. IEEE Intl. Conf. on Acoustics,
on Acoustics, Speech and Signal Processing (ICASSP), 2021.
Speech and Signal Processing (ICASSP), October 2021.
[11] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Has-
sidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at [23] Microsoft, “https://github.com/azure-samples/cognitive-
the cocktail party: A speaker-independent audio-visual model services-speech-sdk,” 2021.
for speech separation,” ACM Trans. Graph., vol. 37, no. 4, July [24] 3rd Deep Noise Suppression Challenge Organizers, “Chal-
2018. lenge results,” www.microsoft.com/en-us/research/academic-
[12] S. Braun and I. Tashev, “On training targets for noise-robust program/deep-noise-suppression-challenge-interspeech-
voice activity detection,” in Proc. European Signal Processing 2021/#!results, Aug. 2021.
Conf. (EUSIPCO), 2021.

You might also like