Effect of Noise Suppression Losses On Speech Distortion and Asr Performance
Effect of Noise Suppression Losses On Speech Distortion and Asr Performance
                                                                       ABSTRACT                                       requires either transcriptions for all training data, or using a tran-
                                                                                                                      scribed subset for the ASR loss. The authors also proposed to up-
                                           Deep learning based speech enhancement has made rapid develop-             date the ASR model during training, which creates an in practice
                                           ment towards improving quality, while models are becoming more             often undesired dependency between speech enhancement and the
                                           compact and usable for real-time on-the-edge inference. However,           jointly trained ASR engine.
                                           the speech quality scales directly with the model size, and small               In this work, we explore several loss functions for a real-time
                                           models are often still unable to achieve sufficient quality. Further-      deep noise suppressor with the goal to improve SIG without harm-
                                           more, the introduced speech distortion and artifacts greatly harm          ing ASR rates. The contribution of this paper is threefold. First,
                                           speech quality and intelligibility, and often significantly degrade au-    we show that by decoupling the loss from the speech enhancement
                                           tomatic speech recognition (ASR) rates. In this work, we shed light        inference engine using end-to-end training, choosing a higher reso-
                                           on the success of the spectral complex compressed mean squared er-         lution in a spectral signal-based loss can improve SIG. Second, we
                                           ror (MSE) loss, and how its magnitude and phase-aware terms are re-        propose ways to integrate MOS and WER estimates from pre-trained
                                           lated to the speech distortion vs. noise reduction trade off. We further   networks in the loss as weighting. Third, we evaluate additional su-
                                           investigate integrating pre-trained reference-less predictors for mean     pervised loss terms computed using pre-trained networks, similar to
                                           opinion score (MOS) and word error rate (WER), and pre-trained             the deep feature loss [4]. In [5], six different networks pre-trained
                                           embeddings on ASR and sound event detection. Our analyses reveal           on different tasks have been used to extract embeddings from output
                                           that none of the pre-trained networks added significant performance        and target signals, to form an additional loss, where a benefit was
                                           over the strong spectral loss.                                             reported only for the sound event detection model published in [6].
                                               Index Terms— speech enhancement, noise reduction, speech               We show different results when trained on a larger dataset and eval-
                                           distortion reduction, speech quality                                       uated on real data using decisive metrics on speech quality and ASR
                                                                                                                      performance, while we find that none of the pre-trained networks
                                                                                                                      improved ASR rates or speech quality significantly.
                                                                  1. INTRODUCTION
                                           Speech enhancement techniques are present in almost any device                       2. SYSTEM AND TRAINING OBJECTIVE
                                           with voice communication or voice command capabilities. The
                                           goal is to extract the speaker’s voice only, reducing disturbing           A captured microphone signal can generally be described by
                                           background noise to improve listening comfort, and aid intelligibil-
                                           ity for human or machine listeners. In the past few years, neural                            y(t) = m {s(t) + r(t) + v(t)} ,                 (1)
                                           network-based speech enhancement techniques showed tremendous
                                           improvements in terms of noise reduction capability [1, 2]. Data-          where s(t) is the non-reverberant desired speech signal, r(t) unde-
                                           driven methods can learn the tempo-spectral properties of any type         sired late reverberation, v(t) additive noise or interfering sounds,
                                           of speech and noise, in contrast to traditional statistical model-based    and m{} can model linear and non-linear acoustical, electrical, or
                                           approaches that often mismatch certain types of signals. While one         digital effects that the signals encounter.
                                           big current challenge in this field is still to find smaller and more
                                           efficient network architectures that are computationally light for
                                           real-time processing while delivering good results, another major          2.1. End-to-end optimization
                                           challenge addressed here is obtaining good, natural sounding speech
                                           quality without processing artifacts.                                      We use an end-to-end enhancement system using a complex en-
                                                In the third deep noise suppression (DNS) challenge [2], the sep-     hancement filter in the short-time Fourier transform (STFT) domain
                                           arate evaluation of speech distortion (SIG), background noise reduc-
                                           tion (BAK), and overall quality (OVL) according to ITU P.835 re-                           sb(t) = FP {G(k, n) FP {y(t)}}−1                  (2)
                                           vealed that while current state-of-the-art methods achieve outstand-
                                           ing noise reduction, only one submission did not degrade SIG on av-        where FP {a(t)} = A(k, n) denotes the linear STFT operator yield-
                                           erage while still showing high BAK. Degradations in SIG potentially        ing a complex signal representation at frequency k and time index n,
                                           also harm the performance of following automatic speech recogni-           and G(k, n) is a complex enhancement filter. The end-to-end opti-
                                           tion (ASR) systems, or human intelligibility as well.                      mization objective is to train a neural network predicting G(k, n),
                                                In [3] a speech enhancement model was trained on a signal-            while optimizing a loss on the time-domain output signal sb(t) as
                                           based loss and a ASR loss with alternating updates. This method            shown in Fig. 1.
                                                                                              norm. hist.
                                                                                                            0.3                                        0.6 Spanish SLR (46h)
               STFTP           Feature        DNN                iSTFTP                                                                                      Emotion CremaD (2h)
    audio                                                                 audio                                                                              Emotion Internal (8h)
                                                                                                            0.2                                        0.4 Freesound Laughing (2h)
                                                                                                            0.1                                        0.2
    Target                               Spectral
                             STFTL                       STFTL
    audio                                  Loss                                                              0                                          0
                                                                                                             -10   0       10    20       30    40           0     10      20        30   40      50
                                        MOS, WER                                                                       RSNR (dB)                                           SRR (dB)
                                          Loss
                                                                                                              Fig. 2. SNR and SRR distributions of speech datasets
                                      Embedding loss
                                     (e.g. sound event
             Training only           detection model)
                                                                                          signal-to-reverberation ratio (SRR) as predicted by a slightly modi-
                                                                                          fied neural network following [12]. The singing data is not shown in
        Fig. 1. End-to-end trained system on various losses.                              Fig. 2 as it is studio quality and our RSNR estimator is not trained
                                                                                          on singing. While the Librivox data has both high RSNR and SRR,
                                                                                          this is not the case for the other datasets, which have broader distri-
2.2. Features and network architecture                                                    butions and lower peaks. Therefore, we select only speech data from
                                                                                          the AVspeech, Spanish, Mandarin, emotion, and laughing subsets
We use complex compressed features by feeding the real and imag-                          with segSN R > 30 dB and SRR > 35 dB for training.
inary part of the complex FFT spectrum as channels into the first
convolutional layer. Magnitude compression is applied to the com-
plex spectra by
                                                                                                                                 4. LOSS FUNCTIONS
                                                   Y (k, n)
              Y c (k, n) = |Y (k, n)|c                            ,             (3)       In this section, we describe training loss functions that are used to
                                               max(|Y (k, n)|, η)
                                                                                          optimize the enhanced signal sb(t). We always use a standard signal-
where the small positive constant η avoids division by zero.                              based spectral loss described in Sec. 4.1, which is extended or mod-
     We use the Convolutional Recurrent U-net for Speech En-                              ified in several ways as described in the following subsections.
hancement (CRUSE) model proposed in [7] with 4 convolutional
encoder/decoder layers with time-frequency kernels (2,3) and stride
(1,2) with pReLU activations, a group of 4 parallel GRUs in the                           4.1. Magnitude-regularized compressed spectral loss
bottleneck and add skip connections with 1 × 1 convolutions. The
network output are two channels for the real and imaginary part of                        As a standard spectral distance-based loss function LSD , we use the
the complex filter G(k, n). To ensure stability, we use tanh output                       complex compressed loss [11, 13], which outperformed other spec-
activation functions restraining the filter values to [−1, 1] as in [8].                  tral distance-based losses in [14], given by
                                                                                                                                                                                                  !
       3. DATA GENERATION AND AUGMENTATION                                                                     1           X          c    bc
                                                                                                                                                2                 X             c         c
                                                                                                                                                                                              2
                                                                                           LSD               = c       λ         S −S               + (1−λ)                |S| −|S|
                                                                                                                                                                                 b                    , (4)
                                                                                                              σs           κ,η                                     κ,η
We use an online data generation and augmentation technique, using
the power of randomness to generate virtually infinitely large train-
ing data. Speech and noise portions are randomly selected from raw                        where here the spectra S(κ, η) = FL {s(t)} and S(κ,        b η) =
audio files with random start times to form 10 s clips. If a section                      FL {b s(t)} are computed with a STFT operation with independent
is too short, one or more files are concatenated to obtain the 10 s                       settings from FP {·} in (2), Ac = |A|c |A|A
                                                                                                                                       is a magnitude compres-
length. 80% of speech and noise are augmented with random biquad                          sion operation, and the frequency and time indices κ, η are omitted
filters [9] and 20% are pitch shifted within [−2, 8] semi-tones. If the                   for brevity. The loss for each sequence is normalized by the active
speech is non-reverberant, a random room impulse response (RIR)                           speech energy σs [15], which is computed from s(t) using a voice
is applied. The non-reverberant speech training target is obtained by                     activity detector (VAD). The complex and magnitude loss terms are
windowing the RIR with a cosine decay of length 50 ms, starting                           linearly weighted by λ, and the compression factor is c = 0.3. The
20 ms (one frame) after the direct path. Speech and noise are mixed                       processing STFT FP and loss frequency transform FL can differ in
with a signal-to-noise ratio (SNR) drawn from a normal distribution                       e. g. window and hop-size parameters, or be even different types of
with mean and standard deviation N (5, 10) dB. The signal levels are                      frequency transforms as shown in Fig. 1. In the following subsec-
varied with a normal distribution N (−26, 10) dB.                                         tions, we propose several extensions to the spectral distance loss
     We use 246 h noise data, consisting of the DNS challenge noise                       LSD to potentially improve quality and generalization.
data (180 h), internal recordings (65 h), and stationary noise (1 h),                          Optional frequency weighting Frequency weightings for sim-
as well as 115 k RIRs published as part of the DNS challenges [10].                       ple spectral distances are often used in perceptually motivated evalu-
Speech data is taken mainly from the 500 h high quality-rated Lib-                        ation metrics and have been tried to integrate as optimization targets
rivox data from [10], in addition to high SNR data from AVspeech                          for speech enhancement [16]. While we already showed that the
[11] and the Mandarin and Spanish, singing, and emotional CremaD                          AMR-wideband based frequency weighting did not yield improve-
corpora published within [10], an internal collection of 8 h emotional                    ments for the experiments in [14], here we explore another attempt
speech and 2 h laughing sourced from Freesound. Figure 2 shows                            applying a simple equivalent rectangular bandwidth (ERB) weight-
the distributions of reverberant speech-to-noise ratio (RSNR) and                         ing [17] to the spectra in (4) using 20 bands.
                                                                                      2
4.2. Additional cepstral distance term                                                                  SNR 10 dB            lowpass 3 kHz
                                                                                                1.00    SNR 25 dB            delay 3 ms
The cepstral distance (CD) [18] is one of the few intrusive objec-                                      SNR 50 dB
tive metrics that is also sensitive to speech distortion artifacts caused                       0.75
                                                                                    emb. loss
by speech enhancement algorithms, in contrast to most other met-
rics, which are majorly sensitive to noise reduction. This motivated                            0.50
adding a CD term LCD to (4) by
                                                                                                0.25
             LCD = βLSD (s, sb) + (1 − β)LCD (s, sb)                 (5)
                                                                                                0.00
                                                                                                  2   3     4    5     6          ase v60      ust
where we chose β = 0.001, but did not find a different weight that
                                                                                                NN ANN ANN ANN ANN             c-b vec-l c-rob
helped improving the results.                                                                PA     P   P     P    P       v e
                                                                                                                         v2 wav2 v2ve
                                                                                                                      wa             wa
4.3. Non-intrusive speech quality and ASR weighted loss
                                                                                Fig. 3. Sensitivity of embedding losses to various degradations.
Secondly, we explore the use of non-intrusive estimators for mean               Losses are normalized per embedding type (column).
opinion score (MOS) [19] and word error rate (WER) [20]. The
MOS predictor has been re-trained with subjective ratings of vari-
ous speech enhancement algorithms, including the MOS ratings in-
                                                                                     Understanding the embeddings To provide insights into how
cluded in the three DNS challenges. Note that both are blind (non-
                                                                                to choose useful embedding extractors and reasons why some em-
intrusive) estimators, meaning they give predictions without requir-
                                                                                beddings work better than others, we conduct a preliminary experi-
ing any reference, which makes them also interesting for unsuper-
                                                                                ment. We show the embedding loss terms for a selection of signal
vised training, to be explored in future work. To avoid dependence
                                                                                degradations, corrupting a speech signal with the same noise with
of a hyper-parameter when extending the loss function by additive
                                                                                three different SNRs, a 3 kHz lowpass degradation, and the impact
terms (e.g., λ in (4)), we use the predictions to weight the spectral
                                                                                of a delay, i. e. a linear phase shift. The degradations are applied
distance loss for each sequence b in a training batch by
                                                                                on 20 speech signals with different noise signals and results are av-
                             X nW ER(b
                                     sb )                                       eraged. Fig. 3 shows the PANN embedding loss using a different
           LM OS,W ER =                   LSD (sb , sbb ),           (6)        number of CNN layers, and the three wav2vec models mentioned
                               nM OS(b
                                     sb )
                              b                                                 above. The embedding loss is normalized to the maximum per em-
where sb (t) is the b-th sequence, nW ER() and nM OS() are the                  bedding (i.e., per column) for better visibility, as we are interested in
WER and MOS predictors. We also explore MOS and WER only                        the differences created by certain degradations.
weighted losses by setting the corresponding other prediction to one.                We observe that the PANN models are rather non-sensitive to
                                                                                lowpass degradations, attributing it a similar penalty as a -50 dB
                                                                                background noise. The wav2vec embeddings are much more sensi-
4.4. Multi-task embedding losses                                                tive to the lowpass distortion, and rate moderate background noise
As a third extension, similiar to [5], we add a distance loss using pre-        with -25 dB comparably less harmful than the PANN embeddings.
trained embeddings on different audio tasks, such as ASR or sound               This might be closer to a human importance or intelligibility rat-
event detection (SED), by                                                       ing, where moderate background noise might be perceived less
                                                                                disturbing than speech signal distortions, and seems therefore more
                   X                          ku(sb ) − u(b
                                                          sb )kp                useful in guiding the networks towards preserving more speech
          Lemb =        LSD (sb , sbb ) + γ                          (7)
                                                 ku(sb )kp                      components rather than suppressing more (probably hardly audible)
                    b
                                                                                background noise. We consequently choose the 4-layer PANN and
where u(sb ) and u(b  sb ) are the embedding vectors from the target            wav2vec-robust embeddings for our later experiments.
speech and output signals, respectively, and we use the normalized
p-norm as distance metric. This differs from [5], where a L1 spectral
                                                                                                         5. EVALUATION
loss was used for LSD . We verified a small benefit from the normal-
ization term and chose p according to the embedding distributions in            5.1. Test sets and metrics
preliminary experiments.
     In this work, we use two different embedding extractors u(): a)            We show results on the public third DNS challenge test set consist-
the PANN SED model [6] that was the only embedding that showed                  ing of 600 actual device recordings under noisy conditions [2]. The
a benefit in [5], and b) an ASR embedding using wav2vec 2.0 mod-                impact on subsequent ASR systems is measured using three sub-
els [21]. For PANN, we use the pre-trained 14-layer CNN model                   sets: The transcribed DNS challenge dataset (challenging noisy con-
using the first 2-6 double CNN layers with p = 1 and γ = 0.05.                  ditions), a collection of internal real meetings (18h, realistic medium
For wav2vec 2.0, we explore three versions of pre-trained mod-                  noisy condition), and 200 high quality high SNR recordings.
els with p = 2 and γ = 0.1 to extract embeddings, which could                        Speech quality is measured using a non-intrusive P.835 DNN-
help to improve ASR performance of the speech enhancement net-                  based estimator similar to DNSMOS [22] trained on the available
works: i) the small wav2vec-base model trained on LibriSpeech, ii)              data from the three DNS challenges and internally collected data.
the large wav2vec-lv60 model trained on LibriSpeech, and iii) the               We term the non-intrusive predictions for signal distortion, back-
large wav2vec-robust model trained on Libri-Light, CommonVoice,                 ground noise and overall quality from P.835 DNSMOS nSIG, nBAK,
Switchboard, Fisher, i. e. , more realistic and noisy data. We use the          nOVL. The P.835 DNSMOS model predicts SIG with > 0.9 correla-
full wav2vec models taking the logits as output. The pre-trained                tion and BAK, OVL with > 0.95 correlation per model. The impact
embeddings extractor networks are frozen while training the speech              on production-grade ASR systems is measured using the public Mi-
enhancement network.                                                            crosoft Azure Speech SDK service [23] for transcription.
                                                                            3
            -0.05                            0.4
                          0.1   0.2                                              loss                   nSIG nBAK nOVL   WER (%)
                    0.0                0.3         0.5
                                                           0.7                              dataset           DNS      DNS HQ meet
     nSIG
             -0.1
                                             0.6                                 noisy                  3.87     3.05     3.11     27.9   5.7   16.5
            -0.15                                         0.8    0.9
                                                                                 L20
                                                                                  SD (20 ms, 50%)       3.77     4.23     3.50     31.0   5.9   18.7
                                                                 1.0             L32
                                                                                  SD (32 ms, 50%)       3.79     4.26     3.53     30.6   5.9   18.6
             -0.2
               1.25              1.3               1.35           1.4            L64
                                                                                  SD (64 ms, 75%)       3.79     4.28     3.54     30.1   5.9   18.4
                                        nBAK                                     L64
                                                                                  SD ERB                3.73     4.22     3.46     31.9   6.0   18.6
                                                                                 L64
                                                                                  SD + CD               3.79     4.26     3.53     30.4   5.8   18.1
Fig. 4. Controlling the speech distortion - noise reduction tradeoff             L64                             4.27     3.53                  18.0
                                                                                  SD -MOS               3.78                       30.2   6.0
for (4) using the complex loss weight λ, where λ = 1 gives the                   L64                    3.79     4.27     3.53            5.8
                                                                                  SD -WER                                          30.5         18.2
complex loss term only, and λ = 0 gives the magnitude only loss.                 L64                    3.79              3.53     30.1   5.8
                                                                                  SD -MOS-WER                    4.26                           18.4
                                                                                 L64
                                                                                  SD + PANN4            3.79     4.27     3.54     30.4   5.8   18.5
                                                                                 L64
                                                                                  SD + wav2vec          3.79     4.26     3.53     30.3   5.9   18.6
5.2. Experimenal settings
The CRUSE processing parameters are implemented using a STFT                   Table 1. Impact of modifying and extending the spectral loss on
with square-root Hann windows of 20 ms length, 50% overlap, and                perceived quality and ASR.
a FFT size of 320. To achieve results in line with prior work, we
use a network size that is on the larger side for most CPU-bound
real-world applications, although it still runs in real-time on standard       nificant degradation compared to the linear frequency resolution.
CPUs and is several times less complex than most research speech                     Contrary to expectations, the additive CD term did not help to
enhancement architectures: The network has 4 encoder conv layers               improve SIG further, but slightly reduced BAK. It did however im-
with channels [32, 64, 128, 256], which are mirrored in the decoder,           prove the WER for the high quality and meeting test data. Disap-
a GRU bottleneck split in 4 parallel subgroups, and conv skip con-             pointingly, the non-intrusive MOS weighting did not improve any
nections [7]. The resulting network has 8.4 M trainable parameters,            MOS metrics over the plain best spectral distance loss, and shows
12.8 M MACs/frame and the ONNX runtime has a processing time                   no clear trend for WER. A reason could be that overall MOS is
of 45 ms per second of audio on a standard laptop CPU. For ref-                still emphasizing BAK more, whereas we would need a loss that im-
erence, the first two ranks in the 3rd DNS challenge [2] stated to             proves SIG to achieve a better overall result. The non-intrusive WER
consume about 60 M MACs and 93 M FLOPs per frame. The net-                     weighting shows a minor WER improvement for the high-quality
work is trained using the AdamW optimizer with initial learning rate           and meeting data, with small degradation of DNS testset compared
0.001, which is halved after plateauing on the validation metric for           to L64SD only. As the ASR models to train the non-intrusive WER
200 epochs. Training is stopped after validation metric plateau of             predictor were trained on mostly clean speech, this could be a reason
400 epochs. One epoch is defined as 5000 training sequences of                 for the WER model not helping the noisy cases. The MOS+WER
10 s. We use the synthetic validation set and heuristic validation             weighting ranks in between MOS and WER only weighted losses.
metric proposed in [7], a weighted sum of PESQ, siSDR and CD.                        Tab. 1 shows results for the PANN-augmented loss using the 4-
                                                                               layer PANN only, which also does not show an improvement. Using
                                                                               more PANN layers or a higher PANN weight γ > 0.05 resulted in
5.3. Results
                                                                               worse performance, and could not exceed the standalone LSD loss.
In the first experiment, we study the not well understood linear               Possible reasons have already been indicated in Fig. 3. The wav2vec
weighting between complex compressed and magnitude compressed                  ASR embedding loss term shows also no significant improvement
loss (4). Fig. 4 shows the nSIG vs. nBAK tradeoff when changing                in terms of WER or MOS. Note that the P.835 DNSMOS absolute
the contribution of the complex loss term with the weight λ in (4).            values, especially nOV L are somewhat compressed. The CRUSE
The magnitude term acts as a regularizer for distortion, while higher          model with L20  SD achieved ∆SIG = −0.17, ∆BAK = 1.98,
weight on the complex term yields stronger noise suppression, at the           ∆OV L = 0.85 in the subjective P.835 tests [24].
price of speech distortion. An optimal weighting is therefore found
as a trade off. We use λ = 0.3 in the following experiments, which                                     6. CONCLUSIONS
we found to be a good balance between SIG and BAK.
     Table 1 shows the results in terms of nSIG, nBAK, nOVL and                In this work, we provided insight into the advantage of magnitude
WER for all losses under test, where DNS, HQ, and meet refer to the            regularization in the complex compressed spectral loss to trade off
three test sets described in Sec. 5.1. The first three rows show the           speech distortion and noise reduction. We further showed that an in-
influence of the STFT resolution FL {·} used to compute the end-               creased spectral resolution of this loss can lead to significantly better
to-end complex compressed loss (4), where we used Hann windows                 results. Besides the increased resolution loss, modifications that also
of {20, 32, 64} ms with {50%, 50%, 75%} overlap. The superscript               aimed at improving signal quality and distortion, e. g. integrating
of LSD indicates the STFT window length. We can observe that                   pre-trained networks could not provide a measurable improvement.
the larger windows lead to improvements of all metrics. This is an             Also, loss extensions that introduced knowledge from pre-trained
interesting finding, also highlighting the fact that speech enhance-           ASR systems showed no improvements in generalization for human
ment approaches implemented on window sizes or look-ahead are                  or machine listeners. The small improvements in signal distortion
not comparable. With the decoupled end-to-end training, we can im-             and WER indicate that there is more research required to improve
prove performance with larger STFT resolutions of the loss, while              these metrics significantly.
keeping the smaller STFT resolution for processing, to keep the pro-
cessing delay low. Similar to many other frequency weightings, the
ERB-weighted spectral distance did not work well, showing a sig-
                                                                           4
                        7. REFERENCES                                       [13] J. Lee, J. Skoglund, T. Shabestary, and H. Kang, “Phase-
                                                                                 sensitive joint learning algorithms for deep learning-based
 [1] D. Wang and J. Chen, “Supervised speech separation based                    speech enhancement,” IEEE Signal Processing Letters, vol.
     on deep learning: An overview,” IEEE/ACM Trans. Audio,                      25, no. 8, pp. 1276–1280, 2018.
     Speech, Lang. Process., vol. 26, no. 10, pp. 1702–1726, Oct
                                                                            [14] S. Braun and I. Tashev, “A consolidated view of loss functions
     2018.
                                                                                 for supervised deep learning-based speech enhancement,” in
 [2] C. K. A. Reddy, H. Dubey, K. Koishida, A. Nair, V. Gopal,                   Intl. Conf. on Telecomm. and Sig. Proc. (TSP), 2021.
     R. Cutler, S. Braun, R. Gamper, H. Aichner, and S. Srinivasan,
                                                                            [15] S. Braun and I. Tashev, “Data augmentation and loss normal-
     “INTERSPEECH 2021 deep noise suppression challenge:,” in
                                                                                 ization for deep noise suppression,” in Proc. Speech and Com-
     Proc. Interspeech, 2021.
                                                                                 puters, 2020.
 [3] S. E. Eskimez, X. Wang, M. Tang, H. Yang, Z. Zhu, Z. Chen,
                                                                            [16] Z. Zhao, S. Elshamy, and T. Fingscheidt, “A perceptual weight-
     H. Wang, and T. Yoshioka, “Human listening and live cap-
                                                                                 ing filter loss for DNN training in speech enhancement,” in
     tioning: Multi-task training for speech enhancement,” in Proc.
                                                                                 Proc. IEEE Workshop on Applications of Signal Processing to
     Interspeech Conf., 2021.
                                                                                 Audio and Acoustics (WASPAA), Oct 2019, pp. 229–233.
 [4] F. G. Germain, Q. Chen, and V. Koltun, “Speech denoising
                                                                            [17] B. C. J. Moore and B. R. Glasberg, “Suggested formulae for
     with deep feature losses,” online, arXiv:1806.10522, 2018.
                                                                                 calculating auditory-filter bandwidths and excitation patterns,”
 [5] S. Kataria, J. Villalba, and N. Dehak, “Perceptual loss based               J. Acoust. Soc. Am., vol. 74, pp. 750–753, 1983.
     speech denoising with an ensemble of audio pattern recogni-
     tion and self-supervised models,” in Proc. IEEE Intl. Conf. on         [18] N. Kitawaki, H. Nagabuchi, and K. Itoh, “Objective quality
     Acoustics, Speech and Signal Processing (ICASSP), 2021, pp.                 evaluation for low bit-rate speech coding systems,” IEEE J.
     7118–7122.                                                                  Sel. Areas Commun., vol. 6, no. 2, pp. 262–273, 1988.
 [6] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D.                 [19] Hannes Gamper, Chandan K A Reddy, Ross Cutler, Ivan J. Ta-
     Plumbley, “PANNs: Large-scale pretrained audio neural net-                  shev, and Johannes Gehrke, “Intrusive and non-intrusive per-
     works for audio pattern recognition,” vol. 28, pp. 2880–2894.               ceptual speech quality assessment using a convolutional neural
                                                                                 network,” in 2019 IEEE Workshop on Applications of Signal
 [7] S. Braun, H. Gamper, C. K. A. Reddy, and I. Tashev, “Towards                Processing to Audio and Acoustics (WASPAA), 2019, pp. 85–
     efficient models for real-time deep noise suppression,” in Proc.            89.
     IEEE Intl. Conf. on Acoustics, Speech and Signal Processing
     (ICASSP), 2021.                                                        [20] H. Gamper, D. Emmanilidou, S. Braun, and I. Tashev, “Pre-
                                                                                 dicting word error rate for reverberant speech,” in Proc.
 [8] H.-S. Choi, J. Kim, J. Hur, A. Kim, J.-W. Ha, and K. Lee.,                  IEEE Intl. Conf. on Acoustics, Speech and Signal Processing
     “Phase-aware speech enhancement with deep complex U-Net,”                   (ICASSP), 2020.
     in Intl. Conf. on Learning Representations (ICLR), 2019.
                                                                            [21] A. Baevski, H. Zhou, A. Mohammed, and M. Auli, “wav2vec
 [9] J. Valin, “A hybrid DSP/deep learning approach to real-time
                                                                                 2.0: a framework for self-supervised learning of speech rep-
     full-band speech enhancement,” in 20th Intl. Workshop on Mul-
                                                                                 resentations,” in Proc. Conf. Neural Information Processing
     timedia Signal Processing (MMSP), Aug 2018, pp. 1–5.
                                                                                 Systems (NeurIPS), 2020.
[10] C. K. A. Reddy, H. Dubey, V. Gopal, R. Cutler, S. Braun,
                                                                            [22] C. Reddy, V. Gopal, and R. Cutler, “DNSMOS: A non-
     H. Gamper, R. Aichner, and S. Srinivasan, “ICASSP 2021
                                                                                 intrusive perceptual objective speech quality metric to evalu-
     deep noise suppression challenge,” in Proc. IEEE Intl. Conf.
                                                                                 ate noise suppressors,” in Proc. IEEE Intl. Conf. on Acoustics,
     on Acoustics, Speech and Signal Processing (ICASSP), 2021.
                                                                                 Speech and Signal Processing (ICASSP), October 2021.
[11] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Has-
     sidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at         [23] Microsoft,       “https://github.com/azure-samples/cognitive-
     the cocktail party: A speaker-independent audio-visual model                services-speech-sdk,” 2021.
     for speech separation,” ACM Trans. Graph., vol. 37, no. 4, July        [24] 3rd Deep Noise Suppression Challenge Organizers, “Chal-
     2018.                                                                       lenge results,” www.microsoft.com/en-us/research/academic-
[12] S. Braun and I. Tashev, “On training targets for noise-robust               program/deep-noise-suppression-challenge-interspeech-
     voice activity detection,” in Proc. European Signal Processing              2021/#!results, Aug. 2021.
     Conf. (EUSIPCO), 2021.