Meta-Learning Adaptive Filters for Audio
Meta-Learning Adaptive Filters for Audio
Abstract—Adaptive filtering algorithms are pervasive through- AF tasks are often grouped into one of four core cat-
out signal processing and have had a material impact on a egories: system identification, inverse modeling, prediction,
wide variety of domains including audio processing, telecom- and interference cancellation [4]. Each of these categories has
munications, biomedical sensing, astrophysics and cosmology,
seismology, and many more. Adaptive filters typically operate numerous AF applications, and advances in one category or
via specialized online, iterative optimization methods such as application can often be applied to many others. In the audio
least-mean squares or recursive least squares and aim to pro- domain, acoustic echo cancellation (AEC) can be formulated
arXiv:2204.11942v2 [cs.SD] 21 Nov 2022
cess signals in unknown or nonstationary environments. Such as single- or multi-channel system identification and has been
algorithms, however, can be slow and laborious to develop, studied extensively [7]–[18]. Equalization can be formulated
require domain expertise to create, and necessitate mathematical
insight for improvement. In this work, we seek to improve as an inverse modeling problem, has been explored in single-
upon hand-derived adaptive filter algorithms and present a and multi-channel formats, and is particularly useful for
comprehensive framework for learning online, adaptive signal sound zone reproduction and active noise control [19]–[24].
processing algorithms or update rules directly from data. To Dereverberation can be formulated as a prediction problem
do so, we frame the development of adaptive filters as a meta- and has been studied considerably in this context [25]–[31].
learning problem in the context of deep learning and use a
form of self-supervision to learn online iterative update rules Finally, multi-microphone enhancement or beamforming can
for adaptive filters. To demonstrate our approach, we focus be formulated as an informed interference cancellation task
on audio applications and systematically develop meta-learned and also has a breadth of associated algorithms [32]–[38].
adaptive filters for five canonical audio problems including sys- When we consider AFs in the context of deep neural
tem identification, acoustic echo cancellation, blind equalization, networks (DNNs), we note two main observations. First,
multi-channel dereverberation, and beamforming. We compare
our approach against common baselines and/or recent state-of- AFs continue to be used extensively, but mostly via hybrid
the-art methods. We show we can learn high-performing adaptive approaches that combine neural networks with standard AF
filters that operate in real-time and, in most cases, significantly algorithms. Second, the underlying AF algorithms and tools
outperform each method we compare against – all using a single for the development of new AFs have had little change in
general-purpose configuration of our approach. several decades. Hybrid approaches, however, have proven
Index Terms—adaptive filtering, meta learning, online opti- very successful. For example, in AEC, neural networks can
mization, learning to learn, deep learning be trained for residual echo suppression, noise suppression,
reference estimation, and more [39]–[47]. In a similar vein,
I. I NTRODUCTION neural networks paired with AFs for active noise control tasks
have shown impressive results [48]–[51]. For dereverberation,
and machine learning including optimal control, optimization, TABLE I: Symbols and operators.
automatic machine learning, reinforcement learning and meta- Symbols Description
learning. Relevant works within these areas include automatic x∈R A real-valued scalar
x∈C A complex-valued scalar
selection of step sizes [7], [66], [67], the direct control of x∈R A real-valued time-domain scalar
model weights [68]–[70], rapid fine-tuning [71], and meta- x ∈ RN A time-domain N -dimensional column vector
learning optimization rules [72]. Out of these works, learning x ∈ CN A complex-valued N -dimensional column vector
optimization rules or learned optimizers is of critical rele- X ∈ CM ×N A complex-valued M × N matrix
x[τ ] A time-varying column vector
vance [73]–[75] and presents the idea of using one neural X[τ ] A time-varying matrix
network as a function that optimizes another function. Such w[τ ] FD AF linear-filter
works, however, focus on creating learned optimizers for u[τ ] FD AF input
d[τ ] FD AF target or desired response
training neural networks in an offline setting, where the latter y[τ ] FD AF estimated response
network is the final product, and the learned optimizer is s[τ ] FD AF true desired signal
otherwise discarded (or otherwise used to train additional IN An N × N identity matrix
0N ×R N × R matrix of zeros
networks). Moreover, this work has had little application to 1N ×R N × R matrix of ones
AFs, except for our own work [63], which we extend here. FN N -point discrete Fourier transform (DFT) matrix
In this work, we formulate the development of AF algo- Operators Description
∗ Convolution
rithms as a meta-learning problem. We learn adaptive filter ·> Transpose
update rules directly from data using self-supervision and ·∗ Complex conjugate
call our approach meta adaptive filtering (Meta-AF). Using ·H Hermitian transpose
diag(·) Vector to a diagonal matrix
our method, we no longer need humans to hand-derive up- cat(· · · ) Column vector concatenation (vertical stack)
date equations, do not need any supervised labeled data for E Expected value
learning, and do not need exhaustive tuning. To showcase (·)
(·)
Element-wise division
our approach, we systematically develop learned AFs for || · || Euclidean norm
exemplary applications of each of the four canonical AF |·| Element-wise magnitude of complex value
∠ Phase of a complex-value
architectures. Then, for each AF task, we compare our work to ln Natural logarithm
a suite of baselines and/or state-of-the-art approaches for the −1 Scalar or matrix inverse
problems of system identification, acoustic echo cancellation,
equalization, multi-channel dereverberation, and multi-channel
interference cancellation (beamforming). For all tasks, we use II. BACKGROUND
A. Notation
identical hyperparameters, significantly reducing engineering
and design time. We evaluate performance using signal-to- We provide an overview of the symbols and operators we
noise ratio (SNR)-like signal metrics and perceptual- or task- use in Table I. We denote scalars via lower-case symbols,
specific metrics as well as specific qualitative comparisons. column vectors via bold, lower-case symbols, and matrices
Our results show we can learn high-performing AF algorithms via bold upper-case symbols. We use bracket indexing [τ ] to
that operate in real-time and, in most cases, outperform all denote time-varying signals and an underline to denote the
methods we compare against. time-domain counterpart of a frequency-domain (FD) symbol.
The contributions of our work are as follows: 1) we We index FD rows via the subscript k, columns via the
present the first general-purpose method of meta-learning subscript m, and elements via the subscripts km.
AF algorithms (update rules) directly from data via self-
supervision (no labels required) 2) we apply our approach
B. Overlap-Save & Overlap-Add Filtering
to all canonical AF architecture categories including system
identification, inverse modeling, prediction, and (informed) We perform short-time (STFT) Fourier filtering using either
interference cancellation and 3) we show how a single hyper- overlap-save (OLS) or overlap-add (OLA) convolution. The
parameter configuration of our approach, trained with different OLS method uses block processing by splitting the input
datasets and losses, can outperform all common AF base- signal into overlapping windows and recombines complete
lines and/or several past state-of-the-art methods we compare non-overlapping components. We use um [t] ∈ R to represent
against according to one or more evaluation metrics and the time-domain sample recorded by microphone m at discrete
are suitable for real-time operation on commodity hardware. time t. We collect N such samples from microphone m
Compared to our previous work on single-channel single-talk to form the time-domain frame um [τ ] = [um [τ R − N +
AEC [63], we present several new improvements including 1) 1], · · · , um [τ R]] ∈ RN , where τ is the frame index, N is the
an improved loss, 2) additional inputs to our learned optimizer window length in samples, R is the hop size in samples, and
and 3) an updated development for multi-block, multi-channel O = N −R is the overlap between frames in samples. Finally,
AFs with coupled updates and 4) extensive experimentation we collect samples from all M microphones to form a multi-
on four new applications. We release demos for each task and channel time-domain signal, U[τ ] = [u1 [τ ], · · · , uM [τ ]] ∈
open source all code1 including baselines. RN ×M . We compute the corresponding FD representation via
U[τ ] = FN U[τ ] ∈ CK×M , where K is the number of
1 For demos and code, please see https://jmcasebeer.github.io/projects/ frequency bins, set to K = N for this work. We select the
metaaf and https://github.com/adobe-research/MetaAF, respectively. mth channel from a multi-channel FD representation using
3
where γ is a forgetting factor, dkm [τ ] and ykm [τ ] are the optimize general purpose neural networks. In contrast, we
desired and estimated signal at frequency k and reference design online AF optimizers that use multiple input signals,
microphone m, and the initialization of Pk [τ ] is critical and are complex-valued, adapt block FD linear filters, and integrate
commonly based on the input SNR. In the case of multi- domain-specific insights to reduce complexity and improve
block and/or multi-channel filters, there are multiple ways performance (coupling across channels and time). Moreover,
to formulate RLS, some of which differ from time-domain we deploy learned optimizers to solve AF tasks as the end-
RLS. Common approaches include diagonalized RLS (D- goal and do not use them to train downstream neural networks.
RLS) and block diagonalized RLS (BD-RLS) as well as QR We also note recent work that uses a supervised DNN to
decomposition techniques [5], [78], [79] and differ in what control the step-size of a D-KF for AEC [61] and another that
terms of the covariance (precision) matrix are modeled. D-RLS uses a supervised DNN to predict both the step-size and a
makes an independence assumption and optimizes each k, m, b nonlinear reference signal for AEC [62]. Compared to these,
filter tap separately, forming K diagonal BM × BM precision we replace the entire update with a neural network, do not
matrices. BD-RLS couples across frames and channels by need supervisory signals, and investigate many tasks.
forming K separate BM × BM precision matrices. In the
case of single block/channel filters, D-RLS, BD-RLS, and
NLMS can be reduced to the same algorithm with different III. P ROPOSED M ETHOD
parameterizations. In our case, we use identical BD-RLS
implementations across all tasks. A. Problem Formulation
When conventional optimizers are compared to each other,
We formulate AF algorithm design as a meta-learning
the order of performance commonly follows LMS, RM-
problem and train neural networks to learn AFs from data,
SProp, NLMS, and RLS, while the order of computational
creating meta-learned adaptive filters. This is in contrast to
complexity is reversed. These algorithms, however, can be
typical AFs that are manually created by human engineers. To
sensitive to tuning, nonstationarities, nonlinearities, and other
do so, we define a learned optimizer, gφ (·), as a neural network
issues that require engineering effort to mitigate failure cases
with one or more input signals parameterized by weights φ
and stability. For multi-block BD-RLS filters, in particular,
that optimizes an AF loss or optimizee L(hθ (·), · · · ) or L for
poor partition conditioning can also lead to degraded RLS
short, using an additive update rule
performance [80] and/or stability issues compared to NLMS
and other alternatives. For the purposes of this work, we
exhaustively grid-search tune the hyperparemeters on held-out θ[τ + 1] = θ[τ ] + gφ (·). (20)
validation sets of signals per task.
Beyond these basic optimizers, we also compare against We then seek an optimal AF optimizer gφ̂ over dataset D
several additional methods. For the AEC task, we compare
against the double-talk robust diagonal Kalman filter (D- φ̂ = arg min ED [ LM ( gφ , L(hθ (·), · · · ) ) ], (21)
φ
KF) [11], the open-source double-talk robust Speex algo-
rithm [12], a weighted-RLS (wRLS) algorithm [81], and where LM is a functional that defines the meta (or optimizer)
WebRTC-AEC3 [82]. The D-KF algorithm is recommended loss that is a function of gφ and an AF loss L with one or more
over other variants [15] and is a common AEC baseline [61]. inputs and filtering function hθ that itself has one or more
In addition, the Speex and wRLS [81] algorithms were the inputs and parameters θ. Intuitively, when we solve (21), we
linear AFs used (with a non-linear post-processor) in the learn a network gφ (·) that solves the AF loss L when applied
first- [47] and second-place [81] winners of the ICASSP repeatedly via an additive update.
2021 Acoustic Echo Cancellation Challenge [83], respectively.
Since our work focuses on linear adaptive filters and can
easily be combined with non-linear post-processors, we be-
B. Optimizee Architecture & Loss
lieve D-KF, wRLS, Speex, and WebRTC-AEC3 are reasonable
baselines. For dereverberation, we compare against NARA- The optimizee, or the AF loss L that is optimized via (20) is
WPE [30], which is a highly effective normalized BD-RLS a function of the filter or architecture hθ (·). The filter can be
based optimizer [27], [30] and comparable to the original NTT any reasonable differentiable filtering operator such as time-
implementation [84]. domain FIR filters, lattice FIR filters, non-linear filters, FD
filters, multi-delayed block FD filters [3], etc. Similarly, the
E. Related Work AF loss can be any reasonable differentiable loss such as the
Initial work on automatically tuning AFs has been explored MSE, ISE, WSE, a regularized loss, negative log-likelihood,
in incremental delta-bar-delta [66], Autostep [67], and else- mutual information, etc. For our work, we focus on single- and
where. Recent machine learning work discusses the idea of multi-channel multi-frame linear block FD filters hθ applied
using DNNs to learn entirely novel optimizer update rules via OLS or OLA with parameters θ k [τ ] = {wk [τ ] ∈ CBM }
from scratch [72]–[75]. We take inspiration from this work, with B buffered frames and M channels per frequency k. We
but make numerous advances specific to AFs. In particular, also set the FDAF loss L[τ ] to be the ISE via (9) with gradient
past learned optimizers [72] are element-wise, offline, real- computed with respect to wk [τ ]H as ∇k [τ ] = uk [τ ](ykm [τ ] −
valued, only a function of the gradient, and are trained to dkm [τ ])∗ .
5
TABLE II: Relationship between AF optimizers. TABLE III: Optimizer complexity comparison.
Optimizer Inputs State Params ∆k [τ ] Optimizer Big-O ≈ CMACS
LMS ∇k [τ ] ∅ λ (12) LMS O(KMB) KMB
NLMS ∇k [τ ], uk [τ ] vk [τ ] λ, γ (14) NLMS O(KMB) 5KMB
RMSProp ∇k [τ ] ν k [τ ] λ, γ (16) RMSProp O(KMB) 6KMB
BD-RLS uk [τ ], dk [τ ], yk [τ ] Pk [τ ] γ (19) BD-RLS O(K(MB)2 ) K(4(MB)2 + 5MB)
Ours ξk [τ ] ψ k [τ ] φ gφ Ours O(K(H2 + MBH)) K(12H2 + (21 + 10MB)H)
C. Optimizer Architecture & Loss logarithm to reduce the dynamic range, which we found to
1) Architecture: Our optimizer gφ is inspired by conven- empirically improve learning. This loss ignores the temporal
tional AF optimizers such as LMS, NLMS, and BD-RLS, but order of AF updates and optimizes for filter coefficients that
updated to have a neural network form. In particular, we focus are unaware of any downstream STFT processing, but the idea
on making a generalized, stochastic variant of an RLS-like of accumulating independent time-step losses is common [72].
optimizer that is applied independently per frequency k to our Second, we define the time-domain frame-accumulated loss
optimizee parameters, but coupled across channels and time LM = ln E[||d̄m [τ ] − ȳm [τ ]||2 ] (26)
frames to allow our approach to model interactions between
channels and frames and vectorize across frequency. To do so, d̄m [τ ] = cat(dm [τ ], dm [τ + 1], · · ·, dm [τ + L − 1]) (27)
we use a recurrent neural network (RNN) where the weights φ ȳm [τ ] = cat(ym [τ ], ym [τ + 1], · · ·, ym [τ + L − 1]), (28)
are shared across all frequency bins, but we maintain separate
where dm [τ ] and ym [τ ] are the time-domain desired and
state ψ k [τ ] per frequency. The inputs to our optimizer at
frequency k are ξ k [τ ] = [∇k [τ ], uk [τ ], dk [τ ], yk [τ ], ek [τ ]], estimated responses of reference channel m and d̄m [τ ] ∈ RRL
where ∇k [τ ] is the gradient of the optimizee with respect and ȳm [τ ] ∈ RRL . Intuitively, to compute this loss for a given
to θ k , and ek [τ ] = dk [τ ] − yk [τ ]. Our optimizer outputs are optimizer gφ , we run (20) for a time horizon of L frames,
the update ∆k [τ ] and the internal state ψ k [τ + 1], resulting in concatenate the sequence of time-domain outputs and target
signals to form longer signals, then compute the time-domain
(∆k [τ ], ψ k [τ + 1]) = gφ (ξ k [τ ], ψ k [τ ]) (22) MSE loss, and take the logarithm. While both losses use the
θ k [τ + 1] = θ k [τ ] + ∆k [τ ]. (23) same time-horizon, the frame accumulated loss allows us to
model boundaries between adjacent updates and implicitly
Our design is in contrast to LMS-, NLMS-, RMSProp-like learn updates that are STFT consistent [85]. To the best of
optimizers that have no state (e.g. LMS) or minimal state dy- our knowledge, our frame accumulated loss is novel for AFs.
namics (e.g. NLMS, RMSProp). In addition, these optimizers 3) Computational Complexity: We compare the computa-
as well as other learned optimizers [72] typically apply updates tional complexity of our proposed approach to conventional
independently per element. For a comparison of optimizer optimizers in Table III. We note that the complexity of our
inputs, state, parameters, and gradients, please see Table II. approach is dependent on the hidden state size H of our RNN,
To define our RNN in more detail, we use a small network but is linear in channels M and blocks B, whereas BD-RLS
composed of a linear layer, nonlinearity, and two Gated is quadratic in M and B, but does not depend on H. Note
Recurrent Unit (GRU) layers with hidden size H = 32, that prior work on learned optimizers, including our own [63],
followed by two additional linear layers with nonlinearities, performs optimization completely element-wise, resulting in a
where all layers are complex-valued. We always re-scale the larger complexity of O(KMBH2 ).
inputs element-wise via
ln(1 + |ξ|)ej∠ξ , (24) D. Learning the Optimizer
to reduce the dynamic range and facilitate training, but keep To learn an optimizer gφ from data, we solve (21) using
the phases unchanged. This pre-processing was found useful standard deep learning tools (i.e. JAX [86], [87]), including
in several previous works [63], [72], although previous work the use of automatic differentiation for training and infer-
used explicit clipping, which we found unnecessary. ence. In addition, we use truncated backpropagation through
2) Loss: We examine two meta losses LM (·) to learn time (TBPTT) [88] with a stochastic gradient descent opti-
our optimizer parameters φ. First, we define the FD frame mizer, Adam [89], that we call our meta optimizer. We show
independent loss a simplified form of our training algorithm in Alg. 1 using our
frame accumulated loss and a batch size of one, where S TFT is
τ +L
1 X an OLA or OLS processor, G RAD returns the gradient of the
LM = ln E[||dm [τ ] − ym [τ ]||2 ], (25) first argument with respect to the second, S AMPLE randomly
L τ
samples signals from a dataset D, and N EXT L grabs the next L
where dm [τ ] and ym [τ ] are the desired and estimated FD time buffers from a longer signal. In practice, we use batching.
signal vectors of the reference channel m (e.g. m = 0). In more detail, we train gφ until the application specific
Intuitively, to compute this loss for a given optimizer gφ , we mean SNR metric on the validation fold of a dataset D has
unroll (20) for a time horizon of L time frames, compute not improved for four epochs. For each of our five applications,
the FD mean-squared error over L frames, and then take the we use (29), (30), (31), (36), and (45), respectively. We halve
6
1.2 15.0 31.4 31.2 24.6 28.4 29.0 34.5 36.1 36.1 36.3 35.9 True
40 System
Median SNRd (dB)
Order
16
32
S
g
rop
MS
Inp Extra
.
um
rol
Lo
Ac g
LM
RL
No m.
L=
No o Lo
uts
L=
L=
SP
Un
NL
Full
Ac
u
D-
No
0
No
RM
No
256 512 1024 2048 4096
Adaptive Filter Order
Fig. 2: Optimizer design decision ablation. Using an accumu-
lated log-loss leads to our best model, particularly for more Fig. 3: Evaluating the effect of different true system orders
complex tasks we address later on. RLS-like inputs are also and adaptive filter orders. Our learned AFs can operate well
useful. The exact value of L is not critical, but larger is better. across a variety of linear system orders, even when training is
restricted to systems of a fixed length (1024 taps).
tuning except changing the dataset used for training and using
D. System Order Modeling Error Results & Discussion
different checkpoints caused by early stopping on validation
performance. In contrast, we re-tune all conventional optimizer Given our fixed set of optimizer and meta-optimizer pa-
baselines for all subsequent tasks on held-out validation sets. rameters, we evaluate the robustness of our Meta-ID AF to
1) Loss Function: First, we compare our selected frame modeling errors by studying what happens when we use an
accumulated loss model (light blue) to the frame accumulated optimizee filter that is too short or too long compared to the
loss without log scaling (black) as well as the frame indepen- true system. We do so by testing a learned optimizer with
dent loss (yellow) and without log scaling (light-purple). As 1) optimizee filter lengths between 256 and 4096 taps and 2)
shown, the log-loss has the single largest effect on SNRd and held-out signals with true filter lengths between 256 and 4096,
yields an astounding 11.7/7.3 dB improvement compared to as well as full length systems (up to 32, 000 taps).
the no-log losses. When we compare the independent vs. accu- Results are shown in Fig. 3. We measure performance by
mulated loss, the accumulation provides a .2 dB improvement. computing the segmental SNRd score via (29) at convergence.
However, when we listen to the estimated response, especially As expected, when the adaptive filter order is equal to or
in more complex tasks, we found that the accumulated loss greater than the true system order, we achieve SNRs of
introduces fewer artifacts and perceptually sounds better. Thus, ≈ 40 dB. It is interesting to note that our learned AF was only
we fix the optimizer loss to be (26). ever trained on optimizee filters with an order of 1024 and
1024 tap true systems. This experiment suggests our learned
2) Input Features: Having selected the optimizer loss, we
optimizers can generalize to new optimizee filter orders.
compare the model inputs. We compare setting the optimizer
input to be only the gradient ξ k [τ ] = [∇k[τ ]] for an LMS-like
learned optimizer (dark purple) and setting the optimizer to be V. ACOUSTIC E CHO C ANCELLATION A BLATION
the full signal set ξ k [τ ] = [∇k[τ ], uk [τ ], dk [τ ], yk [τ ], ek [τ ]] A. Problem Formulation
for our selected RLS-like learned optimizer (light blue). As In our second task, we train a Meta-AF for acoustic echo
shown, the inputs have the second largest effect on SNRd and cancellation (AEC). The goal of AEC is to remove the far-
using the full signal set yields a notable 7.9 dB improvement. end echo from a near-end signal for voice communication by
Thus, we set the input to be the full signal set. mimicking a time-varying transfer function as show in Fig. 4.
3) AF Unroll: With the optimizer loss and inputs fixed, The far-end refers to the signal transmitted to a local listener
we evaluate four different values of AF unroll length, L = and the near-end is captured by a local mic. We model the
2, 8, 16, 32, where L is the number of time-steps over which unknown transfer function between the speaker and mic with
the optimizer loss is computed in (26). Intuitively, a larger a linear multi-delay frequency-domain filter hθ , measure the
unroll introduces less truncation bias but may be more unstable noisy response d which includes the input signal u, noise n,
during training due to exploding or vanishing gradients. The and near-end speech s, and adapt the filter weights θ using our
case where L = 2 corresponds to no unroll, since for L = 1 learned Meta-AEC AF, gφ . The time-domain signal model is
the meta loss is not a function of the optimizer parameters and d[t] = u[t] ∗ w + n[t] + s[t]. The AF loss is the ISE between
yields a zero gradient. As shown, for no unroll L = 2 (grey), the near-end and the predicted response. The FDAF near-end
we get a reduction of the SNRd by 1.8 dB compared to speech estimate is ŝk [τ ] = dk [τ ] − wk [τ ]H uk [τ ].
our best model. When selecting the unroll between 8, 16, 32,
however, there is a small (< 1 dB) overall effect. That said,
we find that longer unroll values work better in challenging B. Experimental Design
scenarios but take longer to train. As a result, we select an We compare our approach to LMS, RMSProp, NLMS,
unroll length of 16, as it represents a good trade-off between BD-RLS, a diagonal Kalman filter (D-KF) model [11], and
performance and training time. Note, the unroll length only Speex [12] for a variety of acoustic echo cancellation scenar-
effects training and is not used at test time. ios. Our scenarios, in increasing difficulty, include single-talk,
8
17.74
20
K e EC
13.30
M
9.11
10
L
14.56
N D
B
10.43 5
ro
1.36
0
RM
0 10 20 0.0 1.6 3.2 4.8 6.5 8.1 9.7
Median ERLE (dB) Time (Seconds)
LMS RMSProp NLMS BD-RLS D-KF Speex Meta-AEC
K e EC
LMSP LM -R D- Spe -A
S p S S F x
12
a
0.948
et
M
double-talk, double-talk with a path change, and noisy double- 0.908 9
talk with a path change and non-linearity. Single-talk refers
L
0.901 6
N D
0.929
B
to the case where only the far-end input signal u is active. 3
ro
0.843
Double-talk refers to the case where both the far-end signal u 0
RM
0.8 0.9 1.0 0.0 1.6 3.2 4.8 6.5 8.1 9.7
and near-end talker signal s are active at the same time. Median STOI Time (Seconds)
A path change refers to the case where the true system
transfer function is abruptly changed (e.g. during a phone call). Fig. 6: AEC double-talk performance. Meta-AEC converges
Nonlinearities refer to the case where the true system is not fastest and has similar peak performance to the D-KF, while
strictly linear (e.g. harmonic loudspeaker distortion). We train preserving near-end speech quality.
a single gφ for AEC and then test it for each scene type against
all hyperparameter-tuned baselines.
We measure echo cancellation performance using segmental validation, and test set (does not include scene changes) to
echo-return loss enhancement (ERLE) [93] and short-time compare to other previously published works directly.
objective intelligibility (STOI) [94]. Segmental ERLE is
C. Results & Discussion
kdu [τ ]k2
ERLE(du [τ ], y[τ ]) = 10 · log10 , (30) Overall, we find that our approach significantly outperforms
kdu [τ ] − y[τ ]k2
all previous methods in all scenarios, but has a larger advan-
where du is the noiseless system response and higher values tage in harder scenes—more details discussed below.
are better. When averaging, we discard silent frames using an 1) Single-Talk: Our approach (light blue, x) exhibits strong
energy-threshold VAD. In scenes with near-end speech, we single-talk performance and surpasses all baselines by >≈
use STOI ∈ [0, 1] to measure the preservation of near-end 3 dB in both median and converged ERLE, as shown in Fig. 5.
speech quality. Higher STOI values are better. We train gφ Additionally, Meta-AEC converges fastest, reaching steady-
via Algorithm 1 using one GPU, which took ≈ 72 hours. We state ≈ 4 seconds before other baselines.
use a four-block multi-delay OLS filter (MDF) with window 2) Double-Talk: Our method (light blue, x) converges
sizes of N = 1024 samples and a hop of R = 512 samples fastest in double-talk, and matches the D-KF in converged-
on 16 kHz audio. In double-talk scenarios, we use the noisy performance, as shown in Fig. 6. Meta-AEC converges ≈ 5
near-end, d as the target and do not use oracle cancellation seconds faster while scoring better in STOI. This result
results (such as du ). is striking as it is typically necessary to either explicitly
With respect to datasets for single-talk, double-talk, and freeze adaptation via double-talk detectors or implicitly freeze
double-talk with path-change experiments, we re-mix the adaption via carefully derived updates as found in both the
synthetic fold of the ICASSP 2021 AEC Challenge dataset D-KF model (dark blue, down triangle) and Speex (green, up
(ICASSP-2021-AEC) [91] with impulse responses from [92]. triangle). We hypothesize our method automatically learns how
We partition [92] into non-overlapping train, test, and valida- to adapt in double-talk in a completely autonomous fashion.
tion folds and set the signal-to-echo-ratio randomly between 3) Double-Talk with Path Change: Our method (light blue,
[−10, 10] with uniform distribution. To simulate a scene x) is able to more robustly handle double-talk with path
change, we splice two files such that the change occurs changes compared to other methods as shown in Fig. 7. Similar
randomly between seconds four and six. For the noisy double- to straight double-talk, our approach effectively learns how
talk with nonlinearity experiments, we use the synthetic fold to deal with adverse conditions (i.e. a path change) without
of [91]. We apply a random circular shift and random scale to explicit supervision, converging and reconverging in ≈ 2.5
all files, each ten seconds long. For each task, there are 9000 seconds, with .044 better median STOI. All other algorithms
training, 500 validation, and 500 test files. Finally, we also similarly struggle, even Speex (green, up triangle), which has
use an unmodified version of the ICASSP-2021-AEC training, explicit self-resetting and dual-filter logic.
9
0.875
et
10
M
0.875 5
N D
0.825 0
WebRTC-AEC3 [82] 0.82 N.A N.A N.A
RM
0.8 0.9 1.0 0.0 1.6 3.2 4.8 6.5 8.1 9.7 Speex [12] 0.869 N.A 3.92 N.A
Median STOI Time (Seconds) Meta-AEC 0.881 0.899 7.73 9.13
Fig. 7: AEC double-talk with a path change (shaded region) TABLE IV: ICASSP-2021-AEC test set linear filter results.
performance. Our approach re-converges rapidly with high Our proposed method out performs several past comparable
speech quality. linear-filtering approaches. A ? denotes results when models
were trained/tuned on the ICASSP-2021-AEC data.
0.876 10
K e EC
0.844
LMSP LM -R D- Spe -A
S p S S F x
a
0.816
et
M
0.804 5
L
0.820
N D
0.800
B
ro
0.799 0
RM
0.8 0.9 1.0 0.0 1.6 3.2 4.8 6.5 8.1 9.7
Median STOI Time (Seconds)
21.94 40.09
Aliased 21.19 35.42 w−1
Gain (dB)
20
OLS Configuration
19.56 35.51
12.15 6.96
0.67 0.05 0
0 10 20 30 0 20 40 60
20
Median SNRd (dB) Median SNRw (dB) D-RLS D-RLS Alias.
LMS RMSProp NLMS D-RLS Meta-EQ Meta-EQ Meta-EQ Alias.
0
0.0 1.6 3.2 4.8
Fig. 10: Equalization results for signal (SNRd ) and sys- Time (Seconds)
tem (SNRw ) SNR. Meta-EQ performance is the least impacted
by constraints. Fig. 11: Comparison of true and estimated systems over time.
The Meta-EQ system rapidly fits to the correct inverse model.
The top plot shows an example system and the bottom shows
set hθ to use aliased OLS where Zw = IK . This comparison SNRw over time across the test set.
lets us test if Meta-EQ is automatically learning constraint-
aware update rules. We train a new gφ for each case (no
separate tuning) and tune all baselines for each case.
We measure performance with signal SNRd , and system
SNRw . We define these as
kdk2
SNRd (d, y) = 10 · log10 (31)
kd − yk2
k|w−1 |k2
−1 −1
SNRw (ŵ , w ) = 10 · log10 (32)
k|w−1 | − |ŵ−1 |k2
Fig. 12: Prediction block diagram. A buffer of past inputs are
respectively, where higher is better. We compute SNRw using
used to estimate a future, unknown signal. The delay, z −D
the inverse system magnitude, which ignores the phase. We
train gφ via Algorithm 1 on one GPU, which took at most signifies a delay by D frames.
36 h. We use an OLS filter with a window size of N = 1024
samples and a hop of R = 512 samples on 16 kHz audio.
3) Computational Complexity: Our learned AF has a single
To construct the equalization dataset, we use speech from
CPU core RTF of ≈ 0.24, and 32 ms latency. Our optimizer
the DAPS dataset [95], take the cleanraw recordings as inputs,
network alone has ≈ 14K complex-valued parameters and
and apply random equalizer filters from the sox library to
single CPU core RTF of ≈ 0.19.
generate the outputs, where we randomly pick between [5, 15]
filters with settings c ∈ [1, 8] kHz, g ∈ [−18, 18], and
q ∈ [.1, 10]. All values are sampled uniformly at random to VII. D EREVERBERATION A BLATION
produce 16, 384 train, 2048 validation and 2048 test signals, all
A. Problem Formulation
5 seconds long. At train, validation, and test time we truncate
the system response to 512 taps. For our fourth task, we train a Meta-AF to perform dere-
verberation via multi-channel linear prediction (MCLP) or the
C. Results & Discussion weighted prediction error (WPE) formulation, as is commonly
used for speech-to-text pre-processing. The WPE formulation
We find our approach (blue, solid) outperforms LMS, RM-
is based on the idea of being able to predict the reverberant part
SProp, NLMS, and D-RLS for our equalization task by a
of a signal from a linear combination of past samples, most
noticeable margin as shown in Fig. 10 and further verify with
commonly in the frequency-domain [25], [26] and shown as a
a qualitative analysis plot in Fig. 11.
block diagram in Fig. 12. Using our method, we use a multi-
1) Constrained vs Unconstrained: For the unconstrained
channel linear frequency-domain filter hθ and adapt the filter
case, our method outperforms D-RLS in SNRd by .75 dB
weights θ using a learned AF gφ to minimize the normalized
and by 4.67 dB in SNRw . When we look at the constrained
ISE AF loss below.
case, the performance for all models is degraded. Interestingly,
Assuming an array of M microphones, we estimate a
however, our performance is proportionally degraded the least.
dereverberated signal with a linear model via
We hypothesize that our approach learns to perform updates
which are aware of the constraint. ŝkm [τ ] = dkm [τ ] − wk [τ ]H uk [τ ] (33)
2) Temporal Performance Analysis: We display final sys-
tem and convergence results in Fig. 11. Our Meta-EQ model where ŝkm [τ ] ∈ C is the current dereverberated signal estimate
finds better solutions more rapidly than D-RLS. D-RLS di- at frequency k and channel m, dkm [τ ] ∈ C is the input
verged ≈ 300 times but Meta-EQ never did. microphone signal, wk [τ ] ∈ CBM is a per frequency filter
11
with B time frames and M channels flattened into a vector, -1.26 0.719
Num. Mics.
We then minimize a per channel and frequency loss -0.77 0.727
8.90 0.857
kŝkm [τ ]k2 -1.70 0.679
L(ŝkm [τ ], λk [τ ]) = , (34) 11.99 0.863
λ2k [τ ]
0 10 0.6 0.8 1.0
τ SRR (dB) Median STOI
1 X
λ2k [τ ] = dk [n]H dk [n], (35) 15
M (B + D)
Interferer
Diffuse
domain signal model for mic m via, 10.53 0.255
9.40 0.246
9.50 0.216
um [t] = rm [t] ∗ s[t] + nm [t] (37) 10.20 0.235
where um [t] ∈ R is the input signal, nm [t] ∈ R is the noise 11.03 0.213
Directional
Interferer
8.58 0.184
signal, s[t] ∈ R is the target signal, and rm [t] ∈ R is the 6.39 0.171
7.70 0.170
impulse response from the source to mic m. In the time- 2.53 0.077
frequency domain with a sufficiently long window, this can 0 5 10 15 0.0 0.1 0.2 0.3
be reformulated as Median ∆ SI-SDR (dB) Median ∆ STOI
Interferer
Diffuse
9.92 10.86 17.78
and first compute 8.86 9.33 19.31
8.91 9.21 22.03
Φss
k [τ ] = γΦss
k [τ − 1] + (1 − γ)(sk [τ ]sk [τ ] + λIM ) (39)
H
9.55 12.25 13.71
where Φss
k [τ ] ∈ C M ×M
is a time-varying estimate of the 10.77 11.75 19.85
target signal spatial covariance matrix, γ is a forgetting factor,
Directional
Interferer
8.26 11.46 12.48
where P(·) extracts the principal component and vk [τ ] ∈ CM Fig. 16: BSS eval comparison across interferers. Meta-GSC
is the final steering vector. We then use the steering vector provides more suppression with less distortion and artifacts.
to estimate a blocking matrix Bk [τ ]. The blocking matrix is
orthogonal to the steering vector and can be constructed as
" # where larger values indicate better performance. STOI is
,vkM [τ ]]H computed between the output and desired speech signal.
− [vk1 [τ ],···
Bk [τ ] = vk0 [τ ] H
∈ CM ×M −1 . (42) We also compute the BSS eval metrics, source-to-distortion
IM −1×M −1
ratio (SDR), source-to-interference ratio (SIR), and source-to-
The distortionless constraint is then satisfied by applying the artifact ratio (SAR) [99]. We train gφ via Algorithm 1 on one
GSC beamformer as GPU, which took ≈ 24 h. We use an OLA filter with a Hann
ŝk [τ ] = (vk [τ ] − Bk [τ ]wk [τ ])H uk [τ ] (43) window size of N = 1024 samples and a hop of R = 512
samples on 6 channel 16 kHz audio.
where wk [τ ] ∈ CM −1 is the adaptive filter weight, and the We use the CHIME-3 challenge proposed in [100], [101].
desired response for the loss is dk [τ ] = vk [τ ]H uk [τ ]. This dataset contains scenes with simulated speech and rela-
Our objective is to learn an optimizer gφ that minimizes the tively diffuse noise sources in a multi-channel environment.
AF ISE loss using this GSC filter implementation. By doing The array is rectangular and has six microphones spaced
so, we learn an online, adaptive beamformer that listens in one around the edge of a smart-tablet. There are 7, 138 training
direction and suppresses interferers from all others. files, 1, 640 validation files, and 1, 320 test files. When running
this dataset with directional sources, we mix spatialized speech
B. Experimental Design from one mixture with the spatialized speech from a random
other mixture. We do not mix speech across folds.
We compare our Meta-GSC beamformer to LMS, RM-
SProp, NLMS, and BD-RLS beamformers, in scenes with
either diffuse or directional noise sources. We seek to test if C. Results and Discussion
our method can learn to process scenes with different spatial We find that Meta-GSC outperforms LMS, RMSProp,
characteristics without modification. We train a single gφ and NLMS, and BD-RLS in median performance metrics as shown
tune all baselines on a single dataset of all scene types. We in Fig. 15 and Fig. 16 and in a qualitative analysis in Fig. 17.
measure performance using scale-invariant source-to-distortion 1) Diffuse Interferers: The diffuse scenario tests the ability
ratio (SI-SDR) [98] and STOI. SI-SDR is computed as, to suppress omnidirectional noise in a perceptually pleasant
fashion. We show these comparisons in the “Diffuse Interferer”
a = (ŝ> s)/ksk (44)
rows of Fig. 15 and Fig. 16. The median input STOI was 0.675
2
SI-SDR(s, ŝ) = 10 · log10 (kask /kas − ŝk ), 2
(45) and the median input SI-SDR was −0.67. Meta-GSC (blue,
13
[57] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, and J. Le Roux, [86] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary,
“Improved mvdr beamforming using single-channel mask prediction D. Maclaurin, and S. Wanderman-Milne, “JAX: composable trans-
networks.” in Interspeech, 2016. formations of Python+ NumPy programs, 2018,” URL http://github.
[58] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based com/google/jax, 2020.
spectral mask estimation for acoustic beamforming,” in ICASSP. [87] T. Hennigan, T. Cai, T. Norman, and I. Babuschkin, “Haiku: Sonnet for
IEEE, 2016. JAX,” 2020. [Online]. Available: http://github.com/deepmind/dm-haiku
[59] K. Qian, Y. Zhang, S. Chang, X. Yang, D. Florencio, and M. Hasegawa- [88] P. J. Werbos, “Backpropagation through time: what it does and how to
Johnson, “Deep learning based speech beamforming,” in ICASSP. do it,” IEEE, 1990.
IEEE, 2018. [89] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[60] Z. Zhang, Y. Xu, M. Yu, S.-X. Zhang, L. Chen, and D. Yu, “ADL- in ICLR, 2014.
MVDR: All deep learning MVDR beamformer for target speech [90] M. Wolter and A. Yao, “Complex gated recurrent neural networks,” in
separation,” in ICASSP. IEEE, 2021. NeurIPS, 2018.
[61] T. Haubner, A. Brendel, and W. Kellermann, “End-to-end deep [91] K. Sridhar, R. Cutler, A. Saabas, T. Parnamaa, M. Loide, H. Gamper,
learning-based adaptation control for frequency-domain adaptive sys- S. Braun, R. Aichner, and S. Srinivasan, “ICASSP 2021 acoustic echo
tem identification,” in ICASSP. IEEE, 2022. cancellation challenge: Datasets, testing framework, and results,” in
[62] H. Zhang, S. Kandadai, H. Rao, M. Kim, T. Pruthi, and T. Kristjansson, ICASSP. IEEE, 2021.
“Deep adaptive AEC: Hybrid of deep learning and adaptive acoustic [92] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A
echo cancellation,” in ICASSP. IEEE, 2022. study on data augmentation of reverberant speech for robust speech
[63] J. Casebeer, N. J. Bryan, and P. Smaragdis, “Auto-DSP: Learning to recognition,” in ICASSP. IEEE, 2017.
optimize acoustic echo cancellers,” in WASPAA. IEEE, 2021. [93] G. Enzner, H. Buchner, A. Favrot, and F. Kuech, “Acoustic echo
[64] J. Casebeer, J. Donley, D. Wong, B. Xu, and A. Kumar, “NICE-beam: control,” in Academic press library in signal processing. Elsevier,
Neural integrated covariance estimators for time-varying beamformers,” 2014, vol. 4.
arXiv:2112.04613, 2021. [94] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm
[65] R. Scheibler and M. Togami, “Surrogate source model learning for for intelligibility prediction of time–frequency weighted noisy speech,”
determined source separation,” in ICASSP. IEEE, 2021. IEEE TASLP, vol. 19, no. 7, 2011.
[66] R. S. Sutton, “Adapting bias by gradient descent: An incremental [95] G. J. Mysore, “Can we automatically transform speech recorded on
version of delta-bar-delta,” in AAAI, 1992. common consumer devices in real-world environments into profes-
[67] A. R. Mahmood, R. S. Sutton, T. Degris, and P. M. Pilarski, “Tuning- sional production quality speech?—a dataset, insights, and challenges,”
free step-size adaptation,” in ICASSP. IEEE, 2012. IEEE SPL, vol. 22, no. 8, 2014.
[68] J. Schmidhuber, “Learning to control fast-weight memories: An alter- [96] K. Kinoshita, M. Delcroix, S. Gannot, E. A. Habets, R. Haeb-Umbach,
native to dynamic recurrent networks,” Neural Computation, 1992. W. Kellermann, V. Leutnant, R. Maas, T. Nakatani, B. Raj et al., “A
[69] I. Bello, B. Zoph, V. Vasudevan, and Q. V. Le, “Neural optimizer search summary of the REVERB challenge: state-of-the-art and remaining
with reinforcement learning,” in ICML, 2017. challenges in reverberant speech processing research,” EURASIP Jour-
nal on Advances in Signal Processing, vol. 2016, no. 1, 2016.
[70] D. Ha, A. Dai, and Q. V. Le, “Hypernetworks,” in ICLR, 2017.
[97] A. Jukić, T. van Waterschoot, and S. Doclo, “Adaptive speech dere-
[71] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for
verberation using constrained sparse multichannel linear prediction,”
fast adaptation of deep networks,” in ICML, 2017.
IEEE SPL, vol. 24, no. 1, 2016.
[72] M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman,
[98] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-
D. Pfau, T. Schaul, B. Shillingford, and N. de Freitas, “Learning to
baked or well done?” in ICASSP. IEEE, 2019.
learn by gradient descent by gradient descent,” in NeurIPS, 2016.
[99] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement
[73] O. Wichrowska, N. Maheswaranathan, M. W. Hoffman, S. G. Col- in blind audio source separation,” IEEE TASLP, vol. 14, no. 4, 2006.
menarejo, M. Denil, N. Freitas, and J. Sohl-Dickstein, “Learned [100] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third CHiME
optimizers that scale and generalize,” in ICML, 2017. speech separation and recognition challenge: Dataset, task and base-
[74] L. Metz, N. Maheswaranathan, J. Nixon, D. Freeman, and J. Sohl- lines,” in ASRU. IEEE, 2015.
Dickstein, “Understanding and correcting pathologies in the training [101] ——, “The third CHiME speech separation and recognition challenge:
of learned optimizers,” in ICML, 2019. Analysis and outcomes,” Computer Speech & Language, 2017.
[75] T. Chen, W. Zhang, Z. Jingyang, S. Chang, S. Liu, L. Amini, and
Z. Wang, “Training stronger baselines for learning to optimize,” in
NeurIPS, 2020.
[76] A. V. Oppenheim and R. W. Schafer, Digital signal processing.
Prentice-Hall, 1975.
[77] G. Hinton, N. Srivastava, and K. Swersky, “Neural networks for
machine learning lecture 6a overview of mini-batch gradient descent,”
Cited on, 2012.
[78] S. T. Alexander and A. L. Ghimikar, “A method for recursive least
squares filtering based upon an inverse qr decomposition,” IEEE
Transactions on Signal Processing, 1993.
[79] P. Strobach, “Low-rank adaptive filters,” IEEE Transactions on Signal
Processing, 1996.
[80] F. Yang, G. Enzner, and J. Yang, “On the convergence behavior
of partitioned-block frequency-domain adaptive filters,” IEEE TSP,
vol. 69, pp. 4906–4920, 2021.
[81] Z. Wang, Y. Na, Z. Liu, B. Tian, and Q. Fu, “Weighted recursive least
square filter and neural network based residual echo suppression for
the aec-challenge,” in ICASSP. IEEE, 2021.
[82] WebRTC acoustic Echo Cancellation v3. https://webrtc.googlesource.
com/src. Accessed: 2022-08-10.
[83] Acoustic Echo Cancellation Challenge – ICASSP 2021.
https://www.microsoft.com/en-us/research/academic-program/
acoustic-echo-cancellation-challenge-icassp-2021/results/. Accessed:
2022-08-10.
[84] V. Manohar, S.-J. Chen, Z. Wang, Y. Fujita, S. Watanabe, and S. Khu-
danpur, “Acoustic modeling for overlapping speech recognition: Jhu
chime-5 challenge system,” in ICASSP. IEEE, 2019.
[85] J. Le Roux, N. Ono, and S. Sagayama, “Explicit consistency constraints
for STFT spectrograms and their application to phase reconstruction,”
in Interspeech, 2008.