0% found this document useful (0 votes)
40 views15 pages

Meta-Learning Adaptive Filters for Audio

Uploaded by

jinbogrundfos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views15 pages

Meta-Learning Adaptive Filters for Audio

Uploaded by

jinbogrundfos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

1

Meta-AF: Meta-Learning for Adaptive Filters


Jonah Casebeer, Student Member, IEEE, Nicholas J. Bryan, Member, IEEE, Paris Smaragdis, Fellow, IEEE

Abstract—Adaptive filtering algorithms are pervasive through- AF tasks are often grouped into one of four core cat-
out signal processing and have had a material impact on a egories: system identification, inverse modeling, prediction,
wide variety of domains including audio processing, telecom- and interference cancellation [4]. Each of these categories has
munications, biomedical sensing, astrophysics and cosmology,
seismology, and many more. Adaptive filters typically operate numerous AF applications, and advances in one category or
via specialized online, iterative optimization methods such as application can often be applied to many others. In the audio
least-mean squares or recursive least squares and aim to pro- domain, acoustic echo cancellation (AEC) can be formulated
arXiv:2204.11942v2 [cs.SD] 21 Nov 2022

cess signals in unknown or nonstationary environments. Such as single- or multi-channel system identification and has been
algorithms, however, can be slow and laborious to develop, studied extensively [7]–[18]. Equalization can be formulated
require domain expertise to create, and necessitate mathematical
insight for improvement. In this work, we seek to improve as an inverse modeling problem, has been explored in single-
upon hand-derived adaptive filter algorithms and present a and multi-channel formats, and is particularly useful for
comprehensive framework for learning online, adaptive signal sound zone reproduction and active noise control [19]–[24].
processing algorithms or update rules directly from data. To Dereverberation can be formulated as a prediction problem
do so, we frame the development of adaptive filters as a meta- and has been studied considerably in this context [25]–[31].
learning problem in the context of deep learning and use a
form of self-supervision to learn online iterative update rules Finally, multi-microphone enhancement or beamforming can
for adaptive filters. To demonstrate our approach, we focus be formulated as an informed interference cancellation task
on audio applications and systematically develop meta-learned and also has a breadth of associated algorithms [32]–[38].
adaptive filters for five canonical audio problems including sys- When we consider AFs in the context of deep neural
tem identification, acoustic echo cancellation, blind equalization, networks (DNNs), we note two main observations. First,
multi-channel dereverberation, and beamforming. We compare
our approach against common baselines and/or recent state-of- AFs continue to be used extensively, but mostly via hybrid
the-art methods. We show we can learn high-performing adaptive approaches that combine neural networks with standard AF
filters that operate in real-time and, in most cases, significantly algorithms. Second, the underlying AF algorithms and tools
outperform each method we compare against – all using a single for the development of new AFs have had little change in
general-purpose configuration of our approach. several decades. Hybrid approaches, however, have proven
Index Terms—adaptive filtering, meta learning, online opti- very successful. For example, in AEC, neural networks can
mization, learning to learn, deep learning be trained for residual echo suppression, noise suppression,
reference estimation, and more [39]–[47]. In a similar vein,
I. I NTRODUCTION neural networks paired with AFs for active noise control tasks
have shown impressive results [48]–[51]. For dereverberation,

A DAPTIVE signal processing and adaptive filter theory


are cornerstones of modern signal processing and have
had a deep and significant impact on modern society. Ap-
DNNs have proven useful for both online and offline ap-
proaches by directly estimating statistics of the dereverber-
ated speech [52]–[56]. This pattern has repeated itself for
plications of adaptive filters (AF) include audio processing, beamforming applications, where DNNs have led to many
telecommunications, biomedical sensing, astrophysics and cos- performance improvements [57]–[60]. In many of these works,
mology, seismology and more. Audio applications, in par- DNNs are trained to predict ratio-masks, or otherwise directly
ticular, are of exceptional importance and find utility for enhance/separate the desired signal. In essence, they act as a
many problems such as single- and multi-channel system distinct module within a larger pipeline that also uses AFs.
identification, echo cancellation, prediction, dereverberation, In contrast, there are a small number of works that more
beamforming, noise cancellation, and beyond. AFs typically tightly couple neural networks and AFs and use DNNs for
operate via online iterative optimization methods, such as least optimal control of AFs. Recently, it was shown that DNNs can
mean square filtering (LMS), normalized LMS (NLMS), re- estimate statistics to control step-sizes [61], [62] or estimate
cursive least-squares (RLS), or Kalman filtering (KF), to solve entire updates [63] for a single-channel AEC. Similarly, past
streaming or online optimization problems [1]–[6] and process work has used DNNs to predict updates for the internal
signals in unknown and/or nonstationary environments. statistics of multi-channel beamformers [64] and to learn
J. Casebeer is with the Department of Computer Science, University source-models for multi-channel source separation [65]. These
of Illinois at Urbana-Champaign, Urbana, IL 61801 USA (e-mail: jon- works differ from hybrid approaches in that they leverage
ahmc2@illinois.edu). Work performed in part while interning at Adobe neural networks to update or control AFs directly and thus
Research.
N. J. Bryan was the lead advisor for this work and is with Adobe Research, focus on improving the performance of AFs themselves. Such
San Francisco, CA, 94103 USA (e-mail: njb@ieee.org) improvement can be leveraged in isolation or, in theory,
P. Smaragdis is with the Department of Computer Science and Department, together with complementary hybrid approaches.
University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA (e-mail:
paris@illinois.edu) Partially funded by NIFA award #2020-67021-32799 More broadly, the idea of controlling AFs via neural net-
Manuscript received Month Day, Year; revised Month Day, Year. works is related to several disciplines of signal processing
2

and machine learning including optimal control, optimization, TABLE I: Symbols and operators.
automatic machine learning, reinforcement learning and meta- Symbols Description
learning. Relevant works within these areas include automatic x∈R A real-valued scalar
x∈C A complex-valued scalar
selection of step sizes [7], [66], [67], the direct control of x∈R A real-valued time-domain scalar
model weights [68]–[70], rapid fine-tuning [71], and meta- x ∈ RN A time-domain N -dimensional column vector
learning optimization rules [72]. Out of these works, learning x ∈ CN A complex-valued N -dimensional column vector
optimization rules or learned optimizers is of critical rele- X ∈ CM ×N A complex-valued M × N matrix
x[τ ] A time-varying column vector
vance [73]–[75] and presents the idea of using one neural X[τ ] A time-varying matrix
network as a function that optimizes another function. Such w[τ ] FD AF linear-filter
works, however, focus on creating learned optimizers for u[τ ] FD AF input
d[τ ] FD AF target or desired response
training neural networks in an offline setting, where the latter y[τ ] FD AF estimated response
network is the final product, and the learned optimizer is s[τ ] FD AF true desired signal
otherwise discarded (or otherwise used to train additional IN An N × N identity matrix
0N ×R N × R matrix of zeros
networks). Moreover, this work has had little application to 1N ×R N × R matrix of ones
AFs, except for our own work [63], which we extend here. FN N -point discrete Fourier transform (DFT) matrix
In this work, we formulate the development of AF algo- Operators Description
∗ Convolution
rithms as a meta-learning problem. We learn adaptive filter ·> Transpose
update rules directly from data using self-supervision and ·∗ Complex conjugate
call our approach meta adaptive filtering (Meta-AF). Using ·H Hermitian transpose
diag(·) Vector to a diagonal matrix
our method, we no longer need humans to hand-derive up- cat(· · · ) Column vector concatenation (vertical stack)
date equations, do not need any supervised labeled data for E Expected value
learning, and do not need exhaustive tuning. To showcase (·)
(·)
Element-wise division
our approach, we systematically develop learned AFs for || · || Euclidean norm
exemplary applications of each of the four canonical AF |·| Element-wise magnitude of complex value
∠ Phase of a complex-value
architectures. Then, for each AF task, we compare our work to ln Natural logarithm
a suite of baselines and/or state-of-the-art approaches for the −1 Scalar or matrix inverse
problems of system identification, acoustic echo cancellation,
equalization, multi-channel dereverberation, and multi-channel
interference cancellation (beamforming). For all tasks, we use II. BACKGROUND
A. Notation
identical hyperparameters, significantly reducing engineering
and design time. We evaluate performance using signal-to- We provide an overview of the symbols and operators we
noise ratio (SNR)-like signal metrics and perceptual- or task- use in Table I. We denote scalars via lower-case symbols,
specific metrics as well as specific qualitative comparisons. column vectors via bold, lower-case symbols, and matrices
Our results show we can learn high-performing AF algorithms via bold upper-case symbols. We use bracket indexing [τ ] to
that operate in real-time and, in most cases, outperform all denote time-varying signals and an underline to denote the
methods we compare against. time-domain counterpart of a frequency-domain (FD) symbol.
The contributions of our work are as follows: 1) we We index FD rows via the subscript k, columns via the
present the first general-purpose method of meta-learning subscript m, and elements via the subscripts km.
AF algorithms (update rules) directly from data via self-
supervision (no labels required) 2) we apply our approach
B. Overlap-Save & Overlap-Add Filtering
to all canonical AF architecture categories including system
identification, inverse modeling, prediction, and (informed) We perform short-time (STFT) Fourier filtering using either
interference cancellation and 3) we show how a single hyper- overlap-save (OLS) or overlap-add (OLA) convolution. The
parameter configuration of our approach, trained with different OLS method uses block processing by splitting the input
datasets and losses, can outperform all common AF base- signal into overlapping windows and recombines complete
lines and/or several past state-of-the-art methods we compare non-overlapping components. We use um [t] ∈ R to represent
against according to one or more evaluation metrics and the time-domain sample recorded by microphone m at discrete
are suitable for real-time operation on commodity hardware. time t. We collect N such samples from microphone m
Compared to our previous work on single-channel single-talk to form the time-domain frame um [τ ] = [um [τ R − N +
AEC [63], we present several new improvements including 1) 1], · · · , um [τ R]] ∈ RN , where τ is the frame index, N is the
an improved loss, 2) additional inputs to our learned optimizer window length in samples, R is the hop size in samples, and
and 3) an updated development for multi-block, multi-channel O = N −R is the overlap between frames in samples. Finally,
AFs with coupled updates and 4) extensive experimentation we collect samples from all M microphones to form a multi-
on four new applications. We release demos for each task and channel time-domain signal, U[τ ] = [u1 [τ ], · · · , uM [τ ]] ∈
open source all code1 including baselines. RN ×M . We compute the corresponding FD representation via
U[τ ] = FN U[τ ] ∈ CK×M , where K is the number of
1 For demos and code, please see https://jmcasebeer.github.io/projects/ frequency bins, set to K = N for this work. We select the
metaaf and https://github.com/adobe-research/MetaAF, respectively. mth channel from a multi-channel FD representation using
3

um [τ ] ∈ CK . We define the FD filter wm [τ ] ∈ CK and the D. Conventional Optimizers


frequency and time output for the mth channel as For audio AFs, it is common to leverage OLS or OLA

filtering and solve (6) via optimization methods per frequency
ym [τ ] = diag(um [τ ])Zw wm [τ ] ∈ CK (1) bin. So, we modify (7) to be
ym [τ ] = Zy ym [τ ] ∈ R , R
(2)
wk [τ + 1] = wk [τ ] + ∆k [τ ], (11)
where Zw = FK T> −1
K/2 TK/2 FK ∈ C and Zy =
K×K where the update ∆k [τ ] is per frequency k. We focus on three
T̄R F−1 ∈ C R×K
are anti-aliasing matrices, TK/2 = common conventional AF optimizers in this form as well as
K
[IK/2 , 0K/2×K/2 ] ∈ RR×K trims the last K/2 samples from a machine learning optimizer to ground the development of
a vector and T̄R = [0R×O , IR ] ∈ RR×K trims the first O our method to familiar, fundamental algorithms. For gradient
samples. methods, we use the partial derivative with respect to wk [τ ]H .
The counterpart to OLS is OLA filtering, which computes 1) Least Mean Square: The least mean square opti-
the frequency output, time output, and buffer bm [τ ] as mizer (LMS) is a stochastic gradient descent method that uses
the ISE loss and gradient. The LMS update is

ym [τ ] = diag(um [τ ])wm [τ ] ∈ CK (3) ∆k [τ ] = −λ∇k [τ ], (12)
ym [τ ] = TR F−1
K ym [τ ] + T̄R bm [τ − 1] ∈ R R
(4) where λ is the step-size and ∇k [τ ] is the gradient of the ISE.
bm [τ ] = −1
FK ym [τ ] + T>
R T̄R bm [τ − 1] ∈ RK . (5) Note, LMS is stateless and only a function of the gradient.
2) Normalized Least Mean Square: The normalized
Here, TR = [IR , 0R×O ] ∈ RR×K . Typically, the forward and LMS (NLMS) algorithm modifies LMS via a running nor-
inverse DFTs are combined with analysis and synthesis win- malizer based on the input power. The NLMS update is
dows and optionally zero-padded. We use Hann windows [76]. vk [τ ] = γvk [τ − 1] + (1 − γ)kuk [τ ]k2 (13)
For multi-channel multi-block input, single-channel output FD
∇k [τ ]
filters, the OLS/OLA equations above are applied per channel, ∆k [τ ] = −λ , (14)
and anti-aliasing is applied per block. The per frequency (anti- vk [τ ]
aliased) filter is wk [τ ] ∈ CBM with B buffered frames and M where the division is element-wise, λ is the step-size and γ is
channels all stacked. The filter input is similarly stacked and a forgetting factor.
is uk [τ ] ∈ CBM , thus requiring (1) and (3) to be modified. 3) Root Mean Squared Propagation: The root mean square
propagation (RMSProp) optimizer [77] modifies NLMS by
replacing vk [τ ] with a gradient-based per-element running
C. Adaptive Filter Problem Formulation normalizer, ν[τ ] using forget factor γ as,

We define an AF as an algorithm or optimizer that solves ν k [τ ] = γν k [τ − 1] + (1 − γ)k∇k [τ ]k2 (15)


∇k [τ ]
∆k [τ ] = −λ p , (16)
θ̂[τ ] = arg min L(· · · )[τ ] (6) ν k [τ ]
θ[τ ]
The value, 1/ ν k [τ ] supplements the p fixed step-size λ and
p
via an additive update rule of the form acts as an adaptive learning rate, λ/ ν k [τ ].
4) Recursive Least Squares: The aim of the recursive least
θ[τ + 1] = θ[τ ] + ∆[τ ], (7)
squares (RLS) algorithm is to exactly solve the AF loss,
most commonly the WSE error. This is accomplished by
where θ̂[τ ] is a set of estimated time-varying filter parameters.
expanding the weighted least squares error into a function
In this work, we focus exclusively on linear FD adaptive fil-
of
P theNweighted empirical signal covariance matrix, Φk [τ ] =
ters (FDAFs), where θ[τ ] = {w[τ ]} without loss of generality.
k [τ ] , where the summation time-indices are
−τ H
γ u [τ ]u
Common losses include the mean-square error (MSE), τ k
application dependent (e.g. causal vs. non-causal implemen-
LM SE [τ ] = E[kem [τ ]k2 ] = E[kdm [τ ] − ym [τ ]k2 ], (8) tations), P
and the (weighted) empirical cross-correlation vector
zk [τ ] = τ γ N −τ uk [τ ]dk [τ ]H to compute the exact solution
the instantaneous square error (ISE), to the resulting normal equations, Φk [τ ]wk [τ ] = zk [τ ]. Run-
ning estimates of Φk [τ ] and zk [τ ] are also commonly used.
LISE [τ ] = kem [τ ]k2 = kdm [τ ] − ym [τ ]k2 , (9) However, instead of performing repeated matrix inversion, the
matrix inversion lemma is used. Thus, RLS can be imple-
and the weighted least squares error (WSE), mented using a time-varying precision (inverse covariance)
matrix Pk [τ ] and the Kalman-gain κk [τ ],
τ
Pk [τ − 1]uk [τ ]
X
LW SE [τ ] = γ τ −n kdm [n] − ym [n]k2 , (10) κk [τ ] = (17)
n=0 γ + uk [τ ]H Pk [τ − 1]uk [τ ]
Pk [τ − 1] − κk [τ ]uk [τ ]H Pk [τ − 1]
where γ is a forgetting factor, m is a reference mic, and the Pk [τ ] = (18)
output ym is computed via (1) or (3) for FD signals or losses γ
and (2) or (4) for time-domain signals or losses. ∆k [τ ] = κk [τ ](dkm [τ ] − ykm [τ ])∗ , (19)
4

where γ is a forgetting factor, dkm [τ ] and ykm [τ ] are the optimize general purpose neural networks. In contrast, we
desired and estimated signal at frequency k and reference design online AF optimizers that use multiple input signals,
microphone m, and the initialization of Pk [τ ] is critical and are complex-valued, adapt block FD linear filters, and integrate
commonly based on the input SNR. In the case of multi- domain-specific insights to reduce complexity and improve
block and/or multi-channel filters, there are multiple ways performance (coupling across channels and time). Moreover,
to formulate RLS, some of which differ from time-domain we deploy learned optimizers to solve AF tasks as the end-
RLS. Common approaches include diagonalized RLS (D- goal and do not use them to train downstream neural networks.
RLS) and block diagonalized RLS (BD-RLS) as well as QR We also note recent work that uses a supervised DNN to
decomposition techniques [5], [78], [79] and differ in what control the step-size of a D-KF for AEC [61] and another that
terms of the covariance (precision) matrix are modeled. D-RLS uses a supervised DNN to predict both the step-size and a
makes an independence assumption and optimizes each k, m, b nonlinear reference signal for AEC [62]. Compared to these,
filter tap separately, forming K diagonal BM × BM precision we replace the entire update with a neural network, do not
matrices. BD-RLS couples across frames and channels by need supervisory signals, and investigate many tasks.
forming K separate BM × BM precision matrices. In the
case of single block/channel filters, D-RLS, BD-RLS, and
NLMS can be reduced to the same algorithm with different III. P ROPOSED M ETHOD
parameterizations. In our case, we use identical BD-RLS
implementations across all tasks. A. Problem Formulation
When conventional optimizers are compared to each other,
We formulate AF algorithm design as a meta-learning
the order of performance commonly follows LMS, RM-
problem and train neural networks to learn AFs from data,
SProp, NLMS, and RLS, while the order of computational
creating meta-learned adaptive filters. This is in contrast to
complexity is reversed. These algorithms, however, can be
typical AFs that are manually created by human engineers. To
sensitive to tuning, nonstationarities, nonlinearities, and other
do so, we define a learned optimizer, gφ (·), as a neural network
issues that require engineering effort to mitigate failure cases
with one or more input signals parameterized by weights φ
and stability. For multi-block BD-RLS filters, in particular,
that optimizes an AF loss or optimizee L(hθ (·), · · · ) or L for
poor partition conditioning can also lead to degraded RLS
short, using an additive update rule
performance [80] and/or stability issues compared to NLMS
and other alternatives. For the purposes of this work, we
exhaustively grid-search tune the hyperparemeters on held-out θ[τ + 1] = θ[τ ] + gφ (·). (20)
validation sets of signals per task.
Beyond these basic optimizers, we also compare against We then seek an optimal AF optimizer gφ̂ over dataset D
several additional methods. For the AEC task, we compare
against the double-talk robust diagonal Kalman filter (D- φ̂ = arg min ED [ LM ( gφ , L(hθ (·), · · · ) ) ], (21)
φ
KF) [11], the open-source double-talk robust Speex algo-
rithm [12], a weighted-RLS (wRLS) algorithm [81], and where LM is a functional that defines the meta (or optimizer)
WebRTC-AEC3 [82]. The D-KF algorithm is recommended loss that is a function of gφ and an AF loss L with one or more
over other variants [15] and is a common AEC baseline [61]. inputs and filtering function hθ that itself has one or more
In addition, the Speex and wRLS [81] algorithms were the inputs and parameters θ. Intuitively, when we solve (21), we
linear AFs used (with a non-linear post-processor) in the learn a network gφ (·) that solves the AF loss L when applied
first- [47] and second-place [81] winners of the ICASSP repeatedly via an additive update.
2021 Acoustic Echo Cancellation Challenge [83], respectively.
Since our work focuses on linear adaptive filters and can
easily be combined with non-linear post-processors, we be-
B. Optimizee Architecture & Loss
lieve D-KF, wRLS, Speex, and WebRTC-AEC3 are reasonable
baselines. For dereverberation, we compare against NARA- The optimizee, or the AF loss L that is optimized via (20) is
WPE [30], which is a highly effective normalized BD-RLS a function of the filter or architecture hθ (·). The filter can be
based optimizer [27], [30] and comparable to the original NTT any reasonable differentiable filtering operator such as time-
implementation [84]. domain FIR filters, lattice FIR filters, non-linear filters, FD
filters, multi-delayed block FD filters [3], etc. Similarly, the
E. Related Work AF loss can be any reasonable differentiable loss such as the
Initial work on automatically tuning AFs has been explored MSE, ISE, WSE, a regularized loss, negative log-likelihood,
in incremental delta-bar-delta [66], Autostep [67], and else- mutual information, etc. For our work, we focus on single- and
where. Recent machine learning work discusses the idea of multi-channel multi-frame linear block FD filters hθ applied
using DNNs to learn entirely novel optimizer update rules via OLS or OLA with parameters θ k [τ ] = {wk [τ ] ∈ CBM }
from scratch [72]–[75]. We take inspiration from this work, with B buffered frames and M channels per frequency k. We
but make numerous advances specific to AFs. In particular, also set the FDAF loss L[τ ] to be the ISE via (9) with gradient
past learned optimizers [72] are element-wise, offline, real- computed with respect to wk [τ ]H as ∇k [τ ] = uk [τ ](ykm [τ ] −
valued, only a function of the gradient, and are trained to dkm [τ ])∗ .
5

TABLE II: Relationship between AF optimizers. TABLE III: Optimizer complexity comparison.
Optimizer Inputs State Params ∆k [τ ] Optimizer Big-O ≈ CMACS
LMS ∇k [τ ] ∅ λ (12) LMS O(KMB) KMB
NLMS ∇k [τ ], uk [τ ] vk [τ ] λ, γ (14) NLMS O(KMB) 5KMB
RMSProp ∇k [τ ] ν k [τ ] λ, γ (16) RMSProp O(KMB) 6KMB
BD-RLS uk [τ ], dk [τ ], yk [τ ] Pk [τ ] γ (19) BD-RLS O(K(MB)2 ) K(4(MB)2 + 5MB)
Ours ξk [τ ] ψ k [τ ] φ gφ Ours O(K(H2 + MBH)) K(12H2 + (21 + 10MB)H)

C. Optimizer Architecture & Loss logarithm to reduce the dynamic range, which we found to
1) Architecture: Our optimizer gφ is inspired by conven- empirically improve learning. This loss ignores the temporal
tional AF optimizers such as LMS, NLMS, and BD-RLS, but order of AF updates and optimizes for filter coefficients that
updated to have a neural network form. In particular, we focus are unaware of any downstream STFT processing, but the idea
on making a generalized, stochastic variant of an RLS-like of accumulating independent time-step losses is common [72].
optimizer that is applied independently per frequency k to our Second, we define the time-domain frame-accumulated loss
optimizee parameters, but coupled across channels and time LM = ln E[||d̄m [τ ] − ȳm [τ ]||2 ] (26)
frames to allow our approach to model interactions between
channels and frames and vectorize across frequency. To do so, d̄m [τ ] = cat(dm [τ ], dm [τ + 1], · · ·, dm [τ + L − 1]) (27)
we use a recurrent neural network (RNN) where the weights φ ȳm [τ ] = cat(ym [τ ], ym [τ + 1], · · ·, ym [τ + L − 1]), (28)
are shared across all frequency bins, but we maintain separate
where dm [τ ] and ym [τ ] are the time-domain desired and
state ψ k [τ ] per frequency. The inputs to our optimizer at
frequency k are ξ k [τ ] = [∇k [τ ], uk [τ ], dk [τ ], yk [τ ], ek [τ ]], estimated responses of reference channel m and d̄m [τ ] ∈ RRL
where ∇k [τ ] is the gradient of the optimizee with respect and ȳm [τ ] ∈ RRL . Intuitively, to compute this loss for a given
to θ k , and ek [τ ] = dk [τ ] − yk [τ ]. Our optimizer outputs are optimizer gφ , we run (20) for a time horizon of L frames,
the update ∆k [τ ] and the internal state ψ k [τ + 1], resulting in concatenate the sequence of time-domain outputs and target
signals to form longer signals, then compute the time-domain
(∆k [τ ], ψ k [τ + 1]) = gφ (ξ k [τ ], ψ k [τ ]) (22) MSE loss, and take the logarithm. While both losses use the
θ k [τ + 1] = θ k [τ ] + ∆k [τ ]. (23) same time-horizon, the frame accumulated loss allows us to
model boundaries between adjacent updates and implicitly
Our design is in contrast to LMS-, NLMS-, RMSProp-like learn updates that are STFT consistent [85]. To the best of
optimizers that have no state (e.g. LMS) or minimal state dy- our knowledge, our frame accumulated loss is novel for AFs.
namics (e.g. NLMS, RMSProp). In addition, these optimizers 3) Computational Complexity: We compare the computa-
as well as other learned optimizers [72] typically apply updates tional complexity of our proposed approach to conventional
independently per element. For a comparison of optimizer optimizers in Table III. We note that the complexity of our
inputs, state, parameters, and gradients, please see Table II. approach is dependent on the hidden state size H of our RNN,
To define our RNN in more detail, we use a small network but is linear in channels M and blocks B, whereas BD-RLS
composed of a linear layer, nonlinearity, and two Gated is quadratic in M and B, but does not depend on H. Note
Recurrent Unit (GRU) layers with hidden size H = 32, that prior work on learned optimizers, including our own [63],
followed by two additional linear layers with nonlinearities, performs optimization completely element-wise, resulting in a
where all layers are complex-valued. We always re-scale the larger complexity of O(KMBH2 ).
inputs element-wise via
ln(1 + |ξ|)ej∠ξ , (24) D. Learning the Optimizer

to reduce the dynamic range and facilitate training, but keep To learn an optimizer gφ from data, we solve (21) using
the phases unchanged. This pre-processing was found useful standard deep learning tools (i.e. JAX [86], [87]), including
in several previous works [63], [72], although previous work the use of automatic differentiation for training and infer-
used explicit clipping, which we found unnecessary. ence. In addition, we use truncated backpropagation through
2) Loss: We examine two meta losses LM (·) to learn time (TBPTT) [88] with a stochastic gradient descent opti-
our optimizer parameters φ. First, we define the FD frame mizer, Adam [89], that we call our meta optimizer. We show
independent loss a simplified form of our training algorithm in Alg. 1 using our
frame accumulated loss and a batch size of one, where S TFT is
τ +L
1 X an OLA or OLS processor, G RAD returns the gradient of the
LM = ln E[||dm [τ ] − ym [τ ]||2 ], (25) first argument with respect to the second, S AMPLE randomly
L τ
samples signals from a dataset D, and N EXT L grabs the next L
where dm [τ ] and ym [τ ] are the desired and estimated FD time buffers from a longer signal. In practice, we use batching.
signal vectors of the reference channel m (e.g. m = 0). In more detail, we train gφ until the application specific
Intuitively, to compute this loss for a given optimizer gφ , we mean SNR metric on the validation fold of a dataset D has
unroll (20) for a time horizon of L time frames, compute not improved for four epochs. For each of our five applications,
the FD mean-squared error over L frames, and then take the we use (29), (30), (31), (36), and (45), respectively. We halve
6

Algorithm 1 Training algorithm.


function F ORWARD(gφ , ψ, hθ , U, dm )
for τ ← 0 to L do . Unroll
U[τ ], d[τ ] ← S TFT(U[τ ], dm [τ ]) . Forward STFT
ym [τ ] ← hθ (U[τ ]) . Save filter output
ym [τ ] ← S TFT−1 (ym [τ ]) . Inverse STFT
L ← ||dm [τ ] − ym [τ ]||2 . AF frame loss
∇[τ ] ← G RAD(L, θ) . Filter gradient
for k ← 0 to K do . Apply update per freq
ξ k [τ ] ← [∇k [τ ], uk [τ ], dk [τ ], yk [τ ], ek [τ ]] Fig. 1: System identification block diagram. System inputs are
(∆k [τ ], ψ k [τ + 1]) ← gφ (ξ k [τ ], ψ k [τ ]) fed to both the adaptive filter and the true system (shaded box).
θ k [τ + 1] ← θ k [τ ] + ∆k [τ ] The adaptive filter is updated to mimic the true system.
ȳ ← C AT(y[τ ] ∀τ ) . Concatenate accumulated frames
return ȳ, ψ, hθ
IV. S YSTEM I DENTIFICATION A BLATIONS
function T RAIN(D)
φ ← [90] init A. Problem Formulation
while φ not C ONVERGED do . Train loop For our first task, we train a Meta-AF to perform online
U, dm ← S AMPLE(D) . Sample signals system identification (ID). We seek to estimate the transfer
θ, ψ ← 0, 0 . Init filter and optimizer state function between an audio source and a microphone over time,
for n ← 0 to end do . Loop across long signal as shown in Fig. 1. This is commonly done in room acoustics
Ū, d̄m ← N EXT L(U, dm ) . Get next L frames and head-related transfer function measurements for virtual
ȳ, ψ, hθ ← F ORWARD(gφ , ψ, hθ , Ū, d̄m ) and augmented reality. To do so, we model the unknown
LM ← via (26) . Meta loss system (e.g. acoustic room) with a linear frequency-domain
∇ ← G RAD(LM , φ) . Optimizer gradient filter hθ (optimizee architecture), inject input signal u into the
φ[n + 1] ← M ETAO PT(φ[n], ∇) . Update opt system, measure the response d, and adapt the filter weights
return φ̂ . Return best φ θ = {wk } using our learned Meta-ID AF, gφ . The AF loss
is the ISE between the desired response, dk [τ ], and the AF
predicted response, yk [τ ] = wk [τ ]H uk [τ ].
the step size after an epoch with no improvement and define an
epoch as 10 passes through the dataset with a batch size of 32. B. Experimental Design
We have a hard stop for training at 100 epochs. From initial
experimentation, we note the choice of the meta-optimizer and We 1) ablate key design decisions of our AF optimizer
meta-optimizer parameters have large impact on performance. architecture and loss in Section IV-C and 2) study the ro-
We initialize φ via [90] and, when using Adam2 , we found it bustness of our approach to modeling errors by changing
was important to use a small learning of 10−4 and a large the AF and true system order at test time in Section IV-D.
momentum term of .99. To help stabilize training, we use For evaluation metrics, we use the segmental SNR (SNRd )
gradient clipping with a clipping value of 10. We use identical between the desired and estimated signals in dB. We compute
gφ settings for each task with input and output sizes set to this per-frame as,
M ·B ·5 and M ·B, respectively, all intermediate layer sizes set kd[τ ]k2
 
to 32, and separate ReLU nonlinearities for real and imaginary. SNRd (d[τ ], y[τ ]) = 10 · log10 , (29)
kd[τ ] − y[τ ]k2
Compared to our past work on single-channel single-talk
AEC [63], we 1) change our loss from (25) without the where higher is better. We train gφ via Algorithm 1 using a
log to (26) 2) change the input of our learned optimizer single GPU to adapt an OLS filter with a rectangular window
from ∇k [τ ] to ξ k [τ ] with updated pre-processing (24) and 3) size N = 2048 and hop size R = 1024 on 16 kHz audio.
develop multi-channel Meta-AFs or development multi-block For training data, we created a dataset by convolving the far-
Meta-AFs to perform time- and block-coupled updates instead end speech from the single-talk portion of the ICASSP 2021
of separate, independent updates per frequency, block, and AEC Challenge data [91] with room impulse responses (RIRs)
channel. To validate our approach, we apply our algorithm from [92] truncated to 1024 taps.
to five audio tasks including system identification, acoustic
echo cancellation, equalization, single/multi-channel derever- C. Optimizer Design Results & Discussion
beration, and beamforming in Section IV, Section V, Sec-
In Fig. 2, we change one aspect of our final design at a time
tion VI, Section VII, and Section VIII, respectively and
and show how each change negatively effects performance.
compare against conventional methods. For our first task,
Our final design, L = 16, is colored light blue, alternative
we also ablate key design decisions and then lock a single
configurations are colored differently, and our previous work
configuration for all remaining tasks and experiments.
[63] is approximately equivalent to a no log, no accumulation,
2 We use a custom implementation of Adam in our source code due to a no extra inputs setting. After this ablation, we fix these values
complex-value error issue in the JAX implementation. for all remaining experiments and do not perform any further
7

1.2 15.0 31.4 31.2 24.6 28.4 29.0 34.5 36.1 36.1 36.3 35.9 True
40 System
Median SNRd (dB)

Order

Converged SNRd (dB)


40
30 256
512
20
20 1024
2048
0 10
4096

16

32
S

g
rop

MS

Inp Extra

.
um
rol
Lo

Ac g
LM

RL

No m.

L=
No o Lo
uts

L=

L=
SP

Un
NL

Full

Ac
u
D-

No
0

No
RM

No
256 512 1024 2048 4096
Adaptive Filter Order
Fig. 2: Optimizer design decision ablation. Using an accumu-
lated log-loss leads to our best model, particularly for more Fig. 3: Evaluating the effect of different true system orders
complex tasks we address later on. RLS-like inputs are also and adaptive filter orders. Our learned AFs can operate well
useful. The exact value of L is not critical, but larger is better. across a variety of linear system orders, even when training is
restricted to systems of a fixed length (1024 taps).

tuning except changing the dataset used for training and using
D. System Order Modeling Error Results & Discussion
different checkpoints caused by early stopping on validation
performance. In contrast, we re-tune all conventional optimizer Given our fixed set of optimizer and meta-optimizer pa-
baselines for all subsequent tasks on held-out validation sets. rameters, we evaluate the robustness of our Meta-ID AF to
1) Loss Function: First, we compare our selected frame modeling errors by studying what happens when we use an
accumulated loss model (light blue) to the frame accumulated optimizee filter that is too short or too long compared to the
loss without log scaling (black) as well as the frame indepen- true system. We do so by testing a learned optimizer with
dent loss (yellow) and without log scaling (light-purple). As 1) optimizee filter lengths between 256 and 4096 taps and 2)
shown, the log-loss has the single largest effect on SNRd and held-out signals with true filter lengths between 256 and 4096,
yields an astounding 11.7/7.3 dB improvement compared to as well as full length systems (up to 32, 000 taps).
the no-log losses. When we compare the independent vs. accu- Results are shown in Fig. 3. We measure performance by
mulated loss, the accumulation provides a .2 dB improvement. computing the segmental SNRd score via (29) at convergence.
However, when we listen to the estimated response, especially As expected, when the adaptive filter order is equal to or
in more complex tasks, we found that the accumulated loss greater than the true system order, we achieve SNRs of
introduces fewer artifacts and perceptually sounds better. Thus, ≈ 40 dB. It is interesting to note that our learned AF was only
we fix the optimizer loss to be (26). ever trained on optimizee filters with an order of 1024 and
1024 tap true systems. This experiment suggests our learned
2) Input Features: Having selected the optimizer loss, we
optimizers can generalize to new optimizee filter orders.
compare the model inputs. We compare setting the optimizer
input to be only the gradient ξ k [τ ] = [∇k[τ ]] for an LMS-like
learned optimizer (dark purple) and setting the optimizer to be V. ACOUSTIC E CHO C ANCELLATION A BLATION
the full signal set ξ k [τ ] = [∇k[τ ], uk [τ ], dk [τ ], yk [τ ], ek [τ ]] A. Problem Formulation
for our selected RLS-like learned optimizer (light blue). As In our second task, we train a Meta-AF for acoustic echo
shown, the inputs have the second largest effect on SNRd and cancellation (AEC). The goal of AEC is to remove the far-
using the full signal set yields a notable 7.9 dB improvement. end echo from a near-end signal for voice communication by
Thus, we set the input to be the full signal set. mimicking a time-varying transfer function as show in Fig. 4.
3) AF Unroll: With the optimizer loss and inputs fixed, The far-end refers to the signal transmitted to a local listener
we evaluate four different values of AF unroll length, L = and the near-end is captured by a local mic. We model the
2, 8, 16, 32, where L is the number of time-steps over which unknown transfer function between the speaker and mic with
the optimizer loss is computed in (26). Intuitively, a larger a linear multi-delay frequency-domain filter hθ , measure the
unroll introduces less truncation bias but may be more unstable noisy response d which includes the input signal u, noise n,
during training due to exploding or vanishing gradients. The and near-end speech s, and adapt the filter weights θ using our
case where L = 2 corresponds to no unroll, since for L = 1 learned Meta-AEC AF, gφ . The time-domain signal model is
the meta loss is not a function of the optimizer parameters and d[t] = u[t] ∗ w + n[t] + s[t]. The AF loss is the ISE between
yields a zero gradient. As shown, for no unroll L = 2 (grey), the near-end and the predicted response. The FDAF near-end
we get a reduction of the SNRd by 1.8 dB compared to speech estimate is ŝk [τ ] = dk [τ ] − wk [τ ]H uk [τ ].
our best model. When selecting the unroll between 8, 16, 32,
however, there is a small (< 1 dB) overall effect. That said,
we find that longer unroll values work better in challenging B. Experimental Design
scenarios but take longer to train. As a result, we select an We compare our approach to LMS, RMSProp, NLMS,
unroll length of 16, as it represents a good trade-off between BD-RLS, a diagonal Kalman filter (D-KF) model [11], and
performance and training time. Note, the unroll length only Speex [12] for a variety of acoustic echo cancellation scenar-
effects training and is not used at test time. ios. Our scenarios, in increasing difficulty, include single-talk,
8

17.74
20

K e EC
13.30

Median ERLE (dB)


LM SP LM -R D- Spe -A
S p S S F x
a
et
14.32 15

M
9.11
10

L
14.56

N D
B
10.43 5

ro
1.36
0

RM
0 10 20 0.0 1.6 3.2 4.8 6.5 8.1 9.7
Median ERLE (dB) Time (Seconds)
LMS RMSProp NLMS BD-RLS D-KF Speex Meta-AEC

Fig. 5: AEC single-talk performance. Meta-AEC converges


Fig. 4: AEC block diagram. System inputs are fed to the
rapidly and has better steady-state performance. We use this
adaptive filter and true system (shaded box). The adaptive filter
same legend for all AEC plots.
is updated to mimic the true system. The desired response can
be noisy due to near-end noise and speech (n[τ ], s[τ ]).
0.958

K e EC

Median ERLE (dB)


0.944

LMSP LM -R D- Spe -A
S p S S F x
12

a
0.948

et
M
double-talk, double-talk with a path change, and noisy double- 0.908 9
talk with a path change and non-linearity. Single-talk refers

L
0.901 6

N D
0.929

B
to the case where only the far-end input signal u is active. 3

ro
0.843
Double-talk refers to the case where both the far-end signal u 0

RM
0.8 0.9 1.0 0.0 1.6 3.2 4.8 6.5 8.1 9.7
and near-end talker signal s are active at the same time. Median STOI Time (Seconds)
A path change refers to the case where the true system
transfer function is abruptly changed (e.g. during a phone call). Fig. 6: AEC double-talk performance. Meta-AEC converges
Nonlinearities refer to the case where the true system is not fastest and has similar peak performance to the D-KF, while
strictly linear (e.g. harmonic loudspeaker distortion). We train preserving near-end speech quality.
a single gφ for AEC and then test it for each scene type against
all hyperparameter-tuned baselines.
We measure echo cancellation performance using segmental validation, and test set (does not include scene changes) to
echo-return loss enhancement (ERLE) [93] and short-time compare to other previously published works directly.
objective intelligibility (STOI) [94]. Segmental ERLE is
C. Results & Discussion
kdu [τ ]k2
 
ERLE(du [τ ], y[τ ]) = 10 · log10 , (30) Overall, we find that our approach significantly outperforms
kdu [τ ] − y[τ ]k2
all previous methods in all scenarios, but has a larger advan-
where du is the noiseless system response and higher values tage in harder scenes—more details discussed below.
are better. When averaging, we discard silent frames using an 1) Single-Talk: Our approach (light blue, x) exhibits strong
energy-threshold VAD. In scenes with near-end speech, we single-talk performance and surpasses all baselines by >≈
use STOI ∈ [0, 1] to measure the preservation of near-end 3 dB in both median and converged ERLE, as shown in Fig. 5.
speech quality. Higher STOI values are better. We train gφ Additionally, Meta-AEC converges fastest, reaching steady-
via Algorithm 1 using one GPU, which took ≈ 72 hours. We state ≈ 4 seconds before other baselines.
use a four-block multi-delay OLS filter (MDF) with window 2) Double-Talk: Our method (light blue, x) converges
sizes of N = 1024 samples and a hop of R = 512 samples fastest in double-talk, and matches the D-KF in converged-
on 16 kHz audio. In double-talk scenarios, we use the noisy performance, as shown in Fig. 6. Meta-AEC converges ≈ 5
near-end, d as the target and do not use oracle cancellation seconds faster while scoring better in STOI. This result
results (such as du ). is striking as it is typically necessary to either explicitly
With respect to datasets for single-talk, double-talk, and freeze adaptation via double-talk detectors or implicitly freeze
double-talk with path-change experiments, we re-mix the adaption via carefully derived updates as found in both the
synthetic fold of the ICASSP 2021 AEC Challenge dataset D-KF model (dark blue, down triangle) and Speex (green, up
(ICASSP-2021-AEC) [91] with impulse responses from [92]. triangle). We hypothesize our method automatically learns how
We partition [92] into non-overlapping train, test, and valida- to adapt in double-talk in a completely autonomous fashion.
tion folds and set the signal-to-echo-ratio randomly between 3) Double-Talk with Path Change: Our method (light blue,
[−10, 10] with uniform distribution. To simulate a scene x) is able to more robustly handle double-talk with path
change, we splice two files such that the change occurs changes compared to other methods as shown in Fig. 7. Similar
randomly between seconds four and six. For the noisy double- to straight double-talk, our approach effectively learns how
talk with nonlinearity experiments, we use the synthetic fold to deal with adverse conditions (i.e. a path change) without
of [91]. We apply a random circular shift and random scale to explicit supervision, converging and reconverging in ≈ 2.5
all files, each ten seconds long. For each task, there are 9000 seconds, with .044 better median STOI. All other algorithms
training, 500 validation, and 500 test files. Finally, we also similarly struggle, even Speex (green, up triangle), which has
use an unmodified version of the ICASSP-2021-AEC training, explicit self-resetting and dual-filter logic.
9

0.936 Method STOI STOI? ERLE (dB) ERLE? (dB)


15
K e EC

Median ERLE (dB)


LMSP LM -R D- Spe -A
S p S S F x
0.892 LMS 0.794 0.794 0.937 0.560
RMSProp 0.802 0.856 1.02 4.63
a

0.875
et

10
M

0.852 NLMS 0.854 0.861 4.81 5.96


BD-RLS 0.829 0.835 3.66 4.07
L

0.875 5
N D

0.886 D-KF [11] 0.817 0.875 1.98 6.55


B

wRLS, β = 0.2 [81] N.A 0.85 N.A N.A


ro

0.825 0
WebRTC-AEC3 [82] 0.82 N.A N.A N.A
RM

0.8 0.9 1.0 0.0 1.6 3.2 4.8 6.5 8.1 9.7 Speex [12] 0.869 N.A 3.92 N.A
Median STOI Time (Seconds) Meta-AEC 0.881 0.899 7.73 9.13

Fig. 7: AEC double-talk with a path change (shaded region) TABLE IV: ICASSP-2021-AEC test set linear filter results.
performance. Our approach re-converges rapidly with high Our proposed method out performs several past comparable
speech quality. linear-filtering approaches. A ? denotes results when models
were trained/tuned on the ICASSP-2021-AEC data.
0.876 10
K e EC

Median ERLE (dB)

0.844
LMSP LM -R D- Spe -A
S p S S F x
a

0.816
et
M

0.804 5
L

0.820
N D

0.800
B
ro

0.799 0
RM

0.8 0.9 1.0 0.0 1.6 3.2 4.8 6.5 8.1 9.7
Median STOI Time (Seconds)

Fig. 8: AEC noisy double-talk with nonlinearities and a path


change (shaded region) performance. Meta-AEC learns to
compensate for the nonlinearity. Fig. 9: Inverse modeling block diagram. System outputs are
fed to the adaptive filter. The adaptive filter is continually
updated to invert the unknown system (shaded box).
4) Noisy Double-Talk with Nonlinearities & Path Change:
When we evaluate scenes with noise, nonlinearies to simulate 6) Computational Complexity: Our learned AF has a sin-
loudspeaker distortion, and path changes, we find that our gle CPU core real-time factor (RTF) (computation/time) of
Meta-AEC approach (light blue, x) continues to perform well, ≈ 0.36, and 32 ms latency (OLS hop size). Our optimizer
as shown in Fig. 8. That is, our peak-performance is ≈ 2 dB network alone has ≈ 14K complex-valued parameters and
above the nearest baseline. In STOI, Meta-AEC scores .027 single CPU core RTF of ≈ 0.31. While this performance
above Speex (green, up triangle). We hypothesize that our is already real-time capable, we suspect it could easily be
approach effectively learns to compensate for the signal model improved with better optimized code.
inaccuracy, even if we only use a linear filter.
5) ICASSP 2021 AEC Challenge Results: In addition to VI. E QUALIZATION A BLATION
testing with our own variant of the ICASSP-AEC-2021 dataset A. Problem Formulation
that includes scene changes, we test our work with an un-
For our third task, we train a Meta-AF for the inverse
modified version of the test set in Table IV. Furthermore, we
modeling task of equalization (EQ). Here, our goal is to
evaluate performance when we train (or tune) on our custom
estimate the inverse of an unknown transfer function, while
training dataset versus when we train on the original training
only observing input and outputs of the forward system, as
dataset (denoted with ? ). See also a similar Table in past
shown in Fig. 9. This is a common component of loudspeaker
work [81]. To the best of our knowledge, this dataset is the
tuning. We model the unknown inverse transfer function with
most recent and widely used dataset for benchmarking AEC
a linear frequency-domain filter hθ , measure the response d
algorithms. Here, results from WebRTC-AEC3 and wRLS,
to an input signal u, and adapt the filter weights θ using our
β = 0.2 come from past work [81]. All other methods have
learned Meta-EQ AF gφ . The AF loss is the ISE between the
the same MDF filtering architecture as described above. Our
true and predicted responses. More precisely, the frequency-
approach outperforms all methods we compare against for both
domain AF output is yk [τ ] = wk [τ ]H uk [τ ].
training datasets, including Speex and wRLS, which were the
linear filters used in the first- and second-place winners of the
ICASSP 2021 AEC Challenge [83]. Interestingly, there is a B. Experimental Design
significant effect on training or tuning with data that includes We compare our Meta-EQ approach to LMS, RMSProp,
scene changes (ours) vs. the original data (e.g. RMSProp and NLMS, and D-RLS on the task of frequency equalization.
D-KF [61]). That is, because the ICASSP-2021-AEC train Additionally, we ablate the equalization filtering mechanics
and test set does not include scene changes, most algorithms for two cases: constrained and unconstrained filters (optimizee
give better performance when trained/tuned on the matching, architecture modifications). In the constrained case, we set hθ
unmodified train set, even though such results are less realistic. to use standard OLS. However, in the unconstrained case, we
10

21.94 40.09
Aliased 21.19 35.42 w−1

Gain (dB)
20
OLS Configuration

19.56 35.51
12.15 6.96
0.67 0.05 0

20.78 38.60 −20


Anti-Aliased

18.43 34.40 0.0 1600.0 3200.0 4800.0 6400.0 8000.0


15.92 34.48 Frequency (Hz)
11.87 10.57

Median SNRw (dB)


0.42 0.07 40

0 10 20 30 0 20 40 60
20
Median SNRd (dB) Median SNRw (dB) D-RLS D-RLS Alias.
LMS RMSProp NLMS D-RLS Meta-EQ Meta-EQ Meta-EQ Alias.
0
0.0 1.6 3.2 4.8
Fig. 10: Equalization results for signal (SNRd ) and sys- Time (Seconds)
tem (SNRw ) SNR. Meta-EQ performance is the least impacted
by constraints. Fig. 11: Comparison of true and estimated systems over time.
The Meta-EQ system rapidly fits to the correct inverse model.
The top plot shows an example system and the bottom shows
set hθ to use aliased OLS where Zw = IK . This comparison SNRw over time across the test set.
lets us test if Meta-EQ is automatically learning constraint-
aware update rules. We train a new gφ for each case (no
separate tuning) and tune all baselines for each case.
We measure performance with signal SNRd , and system
SNRw . We define these as
kdk2
 
SNRd (d, y) = 10 · log10 (31)
kd − yk2
k|w−1 |k2
 
−1 −1
SNRw (ŵ , w ) = 10 · log10 (32)
k|w−1 | − |ŵ−1 |k2
Fig. 12: Prediction block diagram. A buffer of past inputs are
respectively, where higher is better. We compute SNRw using
used to estimate a future, unknown signal. The delay, z −D
the inverse system magnitude, which ignores the phase. We
train gφ via Algorithm 1 on one GPU, which took at most signifies a delay by D frames.
36 h. We use an OLS filter with a window size of N = 1024
samples and a hop of R = 512 samples on 16 kHz audio.
3) Computational Complexity: Our learned AF has a single
To construct the equalization dataset, we use speech from
CPU core RTF of ≈ 0.24, and 32 ms latency. Our optimizer
the DAPS dataset [95], take the cleanraw recordings as inputs,
network alone has ≈ 14K complex-valued parameters and
and apply random equalizer filters from the sox library to
single CPU core RTF of ≈ 0.19.
generate the outputs, where we randomly pick between [5, 15]
filters with settings c ∈ [1, 8] kHz, g ∈ [−18, 18], and
q ∈ [.1, 10]. All values are sampled uniformly at random to VII. D EREVERBERATION A BLATION
produce 16, 384 train, 2048 validation and 2048 test signals, all
A. Problem Formulation
5 seconds long. At train, validation, and test time we truncate
the system response to 512 taps. For our fourth task, we train a Meta-AF to perform dere-
verberation via multi-channel linear prediction (MCLP) or the
C. Results & Discussion weighted prediction error (WPE) formulation, as is commonly
used for speech-to-text pre-processing. The WPE formulation
We find our approach (blue, solid) outperforms LMS, RM-
is based on the idea of being able to predict the reverberant part
SProp, NLMS, and D-RLS for our equalization task by a
of a signal from a linear combination of past samples, most
noticeable margin as shown in Fig. 10 and further verify with
commonly in the frequency-domain [25], [26] and shown as a
a qualitative analysis plot in Fig. 11.
block diagram in Fig. 12. Using our method, we use a multi-
1) Constrained vs Unconstrained: For the unconstrained
channel linear frequency-domain filter hθ and adapt the filter
case, our method outperforms D-RLS in SNRd by .75 dB
weights θ using a learned AF gφ to minimize the normalized
and by 4.67 dB in SNRw . When we look at the constrained
ISE AF loss below.
case, the performance for all models is degraded. Interestingly,
Assuming an array of M microphones, we estimate a
however, our performance is proportionally degraded the least.
dereverberated signal with a linear model via
We hypothesize that our approach learns to perform updates
which are aware of the constraint. ŝkm [τ ] = dkm [τ ] − wk [τ ]H uk [τ ] (33)
2) Temporal Performance Analysis: We display final sys-
tem and convergence results in Fig. 11. Our Meta-EQ model where ŝkm [τ ] ∈ C is the current dereverberated signal estimate
finds better solutions more rapidly than D-RLS. D-RLS di- at frequency k and channel m, dkm [τ ] ∈ C is the input
verged ≈ 300 times but Meta-EQ never did. microphone signal, wk [τ ] ∈ CBM is a per frequency filter
11

with B time frames and M channels flattened into a vector, -1.26 0.719

M=1 M=4 M=8


and uk [τ ] ∈ CBM is a running flattened buffer of dk [τ − D]. 7.82 0.850

Num. Mics.
We then minimize a per channel and frequency loss -0.77 0.727
8.90 0.857
kŝkm [τ ]k2 -1.70 0.679
L(ŝkm [τ ], λk [τ ]) = , (34) 11.99 0.863
λ2k [τ ]
0 10 0.6 0.8 1.0
τ SRR (dB) Median STOI
1 X
λ2k [τ ] = dk [n]H dk [n], (35) 15
M (B + D)

Median SRR (dB)


n=τ −B−D
10
where λ2k [τ ] is a running average estimate of the signal power
5
and dk [τ ] ∈ CM . We use this formation within our framework NARA-WPE:1 NARA-WPE:4 NARA-WPE:8
Meta-WPE:1 Meta-WPE:4 Meta-WPE:8
to perform online multi-channel dereverberation or Meta-WPE 0
and focus on dereverberating a single output channel. 0.0 1.1 2.2 3.4 4.5
Time (Seconds)
B. Experimental Design
Fig. 13: Dereverberation performance in terms of SRR. Meta-
We compare our Meta-WPE to frame-online NARA- WPE excels in SRR, a metric which measures energy re-
WPE [30], a BD-RLS based AF which uses the WPE formu- moved. However, in STOI, Meta-WPE scores worse.
lation. We ablate the filter size and inputs across: M = 1, 4, 8
microphones (optimizee size and input modification). We seek
to test if our method can scale from single- to multi- channel
tasks without modification. We train a new gφ for each M (no
tuning). We measure performance with two metrics, segmental
speech-to-reverberation ratio (SRR) [31] and STOI. SRR is
a signal level metric and measures how much energy was
removed from the signal. It is computed as,
kŝk [τ ]k2
 
SRR(dk [τ ], ŝk [τ ]) = 10 · log10 , (36)
kdk [τ ] − ŝk [τ ]k2
where smaller values indicate more removed energy and better Fig. 14: Informed interference cancellation block diagram. An
performance. STOI is computed between the dereverberated auxiliary signal is used as input to an adaptive filter which is
signal estimate and the ground truth anechoic signal. We train fit to an alternate signal.
gφ via Algorithm 1 on two GPUs, which took at most 24 h.
We use an OLA filter with a Hann window size of N = 512
samples and a hop of R = 256 samples on 16 kHz audio. We perceptually pleasing processing. We re-ran these experiments
fix the buffer size B = 5 taps and the delay to D = 2 frames. with a buffer of size B = 10 as well as with larger and
We use the simulated REVERB challenge dataset [96]. The smaller optimizer network capacities and found this trend
REVERB challenge contains echoic speech mixed with noise was consistent. As a result, we conclude our approach is very
at 20 dB in small (T60 = .25 sec), medium (T60 = .5 sec) and effectively improving the online optimization of the target
large (T60 = .7 sec) rooms at near and far distances. The array loss, but the instantaneous loss itself needs to be changed to
is circular with a diameter of 20 cm. Background noises are better align with perception.
generally stationary. The dataset has 7861 training files, 1484 2) Computational Complexity: The 4 channel learned AF
validation files, and 2176 test files. has a single CPU core RTF of ≈ 0.47, and 16 ms latency.
Our Meta-WPE optimizer network alone has ≈ 17K complex-
valued parameters and single CPU core RTF of ≈ 0.38.
C. Results & Discussion
We find our approach (blue, solid) outperforms NARA-WPE VIII. B EAMFORMING A BLATION
in SRR across all filter configurations, but is worse in STOI
as shown in Fig. 13. We discuss this below. A. Problem Formulation
1) Overall and Temporal Performance: As shown For our fifth and final task, we train a Meta-AF for interfer-
in Fig. 13, Meta-WPE (blue, solid) scores strongly on SRR, ence cancellation using the minimum variance distortionless
where our single-channel Meta-WPE model scores better than response (MVDR) beamformer. The MVDR beamformer can
4 and 8 channel NARA-WPE (red, dotted) models. However, be implemented as an AF using the generalized sidelobe can-
as shown by STOI, the perceptual quality is poor. While celler (GSC) [38] formulation and is commonly used for far-
Meta-WPE is solving the prediction more rapidly, as shown field voice communication and speech-to-text pre-processing.
by segmental SRR, it is not doing so in a perceptual pleasing We depict a version of this problem setup in Fig. 14 and use
manner. Previous studies [31], [97] have also encountered a linear frequency-domain filter hθ . We use the mixture d[τ ]
this phenomenon, and propose a variety of regularization as the target, informed input u[τ ], and adapt the filter weights
tools to align the instantaneous optimization objective with θ[τ ] using our learned Meta-GSC AF gφ and ISE AF loss.
12

Assuming an array of M microphones, we have the time- 12.54 0.259

Interferer
Diffuse
domain signal model for mic m via, 10.53 0.255
9.40 0.246
9.50 0.216
um [t] = rm [t] ∗ s[t] + nm [t] (37) 10.20 0.235

where um [t] ∈ R is the input signal, nm [t] ∈ R is the noise 11.03 0.213

Directional
Interferer
8.58 0.184
signal, s[t] ∈ R is the target signal, and rm [t] ∈ R is the 6.39 0.171
7.70 0.170
impulse response from the source to mic m. In the time- 2.53 0.077
frequency domain with a sufficiently long window, this can 0 5 10 15 0.0 0.1 0.2 0.3
be reformulated as Median ∆ SI-SDR (dB) Median ∆ STOI

LMS RMSProp NLMS BD-RLS Meta-GSC


ukm [τ ] = rkm [τ ]sk [τ ] + nkm [τ ]. (38)
Fig. 15: Performance comparison across interferers. The direc-
The GSC beamformer also assumes access to a steering
tional noise is the most challenging and diffuse is the easiest.
vector, vk . While estimating the steering vector is well stud-
ied [38], it remains non-trivial for real-world applications. For
our case, we assume access to a clean speech recording sk [τ ] 11.94 12.30 23.58

Interferer
Diffuse
9.92 10.86 17.78
and first compute 8.86 9.33 19.31
8.91 9.21 22.03
Φss
k [τ ] = γΦss
k [τ − 1] + (1 − γ)(sk [τ ]sk [τ ] + λIM ) (39)
H
9.55 12.25 13.71

where Φss
k [τ ] ∈ C M ×M
is a time-varying estimate of the 10.77 11.75 19.85
target signal spatial covariance matrix, γ is a forgetting factor,

Directional
Interferer
8.26 11.46 12.48

and λ is a regularization parameter. We then estimate the 6.13 8.85 12.63


7.32 10.38 11.99
steering vector by computing the normalized first principal 2.17 13.98 3.12
component of the target source covariance matrix,
0 5 10 15 0 10 20 0 10 20 30
ṽk [τ ] = P(Φss
k [τ ]) (40) Median SDR (dB) Median SAR (dB) Median SIR (dB)

vk [τ ] = ṽk [τ ]/ṽk0 [τ ] (41) LMS RMSProp NLMS BD-RLS Meta-GSC

where P(·) extracts the principal component and vk [τ ] ∈ CM Fig. 16: BSS eval comparison across interferers. Meta-GSC
is the final steering vector. We then use the steering vector provides more suppression with less distortion and artifacts.
to estimate a blocking matrix Bk [τ ]. The blocking matrix is
orthogonal to the steering vector and can be constructed as
" # where larger values indicate better performance. STOI is
,vkM [τ ]]H computed between the output and desired speech signal.
− [vk1 [τ ],···
Bk [τ ] = vk0 [τ ] H
∈ CM ×M −1 . (42) We also compute the BSS eval metrics, source-to-distortion
IM −1×M −1
ratio (SDR), source-to-interference ratio (SIR), and source-to-
The distortionless constraint is then satisfied by applying the artifact ratio (SAR) [99]. We train gφ via Algorithm 1 on one
GSC beamformer as GPU, which took ≈ 24 h. We use an OLA filter with a Hann
ŝk [τ ] = (vk [τ ] − Bk [τ ]wk [τ ])H uk [τ ] (43) window size of N = 1024 samples and a hop of R = 512
samples on 6 channel 16 kHz audio.
where wk [τ ] ∈ CM −1 is the adaptive filter weight, and the We use the CHIME-3 challenge proposed in [100], [101].
desired response for the loss is dk [τ ] = vk [τ ]H uk [τ ]. This dataset contains scenes with simulated speech and rela-
Our objective is to learn an optimizer gφ that minimizes the tively diffuse noise sources in a multi-channel environment.
AF ISE loss using this GSC filter implementation. By doing The array is rectangular and has six microphones spaced
so, we learn an online, adaptive beamformer that listens in one around the edge of a smart-tablet. There are 7, 138 training
direction and suppresses interferers from all others. files, 1, 640 validation files, and 1, 320 test files. When running
this dataset with directional sources, we mix spatialized speech
B. Experimental Design from one mixture with the spatialized speech from a random
other mixture. We do not mix speech across folds.
We compare our Meta-GSC beamformer to LMS, RM-
SProp, NLMS, and BD-RLS beamformers, in scenes with
either diffuse or directional noise sources. We seek to test if C. Results and Discussion
our method can learn to process scenes with different spatial We find that Meta-GSC outperforms LMS, RMSProp,
characteristics without modification. We train a single gφ and NLMS, and BD-RLS in median performance metrics as shown
tune all baselines on a single dataset of all scene types. We in Fig. 15 and Fig. 16 and in a qualitative analysis in Fig. 17.
measure performance using scale-invariant source-to-distortion 1) Diffuse Interferers: The diffuse scenario tests the ability
ratio (SI-SDR) [98] and STOI. SI-SDR is computed as, to suppress omnidirectional noise in a perceptually pleasant
fashion. We show these comparisons in the “Diffuse Interferer”
a = (ŝ> s)/ksk (44)
rows of Fig. 15 and Fig. 16. The median input STOI was 0.675
2
SI-SDR(s, ŝ) = 10 · log10 (kask /kas − ŝk ), 2
(45) and the median input SI-SDR was −0.67. Meta-GSC (blue,
13

90° 90° to model by humans including AEC double-talk, AEC path


135° 45° 135° 45° changes, and directional interference cancellation. Third, we
found that we could use a single configuration of our method
0 0
−20 −20
for all five tasks, which significantly reduced our development
180° 0° 180° 0°
time. This suggests that our learned AFs are a viable re-
placement of human-derived AFs for a variety of audio signal
processing tasks and are most valuable for complex AF tasks
that typically require more human engineering effort.
225° 315° 225° 315°
When we reflect on how our learned optimizers are able
270° Meta-GSC BD-RLS 270° to achieve this success, we note two core reasons. First,
and most obvious, our meta-learned AFs are data-driven and
Fig. 17: Spatial response plots at ≈ 1 KHz for a directional trained on a particular class of signals (e.g. speech, directional
interferer at ≈ 1 sec. (left) and ≈ 2 sec. (right). noise, etc). Thus, Meta-AF is limited by the capacity of our
optimizer network and training data and not signal modeling
skill. Second, by framing AF development as a meta-learning
solid) outperforms BD-RLS (red, dotted) in SI-SDR perfor-
problem, we effectively distill knowledge of our meta loss
mance with Meta-GSC scoring 12.54 dB improvement and
into the AF loss and corresponding learned update rules,
BD-RLS scoring 10.53 dB improvement. In STOI, Meta-
thus enabling us to learn AFs which optimize objectives that
GSC outperforms BD-RLS by 0.004. The BSS Eval met-
would be very difficult (e.g. our frame accumulated) or even
rics show that Meta-GSC provides 5.8 dB more interferer
impossible to develop manually (e.g. supervised losses, STOI,
suppression (SIR) while simultaneously introducing 1.4 dB
SI-SDR, etc).
fewer artifacts (SAR) and 2.02 dB less distortion (SDR) than
BD-RLS. Typically, enhancement algorithms trade improved B. Future Work
interference suppression for additional artifacts. However, the
The field of meta-learning and meta-learned optimizers is
meta-training scheme produces an optimizer which simultane-
young and has a bright future for signal processing applica-
ously improves both.
2) Directional Interferers: The directional scenario tests tions. Future directions of research include improving training
the ability to suppress sources from one particular direction methods, non-linear optimizee filtering, optimizer architecture,
– typically achieved by steering nulls in the beam pattern. We and the optimizer/meta loss. Particularly promising avenues for
show these comparisons in the “Directional Interferer” rows future work including identifying better optimization losses
of Fig. 15 and Fig. 16. The median input STOI was 0.734 and better filter representations for Meta-AF style optimiza-
and the median input SI-SDR was −0.45. Meta-GSC scores tion. Overall, we are optimistic that our Meta-AF approach
11.03 dB on SI-SDR improvement whereas BD-RLS scores can benefit from both adaptive filtering advances as well as
8.58 dB. STOI performance trends similarly with Meta-GSC deep learning progress and will be an exciting research topic.
outperforming BD-RLS by 0.029. The BSS-Eval metrics show
C. Conclusion
a similar trend with Meta-GSC providing 7.37 dB more SIR
while simultaneously introducing .29 dB fewer artifacts (SAR) We present a general framework called Meta-AF to auto-
and 2.51 dB less distortion (SDR) than BD-RLS. We hypoth- matically develop adaptive filter update rules from data using
esize Meta-GSC steers sharper nulls and learns an automatic meta-learning. Our proposed approach offers several benefits
VAD-like controller. including the first general-purposes method of learning AF
3) Beampattern Comparison: We compute beam plots for update rules directly from data and a self-supervised training
Meta-GSC and BD-RLS at ≈ 1 sec. and ≈ 2 sec. in a scene algorithm that does not require any supervised labeled training
with a directional interferer. As expected, the models share the data. To demonstrate the power of our framework, we test
same look direction. However, our Meta-GSC method appears it on all four canonical adaptive filtering architectures and
to steer more aggressive nulls as shown in Fig. 17 five unique tasks including system identification, acoustic
4) Computational Complexity: Our Meta-GSC method has echo cancellation, equalization, dereverberation, and GSC-
a single CPU core RTF of ≈ 0.54, and 32 ms latency. based beamforming – all using a single configuration trained
The optimizer network alone has ≈ 14K complex-valued on different datasets. In all cases, we were able to train high
parameters and single CPU core RTF of ≈ 0.25. performing AFs, which outperformed conventional optimizers
as well as certain state-of-the-art methods. We are excited
IX. D ISCUSSION , F UTURE W ORK , AND C ONCLUSION about the future of deep learning combined with adaptive
filters and hope our complete code release will stimulate
A. Discussion further research and rapid progress.
When we review the cumulative results of our approach, we
note several interesting observations. First, the performance of X. ACKNOWLEDGMENTS
our meta-learned AFs is strong and compares favorably to con- This work was generously supported by Adobe Research.
ventional optimizers across all tasks. Second, the performance We would also like to thank the anonymous reviewers for
difference between our meta-learned AFs and conventional their feedback and comments, which greatly improved our
AFs is larger for tasks that are traditionally more difficult manuscript.
14

R EFERENCES [30] L. Drude, J. Heymann, C. Boeddeker, and R. Haeb-Umbach, “NARA-


WPE: A python package for weighted prediction error dereverberation
[1] B. Widrow and S. D. Stearns, Adaptive Signal Processing. Prentice- in numpy and tensorflow for online and offline processing,” in ITG-
Hall, 1985. Symposium on Speech Communication. VDE, 2018.
[2] V. J. Mathews, “Circuits and systems tutorials: Adaptive polynomial [31] J. Wung, A. Jukić, S. Malik, M. Souden, R. Pichevar, J. Atkins,
filters,” IEEE SPM, 1991. D. Naik, and A. Acero, “Robust multichannel linear prediction for
[3] J.-S. Soo and K. K. Pang, “Multidelay block frequency domain adaptive online speech dereverberation using weighted householder least squares
filter,” IEEE TASSP, 1990. lattice adaptive filter,” IEEE TSP, vol. 68, 2020.
[4] S. S. Haykin, Adaptive filter theory. Pearson, 2008. [32] L. Griffiths and C. Jim, “An alternative approach to linearly constrained
adaptive beamforming,” IEEE TAP, vol. 30, no. 1, 1982.
[5] J. A. Apolinário, J. A. Apolinário, and R. Rautmann, QRD-RLS
adaptive filtering. Springer, 2009. [33] O. Hoshuyama, A. Sugiyama, and A. Hirano, “A robust adaptive
beamformer for microphone arrays with a blocking matrix using
[6] L. R. Rabiner, B. Gold, and C. Yuen, Theory and application of digital
constrained adaptive filters,” IEEE TSP, vol. 47, no. 10, 1999.
signal processing. Prentice-Hall, 2016.
[34] J. Chen, J. Benesty, and Y. Huang, “A minimum distortion noise
[7] N. N. Schraudolph, “Local gain adaptation in stochastic gradient
reduction algorithm with multiple microphones,” IEEE TASLP, vol. 16,
descent,” in ICANN, 1999.
no. 3, 2008.
[8] S. L. Gay, “An efficient, fast converging adaptive filter for network
[35] M. Souden, J. Benesty, and S. Affes, “On optimal frequency-domain
echo cancellation,” in Asilomar Conf. on Sig., Sys. and Comp. IEEE,
multichannel linear filtering for noise reduction,” IEEE TASLP, vol. 18,
1998.
no. 2, 2009.
[9] J. Benesty, T. Gänsler, D. R. Morgan, S. L. Gay, and M. M. Sondhi, [36] Y. A. Huang and J. Benesty, “A multi-frame approach to the frequency-
Advances in Network and Acoustic Echo Cancellation. Springer, 2001. domain single-channel noise reduction problem,” IEEE TASLP, vol. 20,
[10] E. Hänsler and G. Schmidt, Acoustic echo and noise control: a practical no. 4, 2011.
approach. John Wiley & Sons, 2005. [37] S. Markovich-Golan, S. Gannot, and I. Cohen, “A sparse blocking
[11] G. Enzner and P. Vary, “Frequency-domain adaptive Kalman filter matrix for multiple constraints GSC beamformer,” in ICASSP. IEEE,
for acoustic echo control in hands-free telephones,” Elsevier Signal 2012.
Processing, vol. 86, no. 6, 2006. [38] S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A
[12] J.-M. Valin, “On adjusting the learning rate in frequency domain echo consolidated perspective on multimicrophone speech enhancement and
cancellation with double-talk,” IEEE TASLP, 2007. source separation,” IEEE TASLP, vol. 25, no. 4, 2017.
[13] S. Malik and G. Enzner, “Model-based vs. traditional frequency- [39] A. N. Birkett and R. A. Goubran, “Acoustic echo cancellation using
domain adaptive filtering in the presence of continuous double-talk NLMS-neural network structures,” in ICASSP. IEEE, 1995.
and acoustic echo path variability,” in IWAENC, 2008. [40] A. B. Rabaa and R. Tourki, “Acoustic echo cancellation based on a
[14] ——, “Online maximum-likelihood learning of time-varying dynamical recurrent neural network and a fast affine projection algorithm,” in
models in block-frequency-domain,” in ICASSP. IEEE, 2010. IEEE IES, 1998.
[15] F. Kuech, E. Mabande, and G. Enzner, “State-space architecture of the [41] J. Malek and Z. Koldovskỳ, “Hammerstein model-based nonlinear echo
partitioned-block-based acoustic echo controller,” in ICASSP. IEEE, cancelation using a cascade of neural network and adaptive linear
2014. filter,” in IWAENC. IEEE, 2016.
[16] F. Yang, G. Enzner, and J. Yang, “Frequency-domain adaptive Kalman [42] S. Zhang and W. X. Zheng, “Recursive adaptive sparse exponential
filter with fast recovery of abrupt echo-path changes,” SPL, vol. 24, functional link neural network for nonlinear AEC in impulsive noise
no. 12, 2017. environment,” IEEE TNNLS, 2017.
[17] M. L. Valero and E. A. Habets, “Multi-microphone acoustic echo [43] M. M. Halimeh, C. Huemmer, and W. Kellermann, “A neural network-
cancellation using relative echo transfer functions,” in WASPAA. IEEE, based nonlinear acoustic echo canceller,” IEEE SPL, 2019.
2017. [44] H. Zhang, K. Tan, and D. Wang, “Deep learning for joint acoustic
[18] T. Haubner, A. Brendel, M. Elminshawi, and W. Kellermann, “Noise- echo and noise cancellation with nonlinear distortions.” in Interspeech,
robust adaptation control for supervised acoustic system identification 2019.
exploiting a noise dictionary,” in ICASSP. IEEE, 2021. [45] L. Ma, H. Huang, P. Zhao, and T. Su, “Acoustic echo cancellation
[19] P. A. Nelson, H. Hamada, S. J. Elliott et al., “Adaptive inverse filters by combining adaptive digital filter and recurrent neural network,”
for stereophonic sound reproduction,” IEEE TSP, 1992. arXiv:2005.09237, 2020.
[20] P. A. Nelson, F. Orduna-Bustamante, and H. Hamada, “Inverse filter [46] A. Ivry, I. Cohen, and B. Berdugo, “Nonlinear acoustic echo cancella-
design and equalization zones in multichannel sound reproduction,” tion with deep learning,” in Interspeech, 2021.
IEEE TASLP, vol. 3, no. 3, 1995. [47] J.-M. Valin, S. Tenneti, K. Helwani, U. Isik, and A. Krishnaswamy,
[21] M. Bouchard and S. Quednau, “Multichannel recursive-least-square “Low-complexity, real-time joint neural echo control and speech en-
algorithms and fast-transversal-filter algorithms for active noise control hancement based on PercepNet,” in ICASSP. IEEE, 2021.
and sound reproduction systems,” IEEE TSAP, vol. 8, no. 5, 2000. [48] Y.-L. Zhou, Q.-Z. Zhang, X.-D. Li, and W.-S. Gan, “Analysis and
[22] N. V. George and G. Panda, “Advances in active noise control: A DSP implementation of an ANC system using a filtered-error neural
survey, with emphasis on recent nonlinear techniques,” Elsevier Signal network,” Journal of Sound and Vibration, vol. 285, no. 1-2, 2005.
Processing, vol. 93, no. 2, 2013. [49] T. Krukowicz, “Active noise control algorithm based on a neural net-
[23] L. Lu, K.-L. Yin, R. C. de Lamare, Z. Zheng, Y. Yu, X. Yang, and work and nonlinear input-output system identification model,” Archives
B. Chen, “A survey on active noise control in the past decade—part i: of Acoustics, vol. 35, no. 2, 2010.
Linear systems,” Elsevier Signal Processing, vol. 183, 2021. [50] H. Zhang and D. Wang, “A deep learning approach to active noise
[24] ——, “A survey on active noise control in the past decade–part ii: control,” in Interspeech, 2020.
Nonlinear systems,” Elsevier Signal Processing, vol. 181, 2021. [51] ——, “Deep anc: A deep learning approach to active noise control,”
[25] T. Yoshioka, H. Tachibana, T. Nakatani, and M. Miyoshi, “Adaptive Neural Networks, vol. 141, 2021.
dereverberation of speech signals with speaker-position change detec- [52] K. Kinoshita, M. Delcroix, H. Kwon, T. Mori, and T. Nakatani, “Neural
tion,” in ICASSP. IEEE, 2009. network-based spectrum estimation for online WPE dereverberation.”
[26] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, in Interspeech, 2017.
“Speech dereverberation based on variance-normalized delayed linear [53] J. Heymann, L. Drude, R. Haeb-Umbach, K. Kinoshita, and
prediction,” IEEE TASLP, vol. 18, no. 7, 2010. T. Nakatani, “Frame-online DNN-WPE dereverberation,” in IWAENC.
[27] T. Yoshioka and T. Nakatani, “Generalization of multi-channel linear IEEE, 2018.
prediction methods for blind MIMO impulse response shortening,” [54] L. Drude, C. Boeddeker, J. Heymann, R. Haeb-Umbach, K. Kinoshita,
IEEE TASLP, vol. 20, no. 10, 2012. M. Delcroix, and T. Nakatani, “Integrating neural network based beam-
[28] B. Li, T. N. Sainath, A. Narayanan, J. Caroselli, M. Bacchiani, forming and weighted prediction error dereverberation.” in Interspeech,
A. Misra, I. Shafran, H. Sak, G. Pundak, K. K. Chin et al., “Acoustic 2018.
modeling for google home.” in Interspeech, 2017. [55] Z.-Q. Wang, G. Wichern, and J. Le Roux, “Convolutive prediction for
[29] J. Caroselli, I. Shafran, A. Narayanan, and R. Rose, “Adaptive mul- reverberant speech separation,” in WASPAA. IEEE, 2021.
tichannel dereverberation for automatic speech recognition.” in Inter- [56] ——, “Convolutive prediction for monaural speech dereverberation and
speech, 2017. noisy-reverberant speaker separation,” IEEE TASLP, vol. 29, 2021.
15

[57] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, and J. Le Roux, [86] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary,
“Improved mvdr beamforming using single-channel mask prediction D. Maclaurin, and S. Wanderman-Milne, “JAX: composable trans-
networks.” in Interspeech, 2016. formations of Python+ NumPy programs, 2018,” URL http://github.
[58] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based com/google/jax, 2020.
spectral mask estimation for acoustic beamforming,” in ICASSP. [87] T. Hennigan, T. Cai, T. Norman, and I. Babuschkin, “Haiku: Sonnet for
IEEE, 2016. JAX,” 2020. [Online]. Available: http://github.com/deepmind/dm-haiku
[59] K. Qian, Y. Zhang, S. Chang, X. Yang, D. Florencio, and M. Hasegawa- [88] P. J. Werbos, “Backpropagation through time: what it does and how to
Johnson, “Deep learning based speech beamforming,” in ICASSP. do it,” IEEE, 1990.
IEEE, 2018. [89] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[60] Z. Zhang, Y. Xu, M. Yu, S.-X. Zhang, L. Chen, and D. Yu, “ADL- in ICLR, 2014.
MVDR: All deep learning MVDR beamformer for target speech [90] M. Wolter and A. Yao, “Complex gated recurrent neural networks,” in
separation,” in ICASSP. IEEE, 2021. NeurIPS, 2018.
[61] T. Haubner, A. Brendel, and W. Kellermann, “End-to-end deep [91] K. Sridhar, R. Cutler, A. Saabas, T. Parnamaa, M. Loide, H. Gamper,
learning-based adaptation control for frequency-domain adaptive sys- S. Braun, R. Aichner, and S. Srinivasan, “ICASSP 2021 acoustic echo
tem identification,” in ICASSP. IEEE, 2022. cancellation challenge: Datasets, testing framework, and results,” in
[62] H. Zhang, S. Kandadai, H. Rao, M. Kim, T. Pruthi, and T. Kristjansson, ICASSP. IEEE, 2021.
“Deep adaptive AEC: Hybrid of deep learning and adaptive acoustic [92] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A
echo cancellation,” in ICASSP. IEEE, 2022. study on data augmentation of reverberant speech for robust speech
[63] J. Casebeer, N. J. Bryan, and P. Smaragdis, “Auto-DSP: Learning to recognition,” in ICASSP. IEEE, 2017.
optimize acoustic echo cancellers,” in WASPAA. IEEE, 2021. [93] G. Enzner, H. Buchner, A. Favrot, and F. Kuech, “Acoustic echo
[64] J. Casebeer, J. Donley, D. Wong, B. Xu, and A. Kumar, “NICE-beam: control,” in Academic press library in signal processing. Elsevier,
Neural integrated covariance estimators for time-varying beamformers,” 2014, vol. 4.
arXiv:2112.04613, 2021. [94] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm
[65] R. Scheibler and M. Togami, “Surrogate source model learning for for intelligibility prediction of time–frequency weighted noisy speech,”
determined source separation,” in ICASSP. IEEE, 2021. IEEE TASLP, vol. 19, no. 7, 2011.
[66] R. S. Sutton, “Adapting bias by gradient descent: An incremental [95] G. J. Mysore, “Can we automatically transform speech recorded on
version of delta-bar-delta,” in AAAI, 1992. common consumer devices in real-world environments into profes-
[67] A. R. Mahmood, R. S. Sutton, T. Degris, and P. M. Pilarski, “Tuning- sional production quality speech?—a dataset, insights, and challenges,”
free step-size adaptation,” in ICASSP. IEEE, 2012. IEEE SPL, vol. 22, no. 8, 2014.
[68] J. Schmidhuber, “Learning to control fast-weight memories: An alter- [96] K. Kinoshita, M. Delcroix, S. Gannot, E. A. Habets, R. Haeb-Umbach,
native to dynamic recurrent networks,” Neural Computation, 1992. W. Kellermann, V. Leutnant, R. Maas, T. Nakatani, B. Raj et al., “A
[69] I. Bello, B. Zoph, V. Vasudevan, and Q. V. Le, “Neural optimizer search summary of the REVERB challenge: state-of-the-art and remaining
with reinforcement learning,” in ICML, 2017. challenges in reverberant speech processing research,” EURASIP Jour-
nal on Advances in Signal Processing, vol. 2016, no. 1, 2016.
[70] D. Ha, A. Dai, and Q. V. Le, “Hypernetworks,” in ICLR, 2017.
[97] A. Jukić, T. van Waterschoot, and S. Doclo, “Adaptive speech dere-
[71] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for
verberation using constrained sparse multichannel linear prediction,”
fast adaptation of deep networks,” in ICML, 2017.
IEEE SPL, vol. 24, no. 1, 2016.
[72] M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman,
[98] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-
D. Pfau, T. Schaul, B. Shillingford, and N. de Freitas, “Learning to
baked or well done?” in ICASSP. IEEE, 2019.
learn by gradient descent by gradient descent,” in NeurIPS, 2016.
[99] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement
[73] O. Wichrowska, N. Maheswaranathan, M. W. Hoffman, S. G. Col- in blind audio source separation,” IEEE TASLP, vol. 14, no. 4, 2006.
menarejo, M. Denil, N. Freitas, and J. Sohl-Dickstein, “Learned [100] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third CHiME
optimizers that scale and generalize,” in ICML, 2017. speech separation and recognition challenge: Dataset, task and base-
[74] L. Metz, N. Maheswaranathan, J. Nixon, D. Freeman, and J. Sohl- lines,” in ASRU. IEEE, 2015.
Dickstein, “Understanding and correcting pathologies in the training [101] ——, “The third CHiME speech separation and recognition challenge:
of learned optimizers,” in ICML, 2019. Analysis and outcomes,” Computer Speech & Language, 2017.
[75] T. Chen, W. Zhang, Z. Jingyang, S. Chang, S. Liu, L. Amini, and
Z. Wang, “Training stronger baselines for learning to optimize,” in
NeurIPS, 2020.
[76] A. V. Oppenheim and R. W. Schafer, Digital signal processing.
Prentice-Hall, 1975.
[77] G. Hinton, N. Srivastava, and K. Swersky, “Neural networks for
machine learning lecture 6a overview of mini-batch gradient descent,”
Cited on, 2012.
[78] S. T. Alexander and A. L. Ghimikar, “A method for recursive least
squares filtering based upon an inverse qr decomposition,” IEEE
Transactions on Signal Processing, 1993.
[79] P. Strobach, “Low-rank adaptive filters,” IEEE Transactions on Signal
Processing, 1996.
[80] F. Yang, G. Enzner, and J. Yang, “On the convergence behavior
of partitioned-block frequency-domain adaptive filters,” IEEE TSP,
vol. 69, pp. 4906–4920, 2021.
[81] Z. Wang, Y. Na, Z. Liu, B. Tian, and Q. Fu, “Weighted recursive least
square filter and neural network based residual echo suppression for
the aec-challenge,” in ICASSP. IEEE, 2021.
[82] WebRTC acoustic Echo Cancellation v3. https://webrtc.googlesource.
com/src. Accessed: 2022-08-10.
[83] Acoustic Echo Cancellation Challenge – ICASSP 2021.
https://www.microsoft.com/en-us/research/academic-program/
acoustic-echo-cancellation-challenge-icassp-2021/results/. Accessed:
2022-08-10.
[84] V. Manohar, S.-J. Chen, Z. Wang, Y. Fujita, S. Watanabe, and S. Khu-
danpur, “Acoustic modeling for overlapping speech recognition: Jhu
chime-5 challenge system,” in ICASSP. IEEE, 2019.
[85] J. Le Roux, N. Ono, and S. Sagayama, “Explicit consistency constraints
for STFT spectrograms and their application to phase reconstruction,”
in Interspeech, 2008.

You might also like