0% found this document useful (0 votes)
19 views28 pages

2782 On The Generalization of

This document discusses the generalization capabilities of E(n)-Equivariant Graph Neural Networks (EGNNs), establishing the first theoretical bounds for their generalization error. The authors highlight that the spectral norms of weight matrices significantly influence generalization, and propose a new regularization method that improves performance on various tasks. Empirical results validate the theoretical findings, demonstrating a strong correlation between theoretical and empirical generalization gaps.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views28 pages

2782 On The Generalization of

This document discusses the generalization capabilities of E(n)-Equivariant Graph Neural Networks (EGNNs), establishing the first theoretical bounds for their generalization error. The authors highlight that the spectral norms of weight matrices significantly influence generalization, and propose a new regularization method that improves performance on various tasks. Empirical results validate the theoretical findings, demonstrating a strong correlation between theoretical and empirical generalization gaps.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

On the Generalization of Equivariant Graph Neural Networks

Rafał Karczewski 1 Amauri H. Souza 1 2 Vikas Garg 1 3

Abstract in convolutional neural networks (CNNs) (Fukushima, 1980;


E(n)-Equivariant Graph Neural Networks (EG- LeCun et al., 1990) implement shift-equivariant functions;
NNs) are among the most widely used and suc- graph neural network (GNN) (Gori et al., 2005; Scarselli
cessful models for representation learning on et al., 2009; Gilmer et al., 2017) layers are equivariant to
geometric graphs (e.g., 3D molecules). How- node permutations. Remarkably, equivariant/invariant archi-
ever, while the expressivity of EGNNs has been tectures have led to breakthroughs in tasks such as drug dis-
explored in terms of geometric variants of the covery (Stokes et al., 2020), weather modeling (Verma et al.,
Weisfeiler-Leman isomorphism test, characteriz- 2024), simulation of physical systems (Sanchez-Gonzalez
ing their generalization capability remains open. et al., 2020), traffic forecasting (Derrow-Pinion et al., 2021),
In this work, we establish the first generalization and recommender systems (Ying et al., 2018).
bound for EGNNs. Our bound depicts a depen- Recently, E(n)-equivariant graph neural networks (EGNNs)
dence on the weighted sum of logarithms of the (Satorras et al., 2021) have emerged as an efficient and pow-
spectral norms of the weight matrices (EGNN pa- erful approach for learning representations of geometric
rameters). In addition, our main result reveals graphs — i.e., graphs embedded in Euclidean space. EG-
interesting novel insights: i) the spectral norms NNs employ a GNN-like message-passing scheme (Gilmer
of the initial layers may impact generalization et al., 2017), where the embeddings of each node are refined
more than the final ones; ii) ε-normalization is using messages from its neighbors in the graph at each layer.
beneficial to generalization — confirming prior Their design ensures that EGNNs inherent the permutation-
empirical evidence. We leverage these insights equivariance property of regular GNNs while also being
to introduce a spectral norm regularizer tailored equivariant to actions of the Euclidean group E(n), which
to EGNNs. Experiments on real-world datasets comprise all translations, rotations, and reflections in Rn .
substantiate our analysis, demonstrating a high EGNNs have been successfully applied to molecular prop-
correlation between theoretical and empirical gen- erty prediction (Satorras et al., 2021), drug binding structure
eralization gaps and the effectiveness of the pro- prediction (Stärk et al., 2022), generative modeling (Gar-
posed regularization scheme. cia Satorras et al., 2021), structure-based drug design (Fu
et al., 2022) and molecular dynamics (Arts et al., 2023).

1. Introduction Uncovering the strengths and limits of machine learning


models from a theoretical standpoint is imperative to seed-
Leveraging symmetries of the underlying domains/signals is ing a path to novel principled approaches. For instance, one
a key design principle underlying successful neural network of the most important aspects of any learning machine is its
architectures for structured data (Bronstein et al., 2021; Kipf ability to generalize beyond seen data. Outlining general-
& Welling, 2017; Cohen et al., 2018; Gilmer et al., 2017). ization guarantees for a given model class can profoundly
Typically, this boils down to identifying relevant symmetries impact its applicability. Another relevant aspect concerns
captured in the form of groups (e.g., groups of translations) the functions a model class can approximate. In this regard,
and then building predictive models by composing layers of Joshi et al. (2023) have recently analyzed the expressive
equivariant (or invariant) transformations to the actions of power of EGNNs in terms of a geometric version of the
such groups on the inputs. As classic examples, linear layers Weisfeiler-Leman isomorphism test (or 1-WL test) (Weis-
1 feiler & Lehman, 1968) — which has been extensively used
Department of Computer Science, Aalto University, Finland
2
Federal Institute of Ceará (Brazil) 3 YaiYai Ltd. Correspondence to study the expressivity of regular GNNs (Maron et al.,
to: Rafał Karczewski <rafal.karczewski@aalto.fi>, Vikas Garg 2019; Morris et al., 2019; Sato et al., 2019; Xu et al., 2019).
<vgarg@csail.mit.edu>. In contrast, establishing theoretical guarantees for the gen-
eralization capability of EGNNs remains an open problem.
Proceedings of the 41 st International Conference on Machine
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by In this paper, we study the generalization of the EGNNs
the author(s).

1
On the Generalization of Equivariant Graph Neural Networks

from a theoretical perspective and derive a high probability Expressivity of geometric GNNs. Joshi et al. (2023) de-
bound for their generalization error using the PAC-learning velops a geometric Weisfeiler-Leman (GWL) test to eval-
framework (Valiant, 1984; Mohri et al., 2012). Our analysis uate the expressivity of geometric GNNs, revealing how
reveals three new insights: (i) the generalization gap de- equivariant and invariant layers affect their ability to distin-
pends on the weighted sum of logarithms of spectral norm guish complex geometric graphs. Meanwhile, Wang et al.
of weight matrices; (ii) bottom layers have higher weight (2024) introduces ViSNet, an efficient model that enhances
(aligning with common knowledge that bottom layers gener- molecular structure modeling through geometric informa-
alize better); and (iii) ε-normalization is essential to obtain tion, showcasing improved performance on molecular dy-
a bound polynomial in depth instead of exponential. We namics benchmarks.
compare this bound with existing results on Multilayer Per-
Equivariant neural networks. Equivariant Neural Net-
ceptrons (Bartlett et al., 2017; Neyshabur et al., 2018).
works have reshaped model design by embedding data sym-
Furthermore, we validate our results empirically on a real- metries directly into network architectures. Introduced by
world problem of molecular property prediction (Ramakrish- Cohen & Welling (2016), these networks ensure equivari-
nan et al., 2014), a task particularly well-suited for EGNNs. ance to group actions, a concept expanded in Euclidean
Specifically, we first establish that our theoretical bound spaces (Weiler & Cesa, 2019; Satorras et al., 2021) and Lie
highly correlates with the empirical one across different groups (Finzi et al., 2020). For 3D data, Tensor Field Net-
model hyperparameters. Second, we support our claims works (Thomas et al., 2018) and SE(3)-Transformers (Fuchs
that ε-normalization reduces the generalization gap when et al., 2020) exemplify their utility. The approach is theo-
increasing the depth of the model. retically deepened by Lang & Weiler (2021) and adapted
to dynamics with imperfect symmetries (Wang et al., 2022)
Finally, inspired by our theoretical findings, we propose a
and particle physics (Bogatskiy et al., 2020).
new regularization method. We evaluate it experimentally
and compare it with a commonly used regularization based Generalization of equivariant GNNs. Recent studies have
on the spectral norm. We find that ours leads to lower advanced our understanding of the generalization capabil-
test loss and generalization gap across different tasks and ities of equivariant GNNs. Petrache & Trivedi (2023) ex-
choices of model hyperparameters (e.g., number of layers). plore approximation-generalization trade-offs under group
equivariance, while Behboodi et al. (2022) establish a PAC-
In summary, our main contributions are: i) we derive the
Bayesian generalization bound for these networks. Bulusu
first generalization bounds for E(n)-Equivariant Graph Neu-
et al. (2021) focus on translation-equivariant neural net-
ral Networks; ii) we validate our theoretical analysis with
works, and Kondor & Trivedi (2018) consider equivariance
experiments on real-world data; iii) we show theoretically
to the action of compact groups. Additionally, Elesedy &
and empirically that ε-normalization helps generalization;
Zaidi (2021) demonstrate a strict generalization benefit for
iv) we propose a new regularization method tailored to EG-
equivariant models, and Liu et al. (2023) highlight how
NNs and assess its performance on twelve regression tasks.
physical inductive biases can enhance generalization. San-
nai et al. (2021) further provide improved generalization
bounds through quotient feature spaces.
2. Related Work
Expressivity and Generalization of GNNs. Understand- 3. Background
ing the expressivity and generalization capabilities of Graph
This section overviews E(n)-equivariant graph neural net-
Neural Networks (GNNs) is crucial for their application
works and provides definitions and results from learning
across diverse domains. Xu et al. (2019) demonstrated the
theory we leverage in our analysis. For readability, we
potential of GNNs to capture complex graph structures,
summarize the notation used in this work in Appendix A.
setting a benchmark for their expressivity. However, chal-
lenges in expressivity and generalization are highlighted
by Oono & Suzuki (2020) and Loukas (2020), who show 3.1. E(n)-Equivariant Graph Neural Network
that GNNs can lose expressive power or face limitations We consider geometric graphs and denote them by G =
based on their depth and width. Theoretical advances by (V, E, C, Z), where V is the set of vertices, E ⊆ V 2 is the
Barceló et al. (2020) and Garg et al. (2020) further dissect set of edges, C = {cv }v∈V ⊂ Rdc is an indexed set of
the logical expressiveness and representational boundaries vertex attributes (or colors), and Z = {zv }v∈V ⊂ Rdz is an
of GNNs, respectively. Recent studies, such as (Tang & indexed set of vertex coordinates. The neighborhood of a
Liu, 2023) and (Yang et al., 2023), offer new insights into vertex v is given by J (v) = {u : (u, v) ∈ E}.
GNNs’ generalization, suggesting inherent advantages of
GNN architectures over MLPs in graph learning tasks. Let G be a group acting on two sets X and X ′ by the
representations Φ and Ψ, respectively. We say a function

2
On the Generalization of Equivariant Graph Neural Networks

EG where ϕℓµ is commonly parameterized by multilayer percep-


n NN trons (MLPs). Whenever an MLP has multiple inputs, we
ta tio La
Ro ye assume that they are concatenated into a single vector as in
r
the original implementation 1 .
Next, the messages to each vertex v are used to recursively
update its coordinates using an auxiliary MLP ϕℓz and a
(ℓ) (ℓ)
normalization term γ(zu , zv , ε) as

zv(ℓ+1) = zv(ℓ) +
EG (ℓ) (ℓ)
NN 1 X zv − zu (2)
La n ϕℓz (µ(ℓ)
u→v )
io
tat |J (v)| (ℓ) (ℓ)
ye γ(z , z , ε)
r Ro u∈J (v) u v

and then combined using sum aggregation:


Figure 1. Visualization of rotation equivariance of the EGNN X
model. Colors depict node features. µ(ℓ)
v = µ(ℓ)
u→v . (3)
u∈J (v)

f : X → X ′ is G-equivariant if it commutes with the We then recursively update the embedding of vertex v using
group actions, i.e., for all g ∈ G and x ∈ X , we have that a MLP ϕℓh as
f (Φ(g)x) = Ψ(g)f (x). Here, we are interested in graph
models that are equivariant to: i) the Euclidean group E(n), h(ℓ+1)
v = ϕℓh (h(ℓ) (ℓ)
v , µv ). (4)
which comprises all translations, rotations, and reflections
of the n-dimensional Euclidean space; ii) the permutation
Remark 3.1 (ε-normalization). In their original formulation
group Σn . To represent actions of these groups on graphs,
(Satorras et al., 2021), EGNNs do not include normalization,
it is convenient to assume (WLOG) that V = {1, 2, . . . , n}, (ℓ) (ℓ)
the sets features C, Z are organized as matrices C ∈ Rn×dc , i.e. γ(zu , zv , ε) ≡ 1. In our analysis, we consider ε-
(ℓ) (ℓ) (ℓ) (ℓ)
Z ∈ Rn×dz , and the graph connectivity is given by the normalization, i.e. γ(zu , zv , ε) = ∥zu − zv ∥2 + ε.
adjacency matrix A ∈ {0, 1}n×n — in this case, we denote This variant is available in the original implementation, but
graphs as G = (A, C, Z). Thus, any g ∈ Σn can be repre- not discussed in the manuscript. We will see later in Section
sented by a corresponding permutation matrix Pg that acts 5 that ε-normalization plays a crucial role regarding the
on G by (Pg APg⊤ , Pg C, Pg Z). On the other hand, each generalization of EGNNs.
element g ∈ E(n) (or more precisely E(dz ) in our case)
In graph-level prediction tasks, we often obtain a repre-
is a translation (represented by a vector tg ∈ Rdz ×1 ) fol-
sentation for the entire graph by applying a mean readout
lowed by a linear transformation via an orthogonal matrix
function to the output of the last EGNN layer, i.e.,
Qg ∈ Rdz ×dz that acts on G by (A, C, (Z + 1n t⊤ g )Qg ),
where 1n is a n-dimensional column vector of ones. 1 X (Legnn )
hG = hv . (5)
E(n)-Equivariant GNNs (EGNNs, Satorras et al., 2021) are |V|
v∈V
arguably the most popular models for geometric graphs. The
basic idea consists of modifying message-passing GNNs Finally, we send the graph embedding hG through a final
by incorporating distances between vertices based on the MLP ϕout to achieve the graph-level prediction
geometric features into messages and recursively updating
g(G) = ϕout (hG ). (6)
the geometric features in an E(n)-equivariant fashion. This
way, EGNNs maintain the permutation equivariance of GNN Hereafter, we refer to the full mapping g(·) (i.e., EGNN +
layers while also being equivariant to E(n). final MLP) as the scoring model.
(0) (0)
Let hv = cv and zv = zv ∀v ∈ V. Also, assume that
each edge (u, v) ∈ E has an associated feature auv ∈ Rde . 3.2. Generalization bounds via Rademacher Complexity
At each layer ℓ = 0, 1, . . . , Legnn − 1, EGNNs compute The first important notion is that of generalization error (or
the incoming messages to each vertex v from its neighbors gap), defined as the difference between the expected (or
u ∈ J (v) as true) and empirical risks w.r.t. an arbitrary loss function.
1
 
µ(ℓ) ℓ (ℓ) (ℓ) (ℓ) (ℓ)
u→v = ϕµ hu , hv , ∥zu − zv ∥, auv , (1) github.com/vgsatorras/egnn/blob/main/
models/gcl.py, lines 203, 216

3
On the Generalization of Equivariant Graph Neural Networks

Definition 3.1 (Generalization Error). Let f : X → Y and Lemma 3.1 (Bartlett et al. (2017)). Let F ⊆ [−β, β]X be
L : Y × Y → R+ be a loss function. Let S = {(xi , yi )}m i=1 a class of functions taking values in [−β, β]. Also, assume
be a finite collection of i.i.d. samples from a distribution D that f0 ∈ F, where f0 (x) = 0 ∀x ∈ X . Define ∥f ∥∞ =
over X × Y. The generalization error of f is defined as the supx∈X ∥f (x)∥2 . Then, for any set S = {xi }m i=1 ⊆ X
difference between the expected loss and the sample loss:

1 X
m b S (F) ≤ inf
R √
RS,L (f ) = E(x,y)∼D [L(f (x), y)] − L(f (xi ), yi ).
α>0 m
m i=1 √ ! (9)
Z 2β m p
12
+ log N (F, r, ∥ · ∥∞ )dr .
m α
Deriving generalization bounds for a function class often
involves obtaining some measure of its size (or capacity). In
this regard, the Empirical Rademacher Complexity (ERC) is We will analyze function classes parametrized by weight
one of the most popular tools. In particular, ERC measures matrices, and the following result regarding the bound of
how well a function class can fit random noise. the covering number of sets of matrices will be useful.
Definition 3.2 (Empirical Rademacher complexity). Let Lemma 3.2 (Chen et al. (2020)). Let W = {W ∈ Rd1 ×d2 :
F ⊂ RX be a class of bounded functions and S = ∥W ∥2 ≤ λ} be a set of matrices with bounded spectral
norm, and r > 0 a constant. The covering number of W
i=1 ⊆ X a fixed set of size m. The empirical
{xi }m
Rademacher complexity of F with respect to S is can be bounded in terms of the Frobenius norm ∥ · ∥F as
√ √ d1 d2
min{ d1 , d2 }λ
" # 
m
1 X
N (W, r, ∥ · ∥F ) ≤ 1+2 .
b S (F) = Eσ sup
R σi f (xi ) , (7) r
f ∈F m i=1
(10)
where σ = [σ1 , . . . , σm ] is random vector of i.i.d. random
variables such that P(σi = 1) = P(σi = −1) = 0.5. 4. Main results
Notably, a fundamental result in learning theory bounds the To derive our generalization bound, we make the following
generalization error in terms of the ERC (Theorem 3.1). mild assumptions.
Theorem 3.1 (Mohri et al. (2012)). Let L : Y × Y → [0, 1] Assumption 4.1 (Inputs are bounded). The elements of the
be loss function and F ⊆ Y X a class of functions. Then for input graphs are contained in an Euclidean ball with a radius
any δ > 0, with probability at least 1 − δ over choosing a β. More specifically, there exists a β ≥ 1 such that for all
m-sized sample S ∼ Dm from a distribution D over X × Y, graphs G = (V, E, C, Z), and all v ∈ V and (v, u) ∈ E with
the following holds for any f ∈ F: feature vector avu , we have that
s
2
max{∥cv ∥2 , ∥zv − zu ∥2 , ∥auv ∥2 } ≤ β.
RS,L (f ) ≤ 2Rb S (FL ) + 3 log δ , (8)
2m Assumption 4.2 (EGNNs are parametrized with MLPs).
For all ℓ = 0, . . . , Legnn , the functions ϕℓz , ϕℓh , and ϕℓµ are
where MLPs with identical number of layers, denoted by Lϕ . The
scoring model ϕout is also an MLP with Lout layers. In
FL = {(x, y) 7→ L(f (x), y) : f ∈ F}. addition, all activation functions ψ(·) are Kψ -Lipschitz, for
some Kψ ≥ 1, and ψ(0) = 0.
From Theorem 3.1, finding a generalization bound reduces Assumption 4.3 (Weights have bounded spectral norm).
to bounding the ERC. We will use standard tools for bound- Let Wiϕ denote the weight matrix of the i-th (linear) layer
ing ERC, for which we need to introduce a concept of a Legnn
of the MLP ϕ. For all layers i of the MLPs ϕ ∈ {ϕℓz }ℓ=0 ∪
covering number. L L
{ϕh }ℓ=0 ∪ {ϕµ }ℓ=0 ∪ {ϕout }, there exists a βi,ϕ ≥ 1 such
ℓ egnn ℓ egnn

Definition 3.3 (Covering number). Let Θ be a set and ∥ · ∥


that ∥Wiϕ ∥2 ≤ βi,ϕ .
be a norm. We say that Θ is r-covered by a set Θ′ , with
respect to ∥ · ∥, if for all θ ∈ Θ there exists θ′ ∈ Θ′ with These assumptions are standard in the generalization litera-
∥θ − θ′ ∥ ≤ r. We define the covering number of Θ as the ture (Chen et al., 2020; Bartlett et al., 2017). In particular,
cardinality of the smallest Θ′ that r-covers Θ, and denote it commonly used activation functions are 1-Lipschitz and
by N (Θ, r, ∥ · ∥). vanish at 0 (e.g. ReLU, tanh, LeakyReLU, SiLU, and ELU).
Importantly, the result in Lemma 3.1 relates the ERC of a Our derivation of generalization bounds proceeds through
function class with its covering number. the following steps: (i) Show that the scoring function is

4
On the Generalization of Equivariant Graph Neural Networks

Our results
∥ · ∥ any norm. Then the ε-normalized function
Background
Lemma 4.1 Lemma 4.2
f (·)
-norm. preserves

Theorem 3.1 fε (·) :=


EGNNs are

bounded Lipschitz continuity


ERC bounds
∥f (·)∥ + ε
generalization gap

is Kfε -Lipschitz with Kfε = 3Kf/ε.


Lemma 4.3 Lemma 3.1

EGNNs are Lipschitz


Covering number The proof involves a telescoping trick and two forms of
wrt parameters
bounds ERC
triangle inequality. See Appendix D for the detailed proof.
Lemma 3.2
Now, we move on to the key result that will be used to derive
Lemma 4.4

Covering number
the generalization bounds for the EGNN model, namely
Lipschitz continuity of the EGNN node embeddings with
Scoring model is
for matrices
Lipschitz wrt parameters

respect to model parameters:


Proposition 4.1

Lemma 4.3 (Lipschitz continuity of EGNN wrt params).


Generalization bound
Consider EGNNs as defined in Equations (1)-(4). Let
(ℓ)
hv (W) denote the embedding of node v at layer ℓ produced
Figure 2. Overview of our results and their dependencies. by an EGNN with parameters W = (Wϕh , Wϕz , Wϕµ ),

where Wϕ = {Wiϕ }i=1 — recall that Wiϕ denotes the
weight matrix of the i-th (linear) layer of the MLP ϕ.
Lipschitz continuous w.r.t model parameters; (ii) Show that
the Cartesian product of coverings of weight matrices de- For any two EGNNs with parameters W and W̃, we have
fines a covering of function class of scoring models; and
(iii) Use Lemmas 3.2, 3.1 and Theorem 3.1 to establish ∥h(ℓ) (ℓ) ℓ (ℓ)
v (W) − hv (W̃)∥2 ≤ CQ B distℓ (W, W̃), (13)
the generalization bound. We present a diagram of our where
theoretical contributions and their dependencies in Figure 2.
ℓ Lϕ
X X X
4.1. Lemmata distℓ (W, W̃) = ∥Wiϕ − W̃iϕ ∥2 ,
ℓ′ =1 ϕ∈{ϕℓ′ −1 ,ϕℓ′ −1 ,ϕℓ′ } i=1
z µ
We begin by proving that the outputs of the EGNN model
h

remain bounded as long as inputs are bounded: D, C, β (ℓ) , M (ℓ) are as defined in Lemma 4.1, and
Lemma 4.1 (Boundedness of EGNN embeddings). Con- 
C

sider an EGNN model as described in Equations (1)-(4). Q = 224D2 max ,1
ε
For any graph G, we have that ∀v ∈ V, and u ∈ J (v):

!3 ℓ−1 (14)
Y Y
max{∥h(ℓ) (ℓ) (ℓ) (ℓ)
v ∥2 , ∥zv −zu ∥2 , D∥µu→v ∥2 } ≤ Cβ (ℓ)
, (11) B (ℓ) = M (i) β (i) ,
i=0 i=0
where D is the maximum degree in the graph, C = 8Dβ, with the convention that an empty product is equal to 1.
Q 2

and β (ℓ) = (20D)ℓ i=0 M (i)
with
The proof is in Appendix E.
Lϕ Finally, we can show that the scoring model defined in
(12)
Y
M (ℓ)
= max Kψ βi,ϕ Equation 6 is also Lipschitz continuous w.r.t. its parameters.
ϕ∈{ϕℓ−1 ,ϕℓ−1 ,ϕℓµ } i=1
Lemma 4.4 (Lipschitz continuity w.r.t. parameters of the
h z

scoring model). Let W (g) = (W, Wϕout ) denote the pa-


where βi,ϕ is the bound of ∥Wϕi ∥2 , and Lϕ , the number of
rameters of the scoring model g as defined in Equation (6),
layers of ϕ, and Kψ the Lipschitz constant of the activation
with W as defined in Lemma 4.3, Wϕout = {Wiϕout }L out
i=1 and
function. (g)
g(G; W ) the output of the scoring model with parame-
ters W (g) . Then, g is Lipschitz continuous w.r.t. the model
The proof consists of deriving a recursive relation for the
parameters:
bound as a function of the number of layers, determining its
growth rate, and an induction argument (Appendix C). ∥g(G; W (g) ) − g(G; W̃ (g) )∥ ≤ Kg dist(W (g) , W̃ (g) ),
Next, we show that ε-normalization preserves Lipschitz
where
continuity:
Lout
Lemma 4.2 (Lipschitz continuity preserved under
X
dist(W (g) , W̃ (g) ) = distLegnn (W, W̃)+ ∥Wiϕout −W̃iϕout ∥2
ε-normalization). Let f be a Kf -Lipschitz function and i=1

5
On the Generalization of Equivariant Graph Neural Networks

C, Q, B (·) are as defined in Lemma 4.3 and


Table 1. Comparison with existing generalization bounds. We
Lout compare with bounds for MLPs (Bartlett et al., 2017; Neyshabur
!
Y
Kg = 2 Kψ βi,ϕout CQLegnn B (Legnn ) . et al., 2018) and Equivariant EGNNs (G-EGNNs) (Behboodi et al.,
i=1 2022). All values are given in Big-O notation.
d (width) W (weights) L (depth)
The proof can be found in Appendix F. We now proceed to
PL  ∥Wi ∥2,1 2/3 /2
 3
our main result on the generalization of the scoring model.
QL
MLPs (2017) log(d) i=1 ∥Wi ∥2 i=1 ∥Wi ∥2 1

qP
QL L ∥Wi ∥2F
MLPs (2018)
p p
d log(d) i=1 ∥Wi ∥2 i=1 ∥Wi ∥22 L log(L)
4.2. Generalization bounds qP
QL L ∥Wi ∥2F
G-EGNNs (2022)
p p
d log(d) ∥Wi ∥2 L log(L)
We have developed all the necessary machinery to derive the i=1 i=1 ∥Wi ∥22

generalization bound. To simplify notation, we introduce


PL
EGNNs (ours)
p p
d log(d) i=1 log(max{1, ∥Wi ∥2 }) L log(L)

the stretching factor of a weight matrix:


κ(W ) := max{1, ∥W ∥2 }, (15)
Comparison with existing bounds. In Table 1 we compare
which we will see plays an important role in the bound. the derived bound with existing bounds for MLPs (Bartlett
Proposition 4.1 (Generalization bound of the scoring et al., 2017; Neyshabur et al., 2018). Most notably, our
model). Let G be the space of geometric graphs as de- bound depends on the sum of logarithms of the spectral
fined in Section 3.1, Y = [0, 1] the space of labels, norms of weight matrices as opposed to their product, which
g : G → R the makes it less sensitive to high spectral norms. On the other
 scoring model as defined in (6) and
hand, since our bound ignores weight matrices with a norm
L(by , y) = min (b y − y)2 , 1 the loss function. Then for
any δ > 0, with probability at least 1 − δ over choosing a lower than 1, it cannot get arbitrarily small. We provide
sample S ∼ Dm from a distribution D over G × Y of size more details on the comparison in Appendix H. Note that
m, the following holds: it is difficult to directly compare different bounds as they
 √ q  make different assumptions (Xu & Wang, 2018).
d L √ log 2δ ε-normalization. In our proofs and experiments we as-
RS,L (g) = O  √ ∆+ √ , (16)
m m sumed the EGNN layer to be defined by equations (1)-(4)
(ℓ) (ℓ) (ℓ) (ℓ)
with γ(zu , zv , ε) := ∥zu − zv ∥2 + ε. The original
where L = 3Lϕ Legnn + Lout is the total number of weight formulation (Satorras et al., 2021) includes this variant in
matrices, d is the maximum width across all layers, and the implementation, but the manuscript and experiments
Legnn Lϕ Lout
consider only the unnormalized model corresponding to
(ℓ) (ℓ)
γ(zu , zv , ε) ≡ 1. In subsequent work (Garcia Satorras
X X X X
∆= γi,ϕ + γi,ϕout
ℓ=0 ϕ∈{ϕℓ−1 ,ϕℓ−1 ,ϕℓµ } i=1 i=1 et al., 2021) it has been found empirically that using the
z
h
normalized variant with ε = 1 yields more stable results.
γi,ϕ = wi,ϕ log(Γ · κ(Wiϕ ))
( Note that ε-normalization does not affect the equivariance
1 for ϕ = ϕout properties nor the expressivity. The former follows since
wi,ϕ = (ℓ) (ℓ) (ℓ)
Legnn − ℓ + 1 for ϕ ∈ {ϕℓh , ϕℓz , ϕℓµ } γ(zu , zv , ε) is E(n)-invariant while the latter from µu→v
(ℓ) (ℓ)
dmLDβKψ depending on ∥zu − zv ∥2 . We now argue theoretically
Γ= . why normalization plays a crucial role in generalization.
ε̂
Concretely, following the same reasoning as in the proof of
To prove it, we (i) use the fact that g is Lipschitz continuous
Proposition 4.1, we obtain the bound, which is exponential
w.r.t. weight matrices to show that an appropriately chosen
in the number of layers as opposed to polynomial in the case
matrix covering yields the covering of the class of scoring
of the normalized model. See Appendix I for more details.
models (Lemma G.1); (ii) Use Lemma 3.2 to bound the
We provide additional empirical evidence in Section 6.
covering number; (iii) Leverage Lemma 3.1 to bound the
ERC and; (iv) Establish the bound via Theorem 3.1. The Spectral norm. Proposition 4.1 shows that the general-
exact bound together with the detailed proof can be found ization gap depends on the spectral norm of weight matri-
in Appendix G. ces, which is generally well known (Bartlett et al., 2017;
Neyshabur et al., 2018; Behboodi et al., 2022). However,
5. Discussion our derivation of the bound for EGNNs reveals new insights.
First, not all layers contribute equally. From Equation 16 it
In this section, we discuss the derived generalization bound can be seen that the bottom layers have a higher weight than
and some of its implications. the top layers. This aligns with the intuition that bottom

6
On the Generalization of Equivariant Graph Neural Networks

layers learn more abstract, high-level information, which We observe that the EGNN bound enjoys a better depen-
makes them generalize better as opposed to top layers learn- dence on the node degree D, the spectral norm of the
ing more fine-grained, task-specific information. weights W , and possibly better (but certainly not worse)
on the node depth d. However, it has a worse dependence
Second, the dependence on the spectral norms is logarithmic.
on the depth L. We leave further in-depth analysis of the
This implies that the bound is less sensitive to outliers than
impact of E(n)-equivariance for future work.
e.g., the sum of spectral norms. Furthermore, from the
definition of κ (Equation 15), we see that weight matrices
with spectral norms lower than 1 are ignored, implying that Table 2. Comparison with message-passing GNNs (MP-GNNs)
decreasing the norms further yields no additional gains. (Garg et al., 2020). The a, b, c subscripts denote different model
Dependence on the training set. We note that the spectral variants. All values are given in Big-O notation.
norms of the weight matrices depend on the (size of the) d (width) D (node degree) W (weights) L (depth)
√ √ √
training set, and in general, one cannot decouple this de- MP-GNNa
p
d log d D log D κ(W )3 log κ(W ) log L
pendence for an already trained network, e.g., we cannot MP-GNNb

d log d

D log D
p
κ(W )3 log κ(W )

L log L
assume the spectral norm would not change as we vary the √ √ √
MP-GNNc
p
d d log d D log D κ(W )3 log κ(W ) L log L
number of examples while evaluating the effect of sample
√ √ √
complexity (i.e., size of the training set size) on generaliza- EGNNs (ours)
p
d log d log D log κ(W ) L L log L
tion. The dependence of the spectral norm on the training
set, learning algorithm, etc. was not considered in this study.
Spectral regularization. Multiple regularization methods 6. Experiments
relying on the spectral norm have been proposed, including This section substantiates our theoretical analysis with ex-
the sum of squared spectral norms (Yoshida & Miyato, 2017) periments on real-world regression tasks. In particular, we
or explicitly enforcing the spectral norm to be 1 (Miyato first consider the generalization behavior of EGNNs as a
et al., 2018). We propose a new regularization method function of the model variables (e.g., number of layers). We
that leverages the findings unique to our study, i.e. explicit demonstrate that our bounds highly correlate with empir-
minimization of the generalization gap: ical generalization gaps. We also empirically assess the
beneficial impact of the ε-normalization on generalization,
X
R(W) = wi,ϕ log(κ(Wiϕ )). (17)
i,ϕ validating our findings. Lastly, we show the efficacy of regu-
larized EGNNs obtained from our analysis, highlighting per-
Since log is concave, this can intuitively be considered an
formance gains compared to another regularization method
example of concave regularization (Zhang & Zhang, 2012),
and empirical risk minimizers (no regularization). Our code
which generally favors sparse solutions. To see this, note
is available at https://github.com/Aalto-QuML/
that the derivative of log is x1 , and therefore gradient-based
GeneralizationEGNNs.
optimization will prioritize decreasing norms of weight
matrices, whose norm is the lowest. However, note that Evaluation setup. We consider four molecular property
in our context, instead of log we use log ◦κ, which is prediction tasks from the QM9 dataset (Ramakrishnan et al.,
no longer concave in the entire domain. We empirically 2014). From the data split available in (Satorras et al., 2021;
evaluate R in Section 6. Anderson et al., 2019), we select a subset of 2K molecules
for training and use the entire val/test partitions with ap-
Impact of equivariance. Since the EGNN network without
proximately 17K and 13K molecules, respectively. We train
the equivariant update (Equation 2) reduces to a regular
all models for 1000 epochs using the Adam optimizer. We
message passing GNN, we can compare our bound to the
run five independent trials with different seeds. We provide
Rademacher bound for message passing GNNs (Garg et al.,
further implementation details in Appendix J.
2020). To make that comparison possible, we need to make
the following simplifications to our setting: we assume that Empirical vs. theoretical generalization gaps. To demon-
(1) each MLP has only a single hidden layer; (2) different strate the practical relevance of the bounds extracted from
GNN layers share the same weight matrices; and (3) there our analysis, we contrast the empirical and theoretical gen-
is no output MLP ϕout . eralization gaps in terms of: i) the spectral norms across
epochs; ii) the maximum number of hidden units (hidden
We include three variants of the MP-GNN bound: MP-
dim, d); and iii) the number of EGNN layers (Legnn ). We
GNNa , MP-GNNb , and MP-GNNc , because Garg et al.
also report Pearson correlation coefficients between the gen-
(2020) report a different dependence on various parameters,
eralization curves. Figure 3 shows the results (mean and
depending on the value of the percolation compexity, which
standard deviation) over five runs.
depends on the Lipschitz constants of non-linearities and
the spectral norm. See Table 2 for the comparison. Regarding the spectral norm (top row in Figure 3), we ob-

7
On the Generalization of Equivariant Graph Neural Networks

HOM O ∆ µ LU M O
0.3 0.2
0.2 0.4
Generalization

Empirical gap 0.2


0.1 Our bound 0.2 0.1
0.1
ρ: 0.88±0.03 ρ: 0.90±0.02 ρ: 0.80±0.05 ρ: 0.86±0.06
0.0 0.0 0.0 0.0
1 250 500 750 1000 1 250 500 750 1000 1 250 500 750 1000 1 250 500 750 1000
Epochs Epochs Epochs Epochs
0.4
0.3 0.2
Generalization

0.2
0.2
0.2
0.1 0.1
0.1
ρ: 0.93±0.04 ρ: 0.96±0.03 ρ: 0.96±0.02 ρ: 0.98±0.01
0.0 0.0 0.0 0.0
4 8 16 24 32 4 8 16 24 32 4 8 16 24 32 4 8 16 24 32
# hidden dim, d # hidden dim, d # hidden dim, d # hidden dim, d

0.4 0.3
0.2
Generalization

0.2
0.2
0.2 0.1
0.1
0.1
ρ: 0.93±0.04 ρ: 0.96±0.04 ρ: 0.89±0.13 ρ: 0.95±0.08
0.0 0.0 0.0 0.0
1 3 5 7 1 3 5 7 1 3 5 7 1 3 5 7
# layers, Legnn # layers, Legnn # layers, Legnn # layers, Legnn

Figure 3. Empirical vs. theoretical generalization gaps as a function of the spectral norms over epochs (top), width (middle), and number
of EGNN layers (bottom). Overall, our bounds (black curves) highly correlate with the empirical generalization gap (blue curves). The
variables εHOMO , ∆ε, µ, and εLUMO denote molecular properties on the QM9 dataset.

HOM O ∆ To evaluate the impact of the ε-normalization, we run ex-


0.10 Non-norm. periments with and without the normalization for Legnn ∈
Generalization gap

-norm. 0.10 {1, 3, 5, 7}. Figure 4 reports bar plots for each property. As
0.05 our theory suggests, ε-normalization positively affects gen-
0.05
eralization, especially as we increase the number of layers.
0.00
0.00 Spectral norm regularization. To assess the effectiveness
µ LU M O of our bound on the spectral norm as a regularizer
0.3 (Equation 17), we compare it against a baseline called
Generalization gap

0.2
S PEC AVG that takes the average of the spectral norm over
0.2
all layers as the regularization term. Using the average
0.1
0.1 (or sum) of spectral norms (or their square) is a common
practice (Yoshida & Miyato, 2017). We also report results
0.0
1 3 5 7
0.0
1 3 5 7 without regularization. Again, we consider regression tasks
Legnn Legnn from QM9. For both regularizers, we select the optimal
penalization factor λ ∈ {1, 1e-3, 1e-5, 1e-7} using the
Figure 4. Impact of ε-normalization in terms of the number of validation set. Details are given in Appendix J.
layers. Using ε-normalization yields smaller generalization gaps Table 3 shows the results in terms of test error (MSE)
as the depth (Legnn ) increases.
and generalization gap for different numbers of layers and
tasks. Remarkably, our method significantly outperforms
serve that our bound can capture the trend of the empirical S PEC AVG in virtually all cases. Also, our approach was
gap, which settles down at epochs 500 (εHOMO and µ) or able to produce both smaller test errors and generalization
250 (∆ε and εLUMO ). In 3 out of 4 tasks, the average cor- gaps than the baselines. Overall, S PEC AVG achieves better
relation is greater than 0.8. We can observe an even better results than models without regularization.
correlation when looking at the dependence on hidden di-
mensionality and number of layers — in 7 out of 8 cases, the Additional results. In Appendix K, we provide additional
correlation is over 0.9. These outcomes support our theory. visualizations and experiments. More specifically, we show

8
On the Generalization of Equivariant Graph Neural Networks

Table 3. Test mean squared error (MSE) and generalization gap on QM9 for different regularization methods. ‘None’ denotes EGNNs
without regularization. We denote the best-performing methods in bold. In almost all cases, employing the method derived from our
theoretical analysis leads to the smallest test errors and generalization gaps. Results are multiplied by 100 for better visualization.
Test MSE Generalization gap
Task Legnn None S PEC AVG Ours None S PEC AVG Ours
3 5.72 ± 0.48 5.43 ± 0.37 5.76 ± 0.51 3.61 ± 0.50 3.33 ± 0.49 3.71 ± 0.48
εHOMO 5 9.66 ± 1.32 8.45 ± 1.28 6.71 ± 0.48 8.77 ± 1.40 7.29 ± 1.42 3.71 ± 1.56
7 11.13 ± 0.73 9.25 ± 0.70 7.90 ± 0.50 10.45 ± 0.90 8.33 ± 1.02 4.43 ± 1.12
3 12.40 ± 1.34 12.36 ± 1.34 11.61 ± 1.09 8.02 ± 1.52 7.94 ± 1.46 6.00 ± 1.90
∆ε 5 25.42 ± 7.53 25.38 ± 7.77 17.39 ± 1.37 22.92 ± 6.87 22.91 ± 7.04 10.22 ± 5.17
7 28.49 ± 4.35 26.28 ± 5.00 23.93 ± 2.73 27.46 ± 3.83 25.14 ± 4.34 18.16 ± 7.39
3 30.56 ± 1.18 30.03 ± 0.98 29.56 ± 0.47 8.11 ± 1.96 7.78 ± 1.47 5.76 ± 1.76
µ 5 30.94 ± 1.95 30.48 ± 2.26 28.43 ± 0.26 12.91 ± 2.27 11.80 ± 1.38 8.47 ± 2.85
7 31.73 ± 2.46 31.35 ± 2.59 28.86 ± 0.73 18.42 ± 1.71 17.40 ± 1.66 9.79 ± 4.45
3 5.68 ± 0.45 5.68 ± 0.45 5.11 ± 0.36 3.07 ± 0.53 3.07 ± 0.53 2.46 ± 0.41
εLUMO 5 7.51 ± 1.76 6.74 ± 1.07 5.85 ± 0.56 6.01 ± 1.68 5.17 ± 0.90 3.75 ± 1.07
7 7.87 ± 1.03 7.45 ± 0.73 7.16 ± 0.90 7.03 ± 1.17 6.71 ± 0.88 5.58 ± 2.29

the behavior of generalization gaps over training to illustrate (Flagship program). CSC – IT Center for Science, Finland,
that gaps from our method decrease with the values of λ, as provided computational support for this work. We thank the
expected. In contrast, S PEC AVG has little effect in reducing anonymous reviewers for their constructive feedback.
the gap compared to non-regularized models for most
penalization values. We also compare our regularizer with Impact Statement
S PEC AVG on other eight QM9 regression tasks. Overall,
our method achieves smaller generalization gaps and com- Incorporating symmetries such as equivariance in deep
petitive MSE values. Lastly, we report the average time per learning models is crucial for advancing fields like physics-
epoch of our regularizer for different tasks to show its com- inspired learning, drug and material discovery, protein de-
putational overhead. For the largest model, the regularized sign, and molecular dynamics simulation. Good generaliza-
vs. non-regularized time ratio is approximately 1.5. tion performance is a key goal of machine learning mod-
els. This work investigates the generalization of equivariant
7. Conclusion graph neural networks, providing several insights that we
hope would foster the design of better models. We do not
In this work, we analyzed the generalization capabilities of foresee any particularly negative societal impact of this
E(n)-Equivariant Graph Neural Networks (EGNNs) from work.
a theoretical perspective. We provided high probability
bounds of the generalization error and discussed in detail References
its implications. Specifically, its logarithmic dependence on
the spectral norm of weight matrices, the uneven impact of Anderson, B., Hy, T. S., and Kondor, R. Cormorant: Covari-
different layers, and the importance of ε-normalization. We ant molecular neural networks. In Advances in Neural
performed extensive experiments to validate our theory. Information Processing Systems (NeurIPS), 2019.
Additionally, we proposed a novel regularization method Arts, M., Garcia Satorras, V., Huang, C.-W., Zugner, D.,
inspired by our theoretical results. We show experimentally Federici, M., Clementi, C., Noé, F., Pinsler, R., and
that it helps reduce both the test loss and the generalization van den Berg, R. Two for one: Diffusion models and force
gap and performs better than the commonly used sum of fields for coarse-grained molecular dynamics. Journal of
spectral norms of weight matrices. Chemical Theory and Computation, 19(18):6151–6159,
2023.
Acknowledgements
Barceló, P., Kostylev, E. V., Monet, M., Pérez, J., Reutter, J.,
This work has been supported by the Jane and Aatos Erkko and Silva, J.-P. The logical expressiveness of graph neural
Foundation project (grant 7001703) on “Biodesign: Use of networks. In 8th International Conference on Learning
artificial intelligence in enzyme design for synthetic biol- Representations (ICLR 2020), 2020.
ogy”, and the Finnish Center for Artificial Intelligence FCAI
Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrally-

9
On the Generalization of Equivariant Graph Neural Networks

normalized margin bounds for neural networks. In Guyon, Fu, T., Gao, W., Coley, C., and Sun, J. Reinforced genetic
I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., algorithm for structure-based drug design. Advances
Vishwanathan, S., and Garnett, R. (eds.), Advances in in Neural Information Processing Systems, 35:12325–
Neural Information Processing Systems, volume 30. Cur- 12338, 2022.
ran Associates, Inc., 2017.
Fuchs, F., Worrall, D., Fischer, V., and Welling, M. Se
Behboodi, A., Cesa, G., and Cohen, T. S. A pac-bayesian (3)-transformers: 3d roto-translation equivariant attention
generalization bound for equivariant networks. Advances networks. Advances in neural information processing
in Neural Information Processing Systems, 35:5654– systems, 33:1970–1981, 2020.
5668, 2022.
Fukushima, K. Neocognitron: A self-organizing neural
Bogatskiy, A., Anderson, B., Offermann, J., Roussi, M., network model for a mechanism of pattern recognition
Miller, D., and Kondor, R. Lorentz group equivariant unaffected by shift in position. Biological Cybernetics,
neural network for particle physics. In International 36(4), 1980.
Conference on Machine Learning, pp. 992–1002, 2020.
Garcia Satorras, V., Hoogeboom, E., Fuchs, F., Posner, I.,
Bronstein, M. M., Bruna, J., Cohen, T., and Velickovic, and Welling, M. E(n) equivariant normalizing flows.
P. Geometric deep learning: Grids, groups, graphs, In Advances in Neural Information Processing Systems,
geodesics, and gauges. ArXiv e-prints: 2104.13478, volume 34, pp. 4181–4192. Curran Associates, Inc., 2021.
2021.
Garg, V. K., Jegelka, S., and Jaakkola, T. Generalization
Bulusu, S., Favoni, M., Ipp, A., Müller, D. I., and Schuh, D.
and representational limits of graph neural networks. In
Generalization capabilities of translationally equivariant
International Conference on Machine Learning, 2020.
neural networks. Physical Review D, 104(7):074504,
2021. Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and
Dahl, G. E. Neural message passing for quantum chem-
Chen, M., Li, X., and Zhao, T. On generalization bounds of
istry. In International Conference on Machine Learning
a family of recurrent neural networks. In Proceedings of
(ICML), 2017.
the Twenty Third International Conference on Artificial
Intelligence and Statistics, pp. 1233–1243, 2020.
Gori, M., Monfardini, G., and Scarselli, F. A new model for
Cohen, T. and Welling, M. Group equivariant convolutional learning in graph domains. In IEEE International Joint
networks. In International conference on machine learn- Conference on Neural Networks (IJCNN), 2005.
ing, pp. 2990–2999. PMLR, 2016.
Joshi, C. K., Bodnar, C., Mathis, S. V., Cohen, T., and
Cohen, T. S., Geiger, M., Köhler, J., and Welling, M. Spher- Liò, P. On the expressive power of geometric graph
ical cnns. In International Conference on Learning Rep- neural networks. In International Conference on Machine
resentations (ICLR), 2018. Learning, 2023.

Derrow-Pinion, A., She, J., Wong, D., Lange, O., Hester, Kipf, T. N. and Welling, M. Semi-supervised classifica-
T., Perez, L., Nunkesser, M., Lee, S., Guo, X., Wiltshire, tion with graph convolutional networks. In International
B., Battaglia, P. W., Gupta, V., Li, A., Xu, Z., Sanchez- Conference on Learning Representations (ICLR), 2017.
Gonzalez, A., Li, Y., and Velickovic, P. Eta prediction
with graph neural networks in google maps. In Pro- Kondor, R. and Trivedi, S. On the generalization of equivari-
ceedings of the 30th ACM International Conference on ance and convolution in neural networks to the action of
Information & Knowledge Management, pp. 3767–3776, compact groups. In International Conference on Machine
2021. Learning, pp. 2747–2755, 2018.

Elesedy, B. and Zaidi, S. Provably strict generalisation ben- Lang, L. and Weiler, M. A wigner-eckart theorem for group
efit for equivariant models. In International Conference equivariant convolution kernels. In International Confer-
on Machine Learning, pp. 2959–2969. PMLR, 2021. ence on Learning Representations, 2021.

Finzi, M., Stanton, S., Izmailov, P., and Wilson, A. G. Gen- LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard,
eralizing convolutional neural networks for equivariance R. E., Hubbard, W., and Jackel, L. D. Handwritten digit
to lie groups on arbitrary continuous data. In Interna- recognition with a back-propagation network. In Ad-
tional Conference on Machine Learning, pp. 3165–3176, vances in Neural Information Processing Systems (NIPS),
2020. 1990.

10
On the Generalization of Equivariant Graph Neural Networks

Liu, Y., Cheng, J., Zhao, H., Xu, T., Zhao, P., Tsung, F., Li, Sannai, A., Imaizumi, M., and Kawano, M. Improved
J., and Rong, Y. Improving generalization in equivariant generalization bounds of group invariant/equivariant deep
graph neural networks with physical inductive biases. networks via quotient feature spaces. In Uncertainty in
In The Twelfth International Conference on Learning Artificial Intelligence, pp. 771–780, 2021.
Representations, 2023.
Sato, R., Yamada, M., and Kashima, H. Approximation ra-
Loukas, A. What graph neural networks cannot learn: depth tios of graph neural networks for combinatorial problems.
vs width. In International Conference on Learning Rep- In Advances in Neural Information Processing Systems
resentations, 2020. (NeurIPS), 2019.

Maron, H., Ben-Hamu, H., Serviansky, H., and Lipman, Satorras, V. G., Hoogeboom, E., and Welling, M. E(n) equiv-
Y. Provably powerful graph networks. In Advances in ariant graph neural networks. In International Conference
Neural Information Processing Systems (NeurIPS), 2019. on Machine Learning, 2021.

Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spec- Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and
tral normalization for generative adversarial networks. In Monfardini, G. The graph neural network model. IEEE
International Conference on Learning Representations, Transactions on Neural Networks, 20(1):61–80, 2009.
2018.
Stärk, H., Ganea, O., Pattanaik, L., Barzilay, D., and
Mohri, M., Rostamizadeh, A., and Talwalkar, A. Founda- Jaakkola, T. EquiBind: Geometric deep learning for drug
tions of Machine Learning. The MIT Press, 2012. ISBN binding structure prediction. In Proceedings of the 39th
9780262018258. URL http://www.jstor.org/ International Conference on Machine Learning, volume
stable/j.ctt5hhcw1. 162, pp. 20503–20521, 2022.

Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen, Stokes, J. M., Yang, K., Swanson, K., Jin, W., Cubillos-Ruiz,
J. E., Rattan, G., and Grohe, M. Weisfeiler and leman A., Donghia, N. M., MacNair, C. R., French, S., Carfrae,
go neural: Higher-order graph neural networks. In AAAI L. A., Bloom-Ackermann, Z., Tran, V. M., Chiappino-
Conference on Artificial Intelligence (AAAI), 2019. Pepe, A., Badran, A. H., Andrews, I. W., Chory, E. J.,
Church, G. M., Brown, E. D., Jaakkola, T. S., Barzilay, R.,
Neyshabur, B., Bhojanapalli, S., and Srebro, N. A and Collins, J. J. A deep learning approach to antibiotic
PAC-bayesian approach to spectrally-normalized margin discovery. Cell, 180(4):688 – 702, 2020.
bounds for neural networks. In International Conference
on Learning Representations, 2018. Tang, H. and Liu, Y. Towards understanding generalization
of graph neural networks. In Proceedings of the 40th
Oono, K. and Suzuki, T. Graph neural networks exponen- International Conference on Machine Learning, volume
tially lose expressive power for node classification. In 202, pp. 33674–33719, 23–29 Jul 2023.
International Conference on Learning Representations,
2020. Thomas, N., Smidt, T., Kearnes, S., Yang, L., Li, L.,
Kohlhoff, K., and Riley, P. Tensor field networks:
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., Rotation-and translation-equivariant neural networks for
DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, 3d point clouds. arXiv preprint arXiv:1802.08219, 2018.
A. Automatic differentiation in pytorch. In Advances
in Neural Information Processing Systems (NeurIPS - Valiant, L. G. A theory of the learnable. Communications
Workshop), 2017. of the ACM, 27(11):1134–1142, 1984.

Petrache, M. and Trivedi, S. Approximation-generalization Verma, Y., Heinonen, M., and Garg, V. ClimODE: Cli-
trade-offs under (approximate) group equivariance, 2023. mate and weather forecasting with physics-informed
neural ODEs. In The Twelfth International Confer-
Ramakrishnan, R., Dral, P. O., Rupp, M., and von Lilienfeld, ence on Learning Representations, 2024. URL https:
O. A. Quantum chemistry structures and properties of //openreview.net/forum?id=xuY33XhEGR.
134 kilo molecules. Scientific Data, 1(1), 2014.
Wang, R., Walters, R., and Yu, R. Approximately equiv-
Sanchez-Gonzalez, A., Godwin, J., Pfaff, T., Ying, R., ariant networks for imperfectly symmetric dynamics. In
Leskovec, J., and Battaglia, P. Learning to simulate Proceedings of the 39th International Conference on Ma-
complex physics with graph networks. In International chine Learning, volume 162, pp. 23078–23091. PMLR,
Conference on Machine Learning (ICML), 2020. 2022.

11
On the Generalization of Equivariant Graph Neural Networks

Wang, Y., Wang, T., Li, S., He, X., Li, M., Wang, Z., Zheng,
N., Shao, B., and Liu, T.-Y. Enhancing geometric repre-
sentations for molecules with equivariant vector-scalar
interactive message passing. Nature Communications, 15
(1):313, 2024.
Weiler, M. and Cesa, G. General e (2)-equivariant steerable
cnns. Advances in neural information processing systems,
32, 2019.
Weisfeiler, B. and Lehman, A. A. A reduction of a graph
to a canonical form and an algebra arising during this
reduction. Nauchno-Technicheskaya Informatsia, 2(9):
12–16, 1968.

Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful
are graph neural networks? In International Conference
on Learning Representations (ICLR), 2019.
Xu, Y. and Wang, X. Understanding weight normalized deep
neural networks with rectified linear units. In Bengio, S.,
Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi,
N., and Garnett, R. (eds.), Advances in Neural Informa-
tion Processing Systems, volume 31. Curran Associates,
Inc., 2018.
Yang, C., Wu, Q., Wang, J., and Yan, J. Graph neural
networks are inherently good generalizers: Insights by
bridging gnns and mlps. In International Conference on
Learning Representations (ICLR), 2023.
Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton,
W. L., and Leskovec, J. Graph convolutional neural
networks for web-scale recommender systems. In In-
ternational Conference on Knowledge Discovery & Data
Mining (KDD), 2018.
Yoshida, Y. and Miyato, T. Spectral norm regularization for
improving the generalizability of deep learning. ArXiv,
abs/1705.10941, 2017.
Zhang, C.-H. and Zhang, T. A general theory of concave reg-
ularization for high-dimensional sparse estimation prob-
lems. Statistical Science, 27(4):576–593, 2012.

12
On the Generalization of Equivariant Graph Neural Networks

A. Notation
Table 4: Summary of notation and abbreviations.

Notation Description
L loss function
G = (V, E, C, Z) Geometric graph
V Set of nodes in the graph
E ⊆V 2
Neighbourhood structure
C Set of node features
Z Set of node coordinates
f ∈F f is a function from the function class F
S, m S is a training set consisting of m input-output samples, i.e., S = {(xi , yi )}m
i=1

∥W ∥2 spectral norm of a matrix W ∈ Rd1 ×d2


∥x∥2 Euclidean norm of a vector x ∈ Rd
D Maximal degree in the graph G
ϕ Multilayer Perceptron (MLP)
ψ Activation function
Kψ ≥ 1 Lipschitz constant of ψ
Wiϕ weight matrix of i-th layer of ϕ
βi,ϕ ≥ 1 bound of the spectral norm of Wiϕ , i.e., ∥Wiϕ ∥2 ≤ βi,ϕ
Lϕ , Lout , Legnn Number of layers of ϕ, # layers of the scoring MLP ϕout , # layers of the EGNN
ϕℓ−1 ℓ−1
h , ϕz , ϕℓµ MLPs in the ℓ-th EGNN layer
(ℓ) (ℓ) (ℓ) (ℓ)
hv , zv , µv , µu→v EGNN embeddings and messages after ℓ layers (Equations (1-4))
β≥1 Bound of the input - Assumption 4.1

B. Lipschitz Continuity of Multilayer Perceptrons


In this section, we show that multilayer perceptrons (MLPs) are Lipschitz continuous, both with respect to their inputs as
well as their parameter matrices. We begin with defining the MLP.
Definition B.1 (MLP). A Multi-layer perceptron ϕ with Lϕ layers and activation function ψ is given by

ϕ = fW ϕ ◦ · · · ◦ fW ϕ ,
Lϕ 1

where Wiϕ is the i-th weight matrix and f is given by:

fW (x) = ψ(W x),

where ψ is the activation function. Additionally, we will write

ϕk = fW ϕ ◦ · · · ◦ fW ϕ
k 1

to denote the output of the k-th layer (ϕ ≡ ϕLϕ ) and

ϕ(x) = ϕ(x; Wϕ )
L
for Wϕ = {Wiϕ }i=1
ϕ
whenever we want to emphasize which weight matrices were used.

13
On the Generalization of Equivariant Graph Neural Networks

Consider now a function given by


f(W,b) (x) = ψ(W x + b).

One can augment the input space by appending a constant 1 to x to obtain x̃ and redefining the weight matrix as W̃ =
concat(W, b). Thus the function can be expressed as

f(W,b) (x) = fW̃ (x̃).

Given this representation, we can assume WLOG that b = 0.


Lemma B.1 (Boundedness of MLP). Consider an MLP ϕ with Lϕ layers, where ψ is the activation function, Wiϕ is the
i-the weight matrix. Suppose that ψ is Kψ -Lipschitz and satisfying ψ(0) = 0. Then for all x:
 

Y
∥ϕ(x)∥2 ≤  Kψ ∥Wiϕ ∥2  ∥x∥2 .
i=1

Proof.

∥ϕ(x)∥2 = ∥ϕLϕ (x)∥2 = ∥ψ(WLϕϕ ϕLϕ −1 (x))∥2 = ∥ψ(WLϕϕ ϕLϕ −1 (x)) − ψ(0)∥2 ≤ Kψ ∥WLϕϕ ϕLϕ −1 (x)∥2

Y
≤ Kψ ∥WLϕϕ ∥2 ∥ϕLϕ −1 (x)∥2 ≤ · · · ≤ ∥x∥2 Kψ ∥Wiϕ ∥2 .
i=1

Lemma B.2 (Lipschitz continuity of MLP w.r.t. model parameters). Consider two Lϕ -layer MLPs given by two sets of
parameters: Wϕ and W̃ϕ sharing the same activation function ψ with Lipschitz constant Kψ and ψ(0) = 0. Assume further
that all weight matrices have bounded norm, i.e. for all i, max{∥Wiϕ ∥2 , ∥W̃iϕ ∥2 } ≤ βi,ϕ for some βi,ϕ ≥ 1. Then for all
x, Wϕ , W̃ϕ :
 

Y
∥ϕ(x; Wϕ ) − ϕ(x; W̃ϕ )∥2 ≤ ∥x∥2  Kψ βi,ϕ  dist(Wϕ , W̃ϕ ), (18)
i=1

where

X
dist(Wϕ , W̃ϕ ) := ∥Wiϕ − W̃iϕ ∥2 .
i=1

Proof. For presentation clarity, we assume throughout this proof ∥∥ ≡ ∥∥2 - spectral norm for matrices and Euclidean norm
for vectors.

∥ϕ(x; Wϕ ) − ϕ(x; W̃ϕ )∥ = ∥ψ(WLϕϕ ϕLϕ −1 (x; Wϕ )) − ψ(W̃Lϕϕ ϕLϕ −1 (x; W̃ϕ ))∥
≤ Kψ ∥WLϕϕ ϕLϕ −1 (x; Wϕ ) − W̃Lϕϕ ϕLϕ −1 (x; W̃ϕ )∥.

Now using the telescoping technique:

∥WLϕϕ ϕLϕ −1 (x; Wϕ ) − W̃Lϕϕ ϕLϕ −1 (x; W̃ϕ )∥


≤ ∥WLϕϕ ϕLϕ −1 (x; Wϕ ) − WLϕϕ ϕLϕ −1 (x; W̃ϕ )∥ + ∥WLϕϕ ϕLϕ −1 (x; W̃ϕ ) − W̃Lϕϕ ϕLϕ −1 (x; W̃ϕ )∥
≤ ∥WLϕϕ ∥ · ∥ϕLϕ −1 (x; Wϕ ) − ϕLϕ −1 (x; W̃ϕ )∥ + ∥ϕLϕ −1 (x; W̃ϕ )∥ · ∥WLϕϕ − W̃Lϕϕ ∥
 
Lϕ −1
Y
≤ ∥WLϕϕ ∥ · ∥ϕLϕ −1 (x; Wϕ ) − ϕLϕ −1 (x; W̃ϕ )∥ + ∥x∥  Kψ ∥W̃iϕ ∥ ∥WLϕϕ − W̃Lϕϕ ∥.
i=1

14
On the Generalization of Equivariant Graph Neural Networks

Therefore
 
Lϕ −1
L
Y
∥ϕ(x; Wϕ ) − ϕ(x; W̃ϕ )∥ ≤ ∥x∥Kψ ϕ  ∥W̃iϕ ∥ ∥WLϕϕ − W̃Lϕϕ ∥ + Kψ ∥WLϕϕ ∥ · ∥ϕLϕ −1 (x; Wϕ ) − ϕLϕ −1 (x; W̃ϕ )∥
i=1

and by induction we get



L
X
∥ϕ(x; Wϕ ) − ϕ(x; W̃ϕ )∥ ≤ ∥x∥Kψ ϕ ai ∥Wiϕ − W̃iϕ ∥,
i=1

where   
i−1 Lϕ Lϕ
Y Y Y
ai =  ∥W̃jϕ ∥  ∥Wjϕ′ ∥ ≤ βj,ϕ .
j=1 j ′ =i+1 j=1

Lemma B.3 (Lipschitz continuity of MLP w.r.t. model input). Let ϕ be a Lϕ -layer MLP with a Kψ -Lipschitz activation
function ψ. Then ϕ is Lipschitz continuous with respect to the input:
 

Y
∥ϕ(x) − ϕ(y)∥2 ≤  Kψ ∥Wiϕ ∥2  ∥x − y∥2 .
i=1

Proof. Below we assume ∥ · ∥ ≡ ∥ · ∥2 .

∥ϕ(x) − ϕ(y)∥ = ∥ϕLϕ (x) − ϕLϕ (y)∥ ≤ Kψ ∥WLϕϕ ∥ · ∥ϕLϕ −1 (x) − ϕLϕ −1 (y)∥
 

Y
≤ · · · ≤  Kψ ∥Wiϕ ∥ ∥x − y∥.
i=1

Lemma B.4 (Multiple inputs MLP). Suppose ϕ, a Lϕ -layer MLP takes multiple inputs x1 , . . . , xK by concatenating them
into a single vector x = CONCAT([x1 , . . . , xK ]). Then the output is bounded:
 
Lϕ K
√ Y X
∥ϕ(x1 , . . . , xK )∥2 ≤ K  Kψ ∥Wiϕ ∥2  ∥xi ∥2 .
i=1 i=1

Proof. Claim follows from Lemma B.1 and the inequality:


K
q √ √ X
∥x∥2 = ∥x1 ∥22 + · · · + ∥xK ∥22 ≤ K max ∥xi ∥2 ≤ K ∥xi ∥2 .
i
i=1

C. Proof of Lemma 4.1


In this section, we prove Lemma 4.1. We recall it for completeness:
Lemma 4.1 (Boundedness of EGNN embeddings). Consider an EGNN model as described in Equations (1)-(4). For any
graph G, we have that ∀v ∈ V, and u ∈ J (v):

max{∥h(ℓ) (ℓ) (ℓ) (ℓ)


v ∥2 , ∥zv − zu ∥2 , D∥µu→v ∥2 } ≤ Cβ
(ℓ)
, (11)

15
On the Generalization of Equivariant Graph Neural Networks
Q 2

where D is the maximum degree in the graph, C = 8Dβ, and β (ℓ) = (20D)ℓ i=0 M (i) with


(12)
Y
M (ℓ) = max Kψ βi,ϕ
ϕ∈{ϕℓ−1
h ,ϕℓ−1
z ,ϕℓµ } i=1

where βi,ϕ is the bound of ∥Wϕi ∥2 , and Lϕ , the number of layers of ϕ, and Kψ the Lipschitz constant of the activation
function.

(ℓ)
Proof. In the proof we assume ∥ · ∥ ≡ ∥ · ∥2 . First note that (11) implies that for all v and ℓ: ∥µv ∥ ≤ Cβ (ℓ) due to the
(ℓ) (ℓ)
definition of µv as a sum aggregation of µu→v . We will show the result by induction. Note that, from Lemmas B.3 and
B.4:   1
∥µ(0) 0 (0) (0) (0) (0)
u→v ∥ = ∥ϕµ hu , hv , ∥zu − zv ∥, auv ∥ ≤ 2(2β + β + β)M
(0)
≤ CK (0)
D
and thus establishing the base case of ℓ = 0. Now assume that (11) holds for ℓ′ < ℓ.
From Lemmas B.3, B.4 and the inductive hypothesis:

√ √
∥h(ℓ)
v ∥≤ 2M (ℓ) (∥h(ℓ−1)
v ∥ + ∥µv(ℓ−1) ∥) ≤ 2 2M (ℓ) Cβ (ℓ−1) ≤ Cβ (ℓ) ,
√ 2
where the last inequality holds because 2 2 ≤ 20D and M (ℓ) ≤ M (ℓ) . Similarly for z:

zu(ℓ) − zv(ℓ) = zu(ℓ−1) − zv(ℓ−1)


(ℓ−1) (ℓ−1)
1 X zu − zu′ (ℓ−1)
+ ϕℓ−1
z (µu′ →u )
|J (u)| (ℓ−1) (ℓ−1)
u′ ∈J (u) ∥zu − zu′ ∥+ε (19)
(ℓ−1) (ℓ−1)
1 X zv − zv ′ (ℓ−1)
− ϕzℓ−1 (µv′ →v )
|J (v)| (ℓ−1) (ℓ−1)
v ′ ∈J (v) ∥zv − zv′ ∥ + ε

and therefore:
∥zu(ℓ) − zv(ℓ) ∥ ≤ Cβ (ℓ−1) + 2M (ℓ) Cβ (ℓ−1) ≤ 3M (ℓ) Cβ (ℓ−1) ≤ Cβ (ℓ) ,
because 3 ≤ 20D. Finally for µ:
 
∥µ(ℓ)
u→v ∥ ≤ 2M (ℓ)
∥h(ℓ)
v ∥ + ∥h(ℓ)
u ∥ + ∥z (ℓ)
v − zu
(ℓ)
∥ + ∥auv ∥
 √ 
≤ 2M (ℓ) 4 2M (ℓ) + 1 + 2M (ℓ) + 1 Cβ (ℓ−1)
 2 1
≤ 20 M (ℓ) Cβ (ℓ−1) = Cβ (ℓ) ,
D

where the last inequality holds because 4 + 4 2 ≤ 10.

D. Proof of Lemma 4.2


In this section we prove Lemma 4.2. We recall it for completeness:
Lemma 4.2 (Lipschitz continuity preserved under ε-normalization). Let f be a Kf -Lipschitz function and ∥ · ∥ any norm.
Then the ε-normalized function
f (·)
fε (·) :=
∥f (·)∥ + ε
is Kfε -Lipschitz with Kfε = 3Kf/ε.

16
On the Generalization of Equivariant Graph Neural Networks

Proof.
f (x) f (y) (∥f (y)∥ + ε)f (x) − (∥f (x)∥ + ε)f (y)
∥fε (x) − fε (y)∥ = − =
∥f (x)∥ + ε ∥f (y)∥ + ε (∥f (x)∥ + ε)(∥f (y)∥ + ε)
ε ∥f (y)∥f (x) − ∥f (x)∥f (y)
≤ ∥f (x) − f (y)∥ +
(∥f (x)∥ + ε)(∥f (y)∥ + ε) (∥f (x)∥ + ε)(∥f (y)∥ + ε)
Lf ∥f (y)∥f (x) − ∥f (x)∥f (y)
≤ ∥x − y∥ + .
ε (∥f (x)∥ + ε)(∥f (y)∥ + ε)
Focusing on the numerator of the last term:

∥∥f (y)∥f (x) − ∥f (x)∥f (y)∥ = ∥∥f (y)∥f (x) − ∥f (x)∥f (x) + ∥f (x)∥f (x) − ∥f (x)∥f (y)∥
≤ |∥f (y)∥ − ∥f (x)∥| · ∥f (x)∥ + ∥f (x)∥ · ∥f (x) − f (y)∥
≤ ∥f (x)∥ (Lf ∥x − y∥ + |∥f (x)∥ − ∥f (y)∥|)
≤ 2∥f (x)∥Lf ∥x − y∥,

where in the last step we used the fact that |∥a∥ − ∥b∥| ≤ ∥a − b∥. Therefore:
∥f (y)∥f (x) − ∥f (x)∥f (y) 2∥f (x)∥Lf ∥x − y∥ 2Lf
≤ ≤ ∥x − y∥
(∥f (x)∥ + ε)(∥f (y)∥ + ε) (∥f (x)∥ + ε)(∥f (y)∥ + ε) ε
and
3Lf
∥fε (x) − fε (y)∥ ≤ ∥x − y∥.
ε

E. Proof of Lemma 4.3


In this section we prove Lemma 4.3. We recall it for completeness:
(ℓ)
Lemma 4.3 (Lipschitz continuity of EGNN wrt params). Consider EGNNs as defined in Equations (1)-(4). Let hv (W)
denote the embedding of node v at layer ℓ produced by an EGNN with parameters W = (Wϕh , Wϕz , Wϕµ ), where

Wϕ = {Wiϕ }i=1 — recall that Wiϕ denotes the weight matrix of the i-th (linear) layer of the MLP ϕ.
For any two EGNNs with parameters W and W̃, we have

∥h(ℓ) (ℓ) ℓ (ℓ)


v (W) − hv (W̃)∥2 ≤ CQ B distℓ (W, W̃), (13)

where
ℓ Lϕ
X X X
distℓ (W, W̃) = ∥Wiϕ − W̃iϕ ∥2 ,
ℓ′ =1 ′ ℓ′ −1 i=1
ϕ∈{ϕzℓ −1 ,ϕh ,ϕℓµ′ }

D, C, β (ℓ) , M (ℓ) are as defined in Lemma 4.1, and


 
2 C
Q = 224D max ,1
ε

!3 ℓ−1 (14)
Y Y
(ℓ) (i)
B = M β (i) ,
i=0 i=0

with the convention that an empty product is equal to 1.

Proof. Throughout the proof we assume ∥ · ∥ ≡ ∥ · ∥2 . We will show the result by induction. We will show a stronger
(ℓ) (ℓ) (ℓ) (0) (0)
statement, namely that all hv , zv and µu→v are Lipschitz continuous. The base case obviously holds for hv and zv as
they do not depend on model parameters.

17
On the Generalization of Equivariant Graph Neural Networks

We see from Lemma B.2 and our assumption that inputs are bounded:
1 1
∥µ(0) (0)
u→v (W) − µu→v (W̃)∥ ≤ M
(0)
8βdist(Wϕ0µ , W̃ϕ0µ ) ≤ CM (0) dist(Wϕ0µ , W̃ϕ0µ ) ≤ CM (0) dist0 (W, W̃) (20)
D D
(ℓ′ ) (ℓ′ ) (ℓ′ )
and thus establishing the base case for induction. Assume now that for all ℓ′ < ℓ there exist Kh , Kz , Kµ such that for
all u, v and all W, W̃:
′ ′ (ℓ′ )
∥h(ℓ ) (ℓ )
v (W) − hv (W̃)∥ ≤ Kh distℓ′ (W, W̃)
′ ′ ′
∥zv(ℓ ) (W) − zv(ℓ ) (W̃)∥ ≤ Kz(ℓ ) distℓ′ (W, W̃) (21)

(ℓ′ ) 1 (ℓ′ )
∥µ(ℓ )
u→v (W) − µu→v (W̃)∥ ≤ K distℓ′ (W, W̃).
D µ
(ℓ′ ) (ℓ′ ) (ℓ′ )
Note that (21) implies that ∥µv (W) − µv (W̃)∥ ≤ Kµ distℓ′ (W, W̃). We start with h and proceed with a telescoping
argument:

∥h(ℓ) (ℓ)
v (W) − hv (W̃)∥

= ϕℓ−1 (ℓ−1)
h (hv (W), µ(ℓ−1)
v (W); Wϕℓ−1 ) − ϕhℓ−1 (h(ℓ−1)
v (W̃), µ(ℓ−1)
v (W̃); W̃ϕℓ−1 )
h h

≤ ϕhℓ−1 (h(ℓ−1)
v (W), µ(ℓ−1)
v (W); Wϕℓ−1 ) − ϕhℓ−1 (hv(ℓ−1) (W), µ(ℓ−1)
v (W); W̃ϕℓ−1 )
h h

+ ϕℓ−1 (ℓ−1)
h (hv (W), µ(ℓ−1)
v (W); W̃ϕℓ−1 ) − ϕhℓ−1 (hv(ℓ−1) (W̃), µ(ℓ−1)
v (W̃); W̃ϕℓ−1 )
h h

Now we use Lemmas B.2, B.4 and 4.1 to bound the first term and Lemma B.3 to bound the second

∥h(ℓ) (ℓ)
v (W) − hv (W̃)∥

≤ 2M (ℓ) (∥h(ℓ−1)
v (W)∥ + ∥µ(ℓ−1)
v (W)∥)dist(Wϕℓ−1 , W̃ϕℓ−1 )
h h
√ (ℓ)

(ℓ−1) (ℓ−1) (ℓ−1)

+ 2M ∥hv (W) − hv (W̃)∥ + ∥µv (W) − µ(ℓ−1)
v (W̃)∥ (22)
√ √ (ℓ−1)
≤ 2 2M (ℓ) Cβ (ℓ−1) dist(Wϕℓ−1 , W̃ϕℓ−1 ) + 2M (ℓ) (Kh + Kµ(ℓ−1) )distℓ−1 (W, W̃)
h h
√ (ℓ−1)
≤ 2M (ℓ) (2Cβ (ℓ−1) + Kh + Kµ(ℓ−1) )distℓ (W, W̃).

Moving on to z:

∥zv(ℓ) (W) − zv(ℓ) (W̃)∥


= ∥zv(ℓ−1) (W) − zv(ℓ−1) (W̃)∥
(ℓ−1) (ℓ−1)
1 X zv (W) − zu (W)
+ ϕℓ−1 (ℓ−1)
z (µu→v (W); Wϕℓ−1 )
|J (v)| (ℓ−1)
∥zv (W) −
(ℓ−1)
zu (W)∥ +ε z
u∈J (v)
(ℓ−1) (ℓ−1)
zv (W̃) − zu (W̃)
− (ℓ−1) (ℓ−1)
ϕℓ−1 (ℓ−1)
z (µu→v (W̃); W̃ϕℓ−1
z
)
∥zv (W̃) − zu (W̃)∥ + ε
≤ Kz(ℓ−1) distℓ−1 (W, W̃)
(ℓ−1) (ℓ−1)
1 X zv (W) − zu (W)
+ ∥ϕzℓ−1 (µu→v
(ℓ−1)
(W); Wϕℓ−1 ) − ϕzℓ−1 (µu→v
(ℓ−1)
(W̃); W̃ϕℓ−1 )∥
|J (v)| (ℓ−1)
∥zv (W) −
(ℓ−1)
zu (W)∥ +ε z z
u∈J (v)
(ℓ−1) (ℓ−1) (ℓ−1) (ℓ−1)
1 X zv (W) − zu (W) zv (W̃) − zu (W̃)
+ ∥ϕℓ−1
z (µ(ℓ−1)
u→v ( W̃); W̃ ℓ−1
ϕz )∥ − .
|J (v)| (ℓ−1)
∥zv
(ℓ−1)
(W) − zu
(ℓ−1)
(W)∥ + ε ∥zv
(ℓ−1)
(W̃) − zu (W̃)∥ + ε
u∈J (v)

(ℓ−1) (ℓ−1) (ℓ−1)


Now, since (from the inductive assumption) zv , zu are both Lipschitz continuous with a constant Kz , it
(ℓ−1) (ℓ−1) (ℓ−1)
follows that zv − zu is Lipschitz continuous with a constant 2Kz . Therefore, using Lemma 4.2, we know that

18
On the Generalization of Equivariant Graph Neural Networks

zv(ℓ−1) −zu(ℓ−1)
6Kz(ℓ−1)
(ℓ−1) (ℓ−1) is Lipschitz continuous with a constant ε and we can bound:
∥zv −zu ∥+ε

(ℓ−1) (ℓ−1) (ℓ−1) (ℓ−1) (ℓ−1)


zv (W) − zu (W) zv (W̃) − zu (W̃) 6Kz
(ℓ−1) (ℓ−1)
− (ℓ−1) (ℓ−1)
≤ distℓ−1 (W, W̃).
∥zv (W) − zu (W)∥ + ε ∥zv (W̃) − zu (W̃)∥ + ε ε

and consequently:

!
(ℓ−1)
6Kz
∥zv(ℓ) (W) − zv(ℓ) (W̃)∥ ≤ Kz(ℓ−1) + M (ℓ) (ℓ−1)
∥µu→v (W̃)∥ distℓ−1 (W, W̃)
ε
(23)
1 X
+ ∥ϕℓ−1 (ℓ−1)
z (µu→v (W); Wϕℓ−1 ) − ϕℓ−1 (ℓ−1)
z (µu→v (W̃); W̃ϕℓ−1 )∥.
|J (v)| z z
u∈J (v)

To bound the last term we perform additional telescoping step:

∥ϕℓ−1 (ℓ−1) ℓ−1 (ℓ−1)


z (µu→v (W); Wϕzℓ−1 ) − ϕz (µu→v (W̃); W̃ϕℓ−1
z
)∥
≤ ∥ϕℓ−1 (ℓ−1)
z (µu→v (W); Wϕℓ−1 ) − ϕzℓ−1 (µu→v
(ℓ−1)
(W); W̃ϕℓ−1 )∥
z z
(24)
+ ∥ϕℓ−1 (ℓ−1)
z (µu→v (W); W̃ϕℓ−1
z
) − ϕzℓ−1 (µu→v
(ℓ−1)
(W̃); W̃ϕℓ−1
z
)∥
≤ M (ℓ) ∥µ(ℓ−1)
u→v (W)∥dist(Wϕℓ−1
z
, W̃ϕℓ−1
z
) + M (ℓ) ∥µu→v
(ℓ−1) (ℓ−1)
(W) − µu→v (W̃)∥.

(ℓ−1)
Now from Lemma 4.1 we have ∥µu→v (W)∥ ≤ Cβ (ℓ−1) and by applying (24) to (23), we obtain:

!
(ℓ−1)
6Kz
∥zv(ℓ) (W) − zv(ℓ) (W̃)∥ ≤ Kz(ℓ−1) + M (ℓ)
Cβ (ℓ−1)
+M (ℓ)
Cβ (ℓ−1)
+M (ℓ)
Kµ(ℓ−1) distℓ (W, W̃)
ε
!
(ℓ−1)
(ℓ) (ℓ−1) 6Kz
≤M Cβ + Kµ(ℓ−1) + Kz(ℓ−1) + Cβ (ℓ−1)
distℓ (W, W̃)
ε

(ℓ) (ℓ) (ℓ) (ℓ) (ℓ)


Finally, for µ we will simplifiy notation with ωuv (W) := (hu (W), hv (W), ∥zu (W) − zv (W)∥, auv ):

   
(ℓ)
µu→v (W) − µ(ℓ)
u→v (W̃) = ϕ ℓ
µ ω (ℓ)
uv (W); Wϕµ ℓ − ϕ ℓ
µ ω (ℓ)
uv ( W̃); W̃ ℓ
ϕµ
       
= ϕℓµ ωuv
(ℓ)
(W); Wϕℓµ − ϕℓµ ωuv (ℓ)
(W); W̃ϕℓµ + ϕℓµ ωuv (ℓ)
(W); W̃ϕℓµ − ϕℓµ ωuv(ℓ)
(W̃); W̃ϕℓµ
       
≤ ϕℓµ ωuv
(ℓ)
(W); Wϕℓµ − ϕℓµ ωuv (ℓ)
(W); W̃ϕℓµ + ϕℓµ ωuv (ℓ)
(W); W̃ϕℓµ − ϕℓµ ωuv(ℓ)
(W̃); W̃ϕℓµ
≤ M (ℓ) ∥ωuv
(ℓ)
(W)∥dist(Wϕℓµ , W̃ϕℓµ ) + M (ℓ) ∥ωuv
(ℓ) (ℓ)
(W) − ωuv (W̃)∥.

(ℓ) (ℓ) (ℓ)


We will now determine bounds for ∥ωuv (W)∥ and ∥ωuv (W) − ωuv (W̃)∥. Starting with the first:

 
(ℓ)
∥ωuv (W)∥ ≤ 2 ∥h(ℓ)
u (W)∥ + ∥h (ℓ)
v (W)∥ + ∥zu
(ℓ)
(W) − zv
(ℓ)
(W)∥ + ∥auv ∥
 2 (25)
≤ 8Cβ (ℓ) = 160D M (ℓ) Cβ (ℓ−1)

19
On the Generalization of Equivariant Graph Neural Networks

and the second:

(ℓ) (ℓ)
∥ωuv (W) − ωuv (W̃)∥
 
≤ 2 ∥h(ℓ)
u (W) − h(ℓ)
u (W̃)∥ + ∥h (ℓ)
v (W) − h (ℓ)
v (W̃)∥
 
+ 2 ∥zu(ℓ) (W) − zv(ℓ) (W)∥ − ∥zu(ℓ) (W̃) − zv(ℓ) (W̃)∥∥
 
≤ 2 ∥hu(ℓ) (W) − h(ℓ) u (W̃)∥ + ∥h (ℓ)
v (W) − h (ℓ)
v (W̃)∥ + ∥zu
(ℓ)
(W) − zu
(ℓ)
(W̃)∥ + ∥z (ℓ)
v (W) − zv
(ℓ)
(W̃)∥
√ (ℓ) (ℓ−1) (ℓ−1) (ℓ−1)
≤ 4 2M (2Cβ + Kh + Kµ )distℓ (W, W̃)
!
(ℓ−1)
6K z
+ 4M (ℓ) Cβ (ℓ−1) + Kµ(ℓ−1) + Kz(ℓ−1) + Cβ (ℓ−1) distℓ (W, W̃)
ε
!
(ℓ−1)
(ℓ) (ℓ−1) (ℓ−1) (ℓ−1) (ℓ−1) 6Kz (ℓ−1)
≤ 4M 4Cβ + 2Kh + 3Kµ + Kz + Cβ distℓ (W, W̃)
ε

and therefore:

 3
µ(ℓ)
u→v (W) − µ(ℓ)
u→v ( W̃) ≤ 160D M (ℓ)
Cβ (ℓ−1) distℓ (W, W̃)
(26)
!
2 (ℓ−1)

(ℓ) (ℓ−1) (ℓ−1) 6Kz
+4 M 4Cβ + 2Kh + 3Kµ(ℓ−1) + Kz(ℓ−1) + Cβ (ℓ−1)
distℓ (W, W̃).
ε

In summary, we have the following recursive relationships:

(ℓ)
√ (ℓ−1)
Kh = 2M (ℓ) (2Cβ (ℓ−1) + Kh + Kµ(ℓ−1) )
!
(ℓ−1)
6Kz
Kz(ℓ) =M (ℓ)
Cβ (ℓ−1)
+ Kµ(ℓ−1) + Kz(ℓ−1) + Cβ (ℓ−1)
ε
1 (ℓ)  3 (27)
Kµ = 160D M (ℓ) Cβ (ℓ−1)
D !
2 (ℓ−1)

(ℓ−1) 6K z
+ 4 M (ℓ) 4Cβ (ℓ−1) + 2Kh + 3Kµ(ℓ−1) + Kz(ℓ−1) + Cβ (ℓ−1)
ε

We now determine the growth rates of the constants. Specifically, we will now show by induction that

(ℓ)
Kh ≤ CQℓ B (ℓ)
Kµ(ℓ) ≤ CQℓ B (ℓ) (28)
Kz(ℓ) ≤ CQ B ℓ (ℓ)
,

Q 3 Q
ℓ ℓ−1 (i)
where Q = 224D2 max{ Cε , 1} and B (ℓ) = i=0 M (i) i=0 β with the convention that empty product is equal to 1.
(0) (0) (0)
To show the base case, we recall that Kh = Kz = 0 and Kµ = CM (0) . Now, assuming that (28) holds for all ℓ′ < ℓ,

20
On the Generalization of Equivariant Graph Neural Networks

we will show that it also holds for ℓ. Using the fact that M (ℓ) ≥ 1, β (ℓ) ≥ 1 and B (ℓ) ≥ 1 for all ℓ, we obtain:
(ℓ)
√   √  3
Kh ≤ 2M (l) 2Cβ (ℓ−1) + 2CQℓ−1 B (ℓ−1) ≤ 4 2CQℓ−1 M (ℓ) β (ℓ−1) B (ℓ−1)

 
C
≤ 4 2 max , 1 CQℓ−1 B (ℓ) ≤ CQℓ B (ℓ)
ε
 
6
Kz(ℓ) ≤ M (ℓ) Cβ (ℓ−1) + 2CQℓ−1 B (ℓ−1) + CQℓ−1 B (ℓ−1) Cβ (ℓ−1)
ε
   
6C   3 C
≤ 1+2+ CQℓ−1 M (ℓ) β (ℓ−1) B (ℓ−1) ≤ 9 max , 1 CQℓ−1 B (ℓ) ≤ CQℓ B (ℓ) (29)
ε ε
 3  2  6

Kµ(ℓ) ≤ 160D2 M (ℓ) Cβ (ℓ−1) + 4D M (ℓ) 4Cβ (ℓ−1) + 6CQℓ−1 B (ℓ−1) + Cβ (ℓ−1) CQℓ−1 B (ℓ−1)
ε
  3
24DC 
≤ 160D2 + 16D + 24D + CQℓ−1 M (ℓ) β (ℓ−1) B (ℓ−1)
ε
 
C
≤ 160D2 + 64D max , 1 CQℓ−1 B (ℓ) ≤ CQℓ B (ℓ)

ε

F. Proof of Lemma 4.4


In this section, we prove Lemma 4.4. We recall it for completeness:
Lemma 4.4 (Lipschitz continuity w.r.t. parameters of the scoring model). Let W (g) = (W, Wϕout ) denote the parameters
of the scoring model g as defined in Equation (6), with W as defined in Lemma 4.3, Wϕout = {Wiϕout }L
i=1 and g(G; W
out (g)
)
(g)
the output of the scoring model with parameters W . Then, g is Lipschitz continuous w.r.t. the model parameters:

∥g(G; W (g) ) − g(G; W̃ (g) )∥ ≤ Kg dist(W (g) , W̃ (g) ),

where
Lout
X
dist(W (g) , W̃ (g) ) = distLegnn (W, W̃) + ∥Wiϕout − W̃iϕout ∥2
i=1
(·)
C, Q, B are as defined in Lemma 4.3 and
Lout
!
Y
Kg = 2 Kψ βi,ϕout CQLegnn B (Legnn ) .
i=1

(Legnn )
Proof. In the proof we assume ∥ · ∥ ≡ ∥ · ∥2 . From Equation 5: hG = 1
and therefore hG retain Lipschitz
P
|V| v∈V hv
(L )
continuity of each individual hv egnn with the same constant. It holds that:

∥g(G; W (g) ) − g(G; W̃ (g) )∥ = ∥ϕout (hG (W); Wϕout ) − ϕout (hG (W̃); W̃ϕout )∥
≤ ∥ϕout (hG (W); Wϕout ) − ϕout (hG (W); W̃ϕout )∥ + ∥ϕout (hG (W); W̃ϕout ) − ϕout (hG (W̃); W̃ϕout )∥
Lout Lout
! !
Y Y
≤ Kψ βi,ϕout ∥hG (W)∥dist(Wϕout , W̃ϕout ) + Kψ βi,ϕout ∥hG (W) − hG (W̃)∥
i=1 i=1
Lout
!
Y
≤ Kψ βi,ϕout C(β (Legnn ) + QLegnn B (Legnn ) )dist(W (g) , W̃ (g) )
i=1
Lout
!
Y
≤2 Kψ βi,ϕout CQLegnn B (Legnn ) dist(W (g) , W̃ (g) ).
i=1

21
On the Generalization of Equivariant Graph Neural Networks

G. Proof of Proposition 4.1


In this section, we prove our main result, i.e. Proposition 4.1. We begin with a useful lemma:
Lemma G.1 (Weight matrices’ coverings yield a function class covering). Let F = {fW1 ,...,Wn : Wi ∈ Wi } be a class of
functions parametrized with n weight matrices. Suppose further that for all Wi , W̃i :
 
∥fW1 ,...,Wn − fW̃1 ,...,W̃n ∥ = sup ∥fW1 ,...,Wn (x) − fW̃1 ,...,W̃n (x)∥ ≤ K ∥W1 − W̃1 ∥F + · · · + ≤ ∥Wn − W̃n ∥F .
x∈X

Then
n
Y  ε 
N (F, ε, ∥ · ∥∞ ) ≤ N Wi , , ∥ · ∥F .
i=1
nK

Proof. Let Ci be an ε
-covering of Wi such that |Ci | = N Wi , nK
ε
, ∥ · ∥F and define
 
nK

C = {fW1 ,...,Wn : Wi ∈ Ci } .

Let fW1 ,...,Wn ∈ F. For all i we can choose Ŵi ∈ Ci such that ∥Wi − Ŵi ∥F ≤ nK .
ε
Therefore
 
∥fW1 ,...,Wn − fŴ1 ,...,Ŵn ∥∞ ≤ K ∥W1 − Ŵ1 ∥F + · · · + ∥Wn − Ŵn ∥F ≤ ε

and fŴ1 ,...,Ŵn ∈ C. Hence, C is an ε-covering of F w.r.t. ∥ · ∥∞ and


n n
Y Y  ε 
N (F, ε, ∥ · ∥∞ ) ≤ |C| = |Ci | = N Wi , , ∥ · ∥F .
i=1 i=1
nK

We now move on to proving Proposition 4.1:


Proposition 4.1 (Generalization bound of the scoring model). Let G be the space of geometric graphs as defined in Section
3.1, Y = [0, 1] the space of labels, g : G → R the scoring model as defined in (6) and L(b y − y)2 , 1 the loss
y , y) = min (b
function. Then for any δ > 0, with probability at least 1 − δ over choosing a sample S ∼ Dm from a distribution D over
G × Y of size m, the following holds:
 √ q 
2
d L √ log δ
RS,L (g) = O  √ ∆+ √ , (16)
m m

where L = 3Lϕ Legnn + Lout is the total number of weight matrices, d is the maximum width across all layers, and

Legnn Lϕ Lout
X X X X
∆= γi,ϕ + γi,ϕout
ℓ=0 ϕ∈{ϕℓ−1 ,ϕℓ−1
h z ,ϕℓµ } i=1 i=1

γi,ϕ = wi,ϕ log(Γ · κ(Wiϕ ))


(
1 for ϕ = ϕout
wi,ϕ =
Legnn − ℓ + 1 for ϕ ∈ {ϕℓh , ϕℓz , ϕℓµ }
dmLDβKψ
Γ= .
ε̂

Proof. From Theorem 3.1 we can see that in order to bound RS,L it suffices to find a bound for R
b S (FL ), where

FL = {(x, y) 7→ L(f (x), y) : f ∈ F}

22
On the Generalization of Equivariant Graph Neural Networks

and
F = {g(·, W (g) ) : ∀ℓ ∀ϕ ∈ {ϕℓh , ϕℓz , ϕℓµ , ϕout } ∀i ∥Wiϕ ∥ ≤ βi,ϕ }
is the family of scoring models with bounded spectral norm of weight matrices. The primary tool that we will use for
bounding R b S (FL ) is Lemma 3.1. However, to use it directly, the assumption f0 ∈ FL would need to be satisfied, which is
very restrictive as it assumes the existence of a perfect classifier in the model class. Instead, we observe that

FL = {(x, y) 7→ 1 − fˆ(x, y) : fˆ ∈ F̂L }, (30)

where
F̂L = {(x, y) 7→ 1 − L(f (x), y) : f ∈ F}.

It may seem tautological, but now there exists fˆ0 ∈ F̂L , because we can take f0 ≡ −1 and since y ∈ [0, 1], it holds that
∀(x, y) fˆ0 (x, y) := 1 − L(f0 (x), y) = 0. We can thus use Lemma 3.1 for F̂L and obtain
√ !
Z 2 m
4α 12
q
b S (F̂L ) ≤ inf
R √ + log N (F̂L , r, ∥ · ∥∞ )dr .
α>0 m m α

Now since (x, y) 7→ 1 − L(x, y) is 2-Lipschitz in its first argument, an r/2 covering of F yields an r covering of F̂L and
therefore
r
log N (F̂L , r, ∥ · ∥) ≤ log N (F, , ∥ · ∥∞ ).
2
Furthermore, from Equation (30), one can easily check that R b S (F̂L ) and in summary it holds that
b S (FL ) = R
√ !
2 m
r
r
Z
4α 12
b S (FL ) ≤ inf
R √ + log N (F, , ∥ · ∥∞ )dr .
α>0 m m α 2

However, since N (F, r, ∥ · ∥∞ ) is a decreasing function of r we see that


√ √ Z 2 √m r
2 m 2 m
r r
r  α  α
Z   Z  
log N F, , ∥ · ∥∞ dr ≤ log N F, , ∥ · ∥∞ dr ≤ log N F, , ∥ · ∥∞ dr
α 2 α 2 0 2

r  α 
= 2 m log N F, , ∥ · ∥∞ .
2

By choosing α = √1 ,
m
we obtain a simplified bound
s  
b S (FL ) ≤ 4 + √24
R log N
1
F, √ , ∥ · ∥∞ . (31)
m m 2 m
 
We now move on to finding a bound for log N F, 2√1m , ∥ · ∥∞ . From Lemma 4.4:

∥g(·; W (g) ) − g(·; W̃ (g) )∥∞ = sup ∥g(G; W (g) ) − g(G; W̃ (g) )∥2
G

≤ Kg dist(W (g) , W̃ (g) )


 
Legnn Lϕ Lout
X X X X
= Kg  ∥Wiϕ − W̃iϕ ∥2 + ∥Wiϕout − W̃iϕout ∥2 

ℓ=0 ϕ∈{ϕℓ−1 ,ϕℓ−1
h z ,ϕℓµ } i=1 i=1
 
Legnn Lϕ Lout
X X X X
≤ Kg  ∥Wiϕ − W̃iϕ ∥F + ∥Wiϕout − W̃iϕout ∥F  ,

ℓ=0 ϕ∈{ϕℓ−1 ,ϕℓ−1
h z ,ϕℓµ } i=1 i=1

23
On the Generalization of Equivariant Graph Neural Networks
Q 
Lout
where Kg = 2 i=1 Kψ i,ϕout CQ
β Legnn (Legnn )
B and the last inequality comes from the relationship between the spectral
and Frobenius norms.
 Thus,
 for L = 3Legnn Lϕ + Lout (total number of weight matrices), we can use Lemma G.1, i.e. that it
suffices to find an r
LKg -covering for each weight matrix Wiϕ and their cartesian product yields an r-covering of F:
Legnn Lϕ   Lout  
X X X r X r
log N (F, r, ∥ · ∥∞ ) ≤ log N Wiϕ , , ∥ · ∥F + log N Wiϕout , , ∥ · ∥F .
LKg LKg
ℓ=0 ϕ∈{ϕℓ−1 ,ϕzℓ−1 ,ϕℓ } i=1 i=1
h µ

Now, since the spectral norm of weight matrices is bounded, we can use Lemma 3.2 and obtain
  √ !
ϕ r 2 dLKg βi,ϕ
log N Wi , , ∥ · ∥F ≤ d log 1 + 2 ,
LKg r
L ϕ L
where d = maxℓ=0
egnn
maxϕ∈{ϕℓ−1 ,ϕℓ−1
z ,ϕℓ ,ϕout } maxi=1 dim(Wi ) is the width of the scoring model and dim(W ) =
ϕ

h µ

max(d1 , d2 ) for W ∈ Rd1 ×d2 . Choosing r = 1



2 m
yields
Legnn Lϕ

 
1 X X X  
log N F, √ , ∥ · ∥∞ ≤ d2 log 1 + 4 dmLKg βi,ϕ
2 m
ℓ=0 ϕ∈{ϕℓ−1 ,ϕℓ−1
h z ,ϕℓµ } i=1
Lout
X  √ 
+ d2 log 1 + 4 dmLKg βi,ϕout
i=1
 
Legnn Lϕ  √  XLout  √ 
X X X
≤ d2  log 5 dmLKg βi,ϕ + log 5 dmLKg βi,ϕout 

ℓ=0 ϕ∈{ϕℓ−1 ,ϕℓ−1
h z ,ϕℓµ } i=1 i=1

(32)
and plugging into (31) gives
v
uLegnn Lϕ  √ Lout  √
4 24d u X X X  X 
RS (FL ) ≤
b + √ u log 5 dmLKg βi,ϕ + log 5 dmLKg βi,ϕout . (33)
m muu
t ℓ=0 ϕ∈{ϕhℓ−1 ,ϕzℓ−1 ,ϕℓµ } i=1 i=1
| {z }
=:Σ

We now proceed to simplify the bound. Note that


Legnn Lϕ Lout
√ X X X X
Σ = L log(5 dmL) + L log(Kg ) + log (βi,ϕ ) + log (βi,ϕout )
ℓ=0 ϕ∈{ϕℓ−1 ,ϕℓ−1
h z ,ϕℓµ } i=1 i=1

and
Lout
X
log(Kg ) = log(2) + log(Kψ βi,ϕout ) + log(C) + Legnn log(Q) + log(B (Legnn ) )
i=1
Lout
X
= log(16) + log(Dβ) + log(Kψ βi,ϕout )
i=1
  
2 8Dβ
+ Legnn log 224D max ,1
ε
Legnn Legnn −1


!2 
X X Y
+3 log(M (ℓ) ) + log (20D)ℓ M (i) 
ℓ=0 ℓ=0 i=0

out L Legnn ℓ
Legnn (Legnn − 1) X
 
Dβ XX
≤ C1 log + log(Kψ βi,ϕout ) + 3 log(M (i) ),
ε̂ 2 i=1 i=0 ℓ=0

24
On the Generalization of Equivariant Graph Neural Networks

where C1 does not depend on the parameters of the model and ε̂ = min{1, ε}.
QLϕ
Since M (ℓ) = maxϕ∈{ϕℓ−1 ,ϕzℓ−1 ,ϕℓ } i=1 Kψ βi,ϕ , we have
h µ


X X
log(M (i) ) ≤ log(Kψ βj,ϕ )
ϕ∈{ϕi−1
h ,ϕi−1
z ,ϕiµ } j=1

and
out L
Legnn (Legnn − 1) X
 

log(Kg ) ≤ C2 log + log(Kψ βi,ϕout )
ε̂ 2 i=1
Legnn Lϕ
X X X
+3 (Legnn − ℓ + 1) log(Kψ βi,ϕ ).
ℓ=0 ϕ∈{ϕℓ−1
h ,ϕℓ−1
z ,ϕℓµ } i=1

In summary:  
Legnn Lϕ Lout
X X X X
Σ ≤ C3 L  wi,ϕ log (Γβi,ϕ ) + wi,ϕout log (Γβi,ϕout ) ,

ℓ=0 ϕ∈{ϕℓ−1 ,ϕzℓ−1 ,ϕℓ } i=1 i=1
h µ

where
(
1 for ϕ = ϕout
wi,ϕ =
Legnn − ℓ + 1 for ϕ ∈ {ϕℓh , ϕℓz , ϕℓµ }
dmLDβKψ
Γ= .
ε̂
We can now use Equation 33 and Theorem 3.1 to obtain the bound for the generalization error of the scoring model g with
probability at least 1 − δ: s
8 48d √ log 2δ
RS,L (g) ≤ +√ Σ+3
m m 2m
Furthermore, for a given g(·; W), we observe that the stretching factor as defined in Equation 15 can serve as the bound, i.e.
we can set βi,ϕ = κ(Wiϕ ) and get

√ u
v s
Legnn Lϕ Lout
8 48d L u X X X X log 2δ
RS,L (g) ≤ + C3 √ u γi,ϕ + γi,ϕout + 3
m m 2m
t
ℓ−1 ℓ−1
ℓ=0 ϕ∈{ϕ i=1,ϕz i=1
,ϕℓµ }
h

where

L = 3Lϕ Legnn + Lout


γi,ϕ = wi,ϕ log(Γκ(Wiϕ ))
dmLDβKψ
Γ=
( ε̂
1 for ϕ = ϕout
wi,ϕ =
Legnn − ℓ + 1 for ϕ ∈ {ϕℓh , ϕℓz , ϕℓµ }
κ(Wiϕ ) = max{1, ∥Wiϕ ∥2 }
d = max dim(Wiϕ )
i,ϕ

dim(W ) = max(d1 , d2 ) for W ∈ Rd1 ×d2

25
On the Generalization of Equivariant Graph Neural Networks

H. Logarithmic dependence on spectral norms


In section 5 we discuss how our bound differs from existing ones, specifically the logarithmic dependence on the norm of
weight matrices. Here we provide more details on where this difference stems from. The logarithmic dependence comes
from Lemma 3.2 (Lemma 8 in (Chen et al., 2020)), which states:

min{d1 , d2 }λ
 
d1 ×d2
 
log N A∈R : ∥A∥2 ≤ λ , ε, ∥ · ∥F ≤ d1 d2 log 1 + 2 .
ε

This is in contrast with perhaps a more commonly used result (Lemma 10 in (Chen et al., 2020), special case of Lemma 3.2
in (Bartlett et al., 2017)), which states:

 λ2
A ∈ Rd1 ×d2 : ∥A∥2,1 ≤ λ , ε, ∥ · ∥2 ≤ 2 log(2d1 d2 ).

log N
ε
The former yields a logarithmic dependence on the norm, while the latter produces a multiplicative one. However, these
are not directly comparable. Firstly, there is a tradeoff between the dependence on the norm and the width of the network
d = max{d1 , d2 }. The log of the covering norm in the former yields logarithmic dependence on the norm but quadratic on
the width, and it is the other way around in the latter. In particular, for large values of λ and low values of d, the former may
be preferred, whereas the latter would be better for low values of λ and high values of d.
Furthermore, these two lemmas provide bounds on the covering numbers of sets of matrices. Still, the former one assumes
that the spectral norm is bounded and gives a bound w.r.t. Frobenius norm, while the latter assumes that ∥ · ∥2,1 norm is
bounded and gives a bound w.r.t. spectral norm, which makes the comparison difficult.

I. Impact of ε-normalization
In section 5 we argue that ε-normalization plays an important role in the derivation of the generalization bound. Here,
we provide more details to support this claim. Specifically, suppose we follow the same reasoning as we did in Lemmas
(ℓ) (ℓ)
4.1-4.4, for the unnormalized EGNN model, i.e. γ(zv , zu , ε) ≡ 1 in Equation (2). Assume further that ∀i, ϕ, it holds that
∥Mi,ϕ ∥2 ≤ βϕ .
(l)
Consider the bound of the ℓ-layer EGNN embeddings β (ℓ) , i.e. ∥hv ∥2 ≤ β (ℓ) . From Lemma 4.1, we know that for
β (ℓ) = O(C1ℓ ) for some C1 > 1 in the ε−normalized EGNN model. However, from Equation (19), we can see that β (ℓ)
would satisfy the following recursive relationship:

β (ℓ) = 20D(M (ℓ) β (ℓ−1) )2 ,

which implies

β (ℓ) = O(C22 )
for some C2 > 1. The super-exponential growth of β (ℓ) leads to a super-exponential Lipshitz constant of the unnormalized
EGNN and therefore an exponential dependence of the generalization gap on the number of EGNN layers.

J. Implementation details
The QM9 dataset (Ramakrishnan et al. (2014)) comprises small molecules with a maximum of 29 atoms in 3D space. Each
atom is defined by a 3D position and a five-dimensional one-hot node embedding that represents the atom type, indicated as
(H, C, N, O, F). The dataset has several different chemical properties (outputs), from which we arbitrarily chose 4 to assess
the generalization of EGNNs. We report the results for the remaining properties in Appendix K.
We implement all models using the PyTorch (Paszke et al., 2017) and train them for 1000 epochs using the Adam optimizer
and MSE loss function. We use batch size equal to 96 and cosine decay for the learning rate starting at 10−3 except for the
Mu (µ) property, where the initial value was set to 5 × 10−4 . We run five independent trials with different seeds.
For the experiments in Figure 3, we use width d = 64 (for all layers) and Legnn = 5 for the experiments regarding the
spectral norm, Legnn = 3 for the one regarding the width, and width d = 16 for assessing the generalization gap in terms

26
On the Generalization of Equivariant Graph Neural Networks

of the number of layers. We apply ε-normalization with ε = 1. All internal MLPs (ϕh , ϕz , ϕµ , ϕout ) have two layers with
SiLU activation function. Following the original repo, we use the identity activation function at the last layer of ϕz . For the
experiments on the impact ε-normalization, we use ε = 1 and d = 128 (width).
We note that in their original, Satorras et al. (2021) do not update the positional features z over the layers on the experiments
using QM9. In our experiments, we consider full EGNN models.
For the experiments in Table 3, we use width d = 16, ε-normalization (ε = 1), and internal MLPs with two layers (i.e.,
Lϕ = Lout = 2) with SiLU activation functions. We consider 2K samples for training, and full validation and test sets. The
statistics are obtained from three independent runs, with different seeds. We perform model selection for the regularization
factor λ ∈ {1, 1e-3, 1e-5, 1e-7} using the validation set.

K. Additional visualizations and experiments


Figure 5 and Figure 6 show the learning curves (for a single run) obtained using our spectral regularizer and S PEC AVG,
respectively. Here, we use Legnn = 5. Notably, our regularizer produces generalization gaps that decrease with the
regularization factor λ. In contrast, S PEC AVG shows a hard-threshold behavior: for λ ̸= 1, it has little effect; for λ = 1, it
produces a very small generalization gap but high test error.

HOM O ∆ µ LU M O
0.20 0.08
0.100 0.3 λ: 1.0
Generalization gap

0.15 λ: 0.001 0.06


0.075
0.2 λ: 1e-05
0.050 0.10 λ: 1e-07 0.04
0.1 λ: 0
0.025 0.05 0.02

0.000 0.0 0.00


0.00
0 500 1000 0 500 1000 0 500 1000 0 500 1000
Epochs Epochs Epochs Epochs

Figure 5. Generalization gap over training epochs for the regularization method proposed in this work. Circles at the end of training
denote the final test MSE error.

HOM O ∆ µ LU M O
0.4
0.15 0.4
λ: 1.0
Generalization gap

0.4 λ: 0.001 0.3


0.3
0.10 λ: 1e-05
0.2 λ: 1e-07 0.2
0.2 λ: 0
0.05
0.1 0.1

0.00 0.0 0.0 0.0


0 500 1000 0 500 1000 0 500 1000 0 500 1000
Epochs Epochs Epochs Epochs

Figure 6. Generalization gap over training epochs for S PEC AVG. Circles at the end of training denote the final test MSE error.

Time comparison. We note that regularization does not affect inference time as it only applies during training. The impact
of the proposed method on training is reported in the Table 5. For the largest model (7 layers), the regularized version incurs
a computational overhead of approximately 50% — i.e., the regularized vs. non-regularized time ratio is around 1.5.

Additional QM9 properties. Table 6 reports results (from three independent runs) for the remaining eight QM9 tasks.
For these experiments, we used width d = 32, cosine decay with an initial learning rate of 10−3 , batch size of 96, and 2000
epochs. Overall, we observe that SpecAvg and our regularizer achieve similar Test MSE values, while our method obtains
smaller generalization gaps.

27
On the Generalization of Equivariant Graph Neural Networks

Table 5. Time per epoch (in seconds).


Task Legnn None Our Regularizer
3 1.37 ± 0.04 1.76 ± 0.29
alpha 5 1.52 ± 0.04 2.08 ± 0.21
7 1.64 ± 0.04 2.42 ± 0.06
3 1.38 ± 0.04 1.75 ± 0.32
Cv 5 1.44 ± 0.04 2.08 ± 0.11
7 1.58 ± 0.05 2.44 ± 0.08
3 1.37 ± 0.29 1.73 ± 0.05
G 5 1.46 ± 0.05 2.07 ± 0.10
7 1.57 ± 0.13 2.38 ± 0.10
3 1.32 ± 0.05 3.54 ± 0.09
H 5 1.44 ± 0.04 2.07 ± 0.04
7 1.59 ± 0.27 2.36 ± 0.05

Table 6. Test mean squared error (MSE) and generalization gap on additional QM9 tasks for different regularization methods. We denote
the best-performing methods in bold. Results are multiplied by 100 for better visualization.
Test MSE Generalization gap
Task Legnn None S PEC AVG Ours None S PEC AVG Ours
3 42.87 ± 0.99 42.72 ± 0.88 41.76 ± 1.11 29.29 ± 17.10 29.05 ± 16.87 16.46 ± 5.65
α 5 41.47 ± 0.92 41.31 ± 0.40 41.24 ± 1.30 34.79 ± 10.96 34.34 ± 11.37 34.41 ± 11.19
7 44.11 ± 1.97 43.10 ± 0.93 41.91 ± 2.27 38.98 ± 5.76 28.67 ± 14.88 18.78 ± 22.20
3 7.59 ± 1.38 7.10 ± 1.71 7.09 ± 1.63 7.04 ± 1.64 6.32 ± 2.36 6.49 ± 1.91
Cv 5 12.11 ± 1.00 11.32 ± 1.03 11.28 ± 1.21 12.09 ± 1.01 11.30 ± 1.02 10.03 ± 1.57
7 10.10 ± 2.28 10.24 ± 2.45 9.08 ± 0.51 10.08 ± 2.28 10.23 ± 2.45 9.04 ± 0.48
3 16.60 ± 7.35 15.71 ± 6.62 16.00 ± 6.08 15.17 ± 6.91 13.91 ± 5.81 10.74 ± 3.43
G 5 21.88 ± 9.19 20.63 ± 7.84 21.55 ± 5.24 21.57 ± 9.00 19.93 ± 7.18 17.26 ± 4.95
7 28.92 ± 1.89 26.89 ± 3.39 26.11 ± 4.67 28.83 ± 1.94 26.29 ± 4.01 22.84 ± 9.23
3 15.49 ± 8.10 14.91 ± 6.92 15.19 ± 4.33 13.23 ± 7.48 12.36 ± 5.74 12.97 ± 3.85
H 5 22.60 ± 8.35 22.39 ± 7.58 23.00 ± 4.53 22.20 ± 8.11 21.95 ± 7.28 11.67 ± 5.77
7 25.75 ± 0.59 26.81 ± 2.47 26.91 ± 5.16 25.64 ± 0.60 26.67 ± 2.61 14.67 ± 9.72
3 56.63 ± 4.99 56.62 ± 4.99 56.65 ± 4.94 3.80 ± 2.29 3.81 ± 2.31 3.79 ± 2.31
r2 5 55.31 ± 1.07 55.22 ± 1.24 55.53 ± 1.17 5.03 ± 1.45 5.02 ± 1.45 4.95 ± 1.63
7 58.28 ± 3.76 58.25 ± 3.78 57.61 ± 2.50 6.86 ± 3.36 7.20 ± 3.02 5.22 ± 0.54
3 15.80 ± 7.34 15.46 ± 6.59 16.44 ± 3.67 13.59 ± 6.58 12.82 ± 5.15 14.03 ± 2.68
U 5 22.22 ± 6.94 19.66 ± 7.46 23.34 ± 4.77 21.84 ± 6.68 19.39 ± 7.16 17.75 ± 7.28
7 27.43 ± 1.49 25.63 ± 0.90 26.18 ± 2.99 27.31 ± 1.52 25.09 ± 1.77 22.98 ± 8.18
3 15.70 ± 7.18 15.10 ± 5.99 15.87 ± 6.51 13.62 ± 6.32 12.98 ± 5.08 13.80 ± 5.80
U0 5 22.03 ± 8.80 21.66 ± 8.44 22.79 ± 4.57 21.74 ± 8.63 21.30 ± 8.22 12.67 ± 5.19
7 26.81 ± 3.12 21.71 ± 9.39 21.33 ± 9.74 26.71 ± 3.20 21.64 ± 9.46 21.13 ± 9.86
3 94.14 ± 0.26 94.25 ± 0.29 94.21 ± 0.45 2.17 ± 0.81 2.13 ± 0.92 1.81 ± 1.06
zpve 5 95.29 ± 0.41 94.82 ± 0.25 95.23 ± 0.29 7.18 ± 2.95 1.67 ± 0.50 3.82 ± 2.67
7 95.74 ± 0.32 95.03 ± 0.29 95.14 ± 0.22 19.62 ± 13.81 4.95 ± 3.97 2.03 ± 0.97

28

You might also like